New standalone rbv-caption binary that generates image captions using
ONNX models. Fetches images via CDN, writes captions to a new captions
table, and integrates with search (both quick and advanced modes).
Supported models:
- vit-gpt2: ViT encoder + GPT-2 decoder (auto-download from Xenova)
- florence-2-base: Florence-2 4-stage pipeline using fine-tuned variant
from onnx-community (auto-download)
- blip-base, git-base: manual ONNX export required
Key implementation details:
- Florence-2 task tokens are natural language prompts, not special tokens
- Uses non-merged decoder ONNX models (no KV cache) for simplicity
- Systemd template unit for deploying multiple models concurrently
- Deploy script targets quadbrat for GPU inference
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>