doc: captions planning

2026-04-07 07:08:02 +03:00
parent 88c491f740
commit 65a93e877b
1 changed files with 240 additions and 0 deletions
--- a/doc/plan/caption.md
+++ b/doc/plan/caption.md
@@ -0,0 +1,240 @@
+# Image Captioning Service
+
+## Context
+
+The search functionality currently relies on CLIP embeddings for semantic image matching, but there are no actual text descriptions stored for images. Adding captions would enable direct text search against image descriptions. The captioning service needs to run independently on different hardware (CPU or GPU) without filesystem access to the images, fetching them via CDN instead.
+
+## Architecture
+
+A new standalone binary `rbv-caption` that:
+- Connects to the same PostgreSQL database
+- Fetches batches of uncaptioned image IDs + CDN URLs from the DB
+- Downloads each image via HTTP from the CDN
+- Runs a captioning model (BLIP-base, GIT-base, or Florence-2-base) via ONNX
+- Writes captions back to a new `captions` table
+- Supports `--model` to select which captioning model to use
+- Runs independently of other rbv services, on any hardware
+
+## Changes
+
+### 1. Database Migration — `captions` table
+
+**New file: `migrations/0011_captions.sql`**
+
+```sql
+CREATE TABLE IF NOT EXISTS captions (
+    image_id    BYTEA       NOT NULL REFERENCES images(id) ON DELETE CASCADE,
+    model       TEXT        NOT NULL,
+    caption     TEXT        NOT NULL,
+    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
+    PRIMARY KEY (image_id, model)
+);
+
+CREATE INDEX IF NOT EXISTS idx_captions_image ON captions (image_id);
+```
+
+Composite PK on `(image_id, model)` so different models can each produce a caption for the same image. The `model` column stores a short identifier like `blip-base`, `git-base`, or `florence-2-base`.
+
+### 2. Entity — `Caption` struct
+
+**`crates/rbv-entity/src/caption.rs`** (new file)
+```rust
+pub struct Caption {
+    pub image_id: ImageId,
+    pub model: String,
+    pub caption: String,
+}
+```
+
+**`crates/rbv-entity/src/lib.rs`** — add `mod caption; pub use caption::*;`
+
+### 3. Data Layer — caption queries
+
+**`crates/rbv-data/src/caption.rs`** (new file)
+
+Functions:
+- `upsert_caption(pool, caption: &Caption)` — INSERT ON CONFLICT (image_id, model) DO UPDATE
+- `get_captions(pool, image_id: &ImageId)` — all captions for an image
+- `uncaptioned_image_urls(pool, model: &str, cdn_fs_prefix: &str, cdn_url_prefix: &str, batch_size: i64)` — the key batch query:
+
+```sql
+SELECT gi.image_id, g.path, gi.filename
+FROM gallery_images gi
+JOIN galleries g ON g.id = gi.gallery_id
+WHERE NOT EXISTS (
+    SELECT 1 FROM captions c
+    WHERE c.image_id = gi.image_id AND c.model = $1
+)
+LIMIT $4
+```
+
+The caller then applies CDN prefix mapping to construct the download URL (same logic as `resolveImageUrl` in the frontend): strip `fs_prefix` from `gallery.path`, prepend `url_prefix`, append filename.
+
+- `count_uncaptioned(pool, model: &str)` — for progress reporting
+
+**`crates/rbv-data/src/lib.rs`** — add `pub mod caption;`
+
+### 4. Caption Inference Crate
+
+**`crates/rbv-caption/`** (new crate)
+
+This is a standalone binary crate (not a library consumed by other crates). It keeps the ONNX captioning models separate from `rbv-infer` so that the existing CLI/API binaries don't need to link captioning model code.
+
+**`crates/rbv-caption/Cargo.toml`**:
+```toml
+[package]
+name = "rbv-caption"
+
+[dependencies]
+rbv-entity = { workspace = true }
+rbv-data = { workspace = true }
+clap = { workspace = true }
+sqlx = { workspace = true }
+tokio = { workspace = true }
+anyhow = { workspace = true }
+tracing = { workspace = true }
+tracing-subscriber = { workspace = true }
+ort = { workspace = true }
+tokenizers = { workspace = true }
+ndarray = { workspace = true }
+image = { workspace = true }
+reqwest = { version = "0.12", default-features = false, features = ["rustls-tls"] }
+```
+
+**`crates/rbv-caption/src/main.rs`**:
+- CLI args via clap:
+  - `--database` — PostgreSQL connection string
+  - `--model` — captioning model identifier (`blip-base`, `git-base`, `florence-2-base`)
+  - `--model-dir` — path to ONNX model directory
+  - `--cdn-map` — `fs_prefix=url_prefix` mapping (same format as API server)
+  - `--batch-size` — images per batch (default: 100)
+  - `--concurrency` — concurrent inference tasks (default: 1 for CPU, higher for GPU)
+- Main loop:
+  1. Query `uncaptioned_image_urls()` for a batch
+  2. If empty, log and exit
+  3. For each image in batch (with concurrency limit):
+     a. Download image bytes from CDN URL via reqwest
+     b. Run captioning model → get caption string
+     c. Upsert caption to DB
+  4. Log progress, repeat from step 1
+
+**`crates/rbv-caption/src/models/mod.rs`** — trait + model dispatch:
+```rust
+pub trait CaptionModel: Send + Sync {
+    fn caption(&self, image_bytes: &[u8]) -> Result<String>;
+}
+
+pub fn load_model(model_name: &str, model_dir: &Path) -> Result<Box<dyn CaptionModel>>
+```
+
+**`crates/rbv-caption/src/models/blip.rs`** — BLIP-base ONNX inference
+**`crates/rbv-caption/src/models/git.rs`** — GIT-base ONNX inference
+**`crates/rbv-caption/src/models/florence.rs`** — Florence-2-base ONNX inference
+
+Each model implementation:
+- Loads ONNX encoder + decoder models from `{model_dir}/captioning/{model_name}/`
+- Preprocesses image (resize, normalize per model requirements)
+- Runs autoregressive decoding (encoder output → decoder generates tokens one at a time)
+- Decodes tokens back to text using model's tokenizer
+
+**Model directory layout**:
+```
+{model_dir}/captioning/
+├── blip-base/
+│   ├── encoder.onnx
+│   ├── decoder.onnx
+│   └── tokenizer.json
+├── git-base/
+│   ├── encoder.onnx
+│   ├── decoder.onnx
+│   └── tokenizer.json
+└── florence-2-base/
+    ├── encoder.onnx
+    ├── decoder.onnx
+    └── tokenizer.json
+```
+
+### 5. Search Integration
+
+**`crates/rbv-search/src/combined.rs`** — extend quick search to also search captions:
+
+In `search_quick()`, add a fifth concurrent query:
+```rust
+// alongside existing tag/subject/person/clip searches:
+rbv_data::caption::search_captions(pool, query, fetch_limit)
+```
+
+**`crates/rbv-data/src/caption.rs`** — add:
+```rust
+pub async fn search_captions(pool, query: &str, limit: i64) -> Result<Vec<Gallery>>
+// SQL: find galleries containing images whose captions match the query
+// SELECT DISTINCT g.* FROM galleries g
+// JOIN gallery_images gi ON gi.gallery_id = g.id
+// JOIN captions c ON c.image_id = gi.image_id
+// WHERE c.caption ILIKE '%' || $1 || '%'
+// LIMIT $2
+```
+
+For advanced search, add `caption` as a new filter field in `SearchParams`.
+
+### 6. Workspace Registration
+
+**`Cargo.toml`** (workspace root) — add `"crates/rbv-caption"` to `members` and add `reqwest` to workspace dependencies.
+
+## Implementation Order
+
+1. Migration `0011_captions.sql`
+2. `rbv-entity` — Caption struct
+3. `rbv-data/caption.rs` — queries (upsert, uncaptioned batch, search)
+4. `rbv-caption` crate — CLI, main loop, image download, model trait
+5. `rbv-caption` — BLIP-base model implementation (start with one model)
+6. `rbv-caption` — GIT-base and Florence-2-base implementations
+7. Search integration — extend quick/advanced search to include captions
+8. `cargo build` — verify everything compiles
+
+## Model ONNX Export
+
+The ONNX model files need to be exported before use. This is a one-time setup step per model. The models can be exported from HuggingFace using `optimum-cli` or Python scripts:
+
+```bash
+# BLIP-base
+optimum-cli export onnx --model Salesforce/blip-image-captioning-base blip-base/
+
+# GIT-base
+optimum-cli export onnx --model microsoft/git-base-coco git-base/
+
+# Florence-2-base
+# Florence-2 requires custom export due to its architecture
+```
+
+## Verification
+
+1. `cargo build` — all crates compile
+2. Run migration on database
+3. Export BLIP-base model to ONNX
+4. Run `rbv-caption --database ... --model blip-base --model-dir ... --cdn-map ...` on gramathea
+5. Verify captions appear in `captions` table
+6. Test search — quick search should return results matching caption text
+7. Export GIT-base and Florence-2-base, test on quadbrat with GPU
+
+## Usage
+
+```bash
+# On gramathea (CPU, BLIP-base):
+rbv-caption \
+  --database "$DB" \
+  --model blip-base \
+  --model-dir /path/to/models \
+  --cdn-map "/tank/data/rbv/vault=https://cdn.example.com/vault" \
+  --batch-size 100 \
+  --concurrency 4
+
+# On quadbrat (GPU, GIT-base):
+rbv-caption \
+  --database "$DB" \
+  --model git-base \
+  --model-dir /path/to/models \
+  --cdn-map "/tank/data/rbv/vault=https://cdn.example.com/vault" \
+  --batch-size 200 \
+  --concurrency 8
+```