Configure sqlx pool with min_connections = max_connections so all connections are established at startup, avoiding slow-acquire warnings from lazy mTLS handshakes. Add idle_timeout (5 min) to recycle stale connections from prior runs, and reduce acquire_timeout to 10s for faster failure. Size the pool to io_concurrency + ml_concurrency + 2 to accommodate the worst case where all IO tasks call image_exists concurrently. Reduce default io_concurrency from 4× to 2× ML concurrency to keep pool size within PostgreSQL's default max_connections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.3 KiB
rbv
Image gallery indexer and facial recognition tool.
Subcommands
migrate
Create or update the database schema.
rbv migrate --database <CONNSTR>
Run this once before first use, and again after pulling schema changes.
index
Extract CLIP embeddings and face detections from image galleries and store them in the database.
rbv index \
--target <PATH>... \
--database <CONNSTR> \
--model-dir <PATH> \
[--concurrency <N>] # ML concurrency, default 4
[--io-concurrency <N>] # file I/O concurrency, default 2× ML concurrency
[--reindex] # bypass gallery-level skip
[--ml-purge] # wipe all ML data and re-index from scratch
--target can be any of:
- A single gallery directory (contains
index.jsonandtn/) - A chunk directory (immediate children are galleries)
- A root directory (immediate children are chunks)
- Any arbitrary directory — galleries are discovered recursively
Skip optimisations
Re-running index against the same target is safe and fast. Four layers of
skip logic avoid redundant work:
-
Gallery-level skip — each gallery's image file count and total byte size are stored in the database. If neither has changed since the last successful index, the entire gallery is skipped without reading any files. Use
--reindexto bypass this check. -
Batch existence check — for galleries that are not skipped, all known
(filename, image_id, file_size)tuples are fetched in a single query rather than one query per image. -
File-size fast path — each file is
stat()-ed (cheap syscall). If the filename and file size match the stored values, the file is skipped without being read or hashed. -
Content-hash dedup — if a file is read and its BLAKE3 hash matches an existing image (in this gallery or any other), ML processing is skipped.
Together these mean that a re-index of an unchanged 1.5 M image corpus completes in seconds rather than days.
Two-stage pipeline
File I/O (reads + hashing) and ML inference run under separate concurrency
limits. --io-concurrency controls how many files are read/hashed in
parallel (default: 4× --concurrency), while --concurrency controls ML
inference slots. This keeps the disk and the ML backend both saturated
instead of taking turns.
Quality note: Indexing one gallery or the whole tree produces identical
embeddings. Recognition quality is determined entirely by cluster (below),
which always operates over the full database regardless of how many index
runs contributed to it.
backfill
Populate file_size and gallery stats from disk without running any ML
inference. This is a one-time migration helper — run it after upgrading to
make all skip optimisations effective on the very first index run.
rbv backfill \
--target <PATH>... \
--database <CONNSTR> \
[--concurrency <N>] # default 16
For each gallery that exists both on disk and in the database, backfill
stats every image file and writes file_size into gallery_images rows
where it is currently NULL. It also sets the gallery-level file_count and
total_bytes columns so that gallery-level skip works immediately.
No files are read (only stat()-ed) and no ML models are loaded, so this
runs very quickly even on large corpora.
cluster
Group all indexed face embeddings into person identities using cosine similarity and connected-components clustering.
rbv cluster \
--database <CONNSTR> \
[--threshold <FLOAT>] # default 0.65 (range 0.0–1.0)
- Only faces without an assigned
person_idare clustered; existing assignments are preserved. - Re-run after each
indexpass to assign identities to newly indexed faces. - Raise
--threshold(e.g.0.75) for stricter grouping (fewer false merges, more splits). Lower it for looser grouping.
Typical workflow
# 1. One-time setup
rbv migrate --database "$DATABASE_URL"
# 2. Index all galleries (incremental — safe to re-run)
rbv index --target /mnt/galleries --database "$DATABASE_URL" --model-dir /path/to/models
# 3. Cluster faces into persons
rbv cluster --database "$DATABASE_URL"
# 4. As new galleries are added, repeat steps 2–3
rbv index --target /mnt/galleries/new-chunk --database "$DATABASE_URL" --model-dir /path/to/models
rbv cluster --database "$DATABASE_URL"
After upgrading (one-time)
If you have an existing database from before the skip optimisations were
added, run backfill once to populate file sizes and gallery stats:
rbv migrate --database "$DATABASE_URL"
rbv backfill --target /mnt/galleries --database "$DATABASE_URL"
Subsequent index runs will then benefit from all skip layers immediately.
Resetting face assignments
To discard all clustering results and start fresh (e.g. after tuning
--threshold or after a large new batch is indexed):
-- Unassign all faces
UPDATE face_detections SET person_id = NULL;
-- Remove all person records (cascade-clears person_names too)
DELETE FROM persons;
Then re-run rbv cluster. The embeddings themselves are not touched, so
re-clustering is fast.
To reset a single person's faces without affecting others:
UPDATE face_detections SET person_id = NULL WHERE person_id = '<uuid>';
DELETE FROM persons WHERE id = '<uuid>';