Files
rbv/crates/rbv-cli/README.md
rob thijssen 617fa34a23 Pre-warm connection pool and size it to match concurrency
Configure sqlx pool with min_connections = max_connections so all
connections are established at startup, avoiding slow-acquire warnings
from lazy mTLS handshakes. Add idle_timeout (5 min) to recycle stale
connections from prior runs, and reduce acquire_timeout to 10s for
faster failure.

Size the pool to io_concurrency + ml_concurrency + 2 to accommodate
the worst case where all IO tasks call image_exists concurrently.
Reduce default io_concurrency from 4× to 2× ML concurrency to keep
pool size within PostgreSQL's default max_connections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 09:40:35 +03:00

5.3 KiB
Raw Blame History

rbv

Image gallery indexer and facial recognition tool.

Subcommands

migrate

Create or update the database schema.

rbv migrate --database <CONNSTR>

Run this once before first use, and again after pulling schema changes.


index

Extract CLIP embeddings and face detections from image galleries and store them in the database.

rbv index \
  --target <PATH>... \
  --database <CONNSTR> \
  --model-dir <PATH> \
  [--concurrency <N>]        # ML concurrency, default 4
  [--io-concurrency <N>]     # file I/O concurrency, default 2× ML concurrency
  [--reindex]                # bypass gallery-level skip
  [--ml-purge]               # wipe all ML data and re-index from scratch

--target can be any of:

  • A single gallery directory (contains index.json and tn/)
  • A chunk directory (immediate children are galleries)
  • A root directory (immediate children are chunks)
  • Any arbitrary directory — galleries are discovered recursively

Skip optimisations

Re-running index against the same target is safe and fast. Four layers of skip logic avoid redundant work:

  1. Gallery-level skip — each gallery's image file count and total byte size are stored in the database. If neither has changed since the last successful index, the entire gallery is skipped without reading any files. Use --reindex to bypass this check.

  2. Batch existence check — for galleries that are not skipped, all known (filename, image_id, file_size) tuples are fetched in a single query rather than one query per image.

  3. File-size fast path — each file is stat()-ed (cheap syscall). If the filename and file size match the stored values, the file is skipped without being read or hashed.

  4. Content-hash dedup — if a file is read and its BLAKE3 hash matches an existing image (in this gallery or any other), ML processing is skipped.

Together these mean that a re-index of an unchanged 1.5 M image corpus completes in seconds rather than days.

Two-stage pipeline

File I/O (reads + hashing) and ML inference run under separate concurrency limits. --io-concurrency controls how many files are read/hashed in parallel (default: 4× --concurrency), while --concurrency controls ML inference slots. This keeps the disk and the ML backend both saturated instead of taking turns.

Quality note: Indexing one gallery or the whole tree produces identical embeddings. Recognition quality is determined entirely by cluster (below), which always operates over the full database regardless of how many index runs contributed to it.


backfill

Populate file_size and gallery stats from disk without running any ML inference. This is a one-time migration helper — run it after upgrading to make all skip optimisations effective on the very first index run.

rbv backfill \
  --target <PATH>... \
  --database <CONNSTR> \
  [--concurrency <N>]    # default 16

For each gallery that exists both on disk and in the database, backfill stats every image file and writes file_size into gallery_images rows where it is currently NULL. It also sets the gallery-level file_count and total_bytes columns so that gallery-level skip works immediately.

No files are read (only stat()-ed) and no ML models are loaded, so this runs very quickly even on large corpora.


cluster

Group all indexed face embeddings into person identities using cosine similarity and connected-components clustering.

rbv cluster \
  --database <CONNSTR> \
  [--threshold <FLOAT>]  # default 0.65 (range 0.01.0)
  • Only faces without an assigned person_id are clustered; existing assignments are preserved.
  • Re-run after each index pass to assign identities to newly indexed faces.
  • Raise --threshold (e.g. 0.75) for stricter grouping (fewer false merges, more splits). Lower it for looser grouping.

Typical workflow

# 1. One-time setup
rbv migrate --database "$DATABASE_URL"

# 2. Index all galleries (incremental — safe to re-run)
rbv index --target /mnt/galleries --database "$DATABASE_URL" --model-dir /path/to/models

# 3. Cluster faces into persons
rbv cluster --database "$DATABASE_URL"

# 4. As new galleries are added, repeat steps 23
rbv index --target /mnt/galleries/new-chunk --database "$DATABASE_URL" --model-dir /path/to/models
rbv cluster --database "$DATABASE_URL"

After upgrading (one-time)

If you have an existing database from before the skip optimisations were added, run backfill once to populate file sizes and gallery stats:

rbv migrate --database "$DATABASE_URL"
rbv backfill --target /mnt/galleries --database "$DATABASE_URL"

Subsequent index runs will then benefit from all skip layers immediately.


Resetting face assignments

To discard all clustering results and start fresh (e.g. after tuning --threshold or after a large new batch is indexed):

-- Unassign all faces
UPDATE face_detections SET person_id = NULL;

-- Remove all person records (cascade-clears person_names too)
DELETE FROM persons;

Then re-run rbv cluster. The embeddings themselves are not touched, so re-clustering is fast.

To reset a single person's faces without affecting others:

UPDATE face_detections SET person_id = NULL WHERE person_id = '<uuid>';
DELETE FROM persons WHERE id = '<uuid>';