bench: reproducible benchmark harness + published numbers #22

Open
opened 2026-06-12 08:47:23 +00:00 by grenade · 1 comment
Owner

There are zero benchmarks in the tree. The 7/10 rating rests on unproven performance; this issue either proves it or tells us exactly what to fix.

Why: the single biggest credibility artifact. "Near-frontier AI for mortals" needs a table an outsider can verify.

Deliverables:

  • a committed harness (script or crate) that runs the same model + quant on helexa vs llama.cpp/Ollama (and vLLM where it fits consumer VRAM)
  • measures what one operator feels: TTFT at realistic agent prompt lengths (4k–32k), batch-1 decode tok/s, cold-load time
  • run on the real fleet (3060 / 4090 / 5090), including the heterogeneous-TP configuration peers cannot run
  • results published as doc/benchmarks.md; headline numbers surfaced in README

Dependencies: #21 (token-level metrics) for engine-truth measurement.

There are zero benchmarks in the tree. The 7/10 rating rests on unproven performance; this issue either proves it or tells us exactly what to fix. **Why:** the single biggest credibility artifact. "Near-frontier AI for mortals" needs a table an outsider can verify. **Deliverables:** - a committed harness (script or crate) that runs the same model + quant on helexa vs llama.cpp/Ollama (and vLLM where it fits consumer VRAM) - measures what one operator feels: TTFT at realistic agent prompt lengths (4k–32k), batch-1 decode tok/s, cold-load time - run on the real fleet (3060 / 4090 / 5090), including the heterogeneous-TP configuration peers cannot run - results published as `doc/benchmarks.md`; headline numbers surfaced in README **Dependencies:** #21 (token-level metrics) for engine-truth measurement.
grenade added this to the 7 → 8 milestone 2026-06-12 08:47:23 +00:00
grenade added the p1-now label 2026-06-12 09:01:45 +00:00
Author
Owner

PR #32 lands the harness (script/bench.py) and the first published numbers (doc/benchmarks.md, 2026-06-12): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s, 27B@2×5090 Q6K TP=2 at a steady 35 tok/s with flat decode 128→4k — and the 4k-prefill TTFT of 7.1 s recorded as #23's before-number.

Remaining for full closure: the llama.cpp / Ollama comparison columns (same checkpoints, same hosts — the harness accepts any OpenAI-compatible base URL via --label, so adding them is an install-and-run exercise), and cold-load timing (visible per-deploy in the journal + deploy validation; tracked under #1).

PR #32 lands the harness (`script/bench.py`) and the first published numbers (`doc/benchmarks.md`, 2026-06-12): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s, 27B@2×5090 Q6K TP=2 at a steady 35 tok/s with flat decode 128→4k — and the 4k-prefill TTFT of 7.1 s recorded as #23's before-number. Remaining for full closure: the llama.cpp / Ollama comparison columns (same checkpoints, same hosts — the harness accepts any OpenAI-compatible base URL via `--label`, so adding them is an install-and-run exercise), and cold-load timing (visible per-deploy in the journal + deploy validation; tracked under #1).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/helexa#22