helexa/helexa

bench: reproducible benchmark harness + published numbers #22

New Issue

Open

opened 2026-06-12 08:47:23 +00:00 by grenade · 1 comment

grenade commented

2026-06-12 08:47:23 +00:00

Owner

There are zero benchmarks in the tree. The 7/10 rating rests on unproven performance; this issue either proves it or tells us exactly what to fix.

Why: the single biggest credibility artifact. "Near-frontier AI for mortals" needs a table an outsider can verify.

Deliverables:

a committed harness (script or crate) that runs the same model + quant on helexa vs llama.cpp/Ollama (and vLLM where it fits consumer VRAM)
measures what one operator feels: TTFT at realistic agent prompt lengths (4k–32k), batch-1 decode tok/s, cold-load time
run on the real fleet (3060 / 4090 / 5090), including the heterogeneous-TP configuration peers cannot run
results published as doc/benchmarks.md; headline numbers surfaced in README

Dependencies: #21 (token-level metrics) for engine-truth measurement.

There are zero benchmarks in the tree. The 7/10 rating rests on unproven performance; this issue either proves it or tells us exactly what to fix. **Why:** the single biggest credibility artifact. "Near-frontier AI for mortals" needs a table an outsider can verify. **Deliverables:** - a committed harness (script or crate) that runs the same model + quant on helexa vs llama.cpp/Ollama (and vLLM where it fits consumer VRAM) - measures what one operator feels: TTFT at realistic agent prompt lengths (4k–32k), batch-1 decode tok/s, cold-load time - run on the real fleet (3060 / 4090 / 5090), including the heterogeneous-TP configuration peers cannot run - results published as `doc/benchmarks.md`; headline numbers surfaced in README **Dependencies:** #21 (token-level metrics) for engine-truth measurement.

grenade added this to the 7 → 8 milestone 2026-06-12 08:47:23 +00:00

grenade referenced this issue

2026-06-12 08:47:25 +00:00

release: tagged release + public writeup with benchmark numbers #26

grenade referenced this issue

2026-06-12 08:47:40 +00:00

perf(neuron): chunked delta-rule prefill for Gated DeltaNet #23

grenade referenced this issue

2026-06-12 08:47:41 +00:00

perf(neuron): speculative decoding with a small same-family drafter #25

grenade referenced this issue

2026-06-12 08:58:13 +00:00

Reduce TP=2 Q6K cold-load time for Qwen3.6-27B (~5 min today) #1

grenade referenced this issue

2026-06-12 08:58:14 +00:00

Vision: numerical validation against transformers reference #15

grenade added the p1-now label 2026-06-12 09:01:45 +00:00

grenade referenced this issue

2026-06-12 09:02:07 +00:00

tracking: prioritised path to closing every open issue #27

grenade referenced this issue from a commit

2026-06-12 12:11:56 +00:00

feat(gateway): per-request token metrics — TTFT and tok/s (#21)

grenade referenced this issue

2026-06-12 12:12:11 +00:00

feat(gateway): per-request token metrics — TTFT and tok/s (#21) #30

grenade referenced this issue

2026-06-12 12:37:56 +00:00

feat(neuron): emit final usage frame on SSE streams (stream_options.include_usage) #31

grenade referenced this issue from a commit

2026-06-12 12:39:16 +00:00

feat(bench): reproducible batch-1 benchmark harness + first fleet numbers (#22)

grenade referenced this issue

2026-06-12 12:39:34 +00:00

feat(bench): reproducible benchmark harness + first fleet numbers (#22) #32

grenade commented

2026-06-12 12:39:54 +00:00

Author

Owner

PR #32 lands the harness (script/bench.py) and the first published numbers (doc/benchmarks.md, 2026-06-12): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s, 27B@2×5090 Q6K TP=2 at a steady 35 tok/s with flat decode 128→4k — and the 4k-prefill TTFT of 7.1 s recorded as #23's before-number.

Remaining for full closure: the llama.cpp / Ollama comparison columns (same checkpoints, same hosts — the harness accepts any OpenAI-compatible base URL via --label, so adding them is an install-and-run exercise), and cold-load timing (visible per-deploy in the journal + deploy validation; tracked under #1).

PR #32 lands the harness (`script/bench.py`) and the first published numbers (`doc/benchmarks.md`, 2026-06-12): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s, 27B@2×5090 Q6K TP=2 at a steady 35 tok/s with flat decode 128→4k — and the 4k-prefill TTFT of 7.1 s recorded as #23's before-number. Remaining for full closure: the llama.cpp / Ollama comparison columns (same checkpoints, same hosts — the harness accepts any OpenAI-compatible base URL via `--label`, so adding them is an install-and-run exercise), and cold-load timing (visible per-deploy in the journal + deploy validation; tracked under #1).

grenade referenced this issue from a commit

2026-06-12 12:46:35 +00:00

Merge pull request 'feat(bench): reproducible benchmark harness + first fleet numbers (#22)' (#32) from feat/22-benchmark-harness into main

grenade referenced this issue

2026-06-12 17:05:53 +00:00

feat(neuron): prefix KV caching across requests #11

grenade referenced this issue

2026-06-12 17:52:16 +00:00

perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23) #39

grenade referenced this issue

2026-06-12 20:12:12 +00:00

perf(neuron): chunked delta-rule prefill for Gated DeltaNet #23

grenade referenced this issue

2026-06-12 20:36:19 +00:00

feat(neuron): numerical validation against the transformers reference (#15) #41

grenade referenced this issue

2026-06-13 05:26:20 +00:00

perf(neuron): speculative decoding with a small same-family drafter #25

grenade referenced this issue from a commit

2026-06-13 05:30:26 +00:00

feat(neuron): speculative decoding — acceptance core + config (#25, phase 1)

grenade referenced this issue

2026-06-13 05:30:39 +00:00

feat(neuron): speculative decoding — acceptance core + config (#25, phase 1) #45

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: helexa/helexa#22