Files
helexa/doc
rob thijssen a2e73a8907
All checks were successful
CI / Format (push) Successful in 40s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m8s
CI / CUDA type-check (pull_request) Successful in 2m8s
CI / Clippy (push) Successful in 2m23s
CI / Test (pull_request) Successful in 3m54s
CI / Test (push) Successful in 6m23s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
feat(bench): reproducible batch-1 benchmark harness + first fleet numbers (#22)
script/bench.py: stdlib-only, works against any OpenAI-compatible /v1
endpoint (helexa, llama.cpp, Ollama, vLLM) so cross-engine tables are
a concatenation via the --label column. Measures the operator-felt
trio per (model, prompt-size) cell: TTFT (first SSE content chunk),
decode tok/s (visible tokens over the first→last chunk window,
chunk-per-token engine invariant since streaming usage frames aren't
emitted yet — #31), total wall-clock. Medians over N runs after one
warmup; append-only JSONL for longitudinal tracking.

Measurement traps found against the live fleet and handled:
- thinking models burn the budget invisibly (reasoning deltas are
  off-wire by default) — the prompt appends Qwen's /no_think soft
  switch
- short coalesced replies collapse the decode window to one TCP read
  — rates require a ≥200 ms window and the prompt demands ~300 words

doc/benchmarks.md: method, fleet table, and the first published
numbers (2026-06-12, 8f6f1d3): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s,
27B@2×5090 Q6K TP=2 35 tok/s with flat decode from 128→4k context —
and the 7.1 s 4k-prefill TTFT recorded as #23's before-number.

Refs #22 (competitor baselines still pending — the harness is ready
for them)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:39:13 +03:00
..