cortex

helexa/cortex

Fork 0

Files

History

rob thijssen b179204fd3

build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions

Details

CI / Format (push) Successful in 34s

Details

CI / Clippy (push) Successful in 2m12s

Details

build-prerelease / Resolve version stamps (push) Successful in 3m41s

Details

CI / Test (push) Successful in 5m1s

Details

CI / Build cortex SRPM (push) Has been skipped

Details

CI / Build neuron SRPM (push) Has been skipped

Details

CI / Publish cortex to COPR (push) Has been skipped

Details

CI / Publish neuron to COPR (push) Has been skipped

Details

CI / Bump version in source (push) Has been skipped

Details

build-prerelease / Build neuron-blackwell (push) Successful in 3m32s

Details

build-prerelease / Build neuron-ampere (push) Successful in 5m20s

Details

build-prerelease / Build cortex binary (push) Successful in 12m20s

Details

build-prerelease / Build neuron-ada (push) Successful in 5m17s

Details

build-prerelease / Package cortex RPM (push) Successful in 1m25s

Details

build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled

Details

build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled

Details

build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled

Details

refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker

Second slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two
spawn_blocking sites in `chat_completion` and `chat_completion_stream`
now route through the device worker thread on CUDA loads. CPU loads
keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's
no context to own and the channel hop would only add latency.

What this phase changes:

- `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The
  worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>`
  slab and a `next_handle` counter for minting opaque handles.
- `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`,
  plus a new `arch_handle: Option<ArchHandle>` field. The two are
  mutually exclusive: CUDA loads set `arch_handle = Some(_)` after
  transferring the boxed arch into the worker's slab; CPU loads keep
  `arch = Some(_)` for the legacy spawn_blocking path.
- New `run_inference_via_worker` and `stream_inference_via_worker`
  drive the prefill + decode loop by sending `Job::ForwardLogits` per
  step; the worker copies the resulting `[vocab]` logits to a
  CPU-side `Vec<f32>` before reply, so the async caller never holds a
  device-resident tensor. `apply_repeat_penalty` and
  `LogitsProcessor::sample` run on a CPU candle tensor; no context
  binding side-effects on tokio worker threads.
- `logits_health_slice(&[f32])` complements the existing
  `logits_health(&Tensor)` so the new worker paths can compute
  health stats directly from the CPU vec.
- `unload_model` for the single-GPU CUDA path now sends
  `Job::DropArch { handle }` to the worker so the `Box<ModelArch>`
  drops on the thread that allocated its CUDA tensors. The `Drop` runs
  with the bound context, freeing memory on the right context.

What this phase doesn't touch (yet):

- TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3.
- Single-GPU model load — still spawn_blocking, followed by a
  `Job::TransferIn` to move the freshly-built `ModelArch` into the
  worker slab. Phase 4 moves the load itself onto the worker thread
  and eliminates the bootstrap TransferIn.
- The `device_vram_mb` / `cuda_mem_mb` helpers — still present and
  used by the construction-time logs running inside spawn_blocking
  loads. Phase 4 cleanup folds them into `dispatch.rs`.

Public API unchanged. fmt + clippy clean; 37 lib tests + all
integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 09:55:08 +03:00

src

refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker

2026-05-27 09:55:08 +03:00

tests

feat(neuron): bind listener before pre-warm, surface activation in /health

2026-05-26 15:18:04 +03:00

build.rs

feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only

2026-05-21 11:34:11 +03:00

Cargo.toml

feat(stage-8d-7): direct safetensors fused-region loader

2026-05-21 17:49:35 +03:00