Commit Graph

32 Commits

Author SHA1 Message Date
a435d3a99d Define concrete 'promising' threshold and enforce indicator diversity in ledger-informed prompt
- Replace vague "promising metrics" with avg_sharpe >= 0.5 AND >= 10 trades per instrument
- Add indicator-family diversity rule: if all prior strategies share the same core indicator
  (e.g. all Bollinger Bands), the first strategy of the new run must use a different family
- Give explicit examples of alternative families: MACD, ATR breakout, volume spike,
  donchian channel breakout, stochastic oscillator
- Extend the no-repeat ban to strategies with fewer than 5 trades per instrument

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:21:55 +02:00
b476199de8 Fix ledger context being overridden by prescriptive initial prompt
The 13:20:03 run showed the ledger context was counterproductive: the
initial prompt's "Start with a multi-timeframe trend-following approach"
instruction caused the model to ignore the prior summary and repeat
EMA50-based strategies that produced 0 trades across all 15 iterations.

Two fixes:
- When prior_summary is present, replace the prescriptive starting
  instruction with one that explicitly defers to the ledger: refine the
  best prior strategy or try a different approach if all prior results
  were poor. Prevents the fixed instruction from overriding the context.
- Cap ledger entries per unique strategy at 3. A strategy repeated across
  11 iterations would contribute 33 entries, drowning out other approaches
  in the prior summary. 3 entries (one per instrument) is sufficient.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:54:35 +02:00
d76d3b9061 Use write_all for ledger entries to improve concurrent-write safety
writeln!(f, ...) makes two syscalls (data + newline) which can interleave
between concurrent processes even with O_APPEND. Serialise entry to bytes
and append the newline before write_all() so the entire entry lands in a
single write() syscall, which O_APPEND makes atomic on Linux local
filesystems for typical entry sizes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:12:38 +02:00
0945c94cc8 Add --ledger-file arg for explicit ledger path control
Defaults to <output_dir>/run_ledger.jsonl as before.
Pass --ledger-file to read from (and write to) a specific ledger,
enabling multiple ledger files to seed different search campaigns
or merge results from separate runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:10:22 +02:00
a0316be798 Add cross-run learning via run ledger and compare endpoint
Persist strategy + run_id to results/run_ledger.jsonl after each backtest.
On startup, load the ledger, fetch metrics via the new compare endpoint
(batched in groups of 50), group by strategy, rank by avg Sharpe, and
inject a summary of the top 5 and worst 3 prior strategies into the
iteration-1 prompt.

Also consumes the enriched result_summary fields from swym patch e47c18:
sortino_ratio, calmar_ratio, max_drawdown, pnl_return, avg_win, avg_loss,
max_win, max_loss, avg_hold_duration_secs. Sortino and max_drawdown are
appended to summary_line() when present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 13:05:39 +02:00
609d64587b docs: cross-run learnings plan 2026-03-10 13:04:13 +02:00
6692bdb490 Prompt: fix method vs kind confusion causing 11/15 validation failures
The 12:11:39 run shows the model using {"method":"position_quantity"} for
every sell rule despite the existing CRITICAL note. Root cause: a contradictory
anti-pattern ("Never use an expression object for quantity") was fighting the
correct guidance, and the method/kind distinction wasn't emphatic enough.

- Expand the CRITICAL note to explicitly contrast: buy uses SizingMethod
  ("method"), sell uses Expr ("kind") — they are different object types.
- Remove the contradictory "never use an expression object" anti-pattern
  which conflicted with position_quantity and SizingMethod objects.
- Add a final anti-pattern bullet as a second reminder of the same mistake.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 12:24:57 +02:00
36689e3fbb Prompt: fix field+offset kind omission and add interval guidance
Two gaps revealed by the 2026-03-10T11:42:49 run:
- Iterations 11-15 all failed with "missing field 'kind'" when the model
  wrote {"field":"volume","offset":-1} without the required "kind":"field".
  Expand the existing kind-required note with explicit offset examples.
- Iteration 10 switched to 15m unprompted and got sharpe=-0.41 from
  overtrading. Add anti-pattern note: don't change interval when sharpe
  is negative — fix the signal logic instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 12:09:18 +02:00
87d31f8d7e Use flat result_summary fields from swym patch 8fb410311
BacktestResult::from_response now reads total_positions, winning_positions,
losing_positions, win_rate, profit_factor, net_pnl, total_pnl, sharpe_ratio,
and total_fees directly from the top-level result_summary object instead of
deriving them from backtest_metadata + balance delta.

Removes the quote/initial_balance parameters that were only needed for the
workaround. Restores the full summary_line format with all metrics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 11:41:53 +02:00
3892ab37c1 fix: parse actual result_summary structure (backtest_metadata + assets)
The API doc described a flat result_summary that doesn't exist yet in the
deployed backend. The actual shape is:
  { backtest_metadata: { position_count }, assets: [...], condition_audit_summary }

- total_positions from backtest_metadata.position_count
- net_pnl from assets[quote].tear_sheet.balance_end.total - initial_balance
- win_rate, profit_factor, sharpe_ratio, total_fees, avg_bars_in_trade
  remain None until the API adds them

from_response() takes quote and initial_balance again to locate the
right asset and compute PnL. summary_line() only prints metrics that
are actually present. is_promising() falls back to net_pnl>0 + trades
when sharpe is unavailable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 10:32:13 +02:00
85896752f2 fix: ValidationError.path optional, correct position_quantity usage in prompts
- ValidationError.path is Option<String> — the API omits it for top-level
  structural errors. The required String was causing every validate call to
  fail to deserialize, falling through to submission instead of catching errors.

- Log path as "(top-level)" when absent

- Prompts: add explicit CRITICAL note that {"method":"position_quantity"} is
  wrong — position_quantity is an Expr (uses "kind") not a SizingMethod (uses
  "method"). The new SizingMethod examples caused the model to over-apply
  "method" to exits universally across the entire run.

- Prompts: note that fixed_sum has no multiplier field (additionalProperties)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:45:17 +02:00
ee260ea4d5 fix: parse flat result_summary structure per updated API doc
The API result_summary is a flat object with top-level fields
(total_positions, win_rate, profit_factor, net_pnl, sharpe_ratio, etc.)
not a nested backtest_metadata/instruments map. This was causing all
metrics to parse as None/zero for every completed run.

- Rewrite BacktestResult::from_response() to read flat fields directly
- Replace parse_ratio_value/parse_decimal_str with a single parse_number()
  that accepts both JSON numbers and decimal strings
- Populate winning_positions, losing_positions, total_fees, avg_bars_in_trade
  (previously always None)
- Simplify from_response signature — exchange/base/quote no longer needed
- Add expected_count and coverage_pct to CandleCoverage struct
- Update all example sell rules to use position_quantity instead of "0.01"
- Note that "9999" is a valid sell-all alias (auto-capped by the API)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:37:55 +02:00
3f8d4de7fb feat: add declarative SizingMethod types from upstream schema
Upstream added three new quantity sizing objects alongside DecimalString and Expr:
- fixed_sum: buy N quote-currency worth at current price
- percent_of_balance: buy N% of named asset's free balance
- fixed_units: buy exactly N base units (semantic alias for decimal string)

Update dsl-schema.json to include the three definitions and expand
Action.quantity.oneOf to reference all five valid forms.

Update prompts.rs Quantity section to present the declarative methods
as the preferred approach — they're cleaner, more readable, and
instrument-agnostic compared to raw Expr composition.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:33:43 +02:00
7e1ff51ae0 feat: validate endpoint integration, Expr quantity sizing, apply_func input field fix
- Add /api/v1/strategies/validate client to SwymClient; wire into agent loop
  before submission so all DSL errors are surfaced in one round-trip
- Update dsl-schema.json to upstream: quantity is now oneOf[DecimalString, Expr],
  ExprApplyFunc uses "input" field (renamed from "expr")
- Update prompts: document expression-based quantity sizing (fixed-fraction and
  ATR-based examples), fix apply_func to use "input" not "expr" throughout
- Remove unused ValidationError import

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 09:12:12 +02:00
5146b3f764 fix: replace negligible 0.001 quantity with meaningful sizing guidance
The previous example quantity "0.001" represented <1% of the $10k
initial balance for BTC and near-zero exposure for ETH/SOL, making
P&L and Sharpe results statistically meaningless.

- Update Quantity section with instrument-appropriate reference values
  (BTC: 0.01 ≈ $800, ETH: 3.0 ≈ $600, SOL: 50.0 ≈ $700)
- Replace "0.001" with "0.01" in all four working examples
- Explain that 5–10% of $10k initial balance is the sizing target
- Explicitly warn against "0.001" as it produces negligible exposure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 07:41:28 +02:00
759439313e fix: two Bollinger Band DSL errors from 50-iteration run
- bollinger_upper/lower func Exprs must NOT include a "field" parameter;
  they compute from close internally. Setting "field":"bollinger_upper"
  causes API rejection: expected one of open/high/low/close/volume.
- bollinger Condition "band" only accepts "above_upper" or "below_lower";
  "above_lower" and "below_upper" are invalid variants.

Both errors appeared repeatedly across the 50-iteration run, causing
failed backtest submissions on every Bollinger crossover strategy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 07:39:09 +02:00
9a7761b452 fix: add hma/ma to unsupported list, clarify quantity exit semantics
- Add `hma` (Hull MA) and generic `ma` to unsupported func names —
  both were used by R1 and rejected by the API
- Note that Hull MA can be approximated via apply_func with wma
- Add `"all"` to the quantity placeholder blacklist; explain that exit
  rules must repeat the entry decimal — there is no "close all" concept

Observed in run 2026-03-09T20:10:55: 2 iterations failed on hma/ma,
3 iterations skipped by client-side validation on quantity="all".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 20:23:30 +02:00
8d53d6383d fix: correct DSL mistakes from observed R1 failures
- ADX: clarify it is a FuncName inside {"kind":"func","name":"adx",...},
  not a Condition kind — with inline usage example (ADX > 25 filter)
- Expr "kind" field: add explicit note that every Expr object requires
  "kind"; {"field":"close"} without "kind" is rejected by the API
- MACD: add Example 4 showing full crossover strategy composed from
  bin_op(sub, ema12, ema26) and apply_func(ema,9) as signal line

All three mistakes were observed across consecutive R1-32B runs and
caused repeated API submission failures. Each prompt addition follows
the same pattern as the successful bollinger_upper fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 20:11:05 +02:00
55e41b6795 fix: log R1 thinking, catch repeated DSL errors, add unsupported indicators
Three improvements from the 2026-03-09T18:45:04 run analysis:

**R1 thinking visibility (claude.rs, agent.rs)**
extract_think_content() returns the raw <think> block content before it
is stripped. agent.rs logs it at DEBUG level so 'RUST_LOG=debug' lets
you see why the model keeps repeating a mistake — currently the think
block is silently discarded after stripping.

**Prompt: unsupported indicators and bollinger_upper Expr mistake (prompts.rs)**
- bollinger_upper / bollinger_lower used as {"kind":"bollinger_upper",...}
  was the dominant failure in iters 9-15. Added explicit correction:
  use {"kind":"func","name":"bollinger_upper","period":20} in Expr context,
  never as a standalone kind.
- roc, hma, vwap, macd, cci, stoch are NOT in the swym schema. Added a
  clear "NOT supported" list alongside the supported func names.

**Repeated API error detection in diagnose_history (agent.rs)**
If the same "unknown variant `X`" error appears 2+ times in the last 4
iterations, a targeted diagnosis note is emitted naming the bad variant
and pointing to the DSL reference. This surfaces in the next iteration
prompt so the model gets actionable feedback before it wastes another
backtest budget on the same mistake.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 18:58:50 +02:00
51e452b607 feat: discover max_output_tokens from server at startup
Instead of hardcoding per-family token budgets, ClaudeClient queries the
server at startup and sets max_output_tokens = context_length / 2.

Two discovery strategies, tried in order:
1. LM Studio /api/v1/models — returns loaded_instances[].config.context_length
   (the actually-configured context, e.g. 64000) and max_context_length
   (theoretical max, e.g. 131072). We prefer the loaded value.
2. OpenAI-compat /v1/models/{id} — used as fallback for non-LM Studio
   backends that expose context_length on the model object.

If both fail, the family default is kept (DeepSeekR1=32768, Generic=8192).

lmstudio_context_length() matches model IDs with and without quantization
suffixes (@q4_k_m etc.) so the --model flag doesn't need to be exact.

For the current R1-32B setup: loaded context=64000 → max_output_tokens=32000,
giving the thinking pass plenty of room while reserving half for input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 18:44:41 +02:00
89f7ba66e0 feat: model-family-aware token budgets and prompt style
Add ModelFamily enum (config.rs) detected from the model name:
- DeepSeekR1: matched on "deepseek-r1", "r1-distill" — R1 thinking blocks
  consume thousands of output tokens before the JSON; max_output_tokens
  raised to 32768 and HTTP timeout to 300s; prompt tells the model its
  <think> output is stripped and only the bare JSON is used
- Generic: previous behaviour (8192 tokens, 120s timeout)

ClaudeClient stores the detected family and uses it for max_tokens and
the request timeout. family() accessor lets the caller (agent.rs) pass
it into system_prompt().

prompts::system_prompt() now accepts &ModelFamily and injects a
family-specific "output format" section in place of the hardcoded
"How to respond" block. New families can be added by extending the
enum and the match arms without touching prompt logic elsewhere.

Also: log full anyhow cause chain (:#) on JSON extraction failure and
show response length alongside the truncated preview, to make future
diagnosis easier.

Root cause of the 2026-03-09T18:29:22 run failure: R1's thinking tokens
counted against max_tokens:8192, leaving only ~500 chars for the actual
JSON, which was always truncated mid-object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 18:39:51 +02:00
6f4f864d28 fix: increase max_tokens to 8192 for R1 reasoning overhead
R1 models use 500-2000 tokens for <think> blocks before the final
response. 4096 was too tight — the model would exhaust the budget
mid-thought and never emit the JSON.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 18:17:48 +02:00
185cb4586e fix: strip R1 think blocks before JSON extraction
DeepSeek-R1 models emit <think>...</think> before their actual response.
The brace-counting extractor would grab the first { inside the thinking
block (which contains partial JSON fragments) rather than the final
strategy JSON.

strip_think_blocks() removes all <think>...</think> sections including
unterminated blocks (truncated responses), leaving only the final output
for extract_json to process.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 18:17:06 +02:00
b947f48b01 feat: client-side validation, cycling detection, quantity prompt fix
- validate_strategy(): hard error if quantity is not a parseable decimal
  (catches "ATR_SIZED" etc. before sending to swym API); soft warning if
  a sell rule has no entry_price stop-loss or no bars_since_entry time exit
- Hard validation errors skip the backtest and feed errors back to the LLM
  via IterationRecord.validation_notes included in summary()
- json_contains_kind(): recursive helper to search strategy JSON tree
- diagnose_history(): add cycling detection — triggers is_converged when
  any avg_sharpe value appears 3+ times in history (not just last 3 streak),
  catching the alternating RSI<30 / RSI<25 pattern seen in the latest run
- prompts: clarify that quantity must parse as a float; list invalid
  placeholder strings ("ATR_SIZED", "FULL_BALANCE", "dynamic", etc.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 17:56:59 +02:00
e27aabae34 feat(agent): improve LLM feedback loop and convergence detection
Three related improvements to help the model learn and explore effectively:

Strategy JSON in history: include the compact strategy JSON in each
IterationRecord::summary() so the LLM knows exactly what was tested in
every past iteration, not just the outcome metrics. Without this the model
had no record of what it tried once conversation history was trimmed.

Rule comment in audit: include rule_comment from the condition audit in
the formatted audit string so the LLM can correlate hit-rate data with
the rule's stated purpose.

Convergence detection and anti-anchoring: diagnose_history() now returns
(String, bool) where the bool signals that the last 3 iterations had
avg_sharpe spread < 0.03 (model stuck in local optimum). When converged:
- Emit a ⚠ CONVERGENCE DETECTED note listing untried candle intervals
- Suppress best_so_far JSON to break the anchoring effect that was
  causing the model to produce near-identical strategies for 13+ iterations
- Targeted "try a different approach" instruction

Also add volume-as-field clarification to the DSL mistakes section in
the system prompt, fixing the "unknown variant `volume`" submit error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 14:38:07 +02:00
fb1145acae fix(swym): parse result_summary from actual API response structure
The swym API response structure differs from what the code previously
assumed. Fix all field extraction to match the real shape:

- total_positions: backtest_metadata.position_count (not top-level)
- sharpe_ratio, win_rate, profit_factor: instruments.{key}.{field}.value
  wrapped decimal strings (not plain floats); treat Decimal::MAX sentinel
  (~7.9e28) as None
- net_pnl: instruments.{key}.pnl (plain decimal string)
- instrument key derived as "{exchange_no_underscores}-{base}_{quote}"

Also fix coverage-based backtest_from clamping: after the coverage
check, compute the effective backtest start as the max first_open across
all instruments × common intervals, so strategies never fail with
"requested range outside available data". Log per-interval date ranges
for each instrument at startup.

Additionally:
- Compact format_audit_summary to handle {"rules":[...],"total_bars":N}
  structure with per-condition true_count/evaluated breakdown
- Drop avg_bars from summary_line (field absent from API)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 14:22:29 +02:00
c7a2d65539 fix(prompts): forbid dynamic quantity expressions, require plain decimal string
The model was generating Expr objects for quantity (e.g. ATR-based sizing),
causing consistent QuantitySpec deserialization failures. Replace the
"prefer dynamic sizing" hint with an explicit rule: quantity must always
be a fixed decimal string like "0.001".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 13:11:40 +02:00
292c101859 docs(prompts): add DSL expression kind reference and three working examples
Shows correct usage of rsi/bollinger/ema_trend condition shortcuts, entry_price
and bars_since_entry ExprKind values, and func/cross_over/bin_op expressions.
Also calls out common model mistakes (rsi as ExprKind, bars_since_entry as
FuncName, expr_field) and adds a note that spot strategies are long-only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 13:09:01 +02:00
fc9b7e094a feat(agent): add strategy quality introspection
Log full strategy JSON at debug level, show full anyhow cause chain on
submit failures, surface condition_audit_summary for 0-trade results in
both logs and the summary fed back to the AI each iteration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 12:58:49 +02:00
deb28f6714 chore: local defaults 2026-03-09 12:24:30 +02:00
b7aa458e40 feat(claude): add configurable API base URL via --anthropic-url
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-09 10:28:44 +02:00
934566879e chore: init 2026-03-09 10:15:33 +02:00