DESIGN_STRATEGIST.md §5. Support resume from a crash mid-strategist-loop.
_resume_strategist_state inspects investigation_rounds for a tail entry
without completed_at — an "open" round, i.e. one that started but never
closed. Two repairs:
1. Mark the round closed with strategist_action="interrupted_resume"
so the run history reflects what actually happened.
2. Walk that round's leads; any still in "assigned" state are
re-marked as "failed" with failure_reason="interrupted before
complete". The Retry-failed-leads + Gap-analysis passes that run
after the strategist loop can pick them up.
Returns max(round_number) + 1 — the round at which to resume the loop.
On a clean graph (no prior rounds) returns 1 and makes no changes.
_phase3_strategist_loop now calls this helper before the main for-loop
and uses its return value as start_round, so a resume run lands at the
right round number rather than restarting from R1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DESIGN_STRATEGIST.md §4. Replace the fixed-round hypothesis-directed
loop with a belief-driven strategist loop that runs the strategist
agent once per round and dispatches the leads it proposes.
New helpers on Orchestrator:
_budget_exceeded() hard budget caps (tool_calls,
wall_clock_minutes), complementing
strategist self-throttling.
_execute_strategist_lead(lead) dispatch one lead serially; the
next strategist round sees the
cumulative effect of this lead's
graph mutations.
_phase3_strategist_loop() main loop. Open round, run strategist,
exit on declare_complete or empty
proposals, otherwise dispatch each
lead, judge new phenomena, close round,
apply yield/budget checks.
_phase3_legacy_loop() fallback when strategist.enabled is
false. Identical to the
pre-DESIGN_STRATEGIST behaviour.
The run() entry point branches on strategist_cfg.enabled (default
true) and always follows up with _retry_failed_leads() + Gap
Analysis + mark_remaining_inconclusive() regardless of variant.
Orchestrator.__init__ also wires graph.budgets and
graph.run_start_monotonic from config so the budget_status tool
sees real numbers.
Integration tests use a mock strategist + mock workers to verify
declare_complete, propose_lead -> worker dispatch, zero-yield-streak
hard stop, and budget-cap-stops-the-loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DESIGN_STRATEGIST.md §3. The smallest possible agent — its entire
output per round is one decision: propose 1-3 leads (each citing a
real hypothesis it expects to move) OR declare the investigation
complete with a reason.
Constraint surface:
mandatory_record_tools = ("propose_lead", "declare_investigation_complete")
terminal_tools = ("declare_investigation_complete",)
The agent inherits the BaseAgent forced-retry mechanism: if it returns
without calling either action tool, the orchestrator force-prompts a
RECORD-only retry. declare_complete being terminal means the
tool_call_loop short-circuits the moment the strategist decides
we're done.
_register_graph_tools overrides BaseAgent's default to skip
_register_graph_write_tools entirely — the strategist NEVER writes
phenomena, entities, edges, or hypotheses directly. All graph
mutations come from the workers it dispatches via leads. This keeps
the planning agent's responsibility surface narrow: read the graph,
choose what to do next, that's it.
Prompt walks through the workflow (call graph_overview / marginal_
yield / budget_status / source_coverage first, then take exactly
one terminal action) with decision criteria for propose vs stop.
Registered in agent_factory._AGENT_CLASSES["strategist"].
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DESIGN_STRATEGIST.md §2.5. The strategist's two write actions.
propose_lead validates motivating_hypothesis exists in the graph,
validates expected_evidence_type is a real edge type, validates
source_id refers to a real source in the case — fast specific
errors so the strategist gets fixable feedback rather than a
generic crash. On success, calls graph.add_lead with proposed_by=
"strategist" and round_number=graph.current_strategist_round so
the round-completion code can collect this round's leads.
declare_investigation_complete sets graph.strategist_complete_requested
which the orchestrator inspects after each strategist run to decide
whether to break the loop. reason must come from a closed enum so
the audit log is consistent.
EvidenceGraph gains two transient run-context fields:
current_strategist_round — set by orchestrator at start of round
strategist_complete_requested — flipped by declare_complete
These are intentionally NOT persisted — they're per-run flags, not
graph state.
Both tools required to be in InvestigationStrategist.mandatory_record_
tools (added in S4) so the agent's forced-retry mechanism kicks in if
it returns without taking a documented decision.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DESIGN_STRATEGIST.md §2. Four read-only view tools the strategist uses
to ground its decision each round.
graph_overview() — hypotheses table (log_odds, conf, edges_in,
distinct_sources, recent_flip), sources table,
pending leads. distinct_sources is the
critical signal: a hypothesis with 23 edges
but only 1 distinct_source has fragile cross-
source independence and is a candidate for
a corroboration-seeking lead.
source_coverage(src) — per-source ✓/✗ against an expected-artefact
catalogue. Catalogue is heuristic hints,
NOT a forced checklist. Footer reminds the
strategist to investigate ✗ items only when
an active hypothesis depends on them — this
is the "应试能力存在但不被绑死" guardrail.
marginal_yield(N) — new phenomena / edges / status flips per
recent round. Two consecutive zero-yield
rounds = strong signal to declare complete.
budget_status() — usage vs caps (tool_calls, rounds, wall
clock). Pacing warnings at 70% / 90%.
tools/strategy.py also exports EXPECTED_ARTEFACTS, a per-source-type
table of (name, detector, value_for) entries. Detectors are
substring patterns on tool name + args; the matcher resolves at
call time against graph.tool_invocations. Catalogue covers iOS /
Android / Windows disk / media-collection / archive source types.
All four tools registered in tool_registry, listed as read-only in
llm_client.READ_ONLY_TOOLS for parallel execution. They go through
the invocation-logging wrapper so the strategist's reads are
themselves auditable (the wrapper does NOT cache them — graph
state changes between calls).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DESIGN_STRATEGIST.md §1. Foundation for the Phase 3 strategist loop.
Lead now carries four annotations that let the orchestrator measure
marginal yield per lead and dedupe strategist proposals:
- proposed_by (agent that proposed it: "strategist", "filesystem", …)
- motivating_hypothesis (hyp-id the lead is meant to corroborate/refute)
- expected_evidence_type (edge type the lead's worker should produce)
- round_number (0 = Phase 1 lead, ≥1 = strategist-proposed)
add_lead idempotently dedupes strategist proposals on
(motivating_hypothesis, expected_evidence_type, target_agent, source_id)
to prevent the "strategist loops on the same lead" failure mode.
New InvestigationRound dataclass records per-round provenance: before/
after hypothesis status snapshots, phenomena + edge count deltas, and
the strategist's decision_rationale. ``new_phenomena_count``,
``new_edges_count``, ``status_flips`` are derived properties that the
marginal_yield tool will use.
start_investigation_round / complete_investigation_round /
get_investigation_round / latest_round / leads_from_round complete the
lifecycle. complete is idempotent on already-closed rounds.
Lead.from_dict is forward-compat for state files written before this
commit. InvestigationRound persists as a top-level list in
graph_state.json (auto-save + load_state both wired).
EvidenceGraph also gains graph.budgets and graph.run_start_monotonic
fields that the budget_status view (S2) will read; orchestrator
populates them in S5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First full-case run (runs/2026-05-20T20-15-04/) produced hypotheses
with log_odds +31 (8 direct_evidence + 15 supports). That's the
naive-Bayes independence assumption breaking down: 15 different
phenomena all "supporting" the same hypothesis from one source are
not 15 independent pieces of evidence, they're highly correlated.
DESIGN.md §4.5 last bullet flagged this as a "未实施旋钮" — this
commit implements it.
Rule: the k-th edge of a given (hyp_id, edge_type) contributes
log_lr_base / k instead of log_lr_base. Cumulative is harmonic
sum H_N, bounded by ~ ln N. Single-edge hypotheses unaffected
(k=1 → /1 → no change). Replaying the 2026-05-20 graph's 108
edges under the new rule pulls the top hypothesis from +31.0 →
+8.75; the smallest active hypothesis from +4.0 → +2.08.
Also adds rank + log_lr_base to confidence_log entries so the
math is auditable from the persisted graph.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Timeline agent on the 2026-05-20 full run produced 0 phenomena: initial
round hit max_iterations=60 cap before recording, forced retry then hit
max_iterations=10 cap because every grounding-rejected call burns one
iteration in the new gateway. Two changes restore depth without re-
introducing the original "agent wanders off and never records" failure:
1. Raise retry cap 10 → 30. With grounding auto-rescue (prev commit)
most rejections heal on the first retry, but some still need 2-3
turns; 10 is empirically too tight, 30 leaves headroom.
2. Narrow the retry tool surface to RECORD + graph-write +
read-only-graph-query tools. Investigation tools (list_directory,
sqlite_query, parse_registry_key) are dropped on retry so the agent
can't restart its search loop — the retry is explicitly "record
what you already found, then stop".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First full-case run (runs/2026-05-20T20-15-04/) produced 83 GroundingError
rejections, almost all from a single failure mode: LLM cites a plausible-
looking inv-XXXXXXXX that doesn't exist, while the fact's value is in fact
present verbatim in one of its real tool outputs. The agent knew which
tool it read from, it just mis-typed the citation id.
Two-layer fix in evidence_graph.validate_fact_grounding:
Layer A (silent heal): when the cited inv-id misses, search the same
agent / task's invocations for one whose output contains the value
(strict or normalised substring). If exactly one matches, rewrite
fact.invocation_id in place and accept. Multi-match is NOT auto-
rescued — the candidate ids go back to the LLM so it picks deliberately.
Layer B (informative retry): GroundingError now appends the agent's
recent invocation ids and a brief tool-call summary, so the LLM has
the real ids in front of it for the next attempt rather than
fabricating again from memory.
Both layers preserve the design invariant: the fact's value must still
be present in a real tool output — nothing new can land grounded that
wasn't already verifiable. Cross-agent / cross-task isolation is also
preserved (rescue candidates filtered on agent + task_id).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>