Commit Graph

7 Commits

Author SHA1 Message Date
BattleTag
ca96f29849 feat(strategist) S1: Lead extension + InvestigationRound model
DESIGN_STRATEGIST.md §1. Foundation for the Phase 3 strategist loop.

Lead now carries four annotations that let the orchestrator measure
marginal yield per lead and dedupe strategist proposals:
  - proposed_by         (agent that proposed it: "strategist", "filesystem", …)
  - motivating_hypothesis (hyp-id the lead is meant to corroborate/refute)
  - expected_evidence_type (edge type the lead's worker should produce)
  - round_number        (0 = Phase 1 lead, ≥1 = strategist-proposed)

add_lead idempotently dedupes strategist proposals on
(motivating_hypothesis, expected_evidence_type, target_agent, source_id)
to prevent the "strategist loops on the same lead" failure mode.

New InvestigationRound dataclass records per-round provenance: before/
after hypothesis status snapshots, phenomena + edge count deltas, and
the strategist's decision_rationale. ``new_phenomena_count``,
``new_edges_count``, ``status_flips`` are derived properties that the
marginal_yield tool will use.

start_investigation_round / complete_investigation_round /
get_investigation_round / latest_round / leads_from_round complete the
lifecycle. complete is idempotent on already-closed rounds.

Lead.from_dict is forward-compat for state files written before this
commit. InvestigationRound persists as a top-level list in
graph_state.json (auto-save + load_state both wired).

EvidenceGraph also gains graph.budgets and graph.run_start_monotonic
fields that the budget_status view (S2) will read; orchestrator
populates them in S5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 02:18:35 -10:00
BattleTag
8020c24776 fix(graph): harmonic damping for repeated same-edge_type evidence
First full-case run (runs/2026-05-20T20-15-04/) produced hypotheses
with log_odds +31 (8 direct_evidence + 15 supports). That's the
naive-Bayes independence assumption breaking down: 15 different
phenomena all "supporting" the same hypothesis from one source are
not 15 independent pieces of evidence, they're highly correlated.
DESIGN.md §4.5 last bullet flagged this as a "未实施旋钮" — this
commit implements it.

Rule: the k-th edge of a given (hyp_id, edge_type) contributes
log_lr_base / k instead of log_lr_base. Cumulative is harmonic
sum H_N, bounded by ~ ln N. Single-edge hypotheses unaffected
(k=1 → /1 → no change). Replaying the 2026-05-20 graph's 108
edges under the new rule pulls the top hypothesis from +31.0 →
+8.75; the smallest active hypothesis from +4.0 → +2.08.

Also adds rank + log_lr_base to confidence_log entries so the
math is auditable from the persisted graph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 02:16:37 -10:00
BattleTag
6b485b98f7 fix(grounding): auto-rescue hallucinated invocation_id + list real ids in error
First full-case run (runs/2026-05-20T20-15-04/) produced 83 GroundingError
rejections, almost all from a single failure mode: LLM cites a plausible-
looking inv-XXXXXXXX that doesn't exist, while the fact's value is in fact
present verbatim in one of its real tool outputs. The agent knew which
tool it read from, it just mis-typed the citation id.

Two-layer fix in evidence_graph.validate_fact_grounding:

  Layer A (silent heal): when the cited inv-id misses, search the same
  agent / task's invocations for one whose output contains the value
  (strict or normalised substring). If exactly one matches, rewrite
  fact.invocation_id in place and accept. Multi-match is NOT auto-
  rescued — the candidate ids go back to the LLM so it picks deliberately.

  Layer B (informative retry): GroundingError now appends the agent's
  recent invocation ids and a brief tool-call summary, so the LLM has
  the real ids in front of it for the next attempt rather than
  fabricating again from memory.

Both layers preserve the design invariant: the fact's value must still
be present in a real tool output — nothing new can land grounded that
wasn't already verifiable. Cross-agent / cross-task isolation is also
preserved (rescue candidates filtered on agent + task_id).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 02:14:20 -10:00
BattleTag
81ade8f7ac feat(refit): complete S1-S6 — case abstraction, grounding, log-odds, plugins, coref, multi-source
Consolidates the long-running refit work (DESIGN.md as authoritative spec)
into a single baseline commit. Six stages landed together:

  S1  Case + EvidenceSource abstraction; tools parameterised by source_id
      (case.py, main.py multi-source bootstrap, .bin extension support)
  S2  Grounding gateway in add_phenomenon: verified_facts cite real
      ToolInvocation ids; substring / normalised match enforced; agent +
      task scope checked. Phenomenon.description split into verified_facts
      (grounded) + interpretation (free text). [invocation: inv-xxx]
      prefix on every wrapped tool result so the LLM can cite.
  S3  Confidence as additive log-odds: edge_type → log10(LR) calibration
      table; commutative updates; supported / refuted thresholds derived
      from log_odds; hypothesis × evidence matrix view.
  S4  iOS plugin: unzip_archive + parse_plist / sqlite_tables /
      sqlite_query / parse_ios_keychain / read_idevice_info;
      IOSArtifactAgent; SOURCE_TYPE_AGENTS routing.
  S5  Cross-source entity resolution: typed identifiers on Entity,
      observe_identity gateway, auto coref hypothesis with shared /
      conflicting strong/weak LR edges, reversible same_as edges,
      actor_clusters() view.
  S6  Android partition probe + AndroidArtifactAgent; MediaAgent with
      OCR fallback; orchestrator Phase 1 iterates every analysable
      source; platform-aware get_triage_agent_type; ReportAgent renders
      actor clusters + per-source breakdown.

142 unit tests / 1 skipped — full coverage of the new gateway, log-odds
math, coref hypothesis fall-out, and orchestrator multi-source dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 02:12:10 -10:00
BattleTag
444d58726a refactor: native tool calling + generic forced-retry + terminal exit
- llm_client: switch tool_call_loop from text-based <tool_call> regex
  to OpenAI-native tools=[...] / structured tool_calls field; accumulate
  delta.reasoning_content for DeepSeek thinking-mode echo-back; fold
  preserves system msg and aligns boundary to never orphan role:tool
- base_agent: generic forced-retry via mandatory_record_tools class attr
  (filesystem -> add_phenomenon, timeline -> add_temporal_edge,
  hypothesis -> add_hypothesis, report -> save_report); count via
  executor wrapper
- terminal_tools class attr + loop short-circuit: when a terminal tool
  is called, loop exits with its raw return as final_text. ReportAgent
  declares save_report as terminal - replaces the <answer>-tag stop
  signal that native tool calling broke
- _execute_*: return (raw, formatted) - terminal exit uses untruncated
  raw, conversation history uses 3000-char-capped formatted
- evidence_graph + orchestrator: LLM-derived InvestigationArea support
  (hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS /
  _AREA_TOOLS); manual yaml block kept as optional seed
- strip <answer> references from agent prompts (no longer load-bearing)

Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures
(was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via
save_report (was max_iterations regression). 78/78 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:51:19 +08:00
BattleTag
74e6bde13a refactor: lead provenance, unified link path, SSOT cleanup, configurable weights
Five interrelated cleanups:

1. Lead -> Phenomenon provenance
   - Phenomenon.from_lead_id field on the dataclass
   - BaseAgent.run(lead_id=...) writes self._current_lead_id
   - _add_phenomenon auto-injects from agent state (LLM unaware)
   - Orchestrator dispatch passes lead.id; Phase 1/2-auto/4/5 stay None
   - Merge path preserves the first non-None lead_id on collision

2. Unified Phenomenon <-> Hypothesis link path
   - HypothesisAgent only adds hypotheses, never links
   - link_phenomenon_to_hypothesis tool + executor removed
   - All links go through Orchestrator._judge_new_phenomena
   - Phase 2 unconditionally judges after hypothesis generation
   - Gap Analysis judges after each dispatch round
   (Three previously-missing judge calls now in place.)

3. SSOT in agent subclasses
   - Remove RoleTemplate dataclass, ROLE_TEMPLATES dict,
     _instantiate_from_template method
   - Each agent subclass owns name, role, and tool list
   - agent_factory.py shrinks from 299 to 153 lines
   - All 7 agents now route through _AGENT_CLASSES (filesystem,
     registry, communication, network, timeline were previously dead
     subclasses overridden by templates)

4. Configurable edge weights
   - HYPOTHESIS_EDGE_WEIGHTS -> _DEFAULT_EDGE_WEIGHTS (private default)
   - EvidenceGraph(edge_weights=...) override via config.yaml
   - hypothesis_edge_weights section in config.yaml (commented example)
   - main.py and regenerate_report.py read and pass through

5. regenerate_report.py auto-picks the latest run/*/graph_state.json
   when no CLI arg is given (was a hardcoded date path)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:10:15 +08:00
BattleTag
097d2ce472 Initial commit
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:36:26 +08:00