Commit Graph

10 Commits

Author SHA1 Message Date
BattleTag
81ade8f7ac feat(refit): complete S1-S6 — case abstraction, grounding, log-odds, plugins, coref, multi-source
Consolidates the long-running refit work (DESIGN.md as authoritative spec)
into a single baseline commit. Six stages landed together:

  S1  Case + EvidenceSource abstraction; tools parameterised by source_id
      (case.py, main.py multi-source bootstrap, .bin extension support)
  S2  Grounding gateway in add_phenomenon: verified_facts cite real
      ToolInvocation ids; substring / normalised match enforced; agent +
      task scope checked. Phenomenon.description split into verified_facts
      (grounded) + interpretation (free text). [invocation: inv-xxx]
      prefix on every wrapped tool result so the LLM can cite.
  S3  Confidence as additive log-odds: edge_type → log10(LR) calibration
      table; commutative updates; supported / refuted thresholds derived
      from log_odds; hypothesis × evidence matrix view.
  S4  iOS plugin: unzip_archive + parse_plist / sqlite_tables /
      sqlite_query / parse_ios_keychain / read_idevice_info;
      IOSArtifactAgent; SOURCE_TYPE_AGENTS routing.
  S5  Cross-source entity resolution: typed identifiers on Entity,
      observe_identity gateway, auto coref hypothesis with shared /
      conflicting strong/weak LR edges, reversible same_as edges,
      actor_clusters() view.
  S6  Android partition probe + AndroidArtifactAgent; MediaAgent with
      OCR fallback; orchestrator Phase 1 iterates every analysable
      source; platform-aware get_triage_agent_type; ReportAgent renders
      actor clusters + per-source breakdown.

142 unit tests / 1 skipped — full coverage of the new gateway, log-odds
math, coref hypothesis fall-out, and orchestrator multi-source dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 02:12:10 -10:00
BattleTag
444d58726a refactor: native tool calling + generic forced-retry + terminal exit
- llm_client: switch tool_call_loop from text-based <tool_call> regex
  to OpenAI-native tools=[...] / structured tool_calls field; accumulate
  delta.reasoning_content for DeepSeek thinking-mode echo-back; fold
  preserves system msg and aligns boundary to never orphan role:tool
- base_agent: generic forced-retry via mandatory_record_tools class attr
  (filesystem -> add_phenomenon, timeline -> add_temporal_edge,
  hypothesis -> add_hypothesis, report -> save_report); count via
  executor wrapper
- terminal_tools class attr + loop short-circuit: when a terminal tool
  is called, loop exits with its raw return as final_text. ReportAgent
  declares save_report as terminal - replaces the <answer>-tag stop
  signal that native tool calling broke
- _execute_*: return (raw, formatted) - terminal exit uses untruncated
  raw, conversation history uses 3000-char-capped formatted
- evidence_graph + orchestrator: LLM-derived InvestigationArea support
  (hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS /
  _AREA_TOOLS); manual yaml block kept as optional seed
- strip <answer> references from agent prompts (no longer load-bearing)

Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures
(was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via
save_report (was max_iterations regression). 78/78 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:51:19 +08:00
BattleTag
0a2b344c84 fix: share _safe_json_loads with tool-call parser, not just orchestrator
Move _safe_json_loads from orchestrator.py to llm_client.py and have
_extract_tool_calls use it when parsing <tool_call> JSON blocks from
model output. orchestrator now imports it from llm_client.

Background: in the first full DeepSeek run (runs/2026-05-12T17-25-38),
~10 'Failed to parse tool call JSON' warnings appeared, all from regex
patterns where the LLM wrote \. or \* inside JSON string values:

  Failed to parse tool call JSON: {..., "pattern": "Outlook Express|...|\.dbx"}
  Failed to parse tool call JSON: {..., "pattern": "ethereal.*\.pcap"}
  Failed to parse tool call JSON: {..., "pattern": "lookatlan.*\.txt|..."}

These are exactly the kind of stray-backslash errors stage-1 sanitize
already handles for orchestrator JSON calls — but tool-call extraction
was using bare json.loads. Result: each failed tool call silently dropped
on the floor, the LLM never got a result, and at least one network agent
burned 14m26s spinning before hitting max_iterations=40.

Now the sanitize/log-on-failure path is shared. Verified against the
three failure cases from yesterday's log: all three now parse cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:29:21 +08:00
BattleTag
76df34ed79 docs: add TODO marker for adaptive edge weights
Note that the hard-coded HYPOTHESIS_EDGE_WEIGHTS table is a temporary
choice; an adaptive scheme should be explored once the full pipeline
is stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:14:23 +08:00
BattleTag
893f5b5de2 fix: address agent boundary / JSON robustness / Phase 4 no-op from CFReDS run
Issues found running the system end-to-end on the NIST CFReDS Hacking Case
disk image (SCHARDT.001, Mr. Evil). Four interconnected fixes:

1. HypothesisAgent boundary leak (two layers)
   B.1 Tool set: BaseAgent._register_graph_tools was registering
       add_phenomenon / add_lead / link_to_entity for every agent. With
       an empty graph in Phase 2, HypothesisAgent "compensated" by
       inventing phenomena, dispatching leads, and linking entities.
   B.2 Prompt leak: BaseAgent's shared system prompt hard-coded "Call
       investigation tools (list_directory, parse_registry_key, etc.)".
       HypothesisAgent hallucinated list_directory and wasted 2 LLM
       rounds on 'unknown tool' errors before backing off.

   Fix:
   - Split _register_graph_tools into _register_graph_read_tools +
     _register_graph_write_tools.
   - HypothesisAgent, ReportAgent, TimelineAgent override
     _register_graph_tools to skip write tools.
   - HypothesisAgent and TimelineAgent override _build_system_prompt
     with focused, role-specific workflows (no Phase A-D investigation
     boilerplate).

2. JSON parse failures in Phase 3 lead generation (5/6 hypotheses lost)
   DeepSeek emits JSON with stray backslashes (Windows path references)
   and occasional minor syntax slips. Old single-stage sanitize couldn't
   recover; per-hypothesis fallback silently swallowed each failure.

   Fix:
   - _safe_json_loads: progressive — stage 0 as-is, stage 1 escape stray
     \X (anything not in valid JSON escape set), log raw input on final
     failure for diagnosis.
   - New _call_llm_for_json helper: on parse failure, append the error
     to the prompt and re-call LLM (self-correcting retry, up to 2).
   - All 4 LLM-JSON callsites in orchestrator refactored to use it.

3. Phase 1 sometimes skipped add_phenomenon (LLM treated <answer> as deliverable)
   Strengthen BaseAgent's RECORDING REQUIREMENT — explicit "your <answer>
   is DISCARDED; only graph mutations propagate" plus a new rule:
   negative findings (searched X, found nothing) MUST also be recorded
   as phenomena, since they constrain the hypothesis space.

4. Phase 4 Timeline was a no-op
   TimelineAgent inherited BaseAgent's Phase A-D prompt and never called
   add_temporal_edge — produced 0 temporal edges. Override the prompt
   with concrete workflow (build_filesystem_timeline ->
   get_timestamped_phenomena -> 15-40 add_temporal_edge calls) and
   restrict tool set to read-only + its 3 temporal tools.

Verified end-to-end: HypothesisAgent now 8 tools (no writes), ReportAgent
13 (no graph writes), TimelineAgent 10 (read + temporal + timeline).
All 60 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:14:16 +08:00
BattleTag
0a966d8476 feat: switch LLM client to OpenAI SDK for DeepSeek compatibility
The previous LLMClient used raw httpx + Claude Messages API (/v1/messages,
x-api-key, Anthropic SSE event types). Incompatible with DeepSeek.

Rewrite LLMClient.__init__/chat/close to use openai.AsyncOpenAI:
- /v1/chat/completions endpoint, OpenAI message format
- Bearer auth, native SDK error types
- Stream chunks via async for + chunk.choices[0].delta.content

Tool calling protocol (ReAct text-based tags) and all surrounding helpers
(_apply_progressive_decay, _fold_old_messages, _partition_tool_calls,
tool_call_loop, etc.) are unchanged — endpoint-agnostic by design.

New optional config params surfaced to config.yaml.agent:
- reasoning_effort: "high" | "medium" | "low" — DeepSeek/o1-style depth
- thinking_enabled: bool — DeepSeek extra_body.thinking switch

main.py and regenerate_report.py pass these through to LLMClient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:13:54 +08:00
BattleTag
31812a72ee test: track tests/ directory in version control
tests/test_optimizations.py — 60 pytest cases covering:
- EvidenceGraph: quality scoring, Jaccard merge, async safety,
  hypothesis confidence updates, asset library
- llm_client: tool-result truncation, parallel batch execution,
  progressive context decay, message folding
- orchestrator: parallel dispatch, batched lead generation,
  batched judging
- tool_registry: result cache key derivation

FakeAgent.run signatures updated to BaseAgent.run(task, lead_id=None).

Previously listed in .gitignore (which is itself untracked, so the
ignore rule lives only locally).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:10:31 +08:00
BattleTag
74e6bde13a refactor: lead provenance, unified link path, SSOT cleanup, configurable weights
Five interrelated cleanups:

1. Lead -> Phenomenon provenance
   - Phenomenon.from_lead_id field on the dataclass
   - BaseAgent.run(lead_id=...) writes self._current_lead_id
   - _add_phenomenon auto-injects from agent state (LLM unaware)
   - Orchestrator dispatch passes lead.id; Phase 1/2-auto/4/5 stay None
   - Merge path preserves the first non-None lead_id on collision

2. Unified Phenomenon <-> Hypothesis link path
   - HypothesisAgent only adds hypotheses, never links
   - link_phenomenon_to_hypothesis tool + executor removed
   - All links go through Orchestrator._judge_new_phenomena
   - Phase 2 unconditionally judges after hypothesis generation
   - Gap Analysis judges after each dispatch round
   (Three previously-missing judge calls now in place.)

3. SSOT in agent subclasses
   - Remove RoleTemplate dataclass, ROLE_TEMPLATES dict,
     _instantiate_from_template method
   - Each agent subclass owns name, role, and tool list
   - agent_factory.py shrinks from 299 to 153 lines
   - All 7 agents now route through _AGENT_CLASSES (filesystem,
     registry, communication, network, timeline were previously dead
     subclasses overridden by templates)

4. Configurable edge weights
   - HYPOTHESIS_EDGE_WEIGHTS -> _DEFAULT_EDGE_WEIGHTS (private default)
   - EvidenceGraph(edge_weights=...) override via config.yaml
   - hypothesis_edge_weights section in config.yaml (commented example)
   - main.py and regenerate_report.py read and pass through

5. regenerate_report.py auto-picks the latest run/*/graph_state.json
   when no CLI arg is given (was a hardcoded date path)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:10:15 +08:00
BattleTag
fde96c7d9f docs: rewrite README for EvidenceGraph + 5-phase + 7-agent architecture
Previous README described a Blackboard-based 4-phase, 6-agent system.
The actual code uses:
- EvidenceGraph with typed weighted edges (Phenomenon/Hypothesis/Entity)
- 5 phases (explicit Hypothesis Generation between survey and investigation)
- 7 agents (added HypothesisAgent)

Documents the confidence update formula, Phenomenon Jaccard merging,
Asset Library inode dedup, tool-result caching, Gap Analysis coverage
check, auto-persistence, and the resume mechanism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:09:59 +08:00
BattleTag
097d2ce472 Initial commit
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:36:26 +08:00