MASForensic

Author	SHA1	Message	Date
BattleTag	65745d21dc	feat(strategist) S4: InvestigationStrategist agent DESIGN_STRATEGIST.md §3. The smallest possible agent — its entire output per round is one decision: propose 1-3 leads (each citing a real hypothesis it expects to move) OR declare the investigation complete with a reason. Constraint surface: mandatory_record_tools = ("propose_lead", "declare_investigation_complete") terminal_tools = ("declare_investigation_complete",) The agent inherits the BaseAgent forced-retry mechanism: if it returns without calling either action tool, the orchestrator force-prompts a RECORD-only retry. declare_complete being terminal means the tool_call_loop short-circuits the moment the strategist decides we're done. _register_graph_tools overrides BaseAgent's default to skip _register_graph_write_tools entirely — the strategist NEVER writes phenomena, entities, edges, or hypotheses directly. All graph mutations come from the workers it dispatches via leads. This keeps the planning agent's responsibility surface narrow: read the graph, choose what to do next, that's it. Prompt walks through the workflow (call graph_overview / marginal_ yield / budget_status / source_coverage first, then take exactly one terminal action) with decision criteria for propose vs stop. Registered in agent_factory._AGENT_CLASSES["strategist"]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 02:22:05 -10:00
BattleTag	81ade8f7ac	feat(refit): complete S1-S6 — case abstraction, grounding, log-odds, plugins, coref, multi-source Consolidates the long-running refit work (DESIGN.md as authoritative spec) into a single baseline commit. Six stages landed together: S1 Case + EvidenceSource abstraction; tools parameterised by source_id (case.py, main.py multi-source bootstrap, .bin extension support) S2 Grounding gateway in add_phenomenon: verified_facts cite real ToolInvocation ids; substring / normalised match enforced; agent + task scope checked. Phenomenon.description split into verified_facts (grounded) + interpretation (free text). [invocation: inv-xxx] prefix on every wrapped tool result so the LLM can cite. S3 Confidence as additive log-odds: edge_type → log10(LR) calibration table; commutative updates; supported / refuted thresholds derived from log_odds; hypothesis × evidence matrix view. S4 iOS plugin: unzip_archive + parse_plist / sqlite_tables / sqlite_query / parse_ios_keychain / read_idevice_info; IOSArtifactAgent; SOURCE_TYPE_AGENTS routing. S5 Cross-source entity resolution: typed identifiers on Entity, observe_identity gateway, auto coref hypothesis with shared / conflicting strong/weak LR edges, reversible same_as edges, actor_clusters() view. S6 Android partition probe + AndroidArtifactAgent; MediaAgent with OCR fallback; orchestrator Phase 1 iterates every analysable source; platform-aware get_triage_agent_type; ReportAgent renders actor clusters + per-source breakdown. 142 unit tests / 1 skipped — full coverage of the new gateway, log-odds math, coref hypothesis fall-out, and orchestrator multi-source dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 02:12:10 -10:00
BattleTag	444d58726a	refactor: native tool calling + generic forced-retry + terminal exit - llm_client: switch tool_call_loop from text-based <tool_call> regex to OpenAI-native tools=[...] / structured tool_calls field; accumulate delta.reasoning_content for DeepSeek thinking-mode echo-back; fold preserves system msg and aligns boundary to never orphan role:tool - base_agent: generic forced-retry via mandatory_record_tools class attr (filesystem -> add_phenomenon, timeline -> add_temporal_edge, hypothesis -> add_hypothesis, report -> save_report); count via executor wrapper - terminal_tools class attr + loop short-circuit: when a terminal tool is called, loop exits with its raw return as final_text. ReportAgent declares save_report as terminal - replaces the <answer>-tag stop signal that native tool calling broke - _execute_*: return (raw, formatted) - terminal exit uses untruncated raw, conversation history uses 3000-char-capped formatted - evidence_graph + orchestrator: LLM-derived InvestigationArea support (hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS / _AREA_TOOLS); manual yaml block kept as optional seed - strip <answer> references from agent prompts (no longer load-bearing) Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures (was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via save_report (was max_iterations regression). 78/78 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:51:19 +08:00
BattleTag	893f5b5de2	fix: address agent boundary / JSON robustness / Phase 4 no-op from CFReDS run Issues found running the system end-to-end on the NIST CFReDS Hacking Case disk image (SCHARDT.001, Mr. Evil). Four interconnected fixes: 1. HypothesisAgent boundary leak (two layers) B.1 Tool set: BaseAgent._register_graph_tools was registering add_phenomenon / add_lead / link_to_entity for every agent. With an empty graph in Phase 2, HypothesisAgent "compensated" by inventing phenomena, dispatching leads, and linking entities. B.2 Prompt leak: BaseAgent's shared system prompt hard-coded "Call investigation tools (list_directory, parse_registry_key, etc.)". HypothesisAgent hallucinated list_directory and wasted 2 LLM rounds on 'unknown tool' errors before backing off. Fix: - Split _register_graph_tools into _register_graph_read_tools + _register_graph_write_tools. - HypothesisAgent, ReportAgent, TimelineAgent override _register_graph_tools to skip write tools. - HypothesisAgent and TimelineAgent override _build_system_prompt with focused, role-specific workflows (no Phase A-D investigation boilerplate). 2. JSON parse failures in Phase 3 lead generation (5/6 hypotheses lost) DeepSeek emits JSON with stray backslashes (Windows path references) and occasional minor syntax slips. Old single-stage sanitize couldn't recover; per-hypothesis fallback silently swallowed each failure. Fix: - _safe_json_loads: progressive — stage 0 as-is, stage 1 escape stray \X (anything not in valid JSON escape set), log raw input on final failure for diagnosis. - New _call_llm_for_json helper: on parse failure, append the error to the prompt and re-call LLM (self-correcting retry, up to 2). - All 4 LLM-JSON callsites in orchestrator refactored to use it. 3. Phase 1 sometimes skipped add_phenomenon (LLM treated <answer> as deliverable) Strengthen BaseAgent's RECORDING REQUIREMENT — explicit "your <answer> is DISCARDED; only graph mutations propagate" plus a new rule: negative findings (searched X, found nothing) MUST also be recorded as phenomena, since they constrain the hypothesis space. 4. Phase 4 Timeline was a no-op TimelineAgent inherited BaseAgent's Phase A-D prompt and never called add_temporal_edge — produced 0 temporal edges. Override the prompt with concrete workflow (build_filesystem_timeline -> get_timestamped_phenomena -> 15-40 add_temporal_edge calls) and restrict tool set to read-only + its 3 temporal tools. Verified end-to-end: HypothesisAgent now 8 tools (no writes), ReportAgent 13 (no graph writes), TimelineAgent 10 (read + temporal + timeline). All 60 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:14:16 +08:00
BattleTag	74e6bde13a	refactor: lead provenance, unified link path, SSOT cleanup, configurable weights Five interrelated cleanups: 1. Lead -> Phenomenon provenance - Phenomenon.from_lead_id field on the dataclass - BaseAgent.run(lead_id=...) writes self._current_lead_id - _add_phenomenon auto-injects from agent state (LLM unaware) - Orchestrator dispatch passes lead.id; Phase 1/2-auto/4/5 stay None - Merge path preserves the first non-None lead_id on collision 2. Unified Phenomenon <-> Hypothesis link path - HypothesisAgent only adds hypotheses, never links - link_phenomenon_to_hypothesis tool + executor removed - All links go through Orchestrator._judge_new_phenomena - Phase 2 unconditionally judges after hypothesis generation - Gap Analysis judges after each dispatch round (Three previously-missing judge calls now in place.) 3. SSOT in agent subclasses - Remove RoleTemplate dataclass, ROLE_TEMPLATES dict, _instantiate_from_template method - Each agent subclass owns name, role, and tool list - agent_factory.py shrinks from 299 to 153 lines - All 7 agents now route through _AGENT_CLASSES (filesystem, registry, communication, network, timeline were previously dead subclasses overridden by templates) 4. Configurable edge weights - HYPOTHESIS_EDGE_WEIGHTS -> _DEFAULT_EDGE_WEIGHTS (private default) - EvidenceGraph(edge_weights=...) override via config.yaml - hypothesis_edge_weights section in config.yaml (commented example) - main.py and regenerate_report.py read and pass through 5. regenerate_report.py auto-picks the latest run/*/graph_state.json when no CLI arg is given (was a hardcoded date path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:10:15 +08:00
BattleTag	097d2ce472	Initial commit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 17:36:26 +08:00

6 Commits