MASForensic

Author	SHA1	Message	Date
BattleTag	6b485b98f7	fix(grounding): auto-rescue hallucinated invocation_id + list real ids in error First full-case run (runs/2026-05-20T20-15-04/) produced 83 GroundingError rejections, almost all from a single failure mode: LLM cites a plausible- looking inv-XXXXXXXX that doesn't exist, while the fact's value is in fact present verbatim in one of its real tool outputs. The agent knew which tool it read from, it just mis-typed the citation id. Two-layer fix in evidence_graph.validate_fact_grounding: Layer A (silent heal): when the cited inv-id misses, search the same agent / task's invocations for one whose output contains the value (strict or normalised substring). If exactly one matches, rewrite fact.invocation_id in place and accept. Multi-match is NOT auto- rescued — the candidate ids go back to the LLM so it picks deliberately. Layer B (informative retry): GroundingError now appends the agent's recent invocation ids and a brief tool-call summary, so the LLM has the real ids in front of it for the next attempt rather than fabricating again from memory. Both layers preserve the design invariant: the fact's value must still be present in a real tool output — nothing new can land grounded that wasn't already verifiable. Cross-agent / cross-task isolation is also preserved (rescue candidates filtered on agent + task_id). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 02:14:20 -10:00
BattleTag	81ade8f7ac	feat(refit): complete S1-S6 — case abstraction, grounding, log-odds, plugins, coref, multi-source Consolidates the long-running refit work (DESIGN.md as authoritative spec) into a single baseline commit. Six stages landed together: S1 Case + EvidenceSource abstraction; tools parameterised by source_id (case.py, main.py multi-source bootstrap, .bin extension support) S2 Grounding gateway in add_phenomenon: verified_facts cite real ToolInvocation ids; substring / normalised match enforced; agent + task scope checked. Phenomenon.description split into verified_facts (grounded) + interpretation (free text). [invocation: inv-xxx] prefix on every wrapped tool result so the LLM can cite. S3 Confidence as additive log-odds: edge_type → log10(LR) calibration table; commutative updates; supported / refuted thresholds derived from log_odds; hypothesis × evidence matrix view. S4 iOS plugin: unzip_archive + parse_plist / sqlite_tables / sqlite_query / parse_ios_keychain / read_idevice_info; IOSArtifactAgent; SOURCE_TYPE_AGENTS routing. S5 Cross-source entity resolution: typed identifiers on Entity, observe_identity gateway, auto coref hypothesis with shared / conflicting strong/weak LR edges, reversible same_as edges, actor_clusters() view. S6 Android partition probe + AndroidArtifactAgent; MediaAgent with OCR fallback; orchestrator Phase 1 iterates every analysable source; platform-aware get_triage_agent_type; ReportAgent renders actor clusters + per-source breakdown. 142 unit tests / 1 skipped — full coverage of the new gateway, log-odds math, coref hypothesis fall-out, and orchestrator multi-source dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 02:12:10 -10:00
BattleTag	444d58726a	refactor: native tool calling + generic forced-retry + terminal exit - llm_client: switch tool_call_loop from text-based <tool_call> regex to OpenAI-native tools=[...] / structured tool_calls field; accumulate delta.reasoning_content for DeepSeek thinking-mode echo-back; fold preserves system msg and aligns boundary to never orphan role:tool - base_agent: generic forced-retry via mandatory_record_tools class attr (filesystem -> add_phenomenon, timeline -> add_temporal_edge, hypothesis -> add_hypothesis, report -> save_report); count via executor wrapper - terminal_tools class attr + loop short-circuit: when a terminal tool is called, loop exits with its raw return as final_text. ReportAgent declares save_report as terminal - replaces the <answer>-tag stop signal that native tool calling broke - _execute_*: return (raw, formatted) - terminal exit uses untruncated raw, conversation history uses 3000-char-capped formatted - evidence_graph + orchestrator: LLM-derived InvestigationArea support (hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS / _AREA_TOOLS); manual yaml block kept as optional seed - strip <answer> references from agent prompts (no longer load-bearing) Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures (was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via save_report (was max_iterations regression). 78/78 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:51:19 +08:00
BattleTag	0a2b344c84	fix: share _safe_json_loads with tool-call parser, not just orchestrator Move _safe_json_loads from orchestrator.py to llm_client.py and have _extract_tool_calls use it when parsing <tool_call> JSON blocks from model output. orchestrator now imports it from llm_client. Background: in the first full DeepSeek run (runs/2026-05-12T17-25-38), ~10 'Failed to parse tool call JSON' warnings appeared, all from regex patterns where the LLM wrote \. or \* inside JSON string values: Failed to parse tool call JSON: {..., "pattern": "Outlook Express\|...\|\.dbx"} Failed to parse tool call JSON: {..., "pattern": "ethereal.\.pcap"} Failed to parse tool call JSON: {..., "pattern": "lookatlan.\.txt\|..."} These are exactly the kind of stray-backslash errors stage-1 sanitize already handles for orchestrator JSON calls — but tool-call extraction was using bare json.loads. Result: each failed tool call silently dropped on the floor, the LLM never got a result, and at least one network agent burned 14m26s spinning before hitting max_iterations=40. Now the sanitize/log-on-failure path is shared. Verified against the three failure cases from yesterday's log: all three now parse cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:29:21 +08:00
BattleTag	76df34ed79	docs: add TODO marker for adaptive edge weights Note that the hard-coded HYPOTHESIS_EDGE_WEIGHTS table is a temporary choice; an adaptive scheme should be explored once the full pipeline is stable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:14:23 +08:00
BattleTag	893f5b5de2	fix: address agent boundary / JSON robustness / Phase 4 no-op from CFReDS run Issues found running the system end-to-end on the NIST CFReDS Hacking Case disk image (SCHARDT.001, Mr. Evil). Four interconnected fixes: 1. HypothesisAgent boundary leak (two layers) B.1 Tool set: BaseAgent._register_graph_tools was registering add_phenomenon / add_lead / link_to_entity for every agent. With an empty graph in Phase 2, HypothesisAgent "compensated" by inventing phenomena, dispatching leads, and linking entities. B.2 Prompt leak: BaseAgent's shared system prompt hard-coded "Call investigation tools (list_directory, parse_registry_key, etc.)". HypothesisAgent hallucinated list_directory and wasted 2 LLM rounds on 'unknown tool' errors before backing off. Fix: - Split _register_graph_tools into _register_graph_read_tools + _register_graph_write_tools. - HypothesisAgent, ReportAgent, TimelineAgent override _register_graph_tools to skip write tools. - HypothesisAgent and TimelineAgent override _build_system_prompt with focused, role-specific workflows (no Phase A-D investigation boilerplate). 2. JSON parse failures in Phase 3 lead generation (5/6 hypotheses lost) DeepSeek emits JSON with stray backslashes (Windows path references) and occasional minor syntax slips. Old single-stage sanitize couldn't recover; per-hypothesis fallback silently swallowed each failure. Fix: - _safe_json_loads: progressive — stage 0 as-is, stage 1 escape stray \X (anything not in valid JSON escape set), log raw input on final failure for diagnosis. - New _call_llm_for_json helper: on parse failure, append the error to the prompt and re-call LLM (self-correcting retry, up to 2). - All 4 LLM-JSON callsites in orchestrator refactored to use it. 3. Phase 1 sometimes skipped add_phenomenon (LLM treated <answer> as deliverable) Strengthen BaseAgent's RECORDING REQUIREMENT — explicit "your <answer> is DISCARDED; only graph mutations propagate" plus a new rule: negative findings (searched X, found nothing) MUST also be recorded as phenomena, since they constrain the hypothesis space. 4. Phase 4 Timeline was a no-op TimelineAgent inherited BaseAgent's Phase A-D prompt and never called add_temporal_edge — produced 0 temporal edges. Override the prompt with concrete workflow (build_filesystem_timeline -> get_timestamped_phenomena -> 15-40 add_temporal_edge calls) and restrict tool set to read-only + its 3 temporal tools. Verified end-to-end: HypothesisAgent now 8 tools (no writes), ReportAgent 13 (no graph writes), TimelineAgent 10 (read + temporal + timeline). All 60 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:14:16 +08:00
BattleTag	0a966d8476	feat: switch LLM client to OpenAI SDK for DeepSeek compatibility The previous LLMClient used raw httpx + Claude Messages API (/v1/messages, x-api-key, Anthropic SSE event types). Incompatible with DeepSeek. Rewrite LLMClient.__init__/chat/close to use openai.AsyncOpenAI: - /v1/chat/completions endpoint, OpenAI message format - Bearer auth, native SDK error types - Stream chunks via async for + chunk.choices[0].delta.content Tool calling protocol (ReAct text-based tags) and all surrounding helpers (_apply_progressive_decay, _fold_old_messages, _partition_tool_calls, tool_call_loop, etc.) are unchanged — endpoint-agnostic by design. New optional config params surfaced to config.yaml.agent: - reasoning_effort: "high" \| "medium" \| "low" — DeepSeek/o1-style depth - thinking_enabled: bool — DeepSeek extra_body.thinking switch main.py and regenerate_report.py pass these through to LLMClient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:13:54 +08:00
BattleTag	31812a72ee	test: track tests/ directory in version control tests/test_optimizations.py — 60 pytest cases covering: - EvidenceGraph: quality scoring, Jaccard merge, async safety, hypothesis confidence updates, asset library - llm_client: tool-result truncation, parallel batch execution, progressive context decay, message folding - orchestrator: parallel dispatch, batched lead generation, batched judging - tool_registry: result cache key derivation FakeAgent.run signatures updated to BaseAgent.run(task, lead_id=None). Previously listed in .gitignore (which is itself untracked, so the ignore rule lives only locally). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:10:31 +08:00
BattleTag	74e6bde13a	refactor: lead provenance, unified link path, SSOT cleanup, configurable weights Five interrelated cleanups: 1. Lead -> Phenomenon provenance - Phenomenon.from_lead_id field on the dataclass - BaseAgent.run(lead_id=...) writes self._current_lead_id - _add_phenomenon auto-injects from agent state (LLM unaware) - Orchestrator dispatch passes lead.id; Phase 1/2-auto/4/5 stay None - Merge path preserves the first non-None lead_id on collision 2. Unified Phenomenon <-> Hypothesis link path - HypothesisAgent only adds hypotheses, never links - link_phenomenon_to_hypothesis tool + executor removed - All links go through Orchestrator._judge_new_phenomena - Phase 2 unconditionally judges after hypothesis generation - Gap Analysis judges after each dispatch round (Three previously-missing judge calls now in place.) 3. SSOT in agent subclasses - Remove RoleTemplate dataclass, ROLE_TEMPLATES dict, _instantiate_from_template method - Each agent subclass owns name, role, and tool list - agent_factory.py shrinks from 299 to 153 lines - All 7 agents now route through _AGENT_CLASSES (filesystem, registry, communication, network, timeline were previously dead subclasses overridden by templates) 4. Configurable edge weights - HYPOTHESIS_EDGE_WEIGHTS -> _DEFAULT_EDGE_WEIGHTS (private default) - EvidenceGraph(edge_weights=...) override via config.yaml - hypothesis_edge_weights section in config.yaml (commented example) - main.py and regenerate_report.py read and pass through 5. regenerate_report.py auto-picks the latest run/*/graph_state.json when no CLI arg is given (was a hardcoded date path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:10:15 +08:00
BattleTag	fde96c7d9f	docs: rewrite README for EvidenceGraph + 5-phase + 7-agent architecture Previous README described a Blackboard-based 4-phase, 6-agent system. The actual code uses: - EvidenceGraph with typed weighted edges (Phenomenon/Hypothesis/Entity) - 5 phases (explicit Hypothesis Generation between survey and investigation) - 7 agents (added HypothesisAgent) Documents the confidence update formula, Phenomenon Jaccard merging, Asset Library inode dedup, tool-result caching, Gap Analysis coverage check, auto-persistence, and the resume mechanism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:09:59 +08:00
BattleTag	097d2ce472	Initial commit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 17:36:26 +08:00

11 Commits