refactor: native tool calling + generic forced-retry + terminal exit

- llm_client: switch tool_call_loop from text-based <tool_call> regex to OpenAI-native tools=[...] / structured tool_calls field; accumulate delta.reasoning_content for DeepSeek thinking-mode echo-back; fold preserves system msg and aligns boundary to never orphan role:tool - base_agent: generic forced-retry via mandatory_record_tools class attr (filesystem -> add_phenomenon, timeline -> add_temporal_edge, hypothesis -> add_hypothesis, report -> save_report); count via executor wrapper - terminal_tools class attr + loop short-circuit: when a terminal tool is called, loop exits with its raw return as final_text. ReportAgent declares save_report as terminal - replaces the <answer>-tag stop signal that native tool calling broke - _execute_*: return (raw, formatted) - terminal exit uses untruncated raw, conversation history uses 3000-char-capped formatted - evidence_graph + orchestrator: LLM-derived InvestigationArea support (hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS / _AREA_TOOLS); manual yaml block kept as optional seed - strip <answer> references from agent prompts (no longer load-bearing) Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures (was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via save_report (was max_iterations regression). 78/78 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix: share _safe_json_loads with tool-call parser, not just orchestrator
2026-05-13 13:51:19 +08:00 · 2026-05-12 20:29:21 +08:00 · 2026-05-12 17:14:23 +08:00 · 2026-05-12 17:14:16 +08:00 · 2026-05-12 17:13:54 +08:00
13 changed files with 1818 additions and 429 deletions
--- a/README.md
+++ b/README.md
@@ -36,6 +36,8 @@ main.py                          入口：配置加载、镜像选择、断连
 | `Entity` | `ent-*` | 人、程序、主机、IP 等可复现的实体 |

 Phenomenon → Hypothesis 的边类型与权重写死在 `HYPOTHESIS_EDGE_WEIGHTS`：
+# TODO 
+当前流程跑通以后，寻找自适应方案

 | 边类型 | 权重 | 语义 |
 |---|---:|---|
@@ -65,14 +67,18 @@ Phenomenon → Hypothesis 的边类型与权重写死在 `HYPOTHESIS_EDGE_WEIGHT
 | **Phase 4** | TimelineAgent 用 `build_filesystem_timeline` 生成 MAC 时间线，与 Phenomenon 时间戳关联 |
 | **Phase 5** | ReportAgent 综合假设、证据、实体，生成 Markdown 报告 |

-### Gap Analysis（Phase 3 末）
+### Investigation Areas（hypothesis-derived）

-`config.yaml:investigation_areas` 列出必须覆盖的调查领域（系统信息、用户账户、网络配置、邮件配置、IRC 日志、PCAP、删除文件、Prefetch 等）。Orchestrator 两层判定覆盖情况：
+Phase 2 末尾 orchestrator 调一次 LLM 从所有 active hypothesis 派生 5-12 个 **InvestigationArea**（snake_case slug、description、suggested_agent、expected_keywords、expected_tools、priority、motivating_hypothesis_ids）。Areas 存进 `graph.investigation_areas`，序列化到 `runs/<ts>/investigation_areas.json`。两个用途：

-1. **关键词匹配**（`_AREA_KEYWORDS`）— 扫现有 Phenomenon 标题/描述
-2. **工具命中**（`_AREA_TOOLS`）— 检查是否调用过该领域的关键工具（如 `enumerate_users`、`parse_pcap_strings`）
+1. **Phase 3 主循环提示** — 每个 hypothesis 块附 `Expected areas: a, b, c`，LLM 仍自由选 lead 但有软引导
+2. **Phase 3 末尾 Gap Analysis** — 两层判定覆盖情况：
+   - **关键词匹配**：扫 Phenomenon 标题/描述对照 area.expected_keywords
+   - **工具命中**：检查 area.expected_tools 是否实际调用过

-未覆盖的领域自动派发 lead，最多 3 轮补漏。
+未覆盖的 area 自动派 lead（`suggested_agent` + `priority` + `motivating_hypothesis_ids[0]` 透传给 `Lead.hypothesis_id` 保留 provenance），最多 3 轮补漏。
+
+**手动 override**：`config.yaml:investigation_areas` 默认注释掉，纯 LLM 派生。取消注释可添加强制必查的领域，会先于 LLM 写入并通过 slug-based dedupe 保护不被覆盖（LLM 只会 augment keyword/tool 列表）。这是跨案件/跨平台适配的关键 —— 不再 hardcode Windows-specific 领域。

 ## Agent 体系

@@ -181,11 +187,12 @@ max_investigation_rounds: 5          # Phase 3 最大迭代轮数
 #   - title: "嫌疑人主动实施网络嗅探"
 #     description: "..."

-investigation_areas:                 # Gap Analysis 必须覆盖的领域
-  - area: system_info
-    agent: registry
-    task: "..."
-  # ...
+# investigation_areas:                 # 可选：手动 override（默认全 LLM 派生）
+#   - area: shutdown_time              #         LLM 通过 slug dedupe 只 augment
+#     agent: registry                  #         keyword/tool 列表，不覆盖 manual
+#     priority: 3
+#     keywords: [shutdown]
+#     tools: [get_shutdown_time]
 ```

 未配置 `hypotheses` 时由 HypothesisAgent 自动生成。
--- a/agents/hypothesis.py
+++ b/agents/hypothesis.py
@@ -1,7 +1,9 @@
 """Hypothesis Agent — generates investigative hypotheses from phenomena.

 Generates hypotheses only. Phenomenon→Hypothesis linking is handled centrally
-by Orchestrator._judge_new_phenomena, so all link logic lives in one place.
+by Orchestrator._judge_new_phenomena. Tool set is restricted to read-only
+graph queries + add_hypothesis to prevent the agent from creating phenomena,
+leads, or entity links.
 """

 from __future__ import annotations
@@ -22,11 +24,16 @@ class HypothesisAgent(BaseAgent):
        "and formulate investigative hypotheses about what happened on this system. "
        "Your ultimate goal: build the most complete picture of events that occurred."
    )
+    mandatory_record_tools = ("add_hypothesis",)

    def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
        super().__init__(llm, graph)
        self._register_hypothesis_tools()

+    def _register_graph_tools(self) -> None:
+        """Restrict to read-only graph tools. add_hypothesis is registered separately."""
+        self._register_graph_read_tools()
+
    def _register_hypothesis_tools(self) -> None:
        self.register_tool(
            name="add_hypothesis",
@@ -51,6 +58,32 @@ class HypothesisAgent(BaseAgent):
            executor=self._add_hypothesis,
        )

+    def _build_system_prompt(self, task: str) -> str:
+        """Focused prompt — no INVESTIGATE/RECORD/LINK workflow."""
+        return (
+            f"You are {self.name}, a forensic hypothesis analyst.\n"
+            f"Role: {self.role}\n\n"
+            f"Image: {self.graph.image_path}\n"
+            f"Current investigation state: {self.graph.stats_summary()}\n\n"
+            f"Your task: {task}\n\n"
+            f"WORKFLOW:\n"
+            f"1. Call list_phenomena and search_graph to review existing findings.\n"
+            f"2. For each hypothesis you want to record, call add_hypothesis (title + description).\n"
+            f"3. STOP after you have generated 3-7 hypotheses. Do not call any more tools.\n\n"
+            f"STRICT BOUNDARIES:\n"
+            f"- Your only mutation tool is add_hypothesis. Do NOT attempt list_directory, "
+            f"parse_registry_key, extract_file, or any disk-image investigation tools — "
+            f"they are not yours and you will get 'unknown tool' errors.\n"
+            f"- You CANNOT create phenomena, leads, or entity links. The orchestrator handles "
+            f"all phenomenon↔hypothesis linking after you finish.\n"
+            f"- Each hypothesis must be specific and testable. Avoid generic templates like "
+            f"'Unauthorized Remote Access' or 'Malware Deployment' unless concrete phenomena "
+            f"in the graph already point to them.\n"
+            f"- If the graph is empty, generate broad starting hypotheses and mark them "
+            f"clearly as exploratory in their description so downstream agents know they "
+            f"still need evidence."
+        )
+
    async def _add_hypothesis(self, title: str, description: str) -> str:
        hid = await self.graph.add_hypothesis(
            title=title,
--- a/agents/report.py
+++ b/agents/report.py
@@ -2,9 +2,6 @@

 from __future__ import annotations

-import json
-import os
-
 from base_agent import BaseAgent
 from evidence_graph import EvidenceGraph
 from llm_client import LLMClient
@@ -15,34 +12,46 @@ class ReportAgent(BaseAgent):
    role = (
        "Forensic report writer. You synthesize all findings from the investigation "
        "into a structured, professional forensic analysis report organized by hypotheses.\n\n"
-        "IMPORTANT: Only include findings that have a source_tool attribution (marked VERIFIED). "
+        "Only include findings that have a source_tool attribution (marked VERIFIED). "
        "If evidence lacks source attribution, mark it as UNVERIFIED. "
-        "Do NOT invent or fabricate any data, timestamps, or findings not present in the evidence.\n\n"
-        "CRITICAL: You MUST call save_report to write the final report."
+        "Do NOT invent or fabricate any data, timestamps, or findings not present in the evidence."
    )
+    # Calling save_report is BOTH the recording action and the completion
+    # signal. tool_call_loop returns the moment save_report executes; the
+    # tool's return value becomes the agent's final_text. The forced-retry
+    # mechanism fires if save_report is never called.
+    mandatory_record_tools = ("save_report",)
+    terminal_tools = ("save_report",)

    def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
        super().__init__(llm, graph)
        self._register_tools()

+    def _register_graph_tools(self) -> None:
+        """Restrict to read-only graph tools. Report agent does not mutate state."""
+        self._register_graph_read_tools()
+
    def _build_system_prompt(self, task: str) -> str:
-        """Report agent gets a clean prompt — no Phase A/B/C/D workflow."""
        return (
            f"You are a forensic report writer.\n"
            f"Role: {self.role}\n\n"
            f"Investigation state:\n{self.graph.stats_summary()}\n\n"
            f"Your task: {task}\n\n"
            f"WORKFLOW:\n"
-            f"1. Call get_hypotheses_with_evidence to get all hypotheses and their linked evidence\n"
-            f"2. Call get_all_phenomena to get detailed findings by category\n"
-            f"3. Call get_entities to get people, programs, and hosts\n"
-            f"4. Call get_case_info for case metadata\n"
-            f"5. Write the complete report directly in your <answer> block\n\n"
+            f"1. Call get_hypotheses_with_evidence, get_all_phenomena, get_entities, get_case_info "
+            f"   to gather all the data needed for the report. Make these calls in parallel.\n"
+            f"2. Assemble the complete markdown forensic report.\n"
+            f"3. Call save_report(content=<full markdown>, output_path=\"report.md\").\n"
+            f"   This single call is the completion signal — the run ENDS the moment it executes.\n"
+            f"   Do NOT call any read tools after this point; they will not run.\n"
+            f"   Do NOT write the report as free text outside of save_report; only the\n"
+            f"   `content` argument of save_report is persisted.\n\n"
            f"RULES:\n"
-            f"- Write the report DIRECTLY in <answer> — do NOT use save_report tool\n"
-            f"- Only include findings present in the evidence graph\n"
-            f"- Do NOT invent timestamps, file paths, or data not in the phenomena\n"
-            f"- The report must be complete — do not cut off mid-section\n"
+            f"- The report must be the complete markdown — do not cut off mid-section.\n"
+            f"- Only include findings present in the evidence graph.\n"
+            f"- Do NOT invent timestamps, file paths, or data not in the phenomena.\n"
+            f"- The `content` argument can be 10K+ chars. JSON-escape inner quotes (\\\") and\n"
+            f"  backslashes (\\\\) and newlines (\\n) correctly.\n"
        )

    def _register_tools(self) -> None:
@@ -182,10 +191,16 @@ class ReportAgent(BaseAgent):
        return "\n".join(lines)

    async def _save_report(self, content: str, output_path: str) -> str:
-        try:
-            os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
-            with open(output_path, "w") as f:
-                f.write(content)
-            return f"Report saved to {output_path} ({len(content)} chars)"
-        except Exception as e:
-            return f"Error saving report: {e}"
+        """Save the report and return the content itself.
+
+        The content is returned (rather than a "saved to ..." status string)
+        so that when tool_call_loop short-circuits on this terminal tool,
+        `final_text` is the full markdown — orchestrator writes it to the
+        canonical report.md path under runs/<ts>/.
+
+        The output_path argument is kept for backward compat but the model's
+        chosen path is ignored — the orchestrator owns the persistence path.
+        """
+        if not content:
+            return ""
+        return content
--- a/agents/timeline.py
+++ b/agents/timeline.py
@@ -1,14 +1,21 @@
-"""Timeline Agent — correlates evidence across time."""
+"""Timeline Agent — connects existing phenomena with temporal edges.
+
+Operates on phenomena already in the graph. Does NOT investigate the disk
+image itself. The agent's only useful output is the temporal edges it
+creates between phenomena.
+"""

 from __future__ import annotations

-import json
+import logging

 from base_agent import BaseAgent
 from evidence_graph import EvidenceGraph
 from llm_client import LLMClient
 from tool_registry import TOOL_CATALOG

+logger = logging.getLogger(__name__)
+

 class TimelineAgent(BaseAgent):
    name = "timeline"
@@ -17,29 +24,39 @@ class TimelineAgent(BaseAgent):
        "MAC timestamps and correlate events across all phenomena categories in the "
        "evidence graph to reconstruct the sequence of activities on the system."
    )
+    mandatory_record_tools = ("add_temporal_edge",)

    def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
        super().__init__(llm, graph)
        self._register_tools()

+    def _register_graph_tools(self) -> None:
+        """Restrict to read-only graph tools — Timeline does not add phenomena."""
+        self._register_graph_read_tools()
+
    def _register_tools(self) -> None:
-        # Filesystem timeline tool from catalog
        td = TOOL_CATALOG.get("build_filesystem_timeline")
        if td:
            self.register_tool(td.name, td.description, td.input_schema, td.executor)

-        # Custom tool to get all phenomena with timestamps for correlation
        self.register_tool(
            name="get_timestamped_phenomena",
-            description="Get all phenomena that have timestamps, sorted chronologically. Use for timeline correlation.",
+            description=(
+                "Get all phenomena that have timestamps, sorted chronologically. "
+                "Returns each phenomenon's id, category, title, and a short description "
+                "preview. Use this as your primary input for temporal correlation."
+            ),
            input_schema={"type": "object", "properties": {}},
            executor=self._get_timestamped_phenomena,
        )

-        # Tool to add temporal edges between phenomena
        self.register_tool(
            name="add_temporal_edge",
-            description="Add a temporal relationship between two phenomena (before, after, or concurrent).",
+            description=(
+                "Add a temporal relationship edge between two existing phenomena. "
+                "Use 'before' when source phenomenon happened before target, "
+                "'concurrent' when they occurred within seconds of each other."
+            ),
            input_schema={
                "type": "object",
                "properties": {
@@ -56,6 +73,42 @@ class TimelineAgent(BaseAgent):
            executor=self._add_temporal_edge,
        )

+    def _build_system_prompt(self, task: str) -> str:
+        """Focused prompt — Timeline connects existing phenomena, doesn't investigate."""
+        return (
+            f"You are {self.name}, a forensic timeline correlation analyst.\n"
+            f"Role: {self.role}\n\n"
+            f"Image: {self.graph.image_path}\n"
+            f"Current state: {self.graph.stats_summary()}\n\n"
+            f"Your task: {task}\n\n"
+            f"WORKFLOW:\n"
+            f"1. Call build_filesystem_timeline once to materialize MAC times for the disk.\n"
+            f"2. Call get_timestamped_phenomena to see all phenomena with timestamps, "
+            f"sorted chronologically. THIS IS YOUR PRIMARY INPUT.\n"
+            f"3. For each meaningful temporal relationship between phenomena, call "
+            f"add_temporal_edge(source_id, target_id, relation). Use 'before' when "
+            f"source happened first (the common case); 'concurrent' for events within "
+            f"a few seconds of each other.\n"
+            f"   Examples of meaningful connections:\n"
+            f"     - 'Cain installer executed' (before) 'Cain.exe first execution'\n"
+            f"     - 'WHOIS first lookup'      (before) 'WHOIS second lookup'\n"
+            f"     - 'Recon tool cluster'      (before) 'Anti-forensics defrag'\n"
+            f"     - 'Tool installation'       (before) 'Tool execution'\n"
+            f"4. Aim for 15-40 temporal edges that connect the major events into a "
+            f"forensic story.\n"
+            f"5. STOP after recording all meaningful temporal edges. Do not call any more tools.\n\n"
+            f"STRICT BOUNDARIES:\n"
+            f"- Your job is to CONNECT existing phenomena, NOT to discover new ones. "
+            f"You CANNOT call add_phenomenon — the tool isn't yours.\n"
+            f"- Use ONLY phenomenon IDs returned by get_timestamped_phenomena or "
+            f"list_phenomena. NEVER fabricate IDs.\n"
+            f"- Connect events that tell a forensic story (recon -> exploit -> cover-up). "
+            f"Do not exhaustively pair every two phenomena; focus on causally-relevant "
+            f"sequences.\n"
+            f"- The orchestrator handles report writing in the next phase. Your only "
+            f"output that propagates is the temporal edges you create."
+        )
+
    async def _get_timestamped_phenomena(self) -> str:
        items = [
            ph for ph in self.graph.phenomena.values()
--- a/base_agent.py
+++ b/base_agent.py
@@ -31,11 +31,26 @@ class BaseAgent:
    name: str = "base"
    role: str = "A forensic analysis agent."

+    # Tools the agent MUST invoke at least once for the run to count as productive.
+    # If none of these were called when tool_call_loop returns, run() fires a
+    # forced retry with an explicit "you forgot to record" instruction.
+    # Subclasses override to declare their own recording responsibility
+    # (timeline → add_temporal_edge, hypothesis → add_hypothesis, report → save_report).
+    mandatory_record_tools: tuple[str, ...] = ("add_phenomenon",)
+
+    # Tools whose invocation ends the run immediately. After any terminal tool
+    # is called, tool_call_loop returns with that tool's result text as
+    # final_text. Used by agents whose "completion" is a single explicit
+    # action rather than "model decides to stop calling tools". For multi-call
+    # agents (filesystem records many phenomena) leave empty.
+    terminal_tools: tuple[str, ...] = ()
+
    def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
        self.llm = llm
        self.graph = graph
        self._tools: dict[str, dict] = {}  # name -> schema
        self._executors: dict[str, Any] = {}  # name -> async callable
+        self._record_call_counts: dict[str, int] = {}
        self._work_log: list[str] = []
        self._current_lead_id: str | None = None

@@ -52,7 +67,18 @@ class BaseAgent:
            "description": description,
            "input_schema": input_schema,
        }
-        self._executors[name] = executor
+        if name in self.mandatory_record_tools:
+            self._executors[name] = self._wrap_record_executor(name, executor)
+        else:
+            self._executors[name] = executor
+
+    def _wrap_record_executor(self, name: str, executor: Any) -> Any:
+        """Wrap a mandatory-record executor to count successful invocations."""
+        async def wrapped(*args, **kwargs):
+            result = await executor(*args, **kwargs)
+            self._record_call_counts[name] = self._record_call_counts.get(name, 0) + 1
+            return result
+        return wrapped

    def get_tool_definitions(self) -> list[dict]:
        """Get tool definitions in Claude API format."""
@@ -91,14 +117,21 @@ class BaseAgent:
            f"  FIRST call list_phenomena to get the current IDs — do NOT rely on memory.\n"
            f"  Then call link_to_entity for each relevant phenomenon.\n"
            f"  NEVER guess or fabricate a phenomenon ID. If an ID is not in list_phenomena output, it does not exist.\n\n"
-            f"Phase D — ANSWER:\n"
-            f"  Only give your <answer> AFTER completing Phases B and C.\n\n"
-            f"IMPORTANT:\n"
-            f"- You MUST call add_phenomenon at least once before finishing\n"
-            f"- Complete each phase before starting the next\n"
-            f"- Other agents can ONLY see what you write to the graph\n"
-            f"- If you don't record findings, they are LOST\n"
-            f"- Include relevant file paths, inode numbers, timestamps, and raw data\n\n"
+            f"Phase D — STOP:\n"
+            f"  Once all phenomena are recorded and entities linked, you are DONE.\n"
+            f"  Do not call any more tools. The orchestrator picks up automatically.\n\n"
+            f"CRITICAL — RECORDING REQUIREMENT:\n"
+            f"- Only graph mutations propagate to other agents and the final report.\n"
+            f"- You MUST call add_phenomenon for EVERY significant finding BEFORE you stop.\n"
+            f"- NEGATIVE findings count too. If you searched X (a directory, a pattern, "
+            f"a registry key) and found NOTHING, that absence IS evidence — call "
+            f"add_phenomenon with a 'No matches for X' title and the search scope in "
+            f"raw_data. Negative findings constrain the hypothesis space and prevent "
+            f"the next agent from wasting time re-searching.\n"
+            f"- If you stop without having called add_phenomenon at least once, the task "
+            f"is FAILED and a forced retry will fire.\n"
+            f"- Include exact file paths, inode numbers, timestamps, and the source_tool "
+            f"that produced each finding.\n\n"
            f"ANTI-HALLUCINATION RULES — STRICTLY ENFORCED:\n"
            f"- ONLY record findings that appear VERBATIM in tool results you received\n"
            f"- NEVER invent or guess timestamps, file paths, inode numbers, or program names\n"
@@ -116,6 +149,7 @@ class BaseAgent:
        self._current_lead_id = lead_id

        self._register_graph_tools()
+        self._record_call_counts.clear()

        system = self._build_system_prompt(task)
        messages = [{"role": "user", "content": task}]
@@ -124,12 +158,60 @@ class BaseAgent:
        ph_before = len(self.graph.phenomena)

        try:
-            final_text, _ = await self.llm.tool_call_loop(
+            final_text, conversation = await self.llm.tool_call_loop(
                messages=messages,
                tools=self.get_tool_definitions(),
                tool_executor=self._executors,
                system=system,
+                terminal_tools=self.terminal_tools,
            )
+
+            # Forced-record retry: if the agent has any mandatory recording
+            # tools but never invoked any of them, force one more round with
+            # an explicit "you forgot to record" instruction. The mandatory
+            # set is declared on the class — Timeline → add_temporal_edge,
+            # Hypothesis → add_hypothesis, ReportAgent → (). For agents with
+            # empty mandatory_record_tools this branch is a no-op.
+            registered_mandatory = [
+                t for t in self.mandatory_record_tools if t in self._executors
+            ]
+            recorded_any = any(
+                self._record_call_counts.get(t, 0) > 0
+                for t in registered_mandatory
+            )
+            if registered_mandatory and not recorded_any:
+                missing = "/".join(registered_mandatory)
+                logger.warning(
+                    "[%s] finished without calling any of [%s] — forcing RECORD retry",
+                    self.name, missing,
+                )
+                conversation.append({
+                    "role": "user",
+                    "content": (
+                        f"STOP. You produced an answer without ever calling "
+                        f"{missing}. Your answer is DISCARDED — only graph "
+                        f"mutations propagate to other agents and the final "
+                        f"report.\n\n"
+                        f"You MUST now call {missing} for every significant "
+                        f"finding from your prior investigation, including "
+                        f"exact identifiers, timestamps, and the source_tool "
+                        f"that produced each finding. If you genuinely found "
+                        f"NOTHING noteworthy, call the recording tool ONCE "
+                        f"with a 'No significant findings' style entry "
+                        f"summarizing what you searched.\n\n"
+                        f"Do not run more investigation tools. Just record "
+                        f"what you already found. Then end."
+                    ),
+                })
+                final_text, _ = await self.llm.tool_call_loop(
+                    messages=conversation,
+                    tools=self.get_tool_definitions(),
+                    tool_executor=self._executors,
+                    system=system,
+                    max_iterations=10,
+                    terminal_tools=self.terminal_tools,
+                )
+
            self._work_log.append(f"[Task: {task[:80]}] -> {final_text[:150]}")
        except Exception:
            self.graph.agent_status[self.name] = "failed"
@@ -145,9 +227,17 @@ class BaseAgent:
    # ---- Graph interaction tools --------------------------------------------

    def _register_graph_tools(self) -> None:
-        """Register tools for querying and writing to the evidence graph."""
+        """Register graph query + mutation tools.

-        # --- Read tools ---
+        Subclasses can override to restrict the toolset. For example, a
+        read-only agent (hypothesis, report) overrides this to skip
+        _register_graph_write_tools.
+        """
+        self._register_graph_read_tools()
+        self._register_graph_write_tools()
+
+    def _register_graph_read_tools(self) -> None:
+        """Register read-only graph + asset query tools."""

        self.register_tool(
            name="list_phenomena",
@@ -213,7 +303,49 @@ class BaseAgent:
            executor=self._get_hypothesis_status,
        )

-        # --- Write tools ---
+        self.register_tool(
+            name="list_assets",
+            description=(
+                "List all files extracted from the disk image. "
+                "Shows filename, category, size, local path, and inode. "
+                "Check this before calling extract_file to avoid re-extraction."
+            ),
+            input_schema={
+                "type": "object",
+                "properties": {
+                    "category": {
+                        "type": "string",
+                        "enum": [
+                            "registry_hive", "chat_log", "prefetch", "network_capture",
+                            "config_file", "address_book", "recycle_bin", "executable",
+                            "text_log", "other",
+                        ],
+                        "description": "Filter by category. Omit to list all.",
+                    },
+                },
+            },
+            executor=self._list_assets,
+        )
+
+        self.register_tool(
+            name="find_extracted_file",
+            description=(
+                "Find an already-extracted file by inode or filename. "
+                "Returns the local path so you can use it directly with "
+                "parse_registry_key, read_text_file, etc. without re-extracting."
+            ),
+            input_schema={
+                "type": "object",
+                "properties": {
+                    "inode": {"type": "string", "description": "Inode to look up."},
+                    "filename": {"type": "string", "description": "Filename or partial name to search."},
+                },
+            },
+            executor=self._find_extracted_file,
+        )
+
+    def _register_graph_write_tools(self) -> None:
+        """Register graph mutation tools (add_phenomenon, add_lead, link_to_entity)."""

        self.register_tool(
            name="add_phenomenon",
@@ -282,49 +414,6 @@ class BaseAgent:
            executor=self._link_to_entity,
        )

-        # --- Asset library tools ---
-
-        self.register_tool(
-            name="list_assets",
-            description=(
-                "List all files extracted from the disk image. "
-                "Shows filename, category, size, local path, and inode. "
-                "Check this before calling extract_file to avoid re-extraction."
-            ),
-            input_schema={
-                "type": "object",
-                "properties": {
-                    "category": {
-                        "type": "string",
-                        "enum": [
-                            "registry_hive", "chat_log", "prefetch", "network_capture",
-                            "config_file", "address_book", "recycle_bin", "executable",
-                            "text_log", "other",
-                        ],
-                        "description": "Filter by category. Omit to list all.",
-                    },
-                },
-            },
-            executor=self._list_assets,
-        )
-
-        self.register_tool(
-            name="find_extracted_file",
-            description=(
-                "Find an already-extracted file by inode or filename. "
-                "Returns the local path so you can use it directly with "
-                "parse_registry_key, read_text_file, etc. without re-extracting."
-            ),
-            input_schema={
-                "type": "object",
-                "properties": {
-                    "inode": {"type": "string", "description": "Inode to look up."},
-                    "filename": {"type": "string", "description": "Filename or partial name to search."},
-                },
-            },
-            executor=self._find_extracted_file,
-        )
-
    # ---- Tool executors -----------------------------------------------------

    async def _list_phenomena(self, category: str | None = None) -> str:
--- a/evidence_graph.py
+++ b/evidence_graph.py
@@ -197,6 +197,41 @@ class Lead:
        return cls(**d)


+@dataclass
+class InvestigationArea:
+    """An area to investigate to confirm/refute one or more hypotheses.
+
+    Derived by the orchestrator from active hypotheses after Phase 2; also
+    seeded from config.yaml:investigation_areas as an optional manual
+    override. Each area carries its own keywords + expected tools so the
+    gap-analysis coverage check is generic, not tied to hard-coded constants.
+    """
+
+    id: str                                     # "area-{slug}"
+    area: str                                   # snake_case slug (dedupe key)
+    description: str
+    suggested_agent: str                        # filesystem / registry / communication / network / timeline
+    expected_keywords: list[str] = field(default_factory=list)
+    expected_tools: list[str] = field(default_factory=list)
+    priority: int = 5                           # 1 (highest) - 10 (lowest)
+    motivating_hypothesis_ids: list[str] = field(default_factory=list)
+    created_by: str = ""                        # "manual" | "llm_derive" | "fallback"
+    created_at: str = ""
+
+    def to_dict(self) -> dict:
+        return asdict(self)
+
+    @classmethod
+    def from_dict(cls, d: dict) -> InvestigationArea:
+        return cls(**d)
+
+    def summary(self) -> str:
+        return (
+            f"[{self.area}] P{self.priority} agent={self.suggested_agent} "
+            f"(motivating: {len(self.motivating_hypothesis_ids)})"
+        )
+
+
@dataclass
 class ExtractedAsset:
    """A file extracted from the disk image and tracked in the asset library."""
@@ -270,6 +305,11 @@ class EvidenceGraph:
        self.asset_library: dict[str, ExtractedAsset] = {}
        self._inode_index: dict[str, str] = {}    # inode → asset_id

+        # Investigation areas — derived from hypotheses (LLM) and/or seeded
+        # from config.yaml:investigation_areas (manual override). Drives the
+        # gap-analysis coverage check.
+        self.investigation_areas: dict[str, InvestigationArea] = {}
+
        # Set by BaseAgent.run() before each agent execution
        self._current_agent: str = ""

@@ -295,6 +335,9 @@ class EvidenceGraph:
                "leads": [l.to_dict() for l in self.leads],
                "agent_status": dict(self.agent_status),
                "asset_library": {aid: a.to_dict() for aid, a in self.asset_library.items()},
+                "investigation_areas": {
+                    aid: a.to_dict() for aid, a in self.investigation_areas.items()
+                },
                "saved_at": datetime.now().isoformat(),
            }
            tmp = self._persist_path.with_suffix(".tmp")
@@ -345,6 +388,10 @@ class EvidenceGraph:
            asset = ExtractedAsset.from_dict(a_data)
            graph.asset_library[aid] = asset
            graph._inode_index[asset.inode] = aid
+        graph.investigation_areas = {
+            aid: InvestigationArea.from_dict(a)
+            for aid, a in data.get("investigation_areas", {}).items()
+        }
        graph._rebuild_adjacency()
        logger.info(
            "EvidenceGraph restored: %d phenomena, %d hypotheses, %d entities, "
@@ -656,6 +703,57 @@ class EvidenceGraph:
                    break
            self._auto_save()

+    # ---- Investigation areas -------------------------------------------------
+
+    async def add_investigation_area(
+        self,
+        area: str,
+        description: str,
+        suggested_agent: str,
+        expected_keywords: list[str] | None = None,
+        expected_tools: list[str] | None = None,
+        priority: int = 5,
+        motivating_hypothesis_ids: list[str] | None = None,
+        created_by: str = "",
+    ) -> tuple[str, bool]:
+        """Add or merge an investigation area. Dedupe key is the `area` slug.
+
+        On collision, union the three list fields (keywords / tools /
+        motivating_hypothesis_ids); description / suggested_agent / priority
+        are preserved from the first writer (manual seed wins over LLM derive).
+        Returns (id, was_existing).
+        """
+        async with self._lock:
+            for existing in self.investigation_areas.values():
+                if existing.area == area:
+                    for kw in (expected_keywords or []):
+                        if kw not in existing.expected_keywords:
+                            existing.expected_keywords.append(kw)
+                    for t in (expected_tools or []):
+                        if t not in existing.expected_tools:
+                            existing.expected_tools.append(t)
+                    for hid in (motivating_hypothesis_ids or []):
+                        if hid not in existing.motivating_hypothesis_ids:
+                            existing.motivating_hypothesis_ids.append(hid)
+                    self._auto_save()
+                    return existing.id, True
+
+            aid = f"area-{area}"
+            self.investigation_areas[aid] = InvestigationArea(
+                id=aid,
+                area=area,
+                description=description,
+                suggested_agent=suggested_agent,
+                expected_keywords=list(expected_keywords or []),
+                expected_tools=list(expected_tools or []),
+                priority=priority,
+                motivating_hypothesis_ids=list(motivating_hypothesis_ids or []),
+                created_by=created_by,
+                created_at=datetime.now().isoformat(),
+            )
+            self._auto_save()
+            return aid, False
+
    # ---- Asset library -------------------------------------------------------

    async def register_asset(
--- a/llm_client.py
+++ b/llm_client.py
@@ -1,8 +1,10 @@
-"""Custom LLM client using httpx for Claude Messages API via third-party proxy.
+"""LLM client via the OpenAI SDK (works with DeepSeek's OpenAI-compatible API).

-The proxy does not support Claude's native tool_use format (it strips the `tools`
-field from requests). So we embed tool definitions in the system prompt and parse
-structured JSON tool calls from the model's text output (ReAct-style).
+Tool calling uses the OpenAI-native `tools=[...]` parameter. The model
+returns structured tool_calls via the streaming protocol; we accumulate
+them, dispatch to our executors, and feed results back as `role: "tool"`
+messages. This eliminates the fragile "model writes JSON inside free
+text" problem of the previous ReAct text mode.
 """

 from __future__ import annotations
@@ -18,6 +20,7 @@ from dataclasses import dataclass, field
 from typing import Any

 import httpx
+from openai import APIConnectionError, APIError, APITimeoutError, AsyncOpenAI

 logger = logging.getLogger(__name__)

@@ -30,69 +33,81 @@ class LLMAPIError(Exception):
        self.attempts = attempts


-# Markers the model uses to signal tool calls and final answers
-TOOL_CALL_TAG = "<tool_call>"
-TOOL_CALL_END = "</tool_call>"
-TOOL_RESULT_TAG = "<tool_result>"
-TOOL_RESULT_END = "</tool_result>"
+# Optional answer tags — kept for backward compat with prompts that wrap
+# their final response in <answer>...</answer>. Native tool calling does
+# not need these (no tool_calls = final), but if the model continues to
+# emit them, we strip the tags so callers see clean text.
 ANSWER_TAG = "<answer>"
 ANSWER_END = "</answer>"


-def _build_tools_prompt(tools: list[dict]) -> str:
-    """Format tool definitions for inclusion in the system prompt."""
-    lines = ["You have access to the following tools:\n"]
-    for t in tools:
-        schema = t.get("input_schema", {})
-        props = schema.get("properties", {})
-        required = schema.get("required", [])
+def _to_openai_tools(tools: list[dict]) -> list[dict]:
+    """Convert internal tool definitions to OpenAI native function-tools format."""
+    return [
+        {
+            "type": "function",
+            "function": {
+                "name": t["name"],
+                "description": t["description"],
+                "parameters": t.get("input_schema", {"type": "object", "properties": {}}),
+            },
+        }
+        for t in tools
+    ]

-        params = []
-        for pname, pdef in props.items():
-            req = " (required)" if pname in required else ""
-            desc = pdef.get("description", "")
-            ptype = pdef.get("type", "string")
-            enum_vals = pdef.get("enum")
-            if enum_vals:
-                allowed = ", ".join(f'"{v}"' for v in enum_vals)
-                params.append(f"    - {pname}: {ptype}{req} — {desc} Allowed values: [{allowed}]")
-            else:
-                params.append(f"    - {pname}: {ptype}{req} — {desc}")

-        param_block = "\n".join(params) if params else "    (no parameters)"
-        lines.append(f"## {t['name']}\n{t['description']}\nParameters:\n{param_block}\n")
+def _extract_first_balanced(text: str, open_char: str, close_char: str) -> str | None:
+    """Return the first balanced [...] or {...} substring, or None if no balanced pair.

-    lines.append(
-        "## How to use tools\n"
-        "To call a tool, output a JSON block wrapped in XML tags like this:\n"
-        f"{TOOL_CALL_TAG}\n"
-        '{"name": "tool_name", "arguments": {"param1": "value1"}}\n'
-        f"{TOOL_CALL_END}\n\n"
-        "You can call multiple tools in sequence. After each tool call, you will receive the result in:\n"
-        f"{TOOL_RESULT_TAG}\n...result...\n{TOOL_RESULT_END}\n\n"
-        "When you have finished your analysis and have a final answer, wrap it in:\n"
-        f"{ANSWER_TAG}\nyour final answer here\n{ANSWER_END}\n\n"
-        "Think step by step. Call tools to gather evidence before drawing conclusions.\n"
-        "You MUST call at least one tool before giving your final answer."
+    Stack-based — handles nested brackets correctly (regex with .*? would
+    truncate at the first inner closing bracket, regex with .* would over-eat
+    trailing text). Brackets inside JSON string literals are ignored by
+    callers because the caller passes the result through json.loads which
+    re-parses with proper string handling.
+    """
+    start = text.find(open_char)
+    if start < 0:
+        return None
+    depth = 0
+    for i in range(start, len(text)):
+        c = text[i]
+        if c == open_char:
+            depth += 1
+        elif c == close_char:
+            depth -= 1
+            if depth == 0:
+                return text[start:i + 1]
+    return None
+
+
+def _safe_json_loads(text: str):
+    """Parse JSON with progressive sanitization for LLM-produced output.
+
+    Tries (0) as-is, (1) escape stray backslashes outside valid JSON escapes
+    (\\" \\\\ \\/ \\b \\f \\n \\r \\t \\uXXXX). On final failure, logs raw
+    input (first 600 chars) so we can diagnose what the model emitted.
+
+    Used by orchestrator JSON callsites (_call_llm_for_json) and by
+    tool_call_loop when parsing tool_call arguments returned by the API.
+    """
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        pass
+
+    stage1 = re.sub(
+        r'\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})',
+        r'\\\\',
+        text,
    )
-    return "\n".join(lines)
-
-
-def _extract_tool_calls(text: str) -> list[dict]:
-    """Extract tool call JSON blocks from model output."""
-    pattern = re.compile(
-        re.escape(TOOL_CALL_TAG) + r"\s*(.*?)\s*" + re.escape(TOOL_CALL_END),
-        re.DOTALL,
-    )
-    calls = []
-    for match in pattern.finditer(text):
-        raw = match.group(1).strip()
-        try:
-            parsed = json.loads(raw)
-            calls.append(parsed)
-        except json.JSONDecodeError:
-            logger.warning("Failed to parse tool call JSON: %s", raw[:200])
-    return calls
+    try:
+        return json.loads(stage1)
+    except json.JSONDecodeError as e:
+        logger.warning(
+            "_safe_json_loads failed after sanitize (%s); raw head[:600]=%r",
+            e, text[:600],
+        )
+        raise


 def _extract_answer(text: str) -> str | None:
@@ -234,50 +249,41 @@ _DECAY_TIERS: list[tuple[int, int]] = [


 def _apply_progressive_decay(messages: list[dict]) -> list[dict]:
-    """Truncate tool results in older messages to save context space.
+    """Truncate the `content` of older `role: "tool"` messages to save context.

-    Operates in-place-style on a copy. Only touches user messages that
-    contain <tool_result> blocks (these are the tool-result messages
-    generated by tool_call_loop).
+    Each `role: "tool"` message in the conversation corresponds to one tool
+    call's result. We rank these messages by recency and progressively
+    truncate older ones according to `_DECAY_TIERS`.
    """
-    # Count rounds from the end. A "round" is a (assistant, user) pair.
-    # messages alternate: [user, assistant, user, assistant, user, ...]
-    # The initial user message is index 0, then pairs start at index 1.
    total = len(messages)
-    if total <= 10:  # not enough messages to bother
+    if total <= 10:
        return messages

-    result = []
-    # Count tool-result user messages from the end
-    tool_result_indices = [
-        i for i, m in enumerate(messages)
-        if m["role"] == "user" and TOOL_RESULT_TAG in m.get("content", "")
+    tool_msg_indices = [
+        i for i, m in enumerate(messages) if m.get("role") == "tool"
    ]

-    # Build a set of indices that need decay, mapped to their max_chars
    decay_map: dict[int, int] = {}
-    n_tool_msgs = len(tool_result_indices)
-    for rank, idx in enumerate(reversed(tool_result_indices)):
-        rounds_ago = rank  # 0 = most recent, 1 = second most recent, ...
+    for rank, idx in enumerate(reversed(tool_msg_indices)):
+        rounds_ago = rank
        for threshold, max_chars in _DECAY_TIERS:
            if rounds_ago < threshold:
                decay_map[idx] = max_chars
                break

+    result = []
    for i, msg in enumerate(messages):
        if i in decay_map:
            max_chars = decay_map[i]
-            content = msg["content"]
+            content = msg.get("content", "") or ""
            if len(content) > max_chars + 200:
-                # Truncate but preserve the tool_result tags structure
-                truncated = content[:max_chars]
-                # Count how many tool results are in this message
-                n_results = content.count(TOOL_RESULT_TAG)
-                truncated += (
-                    f"\n... [context compressed: {len(content)} -> {max_chars} chars, "
-                    f"{n_results} tool result(s)]"
+                truncated = (
+                    content[:max_chars]
+                    + f"\n... [context compressed: {len(content)} -> {max_chars} chars]"
                )
-                result.append({"role": msg["role"], "content": truncated})
+                new_msg = dict(msg)
+                new_msg["content"] = truncated
+                result.append(new_msg)
            else:
                result.append(msg)
        else:
@@ -301,44 +307,51 @@ _FOLD_SUMMARY_SYSTEM = (


 class LLMClient:
-    """Calls Claude Messages API through a third-party proxy using raw httpx.
+    """Async LLM client via the OpenAI SDK.

-    Uses prompt-based tool calling (ReAct pattern) since the proxy does not
-    support Claude's native tool_use format.
+    Works with any OpenAI-compatible endpoint (OpenAI, DeepSeek, ...).
+    Tool calling is text-based (ReAct) — see module docstring.
    """

    def __init__(
        self,
        base_url: str,
        api_key: str,
-        model: str = "claude-sonnet-4-6",
+        model: str = "deepseek-v4-pro",
        max_tokens: int = 4096,
        proxy: str | None = "auto",
+        reasoning_effort: str | None = None,
+        thinking_enabled: bool = False,
    ) -> None:
        self.base_url = base_url.rstrip("/")
        self.api_key = api_key
        self.model = model
        self.max_tokens = max_tokens
-        # proxy="auto": read from env; proxy=None/""/"none": no proxy; proxy="http://...": use it
+        self.reasoning_effort = reasoning_effort
+        self.thinking_enabled = thinking_enabled
+
+        # proxy="auto": read from env; proxy=None/""/"none": no proxy
        if proxy == "auto":
            proxy_url = os.environ.get("https_proxy") or os.environ.get("HTTPS_PROXY")
        elif proxy and proxy.lower() != "none":
            proxy_url = proxy
        else:
            proxy_url = None
-        self._client = httpx.AsyncClient(
+
+        http_client = (
+            httpx.AsyncClient(proxy=proxy_url, timeout=300.0)
+            if proxy_url else None
+        )
+
+        self._client = AsyncOpenAI(
+            api_key=self.api_key,
            base_url=self.base_url,
-            headers={
-                "x-api-key": self.api_key,
-                "anthropic-version": "2023-06-01",
-                "content-type": "application/json",
-            },
            timeout=300.0,
-            proxy=proxy_url,
+            http_client=http_client,
        )

    async def close(self) -> None:
-        await self._client.aclose()
+        await self._client.close()

    async def chat(
        self,
@@ -346,81 +359,144 @@ class LLMClient:
        system: str | None = None,
        max_retries: int = 5,
    ) -> str:
-        """Send a streaming chat request and return the assembled text response.
+        """Send a streaming chat completion and return the assembled text."""
+        full_messages: list[dict] = []
+        if system:
+            full_messages.append({"role": "system", "content": system})
+        full_messages.extend(messages)

-        Uses SSE streaming to keep the connection alive and avoid gateway
-        timeouts (504/524) on long-running completions.
-        """
-        import asyncio as _asyncio
-
-        payload: dict[str, Any] = {
+        kwargs: dict[str, Any] = {
            "model": self.model,
+            "messages": full_messages,
            "max_tokens": self.max_tokens,
-            "messages": messages,
            "stream": True,
        }
-        if system:
-            payload["system"] = system
+        if self.reasoning_effort:
+            kwargs["reasoning_effort"] = self.reasoning_effort
+        if self.thinking_enabled:
+            kwargs["extra_body"] = {"thinking": {"type": "enabled"}}

        for attempt in range(max_retries):
-            logger.debug("LLM request (stream): %d messages (attempt %d)", len(messages), attempt + 1)
+            logger.debug(
+                "LLM request (stream): %d messages (attempt %d)",
+                len(messages), attempt + 1,
+            )
            text_parts: list[str] = []
            try:
-                async with self._client.stream(
-                    "POST", "/v1/messages", json=payload,
-                ) as resp:
-                    # Check for HTTP errors before consuming stream
-                    if resp.status_code >= 400:
-                        body = await resp.aread()
-                        raise httpx.HTTPStatusError(
-                            f"Server error '{resp.status_code}' for url '{resp.url}'",
-                            request=resp.request,
-                            response=resp,
-                        )
-
-                    # Parse SSE events
-                    async for line in resp.aiter_lines():
-                        if not line.startswith("data: "):
-                            continue
-                        data_str = line[6:]  # strip "data: " prefix
-                        if data_str.strip() == "[DONE]":
-                            break
-                        try:
-                            event = json.loads(data_str)
-                        except json.JSONDecodeError:
-                            continue
-
-                        event_type = event.get("type", "")
-                        if event_type == "content_block_delta":
-                            delta = event.get("delta", {})
-                            if delta.get("type") == "text_delta":
-                                text_parts.append(delta["text"])
-                        elif event_type == "message_stop":
-                            break
-                        elif event_type == "error":
-                            err_msg = event.get("error", {}).get("message", "Unknown streaming error")
-                            raise httpx.HTTPStatusError(
-                                err_msg, request=resp.request, response=resp,
-                            )
+                stream = await self._client.chat.completions.create(**kwargs)
+                async for chunk in stream:
+                    if not chunk.choices:
+                        continue
+                    delta = chunk.choices[0].delta
+                    if delta.content:
+                        text_parts.append(delta.content)

                text = "".join(text_parts)
                logger.debug("LLM response (stream): %d chars", len(text))
                return text

-            except (httpx.HTTPStatusError, httpx.ConnectError, httpx.ReadTimeout, httpx.RemoteProtocolError) as e:
+            except (APIConnectionError, APITimeoutError, APIError) as e:
                if attempt < max_retries - 1:
                    wait = 2 ** attempt * 10
                    logger.warning("Request failed (%s), retrying in %ds...", e, wait)
-                    await _asyncio.sleep(wait)
+                    await asyncio.sleep(wait)
                else:
                    raise LLMAPIError(
                        f"LLM API unreachable after {max_retries} attempts: {e}",
                        attempts=max_retries,
                    ) from e

-        # Should not reach here, but just in case
        return ""

+    async def _chat_with_tools(
+        self,
+        messages: list[dict],
+        openai_tools: list[dict],
+        max_retries: int = 5,
+    ) -> tuple[str, str | None, list[dict]]:
+        """Stream a chat completion with native tool calling enabled.
+
+        Returns:
+            (text_content, reasoning_content, raw_tool_calls).
+            - reasoning_content is non-None when DeepSeek thinking mode is
+              active; the caller MUST echo it back in the assistant message
+              on subsequent requests, or the API returns HTTP 400.
+            - raw_tool_calls is a list of {"id","name","arguments"} dicts;
+              arguments is the raw JSON string returned by the API.
+        """
+        kwargs: dict[str, Any] = {
+            "model": self.model,
+            "messages": messages,
+            "max_tokens": self.max_tokens,
+            "stream": True,
+            "tools": openai_tools,
+        }
+        if self.reasoning_effort:
+            kwargs["reasoning_effort"] = self.reasoning_effort
+        if self.thinking_enabled:
+            kwargs["extra_body"] = {"thinking": {"type": "enabled"}}
+
+        for attempt in range(max_retries):
+            logger.debug(
+                "LLM request (stream+tools): %d messages, %d tools (attempt %d)",
+                len(messages), len(openai_tools), attempt + 1,
+            )
+            text_parts: list[str] = []
+            reasoning_parts: list[str] = []
+            tool_calls_acc: dict[int, dict] = {}  # index -> {id, name, arguments}
+            try:
+                stream = await self._client.chat.completions.create(**kwargs)
+                async for chunk in stream:
+                    if not chunk.choices:
+                        continue
+                    delta = chunk.choices[0].delta
+                    if delta.content:
+                        text_parts.append(delta.content)
+                    # DeepSeek thinking-mode: reasoning_content is returned
+                    # alongside content and MUST be echoed back on subsequent
+                    # requests, otherwise the API rejects with HTTP 400.
+                    rc = getattr(delta, "reasoning_content", None)
+                    if rc:
+                        reasoning_parts.append(rc)
+                    if delta.tool_calls:
+                        for tc_delta in delta.tool_calls:
+                            idx = tc_delta.index
+                            entry = tool_calls_acc.setdefault(
+                                idx, {"id": None, "name": None, "arguments": ""},
+                            )
+                            if tc_delta.id:
+                                entry["id"] = tc_delta.id
+                            fn = tc_delta.function
+                            if fn:
+                                if fn.name:
+                                    entry["name"] = fn.name
+                                if fn.arguments:
+                                    entry["arguments"] += fn.arguments
+
+                text = "".join(text_parts)
+                reasoning = "".join(reasoning_parts) or None
+                ordered = [tool_calls_acc[i] for i in sorted(tool_calls_acc)]
+                logger.debug(
+                    "LLM response (stream+tools): %d chars, %d reasoning chars, %d tool calls",
+                    len(text), len(reasoning or ""), len(ordered),
+                )
+                return text, reasoning, ordered
+
+            except (APIConnectionError, APITimeoutError, APIError) as e:
+                if attempt < max_retries - 1:
+                    wait = 2 ** attempt * 10
+                    logger.warning(
+                        "Tool-call request failed (%s), retrying in %ds...", e, wait,
+                    )
+                    await asyncio.sleep(wait)
+                else:
+                    raise LLMAPIError(
+                        f"LLM API unreachable after {max_retries} attempts: {e}",
+                        attempts=max_retries,
+                    ) from e
+
+        return "", None, []
+
    async def tool_call_loop(
        self,
        messages: list[dict],
@@ -428,87 +504,159 @@ class LLMClient:
        tool_executor: dict[str, Any],
        system: str | None = None,
        max_iterations: int = 40,
+        terminal_tools: tuple[str, ...] = (),
    ) -> tuple[str, list[dict]]:
-        """Run a ReAct-style tool-calling loop.
+        """Run a tool-calling loop using OpenAI-native tool calls.

-        The model outputs <tool_call> blocks which we parse and execute,
-        feeding results back as <tool_result> blocks until the model
-        outputs an <answer> block.
+        The model returns structured `tool_calls` in its message; we
+        dispatch them through our executor dict and feed each result back
+        as a `role: "tool"` message with the matching `tool_call_id`. The
+        loop ends when:
+          - the model returns a message with no tool_calls (normal exit), or
+          - any tool in `terminal_tools` is called — in that case, the loop
+            short-circuits with that tool's result text as final_text. This
+            gives agents (notably ReportAgent) an explicit completion signal
+            that the old `<answer>` text tag used to provide.

        Returns:
-            (final_text, all_messages)
+            (final_text, full_message_history)
        """
-        # Build system prompt with tool definitions
-        tools_prompt = _build_tools_prompt(tools)
-        full_system = f"{system}\n\n{tools_prompt}" if system else tools_prompt
+        terminal_set = set(terminal_tools)
+        openai_tools = _to_openai_tools(tools)

-        messages = list(messages)  # don't mutate caller's list
-        _folded = False  # Track whether we've already folded once this loop
+        # The caller may pass `messages` either as raw conversation (no system)
+        # together with `system=...`, OR as a complete history that already
+        # starts with the system message (retry path). Accept both shapes.
+        if messages and messages[0].get("role") == "system":
+            full_messages: list[dict] = list(messages)
+        else:
+            full_messages = []
+            if system:
+                full_messages.append({"role": "system", "content": system})
+            full_messages.extend(messages)
+        _folded = False

-        for i in range(max_iterations):
+        for _i in range(max_iterations):
            # ── Context compression before each API call ──────────────
-            # Stage A: progressively decay old tool results
-            messages = _apply_progressive_decay(messages)
-
-            # Stage B: fold oldest messages into LLM summary if too long
-            if not _folded and len(messages) > _FOLD_THRESHOLD:
-                messages = await self._fold_old_messages(messages, full_system)
+            full_messages = _apply_progressive_decay(full_messages)
+            if not _folded and len(full_messages) > _FOLD_THRESHOLD:
+                full_messages = await self._fold_old_messages(full_messages)
                _folded = True
-            elif _folded and len(messages) > _FOLD_THRESHOLD + _FOLD_KEEP_RECENT:
-                # Allow a second fold if messages grew back significantly
-                messages = await self._fold_old_messages(messages, full_system)
+            elif _folded and len(full_messages) > _FOLD_THRESHOLD + _FOLD_KEEP_RECENT:
+                full_messages = await self._fold_old_messages(full_messages)

-            text = await self.chat(messages, system=full_system)
+            text, reasoning, raw_tool_calls = await self._chat_with_tools(
+                full_messages, openai_tools,
+            )

-            # Check for final answer
-            answer = _extract_answer(text)
-            if answer is not None:
-                messages.append({"role": "assistant", "content": text})
-                return answer, messages
+            if not raw_tool_calls:
+                # Model produced a final response. Strip optional <answer>
+                # tags for backward compatibility with old prompts.
+                final_msg: dict[str, Any] = {"role": "assistant", "content": text}
+                if reasoning:
+                    final_msg["reasoning_content"] = reasoning
+                full_messages.append(final_msg)
+                answer = _extract_answer(text)
+                return (answer if answer is not None else text), full_messages

-            # Check for tool calls
-            tool_calls = _extract_tool_calls(text)
+            # Parse arguments + build internal call dicts
+            parsed_calls: list[dict] = []
+            for rc in raw_tool_calls:
+                args_str = rc.get("arguments", "") or ""
+                try:
+                    args = _safe_json_loads(args_str) if args_str.strip() else {}
+                except (json.JSONDecodeError, ValueError) as e:
+                    logger.warning(
+                        "Failed to parse arguments for tool %s: %s",
+                        rc.get("name"), e,
+                    )
+                    args = {}
+                parsed_calls.append({
+                    "id": rc.get("id"),
+                    "name": rc.get("name", ""),
+                    "arguments": args,
+                })

-            if not tool_calls:
-                # No tool calls and no answer tag — treat entire text as answer
-                messages.append({"role": "assistant", "content": text})
-                return text, messages
+            # Append the assistant turn with the raw tool_calls (and the
+            # DeepSeek-mandated reasoning_content echo-back), then execute.
+            asst_msg: dict[str, Any] = {
+                "role": "assistant",
+                "content": text or None,
+                "tool_calls": [
+                    {
+                        "id": rc.get("id"),
+                        "type": "function",
+                        "function": {
+                            "name": rc.get("name", ""),
+                            "arguments": rc.get("arguments", "") or "",
+                        },
+                    }
+                    for rc in raw_tool_calls
+                ],
+            }
+            if reasoning:
+                asst_msg["reasoning_content"] = reasoning
+            full_messages.append(asst_msg)

-            # Execute tool calls — read-only tools run in parallel
-            messages.append({"role": "assistant", "content": text})
-
-            result_parts = []
-            batches = _partition_tool_calls(tool_calls)
+            batches = _partition_tool_calls(parsed_calls)
            t_batch_start = time.monotonic()
-
+            # Each entry: (tool_call_dict, raw_result, formatted_for_llm)
+            executed: list[tuple[dict, str, str]] = []
            for batch in batches:
                if batch.is_read_only and len(batch.calls) > 1:
-                    batch_results = await self._execute_tool_batch_parallel(
+                    results = await self._execute_tool_batch_parallel(
                        batch.calls, tool_executor, tools,
                    )
-                    result_parts.extend(batch_results)
+                    for tc, (raw, formatted) in zip(batch.calls, results):
+                        executed.append((tc, raw, formatted))
                else:
                    for tc in batch.calls:
-                        result_parts.append(
-                            await self._execute_single_tool(tc, tool_executor, tools)
+                        raw, formatted = await self._execute_single_tool(
+                            tc, tool_executor, tools,
                        )
+                        executed.append((tc, raw, formatted))

-            # Emit folded tool-call summary for the terminal
            t_batch_elapsed = time.monotonic() - t_batch_start
-            _emit_tool_call_summary(tool_calls, t_batch_elapsed)
+            _emit_tool_call_summary(parsed_calls, t_batch_elapsed)

-            # Feed results back as a user message
-            result_message = "\n\n".join(result_parts)
-            messages.append({"role": "user", "content": result_message})
+            # Append formatted tool results to the conversation (this is
+            # what the LLM sees on subsequent rounds — truncated for context
+            # economy).
+            for tc, _raw, formatted in executed:
+                full_messages.append({
+                    "role": "tool",
+                    "tool_call_id": tc["id"],
+                    "content": formatted,
+                })
+
+            # Terminal-tool short-circuit: if the model called any tool in
+            # `terminal_tools`, end the loop immediately. The terminal tool's
+            # RAW result (untruncated) becomes final_text — the LLM may have
+            # produced a 20K-char report via save_report and we must not
+            # truncate it just because the LLM-facing copy is truncated.
+            if terminal_set:
+                for tc, raw, _formatted in executed:
+                    name = tc.get("name", "")
+                    if name in terminal_set:
+                        logger.info(
+                            "Terminal tool %s called — exiting tool_call_loop", name,
+                        )
+                        return raw, full_messages

        logger.warning("Tool call loop hit max iterations (%d)", max_iterations)
-        return "[Max tool call iterations reached]", messages
+        return "[Max tool call iterations reached]", full_messages

    async def _execute_single_tool(
        self, tc: dict, tool_executor: dict[str, Any],
        tools: list[dict] | None = None,
-    ) -> str:
-        """Execute a single tool call and return the formatted result."""
+    ) -> tuple[str, str]:
+        """Execute a single tool call.
+
+        Returns (raw_result, formatted_for_llm). `raw_result` is the
+        unmodified executor return (used by terminal-tool short-circuit as
+        final_text). `formatted_for_llm` is `[tool_name] {truncated}` and
+        is what gets fed back to the model as the tool message content.
+        """
        tool_name = tc.get("name", "")
        tool_args = tc.get("arguments", {})

@@ -519,72 +667,106 @@ class LLMClient:

        executor = tool_executor.get(tool_name)
        if executor is None:
-            result_text = f"Error: unknown tool '{tool_name}'"
+            raw = f"Error: unknown tool '{tool_name}'"
        else:
            try:
-                result_text = await executor(**tool_args)
+                raw = await executor(**tool_args)
            except Exception as e:
                logger.error("Tool %s failed: %s", tool_name, e)
-                result_text = f"Error executing {tool_name}: {e}"
+                raw = f"Error executing {tool_name}: {e}"

-        return (
-            f"{TOOL_RESULT_TAG}\n"
-            f"[{tool_name}] {_truncate_tool_result(result_text)}\n"
-            f"{TOOL_RESULT_END}"
-        )
+        formatted = f"[{tool_name}] {_truncate_tool_result(raw)}"
+        return raw, formatted

    async def _execute_tool_batch_parallel(
        self, calls: list[dict], tool_executor: dict[str, Any],
        tools: list[dict] | None = None,
-    ) -> list[str]:
-        """Execute multiple read-only tool calls concurrently."""
+    ) -> list[tuple[str, str]]:
+        """Execute multiple read-only tool calls concurrently.
+
+        Returns a list of (raw_result, formatted_for_llm) tuples in the
+        same order as `calls`.
+        """
        logger.info("Executing %d read-only tools in parallel", len(calls))

-        async def _run_one(tc: dict) -> str:
+        async def _run_one(tc: dict) -> tuple[str, str]:
            tool_name = tc.get("name", "")
            tool_args = tc.get("arguments", {})
            if tools:
                tool_args = _fix_tool_args(tool_name, tool_args, tools)
-            logger.info("Calling tool (parallel): %s(%s)", tool_name, json.dumps(tool_args, ensure_ascii=False))
+            logger.info(
+                "Calling tool (parallel): %s(%s)",
+                tool_name, json.dumps(tool_args, ensure_ascii=False),
+            )
            executor = tool_executor.get(tool_name)
            if executor is None:
-                result_text = f"Error: unknown tool '{tool_name}'"
+                raw = f"Error: unknown tool '{tool_name}'"
            else:
                try:
-                    result_text = await executor(**tool_args)
+                    raw = await executor(**tool_args)
                except Exception as e:
                    logger.error("Tool %s failed: %s", tool_name, e)
-                    result_text = f"Error executing {tool_name}: {e}"
-            return (
-                f"{TOOL_RESULT_TAG}\n"
-                f"[{tool_name}] {_truncate_tool_result(result_text)}\n"
-                f"{TOOL_RESULT_END}"
-            )
+                    raw = f"Error executing {tool_name}: {e}"
+            formatted = f"[{tool_name}] {_truncate_tool_result(raw)}"
+            return raw, formatted

        results = await asyncio.gather(*[_run_one(tc) for tc in calls])
        return list(results)

    async def _fold_old_messages(
-        self, messages: list[dict], system: str,
+        self, messages: list[dict],
    ) -> list[dict]:
        """Fold old messages into an LLM-generated summary (Stage B).

-        Keeps the most recent _FOLD_KEEP_RECENT messages intact and
-        replaces earlier ones with a single summary message.
+        Preserves the leading system message (if any), keeps the most
+        recent _FOLD_KEEP_RECENT messages intact, and replaces the older
+        middle slice with a single summary user message.
        """
-        n_to_fold = len(messages) - _FOLD_KEEP_RECENT
+        # Pin the system message — it must NEVER be summarized away.
+        system_msgs: list[dict] = []
+        body = messages
+        if messages and messages[0].get("role") == "system":
+            system_msgs = [messages[0]]
+            body = messages[1:]
+
+        n_to_fold = len(body) - _FOLD_KEEP_RECENT
        if n_to_fold <= 2:
            return messages

-        old_messages = messages[:n_to_fold]
-        recent_messages = messages[n_to_fold:]
+        # Pull the fold boundary forward so we never split an assistant turn
+        # from its matching tool results. The API rejects (HTTP 400) any
+        # `role: "tool"` message that does not immediately follow an
+        # `assistant` message with `tool_calls`. We walk the boundary into
+        # `recent_messages` while its head is a `role: "tool"` message, or
+        # while the prior `recent` message is `assistant{tool_calls}` whose
+        # paired tools span the boundary.
+        while n_to_fold < len(body):
+            head = body[n_to_fold]
+            if head.get("role") == "tool":
+                n_to_fold += 1
+                continue
+            break
+
+        if n_to_fold >= len(body):
+            # Everything got folded — nothing recent to keep.
+            return system_msgs + [body[0]] if system_msgs else messages
+
+        old_messages = body[:n_to_fold]
+        recent_messages = body[n_to_fold:]

-        # Build a text dump of old messages for summarization
        old_text_parts = []
        for msg in old_messages:
-            role = msg["role"]
-            content = msg.get("content", "")
-            # Truncate each message for the summary prompt to avoid overload
+            role = msg.get("role", "?")
+            content = msg.get("content") or ""
+            # Render tool_calls (assistant turn) compactly.
+            if role == "assistant" and msg.get("tool_calls"):
+                tc_names = [
+                    tc.get("function", {}).get("name", "?")
+                    for tc in msg["tool_calls"]
+                ]
+                content = (content + " " if content else "") + (
+                    "called: " + ", ".join(tc_names)
+                )
            if len(content) > 1000:
                content = content[:1000] + "..."
            old_text_parts.append(f"[{role}]: {content}")
@@ -608,7 +790,6 @@ class LLMClient:
            logger.warning("Context folding failed: %s — keeping original messages", e)
            return messages

-        # Replace old messages with a single summary
        summary_message = {
            "role": "user",
            "content": (
@@ -616,4 +797,4 @@ class LLMClient:
                f"messages in this conversation]\n\n{summary}"
            ),
        }
-        return [summary_message] + recent_messages
+        return system_msgs + [summary_message] + recent_messages
--- a/main.py
+++ b/main.py
@@ -219,6 +219,8 @@ async def async_main() -> None:
        model=agent_cfg["model"],
        max_tokens=agent_cfg.get("max_tokens", 4096),
        proxy=agent_cfg.get("proxy", "auto"),
+        reasoning_effort=agent_cfg.get("reasoning_effort"),
+        thinking_enabled=agent_cfg.get("thinking_enabled", False),
    )

    # Initialize evidence graph
--- a/orchestrator.py
+++ b/orchestrator.py
@@ -12,7 +12,8 @@ from pathlib import Path

 from agent_factory import AgentFactory
 from evidence_graph import EvidenceGraph
-from llm_client import LLMClient
+from llm_client import LLMClient, _extract_first_balanced, _safe_json_loads
+from tool_registry import TOOL_CATALOG

 logger = logging.getLogger(__name__)

@@ -93,6 +94,14 @@ class Orchestrator:
        "Omit phenomena that are unrelated. Be conservative — only link genuinely relevant evidence."
    )

+    _AREA_DERIVE_SYSTEM = (
+        "You are a forensic investigation strategist. Given a set of hypotheses, "
+        "decompose them into a minimal aggregate set of investigation areas. An "
+        "area is a focused, concrete question with the keywords and tool names an "
+        "answering phenomenon would mention. Aggregate aggressively — when two "
+        "hypotheses share an area, emit it once and list both hypothesis_ids."
+    )
+
    def __init__(
        self,
        llm: LLMClient,
@@ -216,6 +225,170 @@ class Orchestrator:
            "The ultimate goal is to reconstruct a detailed timeline of what happened on this host."
        )

+    # ---- Investigation areas (manual seed + LLM derive) ----------------------
+
+    _VALID_AGENT_TYPES = {"filesystem", "registry", "communication", "network", "timeline"}
+
+    async def _seed_manual_investigation_areas(self) -> None:
+        """Import config.yaml:investigation_areas entries (manual override).
+
+        Run early in Phase 2 so manual entries are in the graph before LLM
+        derivation; LLM derive then augments via slug-based dedupe.
+        """
+        for entry in self.config.get("investigation_areas", []):
+            area = entry.get("area")
+            if not area:
+                continue
+            await self.graph.add_investigation_area(
+                area=area,
+                description=entry.get("description", entry.get("task", "")),
+                suggested_agent=entry.get("agent", "filesystem"),
+                expected_keywords=entry.get("keywords", []),
+                expected_tools=entry.get("tools", []),
+                priority=entry.get("priority", 3),
+                created_by="manual",
+            )
+
+    async def _derive_investigation_areas(self) -> None:
+        """Ask LLM to derive investigation areas from active hypotheses.
+
+        Manual-seeded or already-populated graph (resume) → no-op. On LLM
+        failure or empty output, falls back to one-area-per-hypothesis.
+        """
+        if self.graph.investigation_areas:
+            return
+        active = [h for h in self.graph.hypotheses.values() if h.status == "active"]
+        if not active:
+            return
+
+        available_tools = sorted(TOOL_CATALOG.keys())
+        hyp_lines = "\n".join(
+            f"  [{h.id}] {h.title}: {h.description}" for h in active
+        )
+        prompt = (
+            f"Active hypotheses:\n{hyp_lines}\n\n"
+            f"Available agents: {sorted(self._VALID_AGENT_TYPES)}\n"
+            f"Available tool names (pick 1-3 per area for expected_tools): {available_tools}\n\n"
+            f"Emit 5-12 distinct investigation areas covering the FULL hypothesis set.\n"
+            f"Each area must include:\n"
+            f"  - area: snake_case slug (dedupe key)\n"
+            f"  - description: one sentence on what to find\n"
+            f"  - suggested_agent: one of the agents above\n"
+            f"  - expected_keywords: 3-8 lowercase tokens that an answering phenomenon would mention\n"
+            f"  - expected_tools: 1-3 tool names from the list above\n"
+            f"  - priority: 1 (highest) to 10\n"
+            f"  - motivating_hypothesis_ids: at least one [hyp-xxx] from above\n\n"
+            f"Aggregate aggressively — when two hypotheses share an area, emit it ONCE "
+            f"and list both ids in motivating_hypothesis_ids.\n\n"
+            f"Respond ONLY with JSON:\n"
+            f'[{{"area":"...","description":"...","suggested_agent":"...",'
+            f'"expected_keywords":[...],"expected_tools":[...],"priority":1-10,'
+            f'"motivating_hypothesis_ids":["hyp-xxx"]}}]'
+        )
+
+        try:
+            items = await self._call_llm_for_json(
+                system=self._AREA_DERIVE_SYSTEM,
+                user_prompt=prompt,
+                schema="array",
+            )
+        except Exception as e:
+            logger.warning("Area derivation LLM failed: %s — falling back", e)
+            await self._derive_areas_fallback(active)
+            return
+
+        valid_hyp_ids = set(self.graph.hypotheses.keys())
+        for it in items:
+            area = it.get("area", "").strip()
+            if not area:
+                continue
+            agent = it.get("suggested_agent", "filesystem")
+            if agent not in self._VALID_AGENT_TYPES:
+                agent = AGENT_ALIASES.get(agent, "filesystem")
+            tools = [t for t in it.get("expected_tools", []) if t in TOOL_CATALOG]
+            motivating = [
+                h for h in it.get("motivating_hypothesis_ids", [])
+                if h in valid_hyp_ids
+            ]
+            priority = max(1, min(10, int(it.get("priority", 5))))
+            await self.graph.add_investigation_area(
+                area=area,
+                description=it.get("description", ""),
+                suggested_agent=agent,
+                expected_keywords=[
+                    str(kw).lower() for kw in it.get("expected_keywords", [])
+                ],
+                expected_tools=tools,
+                priority=priority,
+                motivating_hypothesis_ids=motivating,
+                created_by="llm_derive",
+            )
+
+        if not self.graph.investigation_areas:
+            await self._derive_areas_fallback(active)
+
+    async def _derive_areas_fallback(self, active: list) -> None:
+        """One area per active hypothesis as a minimal safety net."""
+        for h in active:
+            slug = re.sub(r"[^a-z0-9_]+", "_", h.title.lower())[:40].strip("_")
+            if not slug:
+                slug = h.id.replace("-", "_")
+            await self.graph.add_investigation_area(
+                area=slug,
+                description=h.title,
+                suggested_agent="filesystem",
+                expected_keywords=h.title.lower().split()[:6],
+                expected_tools=[],
+                priority=5,
+                motivating_hypothesis_ids=[h.id],
+                created_by="fallback",
+            )
+
+    # ---- LLM JSON helper -----------------------------------------------------
+
+    async def _call_llm_for_json(
+        self,
+        system: str,
+        user_prompt: str,
+        schema: str = "array",
+        max_retries: int = 2,
+    ):
+        """Call LLM expecting JSON output; self-correct on parse failure.
+
+        Two layers of safety:
+        1. _safe_json_loads escapes stray backslashes outside valid JSON escapes.
+        2. On any remaining parse error, append the error to the prompt and ask
+           the LLM to retry, up to max_retries additional attempts.
+        """
+        error_hint = ""
+        last_err: Exception | None = None
+        open_c, close_c = ('[', ']') if schema == "array" else ('{', '}')
+
+        for attempt in range(max_retries + 1):
+            messages = [{"role": "user", "content": user_prompt + error_hint}]
+            response = await self.llm.chat(messages=messages, system=system)
+            candidate = _extract_first_balanced(response, open_c, close_c) or response
+            try:
+                return _safe_json_loads(candidate)
+            except (json.JSONDecodeError, ValueError) as e:
+                last_err = e
+                if attempt < max_retries:
+                    logger.info(
+                        "JSON parse attempt %d/%d failed (%s); retrying with hint",
+                        attempt + 1, max_retries + 1, e,
+                    )
+                error_hint = (
+                    f"\n\n[Your previous response could not be parsed as JSON: {e}]\n"
+                    f"Output STRICT JSON only — no markdown fences, no code blocks. "
+                    f"Do NOT include literal backslash characters in any string value; "
+                    f"rephrase using forward slashes or describe paths in English "
+                    f"(e.g. 'the Cain folder under Program Files' instead of "
+                    f"'C:\\Program Files\\Cain'). "
+                    f"The response must be a single JSON {schema}."
+                )
+        assert last_err is not None
+        raise last_err
+
    # ---- Hypothesis-directed investigation -----------------------------------

    async def _generate_hypothesis_leads(self) -> None:
@@ -232,10 +405,19 @@ class Orchestrator:
            existing = "\n".join(
                f"    - {r['node']} [{r['edge_type']}]" for r in related
            ) or "    (none yet)"
+            related_areas = [
+                a.area for a in self.graph.investigation_areas.values()
+                if hyp.id in a.motivating_hypothesis_ids
+            ]
+            expected_line = (
+                f"  Expected areas to investigate: {', '.join(related_areas)}\n"
+                if related_areas else ""
+            )
            hyp_blocks.append(
                f"Hypothesis [{hyp.id}]: {hyp.title}\n"
                f"  Description: {hyp.description}\n"
                f"  Current confidence: {hyp.confidence:.2f}\n"
+                f"{expected_line}"
                f"  Existing evidence:\n{existing}"
            )

@@ -252,15 +434,11 @@ class Orchestrator:
        )

        try:
-            response = await self.llm.chat(
-                messages=[{"role": "user", "content": prompt}],
+            tasks = await self._call_llm_for_json(
                system=self._LEAD_GEN_SYSTEM,
+                user_prompt=prompt,
+                schema="array",
            )
-            match = re.search(r'\[.*?\]', response, re.DOTALL)
-            if match:
-                tasks = json.loads(match.group())
-            else:
-                tasks = json.loads(response)

            for task in tasks:
                hyp_id = task.get("hypothesis_id", "")
@@ -286,12 +464,21 @@ class Orchestrator:
            existing_evidence = "\n".join(
                f"  - {r['node']} [{r['edge_type']}]" for r in related
            ) or "  (none yet)"
+            related_areas = [
+                a.area for a in self.graph.investigation_areas.values()
+                if hyp.id in a.motivating_hypothesis_ids
+            ]
+            expected_line = (
+                f"Expected areas to investigate: {', '.join(related_areas)}\n\n"
+                if related_areas else ""
+            )

            prompt = (
                f"Hypothesis: {hyp.title}\n"
                f"Description: {hyp.description}\n"
                f"Current confidence: {hyp.confidence:.2f}\n\n"
                f"Existing evidence linked to this hypothesis:\n{existing_evidence}\n\n"
+                f"{expected_line}"
                f"What additional evidence should we look for to CONFIRM or DENY this hypothesis?\n"
                f"List 1-3 specific, actionable investigation tasks.\n"
                f"For each, specify which agent type should handle it: "
@@ -300,12 +487,11 @@ class Orchestrator:
                f'[{{"agent": "agent_type", "task": "what to investigate", "priority": 1-10}}]'
            )
            try:
-                response = await self.llm.chat(
-                    messages=[{"role": "user", "content": prompt}],
+                tasks = await self._call_llm_for_json(
                    system=self._LEAD_GEN_SYSTEM,
+                    user_prompt=prompt,
+                    schema="array",
                )
-                match = re.search(r'\[.*?\]', response, re.DOTALL)
-                tasks = json.loads(match.group()) if match else json.loads(response)
                for task in tasks:
                    await self.graph.add_lead(
                        target_agent=task.get("agent", "filesystem"),
@@ -351,15 +537,11 @@ class Orchestrator:
        )

        try:
-            response = await self.llm.chat(
-                messages=[{"role": "user", "content": prompt}],
+            judgments = await self._call_llm_for_json(
                system=self._JUDGE_SYSTEM,
+                user_prompt=prompt,
+                schema="array",
            )
-            match = re.search(r'\[.*?\]', response, re.DOTALL)
-            if match:
-                judgments = json.loads(match.group())
-            else:
-                judgments = json.loads(response)

            for j in judgments:
                hyp_id = j.get("hypothesis_id", "")
@@ -402,12 +584,11 @@ class Orchestrator:
                f'[{{"phenomenon_id": "ph-xxx", "edge_type": "supports|contradicts|...", "reason": "brief explanation"}}]'
            )
            try:
-                response = await self.llm.chat(
-                    messages=[{"role": "user", "content": prompt}],
+                judgments = await self._call_llm_for_json(
                    system=self._JUDGE_SYSTEM,
+                    user_prompt=prompt,
+                    schema="array",
                )
-                match = re.search(r'\[.*?\]', response, re.DOTALL)
-                judgments = json.loads(match.group()) if match else json.loads(response)
                for j in judgments:
                    ph_id = j.get("phenomenon_id", "")
                    edge_type = j.get("edge_type", "")
@@ -428,74 +609,57 @@ class Orchestrator:

    # ---- Gap analysis (coverage check) ---------------------------------------

-    _AREA_KEYWORDS: dict[str, list[str]] = {
-        "system_info": ["install date", "registered owner", "product name", "windows xp", "system information"],
-        "user_accounts": ["user account", "enumerate", "sam hive", "administrator", "mr. evil"],
-        "shutdown_time": ["shutdown"],
-        "network_config": ["network interface", "network adapter", "ip address", "dhcp", "mac address", "network config"],
-        "installed_software": ["installed software", "program files", "installed program"],
-        "email_config": ["smtp", "pop3", "nntp", "email account", "email config"],
-        "chat_logs": ["irc", "mirc", "chat log", "channel"],
-        "network_activity": ["packet capture", "pcap", "interception", "http request", "user-agent"],
-        "deleted_files": ["deleted file", "recycle", "recycler"],
-        "execution_evidence": ["prefetch", "execution", "run count", "last execution"],
-    }
+    def _check_coverage(self) -> set[str]:
+        """Return slugs of investigation_areas already covered by phenomena.

-    # Deterministic coverage: if the canonical tool was called, the area is covered.
-    _AREA_TOOLS: dict[str, list[str]] = {
-        "system_info": ["get_system_info"],
-        "user_accounts": ["enumerate_users"],
-        "shutdown_time": ["get_shutdown_time"],
-        "network_config": ["get_network_interfaces"],
-        "installed_software": ["list_installed_software"],
-        "email_config": ["get_email_config"],
-        "network_activity": ["parse_pcap_strings"],
-        "deleted_files": ["count_deleted_files"],
-        "execution_evidence": ["parse_prefetch"],
-    }
+        Layer A: any expected_keyword found in evidence text (category +
+        title + description, lowercased).
+        Layer B: any expected_tool present in the source_tool set of recorded
+        phenomena (deterministic — the canonical tool was actually called).
+        """
+        evidence_text = " ".join(
+            f"{ph.category} {ph.title} {ph.description}".lower()
+            for ph in self.graph.phenomena.values()
+        )
+        used_tools: set[str] = {
+            ph.source_tool for ph in self.graph.phenomena.values() if ph.source_tool
+        }

-    def _check_coverage(self, areas: list[dict]) -> set[str]:
-        # Layer 1: keyword matching on category + title + description
-        evidence_text = ""
-        for ph in self.graph.phenomena.values():
-            evidence_text += f" {ph.category} {ph.title} {ph.description} ".lower()
-
-        # Layer 2: collect all source_tools that produced phenomena
-        used_tools: set[str] = {ph.source_tool for ph in self.graph.phenomena.values() if ph.source_tool}
-
-        covered = set()
-        for area in areas:
-            area_name = area["area"]
-            # Check keywords
-            keywords = self._AREA_KEYWORDS.get(area_name, [])
-            if any(kw in evidence_text for kw in keywords):
-                covered.add(area_name)
+        covered: set[str] = set()
+        for a in self.graph.investigation_areas.values():
+            if any(kw.lower() in evidence_text for kw in a.expected_keywords):
+                covered.add(a.area)
                continue
-            # Check source_tool
-            area_tools = self._AREA_TOOLS.get(area_name, [])
-            if any(tool in used_tools for tool in area_tools):
-                covered.add(area_name)
+            if any(t in used_tools for t in a.expected_tools):
+                covered.add(a.area)
        return covered

    async def _run_gap_analysis(self) -> None:
-        areas = self.config.get("investigation_areas", [])
+        areas = list(self.graph.investigation_areas.values())
        if not areas:
            return

-        covered = self._check_coverage(areas)
-        uncovered = [a for a in areas if a["area"] not in covered]
+        covered = self._check_coverage()
+        uncovered = [a for a in areas if a.area not in covered]

        if not uncovered:
            _log(f"All {len(areas)} investigation areas covered", event="progress")
            return

-        uncovered_names = ", ".join(a["area"] for a in uncovered)
-        _log(f"{len(uncovered)}/{len(areas)} areas uncovered: {uncovered_names}", event="dispatch")
-        for area in uncovered:
+        uncovered_names = ", ".join(a.area for a in uncovered)
+        _log(
+            f"{len(uncovered)}/{len(areas)} areas uncovered: {uncovered_names}",
+            event="dispatch",
+        )
+        for a in uncovered:
            await self.graph.add_lead(
-                target_agent=area["agent"],
-                description=area["task"],
-                priority=3,
+                target_agent=a.suggested_agent,
+                description=a.description,
+                priority=a.priority,
+                hypothesis_id=(
+                    a.motivating_hypothesis_ids[0]
+                    if a.motivating_hypothesis_ids else None
+                ),
            )

        for round_num in range(3):
@@ -542,6 +706,15 @@ class Orchestrator:
                json.dumps(leads_data, ensure_ascii=False, indent=2)
            )

+            # Investigation areas export
+            areas_data = {
+                aid: a.to_dict()
+                for aid, a in self.graph.investigation_areas.items()
+            }
+            (self.run_dir / "investigation_areas.json").write_text(
+                json.dumps(areas_data, ensure_ascii=False, indent=2)
+            )
+
            # Run metadata
            end_time = datetime.now()
            metadata = {
@@ -601,6 +774,11 @@ class Orchestrator:
            if resume_phase <= 2:
                _log("Phase 2: Hypothesis Generation", event="phase")
                t0 = time.monotonic()
+
+                # Seed manual investigation areas (if any) BEFORE LLM derive,
+                # so manual entries win the dedupe and LLM only augments.
+                await self._seed_manual_investigation_areas()
+
                manual_hypotheses = self.config.get("hypotheses", [])
                if manual_hypotheses:
                    await self._generate_hypotheses_manual(manual_hypotheses)
@@ -611,10 +789,17 @@ class Orchestrator:
                if self.graph.phenomena and self.graph.hypotheses:
                    await self._judge_new_phenomena()

+                # Derive investigation areas from active hypotheses.
+                # No-op if manual seed already populated or resume restored areas.
+                await self._derive_investigation_areas()
+
                for h in self.graph.hypotheses.values():
                    _log(f"  {h.summary()}", event="hypothesis")
+                for a in self.graph.investigation_areas.values():
+                    _log(f"  {a.summary()}", event="area")
                _log(
-                    f"+{len(self.graph.hypotheses)} hypotheses generated",
+                    f"+{len(self.graph.hypotheses)} hypotheses, "
+                    f"{len(self.graph.investigation_areas)} areas",
                    event="progress", elapsed=time.monotonic() - t0,
                )

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -5,6 +5,7 @@ description = "Multi-Agent System for Digital Forensics"
 requires-python = ">=3.14"
 dependencies = [
    "httpx[socks]>=0.28.1",
+    "openai>=2.36.0",
    "pyyaml",
    "regipy>=6.2.1",
 ]
--- a/regenerate_report.py
+++ b/regenerate_report.py
@@ -45,6 +45,8 @@ async def main() -> None:
        api_key=agent_cfg["api_key"],
        model=agent_cfg["model"],
        max_tokens=16384,
+        reasoning_effort=agent_cfg.get("reasoning_effort"),
+        thinking_enabled=agent_cfg.get("thinking_enabled", False),
    )

    register_all_tools(graph.image_path, graph.partition_offset, graph)
--- a/tests/test_optimizations.py
+++ b/tests/test_optimizations.py
@@ -14,7 +14,6 @@ from evidence_graph import (
 from llm_client import (
    _truncate_tool_result, _partition_tool_calls, _ToolBatch, READ_ONLY_TOOLS,
    _apply_progressive_decay, _FOLD_THRESHOLD, _FOLD_KEEP_RECENT,
-    TOOL_RESULT_TAG, TOOL_RESULT_END,
 )
 from tool_registry import (
    _tool_result_cache, _cache_key, _make_cached, CACHEABLE_TOOLS,
@@ -598,7 +597,8 @@ class TestParallelToolExecution:
        elapsed = time.monotonic() - start

        assert len(results) == 3
-        assert all("ok" in r for r in results)
+        # results are (raw, formatted) tuples; both contain "ok"
+        assert all("ok" in raw and "ok" in formatted for raw, formatted in results)
        # 3 tasks × 50ms each should take ~50ms parallel, not ~150ms serial
        assert elapsed < 0.12, f"Expected parallel execution but took {elapsed:.3f}s"

@@ -627,8 +627,10 @@ class TestParallelToolExecution:

        results = await client._execute_tool_batch_parallel(calls, tool_executor)
        assert len(results) == 2
-        assert "success" in results[0]
-        assert "Error" in results[1]
+        raw0, formatted0 = results[0]
+        raw1, formatted1 = results[1]
+        assert "success" in raw0 and "success" in formatted0
+        assert "Error" in raw1 and "Error" in formatted1


 # ---------------------------------------------------------------------------
@@ -843,21 +845,27 @@ class TestToolResultCache:

 class TestProgressiveDecay:
    def _make_messages(self, n_rounds: int) -> list[dict]:
-        """Build a synthetic message list with n_rounds of (assistant, user) pairs."""
+        """Build a synthetic message list shaped like native tool calling:
+        user → (assistant w/ tool_calls → tool result)+
+        """
        messages = [{"role": "user", "content": "Start task"}]
        for i in range(n_rounds):
+            tc_id = f"call_{i}"
            messages.append({
                "role": "assistant",
-                "content": f"<tool_call>{{'name': 'tool_{i}'}}</tool_call>",
+                "content": None,
+                "tool_calls": [
+                    {
+                        "id": tc_id,
+                        "type": "function",
+                        "function": {"name": f"tool_{i}", "arguments": "{}"},
+                    },
+                ],
            })
-            # Tool result message with substantial content
            messages.append({
-                "role": "user",
-                "content": (
-                    f"{TOOL_RESULT_TAG}\n"
-                    f"[tool_{i}] {'x' * 2500}\n"
-                    f"{TOOL_RESULT_END}"
-                ),
+                "role": "tool",
+                "tool_call_id": tc_id,
+                "content": f"[tool_{i}] {'x' * 2500}",
            })
        return messages

@@ -865,21 +873,16 @@ class TestProgressiveDecay:
        msgs = self._make_messages(3)
        result = _apply_progressive_decay(msgs)
        assert len(result) == len(msgs)
-        # Content should be identical for short conversations
        for orig, decayed in zip(msgs, result):
-            assert orig["content"] == decayed["content"]
+            assert orig.get("content") == decayed.get("content")

    def test_old_messages_truncated(self):
        msgs = self._make_messages(20)
        result = _apply_progressive_decay(msgs)

-        # Recent tool results should be full length
-        last_tool_result = [m for m in result if m["role"] == "user" and TOOL_RESULT_TAG in m["content"]][-1]
-        assert len(last_tool_result["content"]) > 2000
-
-        # Oldest tool results should be truncated
-        first_tool_result = [m for m in result if m["role"] == "user" and TOOL_RESULT_TAG in m["content"]][0]
-        assert len(first_tool_result["content"]) < 500
+        tool_msgs = [m for m in result if m.get("role") == "tool"]
+        assert len(tool_msgs[-1]["content"]) > 2000
+        assert len(tool_msgs[0]["content"]) < 500

    def test_message_count_preserved(self):
        msgs = self._make_messages(20)
@@ -915,7 +918,7 @@ class TestMessageFolding:
            messages.append({"role": "assistant", "content": f"thinking step {i}"})
            messages.append({"role": "user", "content": f"tool result {i}: {'data ' * 50}"})

-        result = await client._fold_old_messages(messages, "system prompt")
+        result = await client._fold_old_messages(messages)

        # Should be significantly shorter
        assert len(result) < len(messages)
@@ -937,7 +940,7 @@ class TestMessageFolding:
            {"role": "assistant", "content": "hi"},
        ]

-        result = await client._fold_old_messages(messages, "system")
+        result = await client._fold_old_messages(messages)
        # Should return original (n_to_fold = 2 - 10 = negative, so no folding)
        assert result == messages
        client.chat.assert_not_called()
@@ -951,7 +954,555 @@ class TestMessageFolding:
        client.chat = AsyncMock(side_effect=Exception("API error"))

        messages = [{"role": "user", "content": f"msg {i}"} for i in range(40)]
-        result = await client._fold_old_messages(messages, "system")
+        result = await client._fold_old_messages(messages)

        # On failure, should return original messages unchanged
        assert len(result) == 40
+
+    @pytest.mark.asyncio
+    async def test_fold_boundary_never_orphans_tool_message(self):
+        """If the natural fold boundary would leave `role: "tool"` at the
+        head of `recent_messages`, fold must walk the boundary forward
+        until the head is non-tool. The API rejects orphan tool messages
+        with HTTP 400."""
+        from llm_client import LLMClient
+        from unittest.mock import AsyncMock
+
+        client = LLMClient.__new__(LLMClient)
+        client.chat = AsyncMock(return_value="summary")
+
+        # Build a long conversation of (assistant{tool_calls}, tool) pairs.
+        # Place the assistant at the exact n_to_fold boundary so its paired
+        # tool would otherwise be orphaned at the head of recent_messages.
+        msgs: list[dict] = [{"role": "user", "content": "task"}]
+        for i in range(30):
+            tc_id = f"call_{i}"
+            msgs.append({
+                "role": "assistant", "content": None,
+                "tool_calls": [{
+                    "id": tc_id, "type": "function",
+                    "function": {"name": f"t_{i}", "arguments": "{}"},
+                }],
+            })
+            msgs.append({"role": "tool", "tool_call_id": tc_id, "content": "ok"})
+
+        result = await client._fold_old_messages(msgs)
+        # No `role: "tool"` may appear without an `assistant{tool_calls}`
+        # immediately preceding it.
+        for i, m in enumerate(result):
+            if m.get("role") == "tool":
+                assert i > 0, "tool message cannot be first"
+                prev = result[i - 1]
+                assert prev.get("role") == "assistant" and prev.get("tool_calls"), (
+                    f"tool at index {i} preceded by {prev.get('role')} "
+                    f"(tool_calls={bool(prev.get('tool_calls'))})"
+                )
+
+
+# ---------------------------------------------------------------------------
+# Investigation areas: dataclass + derivation + coverage + dispatch
+# ---------------------------------------------------------------------------
+
+class TestInvestigationAreaDerivation:
+    @pytest.fixture
+    def graph(self):
+        return EvidenceGraph()
+
+    @pytest.mark.asyncio
+    async def test_add_investigation_area_dedupes_and_merges_lists(self, graph):
+        aid1, existed1 = await graph.add_investigation_area(
+            area="password_hashes", description="SAM dump",
+            suggested_agent="filesystem",
+            expected_keywords=["sam", "pwdump"],
+            expected_tools=["search_strings"],
+            motivating_hypothesis_ids=["hyp-a"],
+            created_by="llm_derive",
+        )
+        aid2, existed2 = await graph.add_investigation_area(
+            area="password_hashes", description="Other description",
+            suggested_agent="registry",
+            expected_keywords=["sam", "hashdump"],          # one new
+            expected_tools=["search_strings", "parse_registry_key"],   # one new
+            motivating_hypothesis_ids=["hyp-b"],              # new
+            created_by="manual",
+        )
+        assert not existed1
+        assert existed2
+        assert aid1 == aid2
+        a = graph.investigation_areas[aid1]
+        # Description and suggested_agent NOT overwritten (first-write wins)
+        assert a.description == "SAM dump"
+        assert a.suggested_agent == "filesystem"
+        # Three list fields are unioned
+        assert set(a.expected_keywords) == {"sam", "pwdump", "hashdump"}
+        assert set(a.expected_tools) == {"search_strings", "parse_registry_key"}
+        assert set(a.motivating_hypothesis_ids) == {"hyp-a", "hyp-b"}
+
+    @pytest.mark.asyncio
+    async def test_check_coverage_keyword_layer(self, graph):
+        await graph.add_phenomenon(
+            "fs", "filesystem", "Cain SAM dump artifact",
+            "Found sam.lst in the Cain folder",
+            source_tool="list_directory",
+        )
+        await graph.add_investigation_area(
+            area="password_hashes", description="SAM dump",
+            suggested_agent="filesystem",
+            expected_keywords=["sam.lst", "pwdump"],
+            expected_tools=["nonexistent_tool"],
+        )
+        from orchestrator import Orchestrator
+        from agent_factory import AgentFactory
+        from unittest.mock import AsyncMock
+        orch = Orchestrator(AsyncMock(), graph, AgentFactory(AsyncMock(), graph))
+        covered = orch._check_coverage()
+        assert "password_hashes" in covered
+
+    @pytest.mark.asyncio
+    async def test_check_coverage_tool_layer(self, graph):
+        await graph.add_phenomenon(
+            "reg", "registry", "User accounts",
+            "Found accounts",
+            source_tool="enumerate_users",
+        )
+        await graph.add_investigation_area(
+            area="user_accounts", description="Enum users",
+            suggested_agent="registry",
+            expected_keywords=["irrelevant"],
+            expected_tools=["enumerate_users"],
+        )
+        from orchestrator import Orchestrator
+        from agent_factory import AgentFactory
+        from unittest.mock import AsyncMock
+        orch = Orchestrator(AsyncMock(), graph, AgentFactory(AsyncMock(), graph))
+        covered = orch._check_coverage()
+        assert "user_accounts" in covered
+
+    @pytest.mark.asyncio
+    async def test_load_state_round_trip_areas(self, graph, tmp_path):
+        await graph.add_investigation_area(
+            area="x", description="d", suggested_agent="filesystem",
+            expected_keywords=["k1"], expected_tools=["t1"],
+            priority=2, motivating_hypothesis_ids=["hyp-a"],
+            created_by="manual",
+        )
+        path = tmp_path / "state.json"
+        graph.save_state(path)
+        g2 = EvidenceGraph.load_state(path)
+        assert len(g2.investigation_areas) == 1
+        a = list(g2.investigation_areas.values())[0]
+        assert a.area == "x"
+        assert a.expected_keywords == ["k1"]
+        assert a.priority == 2
+        assert a.motivating_hypothesis_ids == ["hyp-a"]
+        assert a.created_by == "manual"
+
+    @pytest.mark.asyncio
+    async def test_derive_no_op_when_areas_already_populated(self, graph):
+        """Resume safety: if areas are already in the graph (manual seed or
+        restored from disk), _derive_investigation_areas does nothing."""
+        from unittest.mock import AsyncMock
+        from orchestrator import Orchestrator
+        from agent_factory import AgentFactory
+
+        await graph.add_hypothesis("test", "desc", created_by="t")
+        await graph.add_investigation_area(
+            area="pre_existing", description="d", suggested_agent="filesystem",
+            created_by="manual",
+        )
+
+        llm = AsyncMock()
+        orch = Orchestrator(llm, graph, AgentFactory(llm, graph))
+        await orch._derive_investigation_areas()
+        # LLM should not have been called
+        assert llm.chat.call_count == 0
+        # Area count unchanged
+        assert len(graph.investigation_areas) == 1
+
+    @pytest.mark.asyncio
+    async def test_fallback_when_llm_returns_empty_list(self, graph):
+        from unittest.mock import AsyncMock
+        from orchestrator import Orchestrator
+        from agent_factory import AgentFactory
+
+        await graph.add_hypothesis("Some compromise", "desc", created_by="t")
+        llm = AsyncMock()
+        llm.chat.return_value = "[]"
+        orch = Orchestrator(llm, graph, AgentFactory(llm, graph))
+        await orch._derive_investigation_areas()
+        # Fallback creates one area per hypothesis
+        assert len(graph.investigation_areas) == 1
+        a = list(graph.investigation_areas.values())[0]
+        assert a.created_by == "fallback"
+
+    @pytest.mark.asyncio
+    async def test_unknown_tool_filtered_kept_keywords(self, graph):
+        """LLM emits a tool name not in TOOL_CATALOG; tool is filtered,
+        but the area itself with its keywords is kept."""
+        from unittest.mock import AsyncMock
+        from orchestrator import Orchestrator
+        from agent_factory import AgentFactory
+
+        h = await graph.add_hypothesis("h", "desc", created_by="t")
+        llm = AsyncMock()
+        llm.chat.return_value = (
+            '[{"area":"foo","description":"desc","suggested_agent":"filesystem",'
+            '"expected_keywords":["kw1","kw2"],"expected_tools":["nonexistent_tool"],'
+            f'"priority":2,"motivating_hypothesis_ids":["{h}"]}}]'
+        )
+        orch = Orchestrator(llm, graph, AgentFactory(llm, graph))
+        await orch._derive_investigation_areas()
+        assert "area-foo" in graph.investigation_areas
+        a = graph.investigation_areas["area-foo"]
+        assert a.expected_keywords == ["kw1", "kw2"]
+        assert a.expected_tools == []   # filtered out
+
+    @pytest.mark.asyncio
+    async def test_unknown_agent_resolved_via_AGENT_ALIASES(self, graph):
+        """LLM emits 'chat' (which is in AGENT_ALIASES → 'communication').
+        The area should land with the resolved agent name."""
+        from unittest.mock import AsyncMock
+        from orchestrator import Orchestrator
+        from agent_factory import AgentFactory
+
+        h = await graph.add_hypothesis("h", "desc", created_by="t")
+        llm = AsyncMock()
+        llm.chat.return_value = (
+            '[{"area":"chat_stuff","description":"d","suggested_agent":"chat",'
+            '"expected_keywords":["irc"],"expected_tools":[],'
+            f'"priority":3,"motivating_hypothesis_ids":["{h}"]}}]'
+        )
+        orch = Orchestrator(llm, graph, AgentFactory(llm, graph))
+        await orch._derive_investigation_areas()
+        a = graph.investigation_areas["area-chat_stuff"]
+        assert a.suggested_agent == "communication"
+
+    @staticmethod
+    def _agent_with_executor(graph, llm, tool_name: str, real_executor):
+        """Build a BaseAgent that registers tool_name via the real register_tool
+        path so the mandatory-record wrapper is engaged."""
+        from base_agent import BaseAgent
+        agent = BaseAgent(llm, graph)
+        agent.name = "test_agent"
+        # Bypass _register_graph_tools side-effects in run() — we register
+        # only what the test needs.
+        agent._register_graph_tools = lambda: None
+        agent.register_tool(
+            name=tool_name, description="", input_schema={}, executor=real_executor,
+        )
+        return agent
+
+    @pytest.mark.asyncio
+    async def test_forced_record_retry_fires_when_zero_phenomena(self):
+        """BaseAgent.run should automatically retry one more LLM round if
+        the agent finished without calling any mandatory recording tool."""
+        from unittest.mock import AsyncMock
+
+        graph = EvidenceGraph()
+        llm = AsyncMock()
+
+        async def real_add(**kw):
+            await graph.add_phenomenon(
+                source_agent="test", category="filesystem",
+                title="Forced retry record",
+                description="Recorded after STOP prompt",
+                source_tool="forced_retry",
+            )
+
+        agent = self._agent_with_executor(graph, llm, "add_phenomenon", real_add)
+
+        async def fake_tool_call_loop(messages, tools, tool_executor, system, max_iterations=40, terminal_tools=()):
+            already_retrying = any(
+                "STOP." in (m.get("content", "") if isinstance(m, dict) else "")
+                for m in messages
+            )
+            if not already_retrying:
+                return "Final answer without recording.", list(messages) + [
+                    {"role": "assistant", "content": "Final answer without recording."}
+                ]
+            await tool_executor["add_phenomenon"]()  # goes through wrapper
+            return "Recorded.", []
+
+        llm.tool_call_loop = fake_tool_call_loop
+        await agent.run("test task")
+        assert len(graph.phenomena) == 1
+        assert agent._record_call_counts["add_phenomenon"] == 1
+
+    @pytest.mark.asyncio
+    async def test_no_retry_when_mandatory_tool_was_called(self):
+        """Retry should NOT fire if a mandatory recording tool was invoked."""
+        from unittest.mock import AsyncMock
+
+        graph = EvidenceGraph()
+        llm = AsyncMock()
+        call_count = {"n": 0}
+
+        async def real_add(**kw):
+            await graph.add_phenomenon(
+                source_agent="test", category="filesystem", title="x",
+                description="y", source_tool="t",
+            )
+
+        agent = self._agent_with_executor(graph, llm, "add_phenomenon", real_add)
+
+        async def fake_tool_call_loop(messages, tools, tool_executor, system, max_iterations=40, terminal_tools=()):
+            call_count["n"] += 1
+            await tool_executor["add_phenomenon"]()  # wrapper increments count
+            return "done.", list(messages)
+
+        llm.tool_call_loop = fake_tool_call_loop
+        await agent.run("test task")
+        assert call_count["n"] == 1   # no retry
+
+    @pytest.mark.asyncio
+    async def test_no_retry_when_mandatory_tools_empty(self):
+        """ReportAgent declares mandatory_record_tools=() — retry should
+        not fire even with zero graph mutations (final text IS the output)."""
+        from unittest.mock import AsyncMock
+        from base_agent import BaseAgent
+
+        graph = EvidenceGraph()
+        llm = AsyncMock()
+        call_count = {"n": 0}
+
+        async def fake_tool_call_loop(messages, tools, tool_executor, system, max_iterations=40, terminal_tools=()):
+            call_count["n"] += 1
+            return "report body here", list(messages)
+
+        llm.tool_call_loop = fake_tool_call_loop
+
+        class ReportLike(BaseAgent):
+            mandatory_record_tools = ()
+
+        agent = ReportLike(llm, graph)
+        agent.name = "report_like"
+        agent._register_graph_tools = lambda: None
+        await agent.run("test task")
+        assert call_count["n"] == 1
+
+    @pytest.mark.asyncio
+    async def test_forced_retry_fires_for_timeline_agent(self):
+        """TimelineAgent.mandatory_record_tools=('add_temporal_edge',) — retry
+        should fire when timeline finishes without creating any temporal edges,
+        even though the agent does not have add_phenomenon."""
+        from unittest.mock import AsyncMock
+        from base_agent import BaseAgent
+
+        graph = EvidenceGraph()
+        llm = AsyncMock()
+        call_count = {"n": 0}
+
+        edge_added = {"n": 0}
+        async def real_add_edge(**kw):
+            edge_added["n"] += 1
+
+        class TimelineLike(BaseAgent):
+            mandatory_record_tools = ("add_temporal_edge",)
+
+        agent = TimelineLike(llm, graph)
+        agent.name = "timeline_like"
+        agent._register_graph_tools = lambda: None
+        agent.register_tool("add_temporal_edge", "", {}, real_add_edge)
+
+        async def fake_tool_call_loop(messages, tools, tool_executor, system, max_iterations=40, terminal_tools=()):
+            call_count["n"] += 1
+            already_retrying = any(
+                "STOP." in (m.get("content", "") if isinstance(m, dict) else "")
+                for m in messages
+            )
+            if not already_retrying:
+                return "answer", list(messages)
+            await tool_executor["add_temporal_edge"]()
+            return "recorded.", []
+
+        llm.tool_call_loop = fake_tool_call_loop
+        await agent.run("build timeline")
+        assert call_count["n"] == 2          # first + retry
+        assert edge_added["n"] == 1
+        assert agent._record_call_counts["add_temporal_edge"] == 1
+
+    # ---- terminal_tools: real LLMClient.tool_call_loop short-circuit -----
+
+    @pytest.mark.asyncio
+    async def test_terminal_tool_exits_loop_immediately(self):
+        """When a terminal tool is called, tool_call_loop must return
+        with that tool's result text as final_text — no further LLM calls."""
+        from unittest.mock import AsyncMock
+        from llm_client import LLMClient
+
+        client = LLMClient.__new__(LLMClient)
+        client.max_tokens = 4096
+        client.reasoning_effort = None
+        client.thinking_enabled = False
+        client.model = "test"
+        client._client = None
+
+        call_count = {"n": 0}
+
+        async def fake_chat_with_tools(messages, openai_tools):
+            call_count["n"] += 1
+            if call_count["n"] == 1:
+                # First turn: model calls a read tool then the terminal tool.
+                return "thinking aloud", None, [
+                    {"id": "tc1", "name": "read_tool", "arguments": "{}"},
+                    {"id": "tc2", "name": "save_report",
+                     "arguments": '{"content":"FINAL REPORT BODY","output_path":"r.md"}'},
+                ]
+            raise AssertionError("loop should have exited after terminal tool")
+
+        client._chat_with_tools = fake_chat_with_tools
+
+        async def read_tool():
+            return "some data"
+
+        async def save_report(content, output_path):
+            return content  # terminal tool returns content as final_text
+
+        tools = [
+            {"name": "read_tool", "description": "", "input_schema": {"type": "object", "properties": {}}},
+            {"name": "save_report", "description": "",
+             "input_schema": {"type": "object", "properties": {
+                 "content": {"type": "string"}, "output_path": {"type": "string"}}}},
+        ]
+        executors = {"read_tool": read_tool, "save_report": save_report}
+
+        final_text, _ = await client.tool_call_loop(
+            messages=[{"role": "user", "content": "go"}],
+            tools=tools, tool_executor=executors,
+            system="sys", terminal_tools=("save_report",),
+        )
+        assert final_text == "FINAL REPORT BODY"
+        assert call_count["n"] == 1  # never called a 2nd round
+
+    @pytest.mark.asyncio
+    async def test_no_terminal_short_circuit_when_not_declared(self):
+        """When terminal_tools is empty, the same call sequence should
+        run the read tool, run save_report-like tool, AND continue the loop
+        (i.e. another LLM round) until the model stops calling tools."""
+        from unittest.mock import AsyncMock
+        from llm_client import LLMClient
+
+        client = LLMClient.__new__(LLMClient)
+        client.max_tokens = 4096
+        client.reasoning_effort = None
+        client.thinking_enabled = False
+        client.model = "test"
+        client._client = None
+
+        call_count = {"n": 0}
+
+        async def fake_chat_with_tools(messages, openai_tools):
+            call_count["n"] += 1
+            if call_count["n"] == 1:
+                return "", None, [
+                    {"id": "tc1", "name": "add_phenomenon",
+                     "arguments": '{"title":"x","description":"y"}'},
+                ]
+            return "all done", None, []  # model stops calling tools
+
+        client._chat_with_tools = fake_chat_with_tools
+
+        async def add_phenomenon(title, description):
+            return f"recorded {title}"
+
+        tools = [
+            {"name": "add_phenomenon", "description": "",
+             "input_schema": {"type": "object", "properties": {
+                 "title": {"type": "string"}, "description": {"type": "string"}}}},
+        ]
+        executors = {"add_phenomenon": add_phenomenon}
+
+        final_text, _ = await client.tool_call_loop(
+            messages=[{"role": "user", "content": "go"}],
+            tools=tools, tool_executor=executors,
+            system="sys", terminal_tools=(),  # NOT terminal
+        )
+        assert final_text == "all done"
+        assert call_count["n"] == 2  # 2 rounds — terminal_tools empty, loop continues
+
+    @pytest.mark.asyncio
+    async def test_report_agent_terminal_tool_declared(self):
+        """ReportAgent should declare save_report as both mandatory and terminal."""
+        from agents.report import ReportAgent
+        assert ReportAgent.terminal_tools == ("save_report",)
+        assert ReportAgent.mandatory_record_tools == ("save_report",)
+
+    @pytest.mark.asyncio
+    async def test_terminal_tool_result_not_truncated(self):
+        """Terminal tool's raw return is used as final_text and must NOT
+        be truncated to 3000 chars (the truncation cap applies only to
+        LLM-context tool result messages). A 20K-char markdown report
+        passed through save_report should reach the caller intact."""
+        from llm_client import LLMClient
+
+        client = LLMClient.__new__(LLMClient)
+        client.max_tokens = 4096
+        client.reasoning_effort = None
+        client.thinking_enabled = False
+        client.model = "test"
+        client._client = None
+
+        long_report = "# Report\n" + ("- finding " + "x" * 100 + "\n") * 200
+        assert len(long_report) > 10000
+
+        call_count = {"n": 0}
+
+        async def fake_chat_with_tools(messages, openai_tools):
+            call_count["n"] += 1
+            if call_count["n"] == 1:
+                return "", None, [
+                    {"id": "tc1", "name": "save_report",
+                     "arguments": '{"content":"placeholder","output_path":"r.md"}'},
+                ]
+            raise AssertionError("loop should have exited")
+
+        client._chat_with_tools = fake_chat_with_tools
+
+        async def save_report(content, output_path):
+            return long_report  # ignore content arg; return long content
+
+        tools = [{"name": "save_report", "description": "",
+                  "input_schema": {"type": "object", "properties": {
+                      "content": {"type": "string"}, "output_path": {"type": "string"}}}}]
+        executors = {"save_report": save_report}
+
+        final_text, _ = await client.tool_call_loop(
+            messages=[{"role": "user", "content": "go"}],
+            tools=tools, tool_executor=executors,
+            system="sys", terminal_tools=("save_report",),
+        )
+        assert final_text == long_report
+        assert len(final_text) > 10000  # not truncated to 3000
+
+    @pytest.mark.asyncio
+    async def test_dispatch_uses_hypothesis_id_when_motivating_ids_present(self, graph):
+        from unittest.mock import AsyncMock
+        from orchestrator import Orchestrator
+        from agent_factory import AgentFactory
+
+        h = await graph.add_hypothesis("h", "desc", created_by="t")
+        await graph.add_investigation_area(
+            area="uncovered_area", description="d", suggested_agent="registry",
+            expected_keywords=["xyz_no_match"], expected_tools=[],
+            priority=2, motivating_hypothesis_ids=[h],
+            created_by="llm_derive",
+        )
+        orch = Orchestrator(AsyncMock(), graph, AgentFactory(AsyncMock(), graph))
+        # Don't actually dispatch (would call agents) — just hit the lead-add path
+        # by manually replicating what _run_gap_analysis does.
+        covered = orch._check_coverage()
+        assert "uncovered_area" not in covered
+        # Simulate dispatch
+        for a in graph.investigation_areas.values():
+            if a.area not in covered:
+                await graph.add_lead(
+                    target_agent=a.suggested_agent,
+                    description=a.description,
+                    priority=a.priority,
+                    hypothesis_id=(a.motivating_hypothesis_ids[0]
+                                   if a.motivating_hypothesis_ids else None),
+                )
+        assert len(graph.leads) == 1
+        assert graph.leads[0].hypothesis_id == h
+        assert graph.leads[0].target_agent == "registry"
+        assert graph.leads[0].priority == 2
--- a/uv.lock
+++ b/uv.lock
@@ -2,6 +2,15 @@ version = 1
 revision = 3
 requires-python = ">=3.14"

+[[package]]
+name = "annotated-types"
+version = "0.7.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/ee/67/531ea369ba64dcff5ec9c3402f9f51bf748cec26dde048a2f973a4eea7f5/annotated_types-0.7.0.tar.gz", hash = "sha256:aff07c09a53a08bc8cfccb9c85b05f1aa9a2a6f23728d790723543408344ce89", size = 16081, upload-time = "2024-05-20T21:33:25.928Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/78/b6/6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8/annotated_types-0.7.0-py3-none-any.whl", hash = "sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53", size = 13643, upload-time = "2024-05-20T21:33:24.1Z" },
+]
+
 [[package]]
 name = "anyio"
 version = "4.13.0"
@@ -41,6 +50,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/b2/fb/08b3f4bf05da99aba8ffea52a558758def16e8516bc75ca94ff73587e7d3/construct-2.10.70-py3-none-any.whl", hash = "sha256:c80be81ef595a1a821ec69dc16099550ed22197615f4320b57cc9ce2a672cb30", size = 63020, upload-time = "2023-11-29T08:44:46.876Z" },
 ]

+[[package]]
+name = "distro"
+version = "1.9.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/fc/f8/98eea607f65de6527f8a2e8885fc8015d3e6f5775df186e443e0964a11c3/distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed", size = 60722, upload-time = "2023-12-24T09:54:32.31Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2", size = 20277, upload-time = "2023-12-24T09:54:30.421Z" },
+]
+
 [[package]]
 name = "h11"
 version = "0.16.0"
@@ -110,12 +128,48 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" },
 ]

+[[package]]
+name = "jiter"
+version = "0.14.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/6e/c1/0cddc6eb17d4c53a99840953f95dd3accdc5cfc7a337b0e9b26476276be9/jiter-0.14.0.tar.gz", hash = "sha256:e8a39e66dac7153cf3f964a12aad515afa8d74938ec5cc0018adcdae5367c79e", size = 165725, upload-time = "2026-04-10T14:28:42.01Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/4f/1e/354ed92461b165bd581f9ef5150971a572c873ec3b68a916d5aa91da3cc2/jiter-0.14.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:6f396837fc7577871ca8c12edaf239ed9ccef3bbe39904ae9b8b63ce0a48b140", size = 315277, upload-time = "2026-04-10T14:27:18.109Z" },
+    { url = "https://files.pythonhosted.org/packages/a6/95/8c7c7028aa8636ac21b7a55faef3e34215e6ed0cbf5ae58258427f621aa3/jiter-0.14.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:a4d50ea3d8ba4176f79754333bd35f1bbcd28e91adc13eb9b7ca91bc52a6cef9", size = 315923, upload-time = "2026-04-10T14:27:19.603Z" },
+    { url = "https://files.pythonhosted.org/packages/47/40/e2a852a44c4a089f2681a16611b7ce113224a80fd8504c46d78491b47220/jiter-0.14.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce17f8a050447d1b4153bda4fb7d26e6a9e74eb4f4a41913f30934c5075bf615", size = 344943, upload-time = "2026-04-10T14:27:21.262Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/1f/670f92adee1e9895eac41e8a4d623b6da68c4d46249d8b556b60b63f949e/jiter-0.14.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f4f1c4b125e1652aefbc2e2c1617b60a160ab789d180e3d423c41439e5f32850", size = 369725, upload-time = "2026-04-10T14:27:22.766Z" },
+    { url = "https://files.pythonhosted.org/packages/01/2f/541c9ba567d05de1c4874a0f8f8c5e3fd78e2b874266623da9a775cf46e0/jiter-0.14.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:be808176a6a3a14321d18c603f2d40741858a7c4fc982f83232842689fe86dd9", size = 461210, upload-time = "2026-04-10T14:27:24.315Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/a9/c31cbec09627e0d5de7aeaec7690dba03e090caa808fefd8133137cf45bc/jiter-0.14.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:26679d58ba816f88c3849306dd58cb863a90a1cf352cdd4ef67e30ccf8a77994", size = 380002, upload-time = "2026-04-10T14:27:26.155Z" },
+    { url = "https://files.pythonhosted.org/packages/50/02/3c05c1666c41904a2f607475a73e7a4763d1cbde2d18229c4f85b22dc253/jiter-0.14.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:80381f5a19af8fa9aef743f080e34f6b25ebd89656475f8cf0470ec6157052aa", size = 354678, upload-time = "2026-04-10T14:27:27.701Z" },
+    { url = "https://files.pythonhosted.org/packages/7d/97/e15b33545c2b13518f560d695f974b9891b311641bdcf178d63177e8801e/jiter-0.14.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:004df5fdb8ecbd6d99f3227df18ba1a259254c4359736a2e6f036c944e02d7c5", size = 358920, upload-time = "2026-04-10T14:27:29.256Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/d2/8b1461def6b96ba44530df20d07ef7a1c7da22f3f9bf1727e2d611077bf1/jiter-0.14.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:cff5708f7ed0fa098f2b53446c6fa74c48469118e5cd7497b4f1cd569ab06928", size = 394512, upload-time = "2026-04-10T14:27:31.344Z" },
+    { url = "https://files.pythonhosted.org/packages/e3/88/837566dd6ed6e452e8d3205355afd484ce44b2533edfa4ed73a298ea893e/jiter-0.14.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:2492e5f06c36a976d25c7cc347a60e26d5470178d44cde1b9b75e60b4e519f28", size = 521120, upload-time = "2026-04-10T14:27:33.299Z" },
+    { url = "https://files.pythonhosted.org/packages/89/6b/b00b45c4d1b4c031777fe161d620b755b5b02cdade1e316dcb46e4471d63/jiter-0.14.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:7609cfbe3a03d37bfdbf5052012d5a879e72b83168a363deae7b3a26564d57de", size = 553668, upload-time = "2026-04-10T14:27:34.868Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/d8/6fe5b42011d19397433d345716eac16728ac241862a2aac9c91923c7509a/jiter-0.14.0-cp314-cp314-win32.whl", hash = "sha256:7282342d32e357543565286b6450378c3cd402eea333fc1ebe146f1fabb306fc", size = 207001, upload-time = "2026-04-10T14:27:36.455Z" },
+    { url = "https://files.pythonhosted.org/packages/e5/43/5c2e08da1efad5e410f0eaaabeadd954812612c33fbbd8fd5328b489139d/jiter-0.14.0-cp314-cp314-win_amd64.whl", hash = "sha256:bd77945f38866a448e73b0b7637366afa814d4617790ecd88a18ca74377e6c02", size = 202187, upload-time = "2026-04-10T14:27:38Z" },
+    { url = "https://files.pythonhosted.org/packages/aa/1f/6e39ac0b4cdfa23e606af5b245df5f9adaa76f35e0c5096790da430ca506/jiter-0.14.0-cp314-cp314-win_arm64.whl", hash = "sha256:f2d4c61da0821ee42e0cdf5489da60a6d074306313a377c2b35af464955a3611", size = 192257, upload-time = "2026-04-10T14:27:39.504Z" },
+    { url = "https://files.pythonhosted.org/packages/05/57/7dbc0ffbbb5176a27e3518716608aa464aee2e2887dc938f0b900a120449/jiter-0.14.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1bf7ff85517dd2f20a5750081d2b75083c1b269cf75afc7511bdf1f9548beb3b", size = 323441, upload-time = "2026-04-10T14:27:41.039Z" },
+    { url = "https://files.pythonhosted.org/packages/83/6e/7b3314398d8983f06b557aa21b670511ec72d3b79a68ee5e4d9bff972286/jiter-0.14.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c8ef8791c3e78d6c6b157c6d360fbb5c715bebb8113bc6a9303c5caff012754a", size = 348109, upload-time = "2026-04-10T14:27:42.552Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/4f/8dc674bcd7db6dba566de73c08c763c337058baff1dbeb34567045b27cdc/jiter-0.14.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:e74663b8b10da1fe0f4e4703fd7980d24ad17174b6bb35d8498d6e3ebce2ae6a", size = 368328, upload-time = "2026-04-10T14:27:44.574Z" },
+    { url = "https://files.pythonhosted.org/packages/3b/5f/188e09a1f20906f98bbdec44ed820e19f4e8eb8aff88b9d1a5a497587ff3/jiter-0.14.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1aca29ba52913f78362ec9c2da62f22cdc4c3083313403f90c15460979b84d9b", size = 463301, upload-time = "2026-04-10T14:27:46.717Z" },
+    { url = "https://files.pythonhosted.org/packages/ac/f0/19046ef965ed8f349e8554775bb12ff4352f443fbe12b95d31f575891256/jiter-0.14.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:8b39b7d87a952b79949af5fef44d2544e58c21a28da7f1bae3ef166455c61746", size = 378891, upload-time = "2026-04-10T14:27:48.32Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/c3/da43bd8431ee175695777ee78cf0e93eacbb47393ff493f18c45231b427d/jiter-0.14.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:78d918a68b26e9fab068c2b5453577ef04943ab2807b9a6275df2a812599a310", size = 360749, upload-time = "2026-04-10T14:27:49.88Z" },
+    { url = "https://files.pythonhosted.org/packages/72/26/e054771be889707c6161dbdec9c23d33a9ec70945395d70f07cfea1e9a6f/jiter-0.14.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:b08997c35aee1201c1a5361466a8fb9162d03ae7bf6568df70b6c859f1e654a4", size = 358526, upload-time = "2026-04-10T14:27:51.504Z" },
+    { url = "https://files.pythonhosted.org/packages/c3/0f/7bea65ea2a6d91f2bf989ff11a18136644392bf2b0497a1fa50934c30a9c/jiter-0.14.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:260bf7ca20704d58d41f669e5e9fe7fe2fa72901a6b324e79056f5d52e9c9be2", size = 393926, upload-time = "2026-04-10T14:27:53.368Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/a1/b1ff7d70deef61ac0b7c6c2f12d2ace950cdeecb4fdc94500a0926802857/jiter-0.14.0-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:37826e3df29e60f30a382f9294348d0238ef127f4b5d7f5f8da78b5b9e050560", size = 521052, upload-time = "2026-04-10T14:27:55.058Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/7b/3b0649983cbaf15eda26a414b5b1982e910c67bd6f7b1b490f3cfc76896a/jiter-0.14.0-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:645be49c46f2900937ba0eaf871ad5183c96858c0af74b6becc7f4e367e36e06", size = 553716, upload-time = "2026-04-10T14:27:57.269Z" },
+    { url = "https://files.pythonhosted.org/packages/97/f8/33d78c83bd93ae0c0af05293a6660f88a1977caef39a6d72a84afab94ce0/jiter-0.14.0-cp314-cp314t-win32.whl", hash = "sha256:2f7877ed45118de283786178eceaf877110abacd04fde31efff3940ae9672674", size = 207957, upload-time = "2026-04-10T14:27:59.285Z" },
+    { url = "https://files.pythonhosted.org/packages/d6/ac/2b760516c03e2227826d1f7025d89bf6bf6357a28fe75c2a2800873c50bf/jiter-0.14.0-cp314-cp314t-win_amd64.whl", hash = "sha256:14c0cb10337c49f5eafe8e7364daca5e29a020ea03580b8f8e6c597fed4e1588", size = 204690, upload-time = "2026-04-10T14:28:00.962Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/2e/a44c20c58aeed0355f2d326969a181696aeb551a25195f47563908a815be/jiter-0.14.0-cp314-cp314t-win_arm64.whl", hash = "sha256:5419d4aa2024961da9fe12a9cfe7484996735dca99e8e090b5c88595ef1951ff", size = 191338, upload-time = "2026-04-10T14:28:02.853Z" },
+]
+
 [[package]]
 name = "masforensics"
 version = "0.1.0"
 source = { virtual = "." }
 dependencies = [
    { name = "httpx", extra = ["socks"] },
+    { name = "openai" },
    { name = "pyyaml" },
    { name = "regipy" },
 ]
@@ -129,6 +183,7 @@ dev = [
 [package.metadata]
 requires-dist = [
    { name = "httpx", extras = ["socks"], specifier = ">=0.28.1" },
+    { name = "openai", specifier = ">=2.36.0" },
    { name = "pyyaml" },
    { name = "regipy", specifier = ">=6.2.1" },
 ]
@@ -139,6 +194,25 @@ dev = [
    { name = "pytest-asyncio", specifier = ">=1.3.0" },
 ]

+[[package]]
+name = "openai"
+version = "2.36.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "anyio" },
+    { name = "distro" },
+    { name = "httpx" },
+    { name = "jiter" },
+    { name = "pydantic" },
+    { name = "sniffio" },
+    { name = "tqdm" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/f4/a1/4d5e84cf51720fc1526cc49e10ac1961abcccb55b0efb3d970db1e9a2728/openai-2.36.0.tar.gz", hash = "sha256:139dea0edd2f1b30c33d46ae1a6929e03906254140318e4608e98fe8c566f2e7", size = 753003, upload-time = "2026-05-07T17:33:17.075Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/9d/1c/5d43735b2553baae2a5e899dcbcd0670a86930d993184d72ca909bf11c9b/openai-2.36.0-py3-none-any.whl", hash = "sha256:143f6194b548dbc2c921af1f1b03b9f14c85fed8a75b5b516f5bcc11a2a50c63", size = 1302361, upload-time = "2026-05-07T17:33:15.063Z" },
+]
+
 [[package]]
 name = "packaging"
 version = "26.0"
@@ -157,6 +231,62 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
 ]

+[[package]]
+name = "pydantic"
+version = "2.13.4"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "annotated-types" },
+    { name = "pydantic-core" },
+    { name = "typing-extensions" },
+    { name = "typing-inspection" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/18/a5/b60d21ac674192f8ab0ba4e9fd860690f9b4a6e51ca5df118733b487d8d6/pydantic-2.13.4.tar.gz", hash = "sha256:c40756b57adaa8b1efeeced5c196f3f3b7c435f90e84ea7f443901bec8099ef6", size = 844775, upload-time = "2026-05-06T13:43:05.343Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/fd/7b/122376b1fd3c62c1ed9dc80c931ace4844b3c55407b6fb2d199377c9736f/pydantic-2.13.4-py3-none-any.whl", hash = "sha256:45a282cde31d808236fd7ea9d919b128653c8b38b393d1c4ab335c62924d9aba", size = 472262, upload-time = "2026-05-06T13:43:02.641Z" },
+]
+
+[[package]]
+name = "pydantic-core"
+version = "2.46.4"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/9d/56/921726b776ace8d8f5db44c4ef961006580d91dc52b803c489fafd1aa249/pydantic_core-2.46.4.tar.gz", hash = "sha256:62f875393d7f270851f20523dd2e29f082bcc82292d66db2b64ea71f64b6e1c1", size = 471464, upload-time = "2026-05-06T13:37:06.98Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/8d/74/228a26ddad29c6672b805d9fd78e8d251cd04004fa7eed0e622096cd0250/pydantic_core-2.46.4-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:428e04521a40150c85216fc8b85e8d39fece235a9cf5e383761238c7fa9b96fb", size = 2102079, upload-time = "2026-05-06T13:38:41.019Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/1f/8970b150a4b4365623ae00fc88603491f763c627311ae8031e3111356d6e/pydantic_core-2.46.4-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:23ace664830ee0bfe014a0c7bc248b1f7f25ed7ad103852c317624a1083af462", size = 1952179, upload-time = "2026-05-06T13:36:59.812Z" },
+    { url = "https://files.pythonhosted.org/packages/95/30/5211a831ae054928054b2f79731661087a2bc5c01e825c672b3a4a8f1b3e/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce5c1d2a8b27468f433ca974829c44060b8097eedc39933e3c206a90ee49c4a9", size = 1978926, upload-time = "2026-05-06T13:37:39.933Z" },
+    { url = "https://files.pythonhosted.org/packages/57/e9/689668733b1eb67adeef047db3c2e8788fcf65a7fd9c9e2b46b7744fe245/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7283d57845ecf5a163403eb0702dfc220cc4fbdd18919cb5ccea4f95ee1cdab4", size = 2046785, upload-time = "2026-05-06T13:38:01.995Z" },
+    { url = "https://files.pythonhosted.org/packages/60/d9/6715260422ff50a2109878fd24d948a6c3446bb2664f34ee78cd972b3acd/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8daafc69c93ee8a0204506a3b6b30f586ef54028f52aeeeb5c4cfc5184fd5914", size = 2228733, upload-time = "2026-05-06T13:40:50.371Z" },
+    { url = "https://files.pythonhosted.org/packages/18/ae/fdb2f64316afca925640f8e70bb1a564b0ec2721c1389e25b8eb4bf9a299/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:cd2213145bcc2ba85884d0ac63d222fece9209678f77b9b4d76f054c561adb28", size = 2307534, upload-time = "2026-05-06T13:37:21.531Z" },
+    { url = "https://files.pythonhosted.org/packages/89/1d/8eff589b45bb8190a9d12c49cfad0f176a5cbd1534908a6b5125e2886239/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7a5f930472650a82629163023e630d160863fce524c616f4e5186e5de9d9a49b", size = 2099732, upload-time = "2026-05-06T13:39:31.942Z" },
+    { url = "https://files.pythonhosted.org/packages/06/d5/ee5a3366637fee41dee51a1fc91562dcf12ddbc68fda34e6b253da2324bb/pydantic_core-2.46.4-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:c1b3f518abeca3aa13c712fd202306e145abf59a18b094a6bafb2d2bbf59192c", size = 2129627, upload-time = "2026-05-06T13:37:25.033Z" },
+    { url = "https://files.pythonhosted.org/packages/94/33/2414be571d2c6a6c4d08be21f9292b6d3fdb08949a97b6dfe985017821db/pydantic_core-2.46.4-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1a7dd0b3ee80d90150e3495a3a13ac34dbcbfd4f012996a6a1d8900e91b5c0fb", size = 2179141, upload-time = "2026-05-06T13:37:14.046Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/79/7daa95be995be0eecc4cf75064cb33f9bbbfe3fe0158caf2f0d4a996a5c7/pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:3fb702cd90b0446a3a1c5e470bfa0dd23c0233b676a9099ddcc964fa6ca13898", size = 2184325, upload-time = "2026-05-06T13:36:53.615Z" },
+    { url = "https://files.pythonhosted.org/packages/9f/cb/d0a382f5c0de8a222dc61c65348e0ce831b1f68e0a018450d31c2cace3a5/pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_armv7l.whl", hash = "sha256:b8458003118a712e66286df6a707db01c52c0f52f7db8e4a38f0da1d3b94fc4e", size = 2323990, upload-time = "2026-05-06T13:40:29.971Z" },
+    { url = "https://files.pythonhosted.org/packages/05/db/d9ba624cc4a5aced1598e88c04fdbd8310c8a69b9d38b9a3d39ce3a61ed7/pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:372429a130e469c9cd698925ce5fc50940b7a1336b0d82038e63d5bbc4edc519", size = 2369978, upload-time = "2026-05-06T13:37:23.027Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/20/d15df15ba918c423461905802bfd2981c3af0bfa0e40d05e13edbfa48bc3/pydantic_core-2.46.4-cp314-cp314-win32.whl", hash = "sha256:85bb3611ff1802f3ee7fdd7dbff26b56f343fb432d57a4728fdd49b6ef35e2f4", size = 1966354, upload-time = "2026-05-06T13:38:03.499Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/b6/6b8de4c0a7d7ab3004c439c80c5c1e0a3e8d78bbae19379b01960383d9e5/pydantic_core-2.46.4-cp314-cp314-win_amd64.whl", hash = "sha256:811ff8e9c313ab425368bcbb36e5c4ebd7108c2bbf4e4089cfbb0b01eff63fac", size = 2072238, upload-time = "2026-05-06T13:39:40.807Z" },
+    { url = "https://files.pythonhosted.org/packages/32/36/51eb763beec1f4cf59b1db243a7dcc39cbb41230f050a09b9d69faaf0a48/pydantic_core-2.46.4-cp314-cp314-win_arm64.whl", hash = "sha256:bfec22eab3c8cc2ceec0248aec886624116dc079afa027ecc8ad4a7e62010f8a", size = 2018251, upload-time = "2026-05-06T13:37:26.72Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/91/855af51d625b23aa987116a19e231d2aaef9c4a415273ddc189b79a45fee/pydantic_core-2.46.4-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:af8244b2bef6aaad6d92cda81372de7f8c8d36c9f0c3ea36e827c60e7d9467a0", size = 2099593, upload-time = "2026-05-06T13:39:47.682Z" },
+    { url = "https://files.pythonhosted.org/packages/fb/1b/8784a54c65edb5f49f0a14d6977cf1b209bba85a4c77445b255c2de58ab3/pydantic_core-2.46.4-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:5a4330cdbc57162e4b3aa303f588ba752257694c9c9be3e7ebb11b4aca659b5d", size = 1935226, upload-time = "2026-05-06T13:40:40.428Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/e7/1955d28d1afc56dd4b3ad7cc0cf39df1b9852964cf16e5d13912756d6d6b/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:29c61fc04a3d840155ff08e475a04809278972fe6aef51e2720554e96367e34b", size = 1974605, upload-time = "2026-05-06T13:37:32.029Z" },
+    { url = "https://files.pythonhosted.org/packages/93/e2/3fedbf0ba7a22850e6e9fd78117f1c0f10f950182344d8a6c535d468fdd8/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:c50f2528cf200c5eed56faf3f4e22fcd5f38c157a8b78576e6ba3168ec35f000", size = 2030777, upload-time = "2026-05-06T13:38:55.239Z" },
+    { url = "https://files.pythonhosted.org/packages/f8/61/46be275fcaaba0b4f5b9669dd852267ce1ff616592dccf7a7845588df091/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:0cbe8b01f948de4286c74cdd6c667aceb38f5c1e26f0693b3983d9d74887c65e", size = 2236641, upload-time = "2026-05-06T13:37:08.096Z" },
+    { url = "https://files.pythonhosted.org/packages/60/db/12e93e46a8bac9988be3c016860f83293daea8c716c029c9ace279036f2f/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:617d7e2ca7dcb8c5cf6bcb8c59b8832c94b36196bbf1cbd1bfb56ed341905edd", size = 2286404, upload-time = "2026-05-06T13:40:20.221Z" },
+    { url = "https://files.pythonhosted.org/packages/e2/4a/4d8b19008f38d31c53b8219cfedc2e3d5de5fe99d90076b7e767de29274f/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7027560ee92211647d0d34e3f7cd6f50da56399d26a9c8ad0da286d3869a53f3", size = 2109219, upload-time = "2026-05-06T13:38:12.153Z" },
+    { url = "https://files.pythonhosted.org/packages/88/70/3cbc40978fefb7bb09c6708d40d4ad1a5d70fd7213c3d17f971de868ec1f/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:f99626688942fb746e545232e7726926f3be91b5975f8b55327665fafda991c7", size = 2110594, upload-time = "2026-05-06T13:40:02.971Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/20/b8d36736216e29491125531685b2f9e61aa5b4b2599893f8268551da3338/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:fc3e9034a63de20e15e8ade85358bc6efc614008cab72898b4b4952bea0509ff", size = 2159542, upload-time = "2026-05-06T13:39:27.506Z" },
+    { url = "https://files.pythonhosted.org/packages/1d/a2/367df868eb584dacf6bf82a389272406d7178e301c4ac82545ab98bc2dd9/pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:97e7cf2be5c77b7d1a9713a05605d49460d02c6078d38d8bef3cbe323c548424", size = 2168146, upload-time = "2026-05-06T13:38:31.93Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/b8/4460f77f7e201893f649a29ab355dddd3beee8a97bcb1a320db414f9a06e/pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_armv7l.whl", hash = "sha256:3bf92c5d0e00fefaab325a4d27828fe6b6e2a21848686b5b60d2d9eeb09d76c6", size = 2306309, upload-time = "2026-05-06T13:37:44.717Z" },
+    { url = "https://files.pythonhosted.org/packages/64/c4/be2639293acd87dc8ddbcec41a73cee9b2ebf996fe6d892a1a74e88ad3f7/pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:3ecbc122d18468d06ca279dc26a8c2e2d5acb10943bb35e36ae92096dc3b5565", size = 2369736, upload-time = "2026-05-06T13:37:05.645Z" },
+    { url = "https://files.pythonhosted.org/packages/30/a6/9f9f380dbb301f67023bf8f707aaa75daadf84f7152d95c410fd7e81d994/pydantic_core-2.46.4-cp314-cp314t-win32.whl", hash = "sha256:e846ae7835bf0703ae43f534ab79a867146dadd59dc9ca5c8b53d5c8f7c9ef02", size = 1955575, upload-time = "2026-05-06T13:38:51.116Z" },
+    { url = "https://files.pythonhosted.org/packages/40/1f/f1eb9eb350e795d1af8586289746f5c5677d16043040d63710e22abc43c9/pydantic_core-2.46.4-cp314-cp314t-win_amd64.whl", hash = "sha256:2108ba5c1c1eca18030634489dc544844144ee36357f2f9f780b93e7ddbb44b5", size = 2051624, upload-time = "2026-05-06T13:38:21.672Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/d2/42dd53d0a85c27606f316d3aa5d2869c4e8470a5ed6dec30e4a1abe19192/pydantic_core-2.46.4-cp314-cp314t-win_arm64.whl", hash = "sha256:4fcbe087dbc2068af7eda3aa87634eba216dbda64d1ae73c8684b621d33f6596", size = 2017325, upload-time = "2026-05-06T13:40:52.723Z" },
+]
+
 [[package]]
 name = "pygments"
 version = "2.20.0"
@@ -243,6 +373,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/65/eb/db13ab9b8d54e04f42b6619acca417ee37b07eb141a54884d13d20d7459e/regipy-6.2.1-py3-none-any.whl", hash = "sha256:b03110e5c4e12385e1ba53c032ccd120c6dcde1b71afb8c3b7aa4717a5a24e43", size = 134861, upload-time = "2026-01-22T15:26:05.653Z" },
 ]

+[[package]]
+name = "sniffio"
+version = "1.3.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/a2/87/a6771e1546d97e7e041b6ae58d80074f81b7d5121207425c964ddf5cfdbd/sniffio-1.3.1.tar.gz", hash = "sha256:f4324edc670a0f49750a81b895f35c3adb843cca46f0530f79fc1babb23789dc", size = 20372, upload-time = "2024-02-25T23:20:04.057Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e9/44/75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40/sniffio-1.3.1-py3-none-any.whl", hash = "sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2", size = 10235, upload-time = "2024-02-25T23:20:01.196Z" },
+]
+
 [[package]]
 name = "socksio"
 version = "1.0.0"
@@ -251,3 +390,36 @@ sdist = { url = "https://files.pythonhosted.org/packages/f8/5c/48a7d9495be3d1c65
 wheels = [
    { url = "https://files.pythonhosted.org/packages/37/c3/6eeb6034408dac0fa653d126c9204ade96b819c936e136c5e8a6897eee9c/socksio-1.0.0-py3-none-any.whl", hash = "sha256:95dc1f15f9b34e8d7b16f06d74b8ccf48f609af32ab33c608d08761c5dcbb1f3", size = 12763, upload-time = "2020-04-17T15:50:31.878Z" },
 ]
+
+[[package]]
+name = "tqdm"
+version = "4.67.3"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "colorama", marker = "sys_platform == 'win32'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/09/a9/6ba95a270c6f1fbcd8dac228323f2777d886cb206987444e4bce66338dd4/tqdm-4.67.3.tar.gz", hash = "sha256:7d825f03f89244ef73f1d4ce193cb1774a8179fd96f31d7e1dcde62092b960bb", size = 169598, upload-time = "2026-02-03T17:35:53.048Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/16/e1/3079a9ff9b8e11b846c6ac5c8b5bfb7ff225eee721825310c91b3b50304f/tqdm-4.67.3-py3-none-any.whl", hash = "sha256:ee1e4c0e59148062281c49d80b25b67771a127c85fc9676d3be5f243206826bf", size = 78374, upload-time = "2026-02-03T17:35:50.982Z" },
+]
+
+[[package]]
+name = "typing-extensions"
+version = "4.15.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/72/94/1a15dd82efb362ac84269196e94cf00f187f7ed21c242792a923cdb1c61f/typing_extensions-4.15.0.tar.gz", hash = "sha256:0cea48d173cc12fa28ecabc3b837ea3cf6f38c6d1136f85cbaaf598984861466", size = 109391, upload-time = "2025-08-25T13:49:26.313Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/18/67/36e9267722cc04a6b9f15c7f3441c2363321a3ea07da7ae0c0707beb2a9c/typing_extensions-4.15.0-py3-none-any.whl", hash = "sha256:f0fa19c6845758ab08074a0cfa8b7aecb71c999ca73d62883bc25cc018c4e548", size = 44614, upload-time = "2025-08-25T13:49:24.86Z" },
+]
+
+[[package]]
+name = "typing-inspection"
+version = "0.4.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/55/e3/70399cb7dd41c10ac53367ae42139cf4b1ca5f36bb3dc6c9d33acdb43655/typing_inspection-0.4.2.tar.gz", hash = "sha256:ba561c48a67c5958007083d386c3295464928b01faa735ab8547c5692e87f464", size = 75949, upload-time = "2025-10-01T02:14:41.687Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/dc/9b/47798a6c91d8bdb567fe2698fe81e0c6b7cb7ef4d13da4114b41d239f65d/typing_inspection-0.4.2-py3-none-any.whl", hash = "sha256:4ed1cacbdc298c220f1bd249ed5287caa16f34d44ef4e9c3d0cbad5b521545e7", size = 14611, upload-time = "2025-10-01T02:14:40.154Z" },
+]
Author	SHA1	Message	Date
BattleTag	444d58726a	refactor: native tool calling + generic forced-retry + terminal exit - llm_client: switch tool_call_loop from text-based <tool_call> regex to OpenAI-native tools=[...] / structured tool_calls field; accumulate delta.reasoning_content for DeepSeek thinking-mode echo-back; fold preserves system msg and aligns boundary to never orphan role:tool - base_agent: generic forced-retry via mandatory_record_tools class attr (filesystem -> add_phenomenon, timeline -> add_temporal_edge, hypothesis -> add_hypothesis, report -> save_report); count via executor wrapper - terminal_tools class attr + loop short-circuit: when a terminal tool is called, loop exits with its raw return as final_text. ReportAgent declares save_report as terminal - replaces the <answer>-tag stop signal that native tool calling broke - _execute_*: return (raw, formatted) - terminal exit uses untruncated raw, conversation history uses 3000-char-capped formatted - evidence_graph + orchestrator: LLM-derived InvestigationArea support (hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS / _AREA_TOOLS); manual yaml block kept as optional seed - strip <answer> references from agent prompts (no longer load-bearing) Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures (was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via save_report (was max_iterations regression). 78/78 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:51:19 +08:00
BattleTag	0a2b344c84	fix: share _safe_json_loads with tool-call parser, not just orchestrator Move _safe_json_loads from orchestrator.py to llm_client.py and have _extract_tool_calls use it when parsing <tool_call> JSON blocks from model output. orchestrator now imports it from llm_client. Background: in the first full DeepSeek run (runs/2026-05-12T17-25-38), ~10 'Failed to parse tool call JSON' warnings appeared, all from regex patterns where the LLM wrote \. or \* inside JSON string values: Failed to parse tool call JSON: {..., "pattern": "Outlook Express\|...\|\.dbx"} Failed to parse tool call JSON: {..., "pattern": "ethereal.\.pcap"} Failed to parse tool call JSON: {..., "pattern": "lookatlan.\.txt\|..."} These are exactly the kind of stray-backslash errors stage-1 sanitize already handles for orchestrator JSON calls — but tool-call extraction was using bare json.loads. Result: each failed tool call silently dropped on the floor, the LLM never got a result, and at least one network agent burned 14m26s spinning before hitting max_iterations=40. Now the sanitize/log-on-failure path is shared. Verified against the three failure cases from yesterday's log: all three now parse cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:29:21 +08:00
BattleTag	76df34ed79	docs: add TODO marker for adaptive edge weights Note that the hard-coded HYPOTHESIS_EDGE_WEIGHTS table is a temporary choice; an adaptive scheme should be explored once the full pipeline is stable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:14:23 +08:00
BattleTag	893f5b5de2	fix: address agent boundary / JSON robustness / Phase 4 no-op from CFReDS run Issues found running the system end-to-end on the NIST CFReDS Hacking Case disk image (SCHARDT.001, Mr. Evil). Four interconnected fixes: 1. HypothesisAgent boundary leak (two layers) B.1 Tool set: BaseAgent._register_graph_tools was registering add_phenomenon / add_lead / link_to_entity for every agent. With an empty graph in Phase 2, HypothesisAgent "compensated" by inventing phenomena, dispatching leads, and linking entities. B.2 Prompt leak: BaseAgent's shared system prompt hard-coded "Call investigation tools (list_directory, parse_registry_key, etc.)". HypothesisAgent hallucinated list_directory and wasted 2 LLM rounds on 'unknown tool' errors before backing off. Fix: - Split _register_graph_tools into _register_graph_read_tools + _register_graph_write_tools. - HypothesisAgent, ReportAgent, TimelineAgent override _register_graph_tools to skip write tools. - HypothesisAgent and TimelineAgent override _build_system_prompt with focused, role-specific workflows (no Phase A-D investigation boilerplate). 2. JSON parse failures in Phase 3 lead generation (5/6 hypotheses lost) DeepSeek emits JSON with stray backslashes (Windows path references) and occasional minor syntax slips. Old single-stage sanitize couldn't recover; per-hypothesis fallback silently swallowed each failure. Fix: - _safe_json_loads: progressive — stage 0 as-is, stage 1 escape stray \X (anything not in valid JSON escape set), log raw input on final failure for diagnosis. - New _call_llm_for_json helper: on parse failure, append the error to the prompt and re-call LLM (self-correcting retry, up to 2). - All 4 LLM-JSON callsites in orchestrator refactored to use it. 3. Phase 1 sometimes skipped add_phenomenon (LLM treated <answer> as deliverable) Strengthen BaseAgent's RECORDING REQUIREMENT — explicit "your <answer> is DISCARDED; only graph mutations propagate" plus a new rule: negative findings (searched X, found nothing) MUST also be recorded as phenomena, since they constrain the hypothesis space. 4. Phase 4 Timeline was a no-op TimelineAgent inherited BaseAgent's Phase A-D prompt and never called add_temporal_edge — produced 0 temporal edges. Override the prompt with concrete workflow (build_filesystem_timeline -> get_timestamped_phenomena -> 15-40 add_temporal_edge calls) and restrict tool set to read-only + its 3 temporal tools. Verified end-to-end: HypothesisAgent now 8 tools (no writes), ReportAgent 13 (no graph writes), TimelineAgent 10 (read + temporal + timeline). All 60 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:14:16 +08:00
BattleTag	0a966d8476	feat: switch LLM client to OpenAI SDK for DeepSeek compatibility The previous LLMClient used raw httpx + Claude Messages API (/v1/messages, x-api-key, Anthropic SSE event types). Incompatible with DeepSeek. Rewrite LLMClient.__init__/chat/close to use openai.AsyncOpenAI: - /v1/chat/completions endpoint, OpenAI message format - Bearer auth, native SDK error types - Stream chunks via async for + chunk.choices[0].delta.content Tool calling protocol (ReAct text-based tags) and all surrounding helpers (_apply_progressive_decay, _fold_old_messages, _partition_tool_calls, tool_call_loop, etc.) are unchanged — endpoint-agnostic by design. New optional config params surfaced to config.yaml.agent: - reasoning_effort: "high" \| "medium" \| "low" — DeepSeek/o1-style depth - thinking_enabled: bool — DeepSeek extra_body.thinking switch main.py and regenerate_report.py pass these through to LLMClient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:13:54 +08:00