refactor: native tool calling + generic forced-retry + terminal exit

- llm_client: switch tool_call_loop from text-based <tool_call> regex to OpenAI-native tools=[...] / structured tool_calls field; accumulate delta.reasoning_content for DeepSeek thinking-mode echo-back; fold preserves system msg and aligns boundary to never orphan role:tool - base_agent: generic forced-retry via mandatory_record_tools class attr (filesystem -> add_phenomenon, timeline -> add_temporal_edge, hypothesis -> add_hypothesis, report -> save_report); count via executor wrapper - terminal_tools class attr + loop short-circuit: when a terminal tool is called, loop exits with its raw return as final_text. ReportAgent declares save_report as terminal - replaces the <answer>-tag stop signal that native tool calling broke - _execute_*: return (raw, formatted) - terminal exit uses untruncated raw, conversation history uses 3000-char-capped formatted - evidence_graph + orchestrator: LLM-derived InvestigationArea support (hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS / _AREA_TOOLS); manual yaml block kept as optional seed - strip <answer> references from agent prompts (no longer load-bearing) Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures (was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via save_report (was max_iterations regression). 78/78 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:51:19 +08:00
parent 0a2b344c84
commit 444d58726a
9 changed files with 1356 additions and 298 deletions
--- a/agents/hypothesis.py
+++ b/agents/hypothesis.py
@@ -24,6 +24,7 @@ class HypothesisAgent(BaseAgent):
        "and formulate investigative hypotheses about what happened on this system. "
        "Your ultimate goal: build the most complete picture of events that occurred."
    )
+    mandatory_record_tools = ("add_hypothesis",)

    def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
        super().__init__(llm, graph)
@@ -68,7 +69,7 @@ class HypothesisAgent(BaseAgent):
            f"WORKFLOW:\n"
            f"1. Call list_phenomena and search_graph to review existing findings.\n"
            f"2. For each hypothesis you want to record, call add_hypothesis (title + description).\n"
-            f"3. Wrap a short summary in <answer> when you have generated 3-7 hypotheses.\n\n"
+            f"3. STOP after you have generated 3-7 hypotheses. Do not call any more tools.\n\n"
            f"STRICT BOUNDARIES:\n"
            f"- Your only mutation tool is add_hypothesis. Do NOT attempt list_directory, "
            f"parse_registry_key, extract_file, or any disk-image investigation tools — "