refactor: native tool calling + generic forced-retry + terminal exit

- llm_client: switch tool_call_loop from text-based <tool_call> regex
  to OpenAI-native tools=[...] / structured tool_calls field; accumulate
  delta.reasoning_content for DeepSeek thinking-mode echo-back; fold
  preserves system msg and aligns boundary to never orphan role:tool
- base_agent: generic forced-retry via mandatory_record_tools class attr
  (filesystem -> add_phenomenon, timeline -> add_temporal_edge,
  hypothesis -> add_hypothesis, report -> save_report); count via
  executor wrapper
- terminal_tools class attr + loop short-circuit: when a terminal tool
  is called, loop exits with its raw return as final_text. ReportAgent
  declares save_report as terminal - replaces the <answer>-tag stop
  signal that native tool calling broke
- _execute_*: return (raw, formatted) - terminal exit uses untruncated
  raw, conversation history uses 3000-char-capped formatted
- evidence_graph + orchestrator: LLM-derived InvestigationArea support
  (hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS /
  _AREA_TOOLS); manual yaml block kept as optional seed
- strip <answer> references from agent prompts (no longer load-bearing)

Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures
(was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via
save_report (was max_iterations regression). 78/78 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
BattleTag
2026-05-13 13:51:19 +08:00
parent 0a2b344c84
commit 444d58726a
9 changed files with 1356 additions and 298 deletions

View File

@@ -67,14 +67,18 @@ Phenomenon → Hypothesis 的边类型与权重写死在 `HYPOTHESIS_EDGE_WEIGHT
| **Phase 4** | TimelineAgent 用 `build_filesystem_timeline` 生成 MAC 时间线,与 Phenomenon 时间戳关联 |
| **Phase 5** | ReportAgent 综合假设、证据、实体,生成 Markdown 报告 |
### Gap AnalysisPhase 3 末
### Investigation Areashypothesis-derived
`config.yaml:investigation_areas` 列出必须覆盖的调查领域系统信息、用户账户、网络配置、邮件配置、IRC 日志、PCAP、删除文件、Prefetch 等。Orchestrator 两层判定覆盖情况
Phase 2 末尾 orchestrator 调一次 LLM 从所有 active hypothesis 派生 5-12 个 **InvestigationArea**snake_case slug、description、suggested_agent、expected_keywords、expected_tools、priority、motivating_hypothesis_ids。Areas 存进 `graph.investigation_areas`,序列化到 `runs/<ts>/investigation_areas.json`。两个用途
1. **关键词匹配**`_AREA_KEYWORDS`)— 扫现有 Phenomenon 标题/描述
2. **工具命中**`_AREA_TOOLS`)— 检查是否调用过该领域的关键工具(如 `enumerate_users``parse_pcap_strings`
1. **Phase 3 主循环提示** — 每个 hypothesis 块附 `Expected areas: a, b, c`LLM 仍自由选 lead 但有软引导
2. **Phase 3 末尾 Gap Analysis** — 两层判定覆盖情况:
- **关键词匹配**:扫 Phenomenon 标题/描述对照 area.expected_keywords
- **工具命中**:检查 area.expected_tools 是否实际调用过
未覆盖的领域自动派 lead最多 3 轮补漏。
未覆盖的 area 自动派 lead`suggested_agent` + `priority` + `motivating_hypothesis_ids[0]` 透传给 `Lead.hypothesis_id` 保留 provenance,最多 3 轮补漏。
**手动 override**`config.yaml:investigation_areas` 默认注释掉,纯 LLM 派生。取消注释可添加强制必查的领域,会先于 LLM 写入并通过 slug-based dedupe 保护不被覆盖LLM 只会 augment keyword/tool 列表)。这是跨案件/跨平台适配的关键 —— 不再 hardcode Windows-specific 领域。
## Agent 体系
@@ -183,11 +187,12 @@ max_investigation_rounds: 5 # Phase 3 最大迭代轮数
# - title: "嫌疑人主动实施网络嗅探"
# description: "..."
investigation_areas: # Gap Analysis 必须覆盖的领域
- area: system_info
agent: registry
task: "..."
# ...
# investigation_areas: # 可选:手动 override默认全 LLM 派生)
# - area: shutdown_time # LLM 通过 slug dedupe 只 augment
# agent: registry # keyword/tool 列表,不覆盖 manual
# priority: 3
# keywords: [shutdown]
# tools: [get_shutdown_time]
```
未配置 `hypotheses` 时由 HypothesisAgent 自动生成。