docs(strategist) S8/9: DESIGN.md updates + DESIGN_STRATEGIST.md spec
DESIGN_STRATEGIST.md §11. The strategist refit is the first sub-design
big enough to need its own document, so it lives as a sibling to
DESIGN.md rather than inline.
DESIGN_STRATEGIST.md (new, 543 lines) covers:
§0 Scope, non-goals, invariants preserved
§1 Data model (Lead extension, InvestigationRound)
§2 Six tools (graph_overview / source_coverage / marginal_yield /
budget_status / propose_lead / declare_investigation_complete)
with full input_schema
§3 InvestigationStrategist agent class
§4 Orchestrator Phase 3 loop pseudocode
§5 Persistence + resume strategy
§6 config schema
§7 Test plan (8 scenarios)
§8 9-step build order (matches commit history)
§9 Risks + mitigations
§10 Open questions
§11 Required DESIGN.md updates (applied here)
§12 What this design does NOT solve (exam-test coverage, vision-
capable LLM, blockchain explorer, etc.)
DESIGN.md updates per §11:
§4.5 Note harmonic damping is now landed
§4.9 Phase 3 table row now points at the strategist loop +
inline summary
§5 Lead + InvestigationRound rows added to the data-model
summary table
This commit closes the strategist refit. All 174 tests pass / 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
543
DESIGN_STRATEGIST.md
Normal file
543
DESIGN_STRATEGIST.md
Normal file
@@ -0,0 +1,543 @@
|
||||
# Strategist Loop —— Phase 3 信念驱动改造
|
||||
|
||||
> 这是 DESIGN.md 的补充设计文档,针对 §4.9 编排器 Phase 3 的具体重写。
|
||||
>
|
||||
> **触发动因**:2026-05-20 第一次全 6-source 实战(`runs/2026-05-20T20-15-04/`)
|
||||
> 暴露 Phase 3 不工作——8 条 pending leads 一个都没派发,因为
|
||||
> log-odds 通胀让所有 hypothesis 立即 converged。即使在「调和衰减」修复
|
||||
> log-odds 数学后(commit 在 `evidence_graph.py:update_hypothesis_confidence`),
|
||||
> Phase 3 在当前架构下仍然是「单轮触发、规则收敛」的机械流程——LLM
|
||||
> 在调度层完全没有发言权。本设计把 Phase 3 改为 LLM 驱动的探索循环。
|
||||
|
||||
---
|
||||
|
||||
## 0. 范围
|
||||
|
||||
### 做什么
|
||||
|
||||
把 `orchestrator.py:Phase 3` 从「单轮、规则触发」改造为「strategist-loop、信念驱动」:
|
||||
新增一个 `InvestigationStrategist` agent + 4 个决策视图工具 + 2 个决策动作工具
|
||||
+ 编排器循环改写。
|
||||
|
||||
### 不做什么
|
||||
|
||||
- 不改 Phase 1(per-source triage 保持现状)
|
||||
- 不改 Phase 2(HypothesisAgent 不动;strategist 可以**调用**它,但不替代)
|
||||
- 不改 Phase 4/5(timeline / report)
|
||||
- 不写专家级 per-source 检查清单(只在 `source_coverage` 工具里塞**软提示**清单)
|
||||
- 不引入新的图节点类型;leads 复用现有结构
|
||||
|
||||
### 保留的不变式
|
||||
|
||||
- DESIGN.md §4.3 grounding 网关,所有写入仍走它
|
||||
- DESIGN.md §4.5 log-odds + 调和衰减
|
||||
- DESIGN.md §4.4 verified_facts vs interpretation 划界
|
||||
- 断连恢复(`graph_state.json` 序列化兼容)
|
||||
|
||||
### 设计原则
|
||||
|
||||
1. **"LLM 提议,代码裁决" 上移到调度层**:DESIGN.md 第一原则现在只在事实层
|
||||
(grounding)兑现,调度层「该不该深入、深入哪里、何时停」目前是代码硬决策。
|
||||
本设计让 LLM 持有调度决策权。
|
||||
2. **应试能力存在但不被绑死**:系统的工具集和软提示清单覆盖应试场景所需的工件
|
||||
类别;但是否查某个工件、查到什么深度,由 strategist 看具体案件性质决定,
|
||||
不被预定义清单强制。
|
||||
3. **可解释、可审计**:每一轮 strategist 决策、动机、产出收益都被记入持久化的
|
||||
`InvestigationRound`,可事后复盘。
|
||||
|
||||
---
|
||||
|
||||
## 1. 数据模型变更
|
||||
|
||||
### 1.1 `Lead` 扩 4 字段
|
||||
|
||||
`evidence_graph.py:Lead` 现有 `(id, title, description, target_agent, source_id, status, …)`。
|
||||
新增:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Lead:
|
||||
# ... existing fields
|
||||
proposed_by: str = "" # "strategist" | "filesystem" | ... — 提案 agent
|
||||
motivating_hypothesis: str = "" # hyp-id this lead is meant to corroborate/refute
|
||||
expected_evidence_type: str = "" # one of edge_types — 期望产出的边类型
|
||||
round_number: int = 0 # 哪一轮 strategist 产生
|
||||
```
|
||||
|
||||
`motivating_hypothesis` 是关键——它把 lead 和 hypothesis 显式挂钩,让事后能算
|
||||
"这条 lead 跑完到底有没有改变假设状态",即 strategist 的边际收益度量。
|
||||
|
||||
### 1.2 新增 `InvestigationRound` 节点
|
||||
|
||||
记录每一轮 strategist 的决策本身——provenance 也要可审计:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class InvestigationRound:
|
||||
id: str # "round-001"
|
||||
round_number: int
|
||||
started_at: str
|
||||
completed_at: str = ""
|
||||
strategist_action: str = "" # "propose_leads" | "declare_complete"
|
||||
leads_proposed: list[str] = field(default_factory=list)
|
||||
leads_executed: list[str] = field(default_factory=list)
|
||||
hypothesis_status_snapshot_before: dict = field(default_factory=dict) # hyp_id → status
|
||||
hypothesis_status_snapshot_after: dict = field(default_factory=dict)
|
||||
new_phenomena_count: int = 0
|
||||
new_edges_count: int = 0
|
||||
decision_rationale: str = "" # strategist 自述
|
||||
```
|
||||
|
||||
随 graph 序列化(加进 `to_dict`/`from_dict`)。
|
||||
|
||||
---
|
||||
|
||||
## 2. 新工具
|
||||
|
||||
放在新文件 `tools/strategy.py`。按现有 `TOOL_CATALOG` 注册模式登记。
|
||||
|
||||
### 2.1 `graph_overview()` — 全局态势(只读)
|
||||
|
||||
**Signature**: `graph_overview() -> str`
|
||||
|
||||
**输出**(markdown,比 JSON 更易 LLM 解读):
|
||||
|
||||
```markdown
|
||||
# Investigation State
|
||||
|
||||
## Hypotheses (8)
|
||||
| id | title | L | conf | status | edges_in | distinct_sources | flipped_in_last_2_rounds |
|
||||
|----|-------|---|------|--------|----------|------------------|---------------------------|
|
||||
| hyp-83db8748 | Multi-Device Composite | +8.75 | 0.99 | supported | 23 | 1 | no |
|
||||
| hyp-daa7c704 | Multiple Identity Aliases | +9.21 | 0.99 | supported | 11 | 3 | no |
|
||||
| hyp-7fa9b13e | Sunny.zip contains timer_a | +2.08 | 0.99 | supported | 4 | 1 | yes (active→supported in R2) |
|
||||
| ...
|
||||
|
||||
## Sources (6)
|
||||
| id | type | phenomena | identities | last_touched_in_round |
|
||||
| src-usb-leung | disk_image | 8 | 1 | R1 |
|
||||
| ...
|
||||
|
||||
## Pending Leads (3)
|
||||
| id | from | targeting | for_hypothesis | reason |
|
||||
| lead-aaa | filesystem | src-ios-chan/Safari | hyp-83db8748 | Safari history likely contains device-switching evidence |
|
||||
```
|
||||
|
||||
**关键标注**:`distinct_sources` 一栏暴露了"这个假设只靠一个源支撑"——strategist
|
||||
看到 23 边都来自 android 源会自动判断"需要从别处独立证据"。
|
||||
|
||||
### 2.2 `source_coverage(source_id: str)` — 单源覆盖度(只读)
|
||||
|
||||
**Signature**: `source_coverage(source_id: str) -> str`
|
||||
|
||||
**实现**:扫 `graph.tool_invocations`,过滤 `source_id == 该源`,按工具名 + 主要 args
|
||||
分组。然后跟 `EXPECTED_ARTEFACTS[source_type]` 比对,未触达项打 ✗。
|
||||
|
||||
```python
|
||||
# tools/strategy.py
|
||||
EXPECTED_ARTEFACTS: dict[str, list[dict]] = {
|
||||
"disk_image+windows": [
|
||||
{"name": "filesystem layout", "detector": "fls|mmls", "value_for": "deleted files, hidden partitions"},
|
||||
{"name": "registry hives", "detector": "parse_registry_key", "value_for": "user activity, installed software"},
|
||||
{"name": "browser history", "detector": "list_directory@AppData/.../History", "value_for": "URL access, downloads"},
|
||||
{"name": "prefetch", "detector": "extract_file@Windows/Prefetch", "value_for": "program execution evidence"},
|
||||
# ...
|
||||
],
|
||||
"mobile_extraction": [
|
||||
{"name": "AddressBook", "detector": "sqlite_query@AddressBook.sqlitedb", "value_for": "contacts"},
|
||||
{"name": "SMS messages", "detector": "sqlite_query@sms.db", "value_for": "messaging content"},
|
||||
{"name": "WhatsApp messages", "detector": "sqlite_query@ChatStorage.sqlite", "value_for": "WhatsApp content"},
|
||||
{"name": "Call history", "detector": "sqlite_query@CallHistoryDB", "value_for": "call records"},
|
||||
{"name": "Safari history", "detector": "sqlite_query@History.db|read_text@Bookmarks.plist", "value_for": "web browsing"},
|
||||
{"name": "Photos library", "detector": "sqlite_query@Photos.sqlite", "value_for": "photo metadata, EXIF, geolocation"},
|
||||
{"name": "iCloud accounts", "detector": "parse_plist@Accounts3.sqlite|parse_keychain", "value_for": "Apple ID, services"},
|
||||
{"name": "App inventory", "detector": "list_directory@var/containers/Bundle/Application", "value_for": "installed apps"},
|
||||
],
|
||||
"disk_image+android": [...],
|
||||
"media_collection": [
|
||||
{"name": "OCR text", "detector": "ocr_image", "value_for": "screenshot text"},
|
||||
{"name": "EXIF metadata", "detector": "exif_image", "value_for": "device, timestamps, geolocation"},
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
**软提示语义**:output 末尾必带一句:
|
||||
|
||||
> Coverage hints are heuristics, not requirements. Skip an item if the case theory
|
||||
> makes it irrelevant. Investigate ✗ items only when they could materially affect
|
||||
> an active hypothesis.
|
||||
|
||||
这一句是**"应试能力存在但不被绑死"的关键**——LLM 看到 ✗ 不会盲投,会先看
|
||||
hypothesis 列表问"这个工件对当前任何 hypothesis 有意义吗"。
|
||||
|
||||
### 2.3 `marginal_yield(last_n_rounds: int = 2)` — 边际收益(只读)
|
||||
|
||||
**Signature**: `marginal_yield(last_n_rounds: int = 2) -> str`
|
||||
|
||||
**实现**:扫最近 N 个 `InvestigationRound`,统计:
|
||||
- 每轮新增 phenomena 数
|
||||
- 每轮新增 P→H 边数
|
||||
- 每轮 hypothesis status flips 数(active→supported / 反向)
|
||||
|
||||
**输出**:
|
||||
|
||||
```markdown
|
||||
# Marginal Yield (last 2 rounds)
|
||||
|
||||
| round | new_phenomena | new_edges | status_flips |
|
||||
| R3 | 5 | 7 | 1 |
|
||||
| R4 | 2 | 1 | 0 |
|
||||
|
||||
Trend: decelerating (R4 yield 33% of R3).
|
||||
Recommendation interpretation aid: yield trending to zero suggests diminishing
|
||||
returns; consider declare_complete after one more probe.
|
||||
```
|
||||
|
||||
最后一行是 LLM-friendly heuristic prose,不是强制信号。
|
||||
|
||||
### 2.4 `budget_status()` — 预算视图(只读)
|
||||
|
||||
**Signature**: `budget_status() -> str`
|
||||
|
||||
```markdown
|
||||
# Budget Status
|
||||
|
||||
| metric | used | cap | pct |
|
||||
| tool_calls | 1248 | 5000 | 25% |
|
||||
| strategist_rounds | 3 | 10 | 30% |
|
||||
| wall_clock_minutes | 142 | 360 | 39% |
|
||||
|
||||
Phase 1 used 89% of allocated. Phase 2 used 4%. Phase 3 (strategist) so far: 7%.
|
||||
```
|
||||
|
||||
预算从 config.yaml 读,新增字段见 §6。无预算配置时进 unbounded 模式(仅靠
|
||||
strategist 自宣 complete + hard safety cap)。
|
||||
|
||||
### 2.5 决策动作工具(写入)
|
||||
|
||||
注册到 strategist 的 `mandatory_record_tools`。Strategist 每轮必须 call 至少一个,
|
||||
否则 forced-retry 触发(复用现有机制)。
|
||||
|
||||
**`propose_lead(...)`**:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "propose_lead",
|
||||
"input_schema": {
|
||||
"type": "object",
|
||||
"required": [
|
||||
"description", "target_agent",
|
||||
"motivating_hypothesis", "expected_evidence_type",
|
||||
],
|
||||
"properties": {
|
||||
"description": {
|
||||
"type": "string",
|
||||
"description": "1-2 sentence specific investigation request, including target source/artefact",
|
||||
},
|
||||
"target_agent": {
|
||||
"type": "string",
|
||||
"enum": ["filesystem","registry","communication","network","ios_artifact","android_artifact","media"],
|
||||
},
|
||||
"source_id": {"type": "string", "description": "which source to investigate"},
|
||||
"motivating_hypothesis": {
|
||||
"type": "string",
|
||||
"description": "hyp-id this lead is meant to corroborate or refute",
|
||||
},
|
||||
"expected_evidence_type": {
|
||||
"type": "string",
|
||||
"enum": ["direct_evidence","supports","contradicts","weakens","prerequisite_met","consequence_observed"],
|
||||
},
|
||||
"rationale": {"type": "string", "description": "why this fills a real gap"},
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**`declare_investigation_complete(...)`**:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "declare_investigation_complete",
|
||||
"input_schema": {
|
||||
"type": "object",
|
||||
"required": ["reason"],
|
||||
"properties": {
|
||||
"reason": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"marginal_yield_zero",
|
||||
"budget_exhausted",
|
||||
"all_hypotheses_resolved",
|
||||
"coverage_saturated",
|
||||
"other",
|
||||
],
|
||||
},
|
||||
"rationale": {"type": "string"},
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Terminal tool —— 调用即结束循环(复用现有 `terminal_tools` 机制)。
|
||||
|
||||
---
|
||||
|
||||
## 3. `InvestigationStrategist` agent
|
||||
|
||||
新文件 `agents/strategist.py`,约 150 行。
|
||||
|
||||
```python
|
||||
class InvestigationStrategist(BaseAgent):
|
||||
name = "strategist"
|
||||
role = (
|
||||
"You are the investigation strategist. You do not run forensic tools yourself. "
|
||||
"Your job is to read the current evidence graph and decide ONE of:\n"
|
||||
" (a) propose 1-3 new investigation leads that would materially affect an active hypothesis, or\n"
|
||||
" (b) declare the investigation complete.\n"
|
||||
"\n"
|
||||
"Use graph_overview / source_coverage / marginal_yield / budget_status to ground your judgment. "
|
||||
"DO NOT propose a lead that just adds more same-direction evidence to an already-supported hypothesis "
|
||||
"(harmonic damping makes it ~useless). DO propose leads when:\n"
|
||||
" - A hypothesis is supported by edges from only ONE source — get cross-source corroboration.\n"
|
||||
" - A hypothesis is in the active band (0.2 < conf < 0.8) — it needs the deciding evidence.\n"
|
||||
" - A specific high-value artefact is uncovered on a source where the active hypotheses suggest it matters.\n"
|
||||
"\n"
|
||||
"Declare complete when marginal_yield is approaching zero AND no remaining active hypotheses have "
|
||||
"obvious investigation paths."
|
||||
)
|
||||
|
||||
mandatory_record_tools = ("propose_lead", "declare_investigation_complete")
|
||||
terminal_tools = ("declare_investigation_complete",)
|
||||
|
||||
def _register_graph_tools(self):
|
||||
# Read-only tools — strategist NEVER writes phenomena/edges directly.
|
||||
# All graph writes happen via the workers it dispatches.
|
||||
self._register_graph_read_tools()
|
||||
# No graph_write_tools.
|
||||
# Add strategy-specific tools:
|
||||
for tool_name in (
|
||||
"graph_overview", "source_coverage", "marginal_yield", "budget_status",
|
||||
"propose_lead", "declare_investigation_complete",
|
||||
):
|
||||
td = TOOL_CATALOG[tool_name]
|
||||
self.register_tool(td.name, td.description, td.input_schema, td.executor)
|
||||
```
|
||||
|
||||
注册到 `agent_factory._AGENT_CLASSES["strategist"]`。
|
||||
|
||||
---
|
||||
|
||||
## 4. 编排器改造
|
||||
|
||||
### 4.1 删除/替换:现在的 Phase 3
|
||||
|
||||
`orchestrator.py:Phase 3` 当前逻辑(约 150 行):检查 leads → 派 worker →
|
||||
检查 converged → 退出。**删除**。
|
||||
|
||||
### 4.2 新 Phase 3:strategist loop
|
||||
|
||||
```python
|
||||
async def _phase3_strategist_loop(self, run_dir: Path) -> None:
|
||||
"""Belief-driven investigation: strategist proposes, workers execute, repeat."""
|
||||
_log("Phase 3: Strategist-Driven Investigation", event="phase")
|
||||
|
||||
strategist = self.factory.get_or_create_agent("strategist")
|
||||
max_rounds = self.config.get("budgets", {}).get("strategist_rounds_max", 10)
|
||||
|
||||
for round_num in range(1, max_rounds + 1):
|
||||
# 1. Record round start + snapshot
|
||||
rid = await self.graph.start_investigation_round(round_num)
|
||||
|
||||
# 2. Strategist run
|
||||
_log(f"Strategist Round {round_num}", event="phase")
|
||||
await strategist.run(
|
||||
f"Review the graph and decide the next investigation action. "
|
||||
f"This is round {round_num}/{max_rounds}. Budget used so far: see budget_status."
|
||||
)
|
||||
|
||||
# 3. Did strategist declare complete?
|
||||
if self.graph.is_round_terminal(rid):
|
||||
_log(f"Strategist declared complete at round {round_num}", event="progress")
|
||||
break
|
||||
|
||||
# 4. Collect new leads proposed this round
|
||||
new_leads = self.graph.leads_from_round(round_num)
|
||||
if not new_leads:
|
||||
_log(f"No leads proposed in round {round_num} — stopping", event="progress")
|
||||
break
|
||||
|
||||
# 5. Dispatch each lead
|
||||
for lead in new_leads:
|
||||
await self._execute_lead(lead, round_num)
|
||||
|
||||
# 6. Close round + record yield
|
||||
await self.graph.complete_investigation_round(rid)
|
||||
|
||||
# 7. Hard budget check
|
||||
if self._budget_exceeded():
|
||||
_log(f"Budget exhausted at round {round_num}", event="progress")
|
||||
break
|
||||
```
|
||||
|
||||
### 4.3 `_execute_lead` 复用现有 worker 派发逻辑
|
||||
|
||||
```python
|
||||
async def _execute_lead(self, lead: Lead, round_num: int) -> None:
|
||||
agent_type = AGENT_ALIASES.get(lead.target_agent, lead.target_agent)
|
||||
worker = self.factory.get_or_create_agent(agent_type)
|
||||
if worker is None:
|
||||
logger.warning(f"No worker for lead {lead.id}: {agent_type}")
|
||||
return
|
||||
|
||||
src = self.graph.case.get_source(lead.source_id) if lead.source_id else None
|
||||
if src:
|
||||
self.graph.set_active_source(src)
|
||||
|
||||
_log(
|
||||
f"Round {round_num} dispatching: {lead.description}",
|
||||
event="dispatch", agent=agent_type,
|
||||
)
|
||||
await worker.run(
|
||||
f"Investigate this specific lead from the strategist:\n\n"
|
||||
f"REQUEST: {lead.description}\n"
|
||||
f"MOTIVATING HYPOTHESIS: {lead.motivating_hypothesis}\n"
|
||||
f"EXPECTED EVIDENCE TYPE: {lead.expected_evidence_type}\n"
|
||||
f"RATIONALE: {lead.rationale}\n\n"
|
||||
f"After investigating, record findings via add_phenomenon AND link relevant phenomena "
|
||||
f"to {lead.motivating_hypothesis} via the appropriate edge_type."
|
||||
)
|
||||
lead.status = "completed"
|
||||
self.graph._auto_save()
|
||||
```
|
||||
|
||||
### 4.4 自动 hypothesis 重生成(可选,建议加)
|
||||
|
||||
新增 phenomena 可能产生**新假设**(不只是更新现有假设)。让 strategist 用
|
||||
`propose_lead(target_agent="hypothesis", description="re-examine recent phenomena for new hypotheses")`
|
||||
显式触发——这是 strategist 自决定的,不是定时触发。一致性优于自动定时。
|
||||
|
||||
---
|
||||
|
||||
## 5. 状态持久化
|
||||
|
||||
`graph_state.json` 新增顶层 key `investigation_rounds: list[InvestigationRound]`。
|
||||
`save_state` / `load_state` 处理。**断连恢复**时:
|
||||
|
||||
- 找最近一个未 completed 的 round → 视为该 round 失败
|
||||
- 从下一个 round 重新开始
|
||||
- 已完成 round 的 phenomena / edges 自然保留
|
||||
|
||||
---
|
||||
|
||||
## 6. 配置
|
||||
|
||||
`config.yaml` 新增:
|
||||
|
||||
```yaml
|
||||
strategist:
|
||||
enabled: true # false = 走老 Phase 3 逻辑(safety fallback)
|
||||
max_rounds: 10
|
||||
hard_stop_marginal_yield_zero_rounds: 3 # 连续 3 轮 yield=0 强制停
|
||||
|
||||
budgets:
|
||||
tool_calls_total: 5000
|
||||
wall_clock_minutes_max: 480
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 测试策略
|
||||
|
||||
新文件 `tests/test_strategist.py` 或加入 `test_optimizations.py`。最少要测:
|
||||
|
||||
1. Strategist 调 `declare_complete` 时 loop 立即退出
|
||||
2. Strategist 调 `propose_lead` 时 lead 入 graph 且 round_number 正确
|
||||
3. Round snapshot 正确捕获 before/after status
|
||||
4. 预算耗尽时即使 strategist 还想继续也强制停
|
||||
5. 断连恢复:中途中断后重启从下一 round 开始
|
||||
6. `graph_overview` 输出包含 `distinct_sources` 标注
|
||||
7. `source_coverage` 对未触达项标 ✗
|
||||
8. `marginal_yield` 数字与 `confidence_log` 一致
|
||||
|
||||
不写 LLM 集成测试——strategist 行为通过 mock LLM 验证(已有这种模式见
|
||||
`test_forced_record_retry_fires_when_zero_phenomena`)。
|
||||
|
||||
---
|
||||
|
||||
## 8. 实施顺序
|
||||
|
||||
按依赖排(**每步独立 commit**——结构性改造,单点回滚关键):
|
||||
|
||||
| 步 | 内容 | 依赖 | 工作量估算 |
|
||||
|---|---|---|---|
|
||||
| 1 | `Lead` 加 4 字段 + `InvestigationRound` 数据类 + 序列化 | — | 60 行 + 测试 |
|
||||
| 2 | `graph_overview` / `source_coverage` / `marginal_yield` / `budget_status` 实现 | 1 | 250 行 + 测试 |
|
||||
| 3 | `propose_lead` / `declare_investigation_complete` 工具 | 1 | 80 行 + 测试 |
|
||||
| 4 | `InvestigationStrategist` agent class | 2, 3 | 120 行 + 测试 |
|
||||
| 5 | 编排器 Phase 3 重写 | 4 | 150 行(替换 ~50 行旧)+ 测试 |
|
||||
| 6 | config schema + 加载逻辑 | 5 | 30 行 |
|
||||
| 7 | 断连恢复处理 | 5 | 40 行 + 测试 |
|
||||
| 8 | 真实案件 smoke run(小规模:USB only) | 7 | 0 代码 |
|
||||
| 9 | 文档:DESIGN.md §4.9 改写 + 本文件归档 | 8 | 文档 |
|
||||
|
||||
总:~800 行新代码 + 测试 + 文档。
|
||||
|
||||
---
|
||||
|
||||
## 9. 风险 + 缓解
|
||||
|
||||
| 风险 | 缓解 |
|
||||
|---|---|
|
||||
| Strategist 太保守(永远 declare_complete) | 加 prompt 例子展示什么是"该深入的情况";测试时小样本验证 |
|
||||
| Strategist 太激进(每轮都 propose 7+ leads) | `propose_lead` 工具 schema 限制每轮最多 3-5 个;prompt 强调"重质不重量" |
|
||||
| 单 worker 跑不完 lead 导致预算雪崩 | worker 调用本身 max_iter 不变;strategist 预算独立 |
|
||||
| LLM 不理解 `distinct_sources` 这种暗示 | `graph_overview` 末尾加 1-2 句 plain-English 解读 "Hypothesis X has 23 edges but all from one source → cross-source corroboration would strengthen it" |
|
||||
| Phase 1 触发产生的 leads 被 strategist 忽略 | strategist prompt 明确"先处理已有 pending leads,再产新的" |
|
||||
| 死循环(strategist 反复产同样 lead) | Lead 表上加 `(motivating_hyp, expected_type, source_id)` 三元组去重 |
|
||||
| `EXPECTED_ARTEFACTS` 清单维护成本 | 故意保持"软提示"——清单不完整也不会破,只是某些深度需要更多 LLM 自觉 |
|
||||
|
||||
---
|
||||
|
||||
## 10. 开放问题
|
||||
|
||||
1. **InvestigationRound 该不该自己跑 hypothesis agent?**
|
||||
倾向 strategist 用 lead 显式触发(一致性更好),不做定时触发。
|
||||
|
||||
2. **预算超用怎么办——硬停 vs 软警告?**
|
||||
当前设计硬停;可加 "strategist 看到 budget < 10% 时只能 declare_complete"
|
||||
的 schema enforcement。
|
||||
|
||||
3. **跨 source 边的"独立性奖励"是否纳入 log-odds?**
|
||||
上次衰减用了 `1/k`,没区分跨源 vs 同源。如果要纳入,公式应改为
|
||||
`1/k_within_source × bonus_for_distinct_sources`。这是后续单独工程。
|
||||
|
||||
4. **Strategist 输出的 `rationale` 该不该走 grounding?**
|
||||
它不会写 phenomena,但 `rationale` 字段可能包含具体值
|
||||
("based on inv-12345...")。倾向不强制——这是元层判断,不是事实落地。
|
||||
|
||||
5. **现 Phase 3 的 `max_investigation_rounds` config 留还是删?**
|
||||
建议留作 `strategist.enabled=false` 时的 fallback 旋钮。
|
||||
|
||||
---
|
||||
|
||||
## 11. 与 DESIGN.md 的关系
|
||||
|
||||
本文档落地后,DESIGN.md 需要的对应更新:
|
||||
|
||||
- **§4.5**:补一段「同时也要看 log_odds 的**结构**——edges_in 数 / distinct_sources
|
||||
是 strategist 判断是否深入的关键信号,不只是 confidence 数值」
|
||||
- **§4.9 Phase 3**:表格内容从「leads 派发到源感知 agent」改为
|
||||
「strategist 循环:看图、提案、执行、复盘、停 / 续」
|
||||
- **§8**(设计取舍):新增第 6 条:「调度层 LLM 化的取舍——strategist 决定深度,
|
||||
但每轮预算受 `budgets.*` 硬限制;这是"LLM 提议、代码裁决"原则在调度层的兑现」
|
||||
|
||||
---
|
||||
|
||||
## 12. 备忘:本设计**不解决**的问题
|
||||
|
||||
- 应试题 8% 命中率的根因是**工具集不全**(无 vision、无 ZIP 暴力破解、无 VeraCrypt
|
||||
挂载、无 blockchain explorer),不是调度问题。strategist 让现有工具被用得更狠,
|
||||
但不会凭空多出工具。
|
||||
- LLM 编造 `invocation_id`(已修补,见 `feedback_grounding_pending` memory)和
|
||||
log-odds 通胀(已修补:调和衰减)是本设计的**前置依赖**,不在本设计范围内。
|
||||
- Per-edge-type 的更精细贝叶斯建模(如跨源独立性 bonus)是独立工程。
|
||||
Reference in New Issue
Block a user