Compare commits
10 Commits
6b485b98f7
...
8b964b5dec
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8b964b5dec | ||
|
|
388321ee30 | ||
|
|
093f3cec1f | ||
|
|
a103c17bdb | ||
|
|
65745d21dc | ||
|
|
ff3a05d7ce | ||
|
|
6ebbc675c1 | ||
|
|
ca96f29849 | ||
|
|
8020c24776 | ||
|
|
f04ccd4bc7 |
24
DESIGN.md
24
DESIGN.md
@@ -172,9 +172,11 @@ verified facts(带重跑指令的引证)与 interpretation(明确标注的
|
||||
|
||||
- 阈值不变(≥0.8 supported / ≤0.2 refuted),只是改由 `L_post` 推出。
|
||||
- `prior_prob` 成为可配置量(默认 0.5 → `L_prior=0`)。
|
||||
- **简化假设说明**:多条边按独立处理(朴素贝叶斯)。同类证据反复出现并非
|
||||
完全独立——加一个旋钮:同 `(hypothesis, edge_type)` 的边数封顶或衰减,避免
|
||||
「同一发现被多 agent 重复入图」虚高置信度(现有 Jaccard 去重已部分缓解)。
|
||||
- **同类证据调和衰减**(2026-05 落地):同 `(hypothesis, edge_type)` 的第 k 条边
|
||||
贡献 `log_lr_base / k`。累计 = `log_lr_base · H_N`(调和级数,~ ln N)。
|
||||
解决朴素贝叶斯独立性破产 + 同一发现被多 agent 重复入图导致 L=+31 的失控
|
||||
(2026-05-20 实战数据)。单条边不变(k=1, 衰减=1.0)。**结构信号**比绝对值
|
||||
更重要:strategist 看 `distinct_sources` 比看 confidence 数值更能判断证据厚度。
|
||||
|
||||
附带产出一个 **假设 × 证据矩阵**视图,供报告与线索选择使用。
|
||||
|
||||
@@ -235,11 +237,19 @@ network=浏览器/PCAP)。改为按**调查职能**组织,并增加平台特
|
||||
|---|---|
|
||||
| Phase 1 | 「单镜像初勘」→ **逐源并行 triage**,每源派类型适配的 agent |
|
||||
| Phase 2 | 假设跨源生成;身份共指假设在此首次登记 |
|
||||
| Phase 3 | leads 派发到源感知 agent;假设×证据矩阵实时更新 |
|
||||
| Phase 3 | **Strategist 循环**:LLM 元 agent 每轮看图决定 propose_lead 或 declare_complete;workers 执行 lead;hypothesis 边重判 — 详见 `DESIGN_STRATEGIST.md` |
|
||||
| Phase 4 | 跨源时间线合并,**按源做时区归一**(iOS UTC vs 安卓本地时间) |
|
||||
| Phase 5 | 一案一份综合报告:含假设结论、实体关联图、每条结论的 provenance 引证 |
|
||||
|
||||
断连恢复、运行归档逻辑保留,`graph_state.json` 增量纳入新字段。
|
||||
**Phase 3 的"LLM 决定深度"**(2026-05 实战暴露 Phase 3 单轮触发 + log-odds 通胀致使 8 个 pending leads 一个未派发后落地):调度层从代码硬决策("max_rounds=N, converged→stop")转为 LLM 元 agent 驱动。
|
||||
|
||||
- 新 agent `InvestigationStrategist`(`agents/strategist.py`)每轮取一个动作:propose 1-3 lead,或 declare_investigation_complete
|
||||
- 4 个只读视图工具:`graph_overview` / `source_coverage` / `marginal_yield` / `budget_status`(`tools/strategy.py`)让 LLM 看到调度信号
|
||||
- 2 个写入决策工具:`propose_lead` / `declare_investigation_complete` 是 strategist 的 mandatory_record
|
||||
- 编排器读 `config.yaml:strategist.*` + `config.yaml:budgets.*` 控制 max_rounds 和 hard caps
|
||||
- 看 `[[DESIGN_STRATEGIST]]` 获取完整数据模型、prompt 设计、断连恢复、风险/缓解
|
||||
|
||||
断连恢复、运行归档逻辑保留;`graph_state.json` 新增 `investigation_rounds[]` 数组持久化 strategist 每轮决策。
|
||||
|
||||
---
|
||||
|
||||
@@ -252,8 +262,10 @@ network=浏览器/PCAP)。改为按**调查职能**组织,并增加平台特
|
||||
| `Phenomenon` | + `source_id`;description 拆为 `verified_facts[]` + `interpretation`;澄清/移除语义含混的 `confidence`(默认 1.0),观测的可靠性由 grounding 表达 |
|
||||
| `Hypothesis` | + `prior_prob`、`log_odds`(累加量);`confidence` 改为派生值 |
|
||||
| `Entity` | + 类型化标识符集合;通过 `same_as` 边跨源连通 |
|
||||
| Phenomenon→Hypothesis 边 | 携带 `edge_type`,映射到 `log₁₀(LR)`(替换 `_DEFAULT_EDGE_WEIGHTS`) |
|
||||
| Phenomenon→Hypothesis 边 | 携带 `edge_type`,映射到 `log₁₀(LR)`(替换 `_DEFAULT_EDGE_WEIGHTS`);同 `(hyp, edge_type)` 的第 k 条边按 `1/k` 调和衰减 |
|
||||
| Entity→Entity 边 | **新增** `same_as`(由 coref 假设背书,可逆) |
|
||||
| `Lead` | + `proposed_by` / `motivating_hypothesis` / `expected_evidence_type` / `round_number`(strategist 注解) |
|
||||
| `InvestigationRound` | **新增**:strategist 每轮决策的 provenance + before/after 快照 + 收益指标 |
|
||||
|
||||
`evidence_graph.py` 的 `VALID_EDGE_TYPES`、序列化/反序列化、Jaccard 去重相应适配。
|
||||
|
||||
|
||||
543
DESIGN_STRATEGIST.md
Normal file
543
DESIGN_STRATEGIST.md
Normal file
@@ -0,0 +1,543 @@
|
||||
# Strategist Loop —— Phase 3 信念驱动改造
|
||||
|
||||
> 这是 DESIGN.md 的补充设计文档,针对 §4.9 编排器 Phase 3 的具体重写。
|
||||
>
|
||||
> **触发动因**:2026-05-20 第一次全 6-source 实战(`runs/2026-05-20T20-15-04/`)
|
||||
> 暴露 Phase 3 不工作——8 条 pending leads 一个都没派发,因为
|
||||
> log-odds 通胀让所有 hypothesis 立即 converged。即使在「调和衰减」修复
|
||||
> log-odds 数学后(commit 在 `evidence_graph.py:update_hypothesis_confidence`),
|
||||
> Phase 3 在当前架构下仍然是「单轮触发、规则收敛」的机械流程——LLM
|
||||
> 在调度层完全没有发言权。本设计把 Phase 3 改为 LLM 驱动的探索循环。
|
||||
|
||||
---
|
||||
|
||||
## 0. 范围
|
||||
|
||||
### 做什么
|
||||
|
||||
把 `orchestrator.py:Phase 3` 从「单轮、规则触发」改造为「strategist-loop、信念驱动」:
|
||||
新增一个 `InvestigationStrategist` agent + 4 个决策视图工具 + 2 个决策动作工具
|
||||
+ 编排器循环改写。
|
||||
|
||||
### 不做什么
|
||||
|
||||
- 不改 Phase 1(per-source triage 保持现状)
|
||||
- 不改 Phase 2(HypothesisAgent 不动;strategist 可以**调用**它,但不替代)
|
||||
- 不改 Phase 4/5(timeline / report)
|
||||
- 不写专家级 per-source 检查清单(只在 `source_coverage` 工具里塞**软提示**清单)
|
||||
- 不引入新的图节点类型;leads 复用现有结构
|
||||
|
||||
### 保留的不变式
|
||||
|
||||
- DESIGN.md §4.3 grounding 网关,所有写入仍走它
|
||||
- DESIGN.md §4.5 log-odds + 调和衰减
|
||||
- DESIGN.md §4.4 verified_facts vs interpretation 划界
|
||||
- 断连恢复(`graph_state.json` 序列化兼容)
|
||||
|
||||
### 设计原则
|
||||
|
||||
1. **"LLM 提议,代码裁决" 上移到调度层**:DESIGN.md 第一原则现在只在事实层
|
||||
(grounding)兑现,调度层「该不该深入、深入哪里、何时停」目前是代码硬决策。
|
||||
本设计让 LLM 持有调度决策权。
|
||||
2. **应试能力存在但不被绑死**:系统的工具集和软提示清单覆盖应试场景所需的工件
|
||||
类别;但是否查某个工件、查到什么深度,由 strategist 看具体案件性质决定,
|
||||
不被预定义清单强制。
|
||||
3. **可解释、可审计**:每一轮 strategist 决策、动机、产出收益都被记入持久化的
|
||||
`InvestigationRound`,可事后复盘。
|
||||
|
||||
---
|
||||
|
||||
## 1. 数据模型变更
|
||||
|
||||
### 1.1 `Lead` 扩 4 字段
|
||||
|
||||
`evidence_graph.py:Lead` 现有 `(id, title, description, target_agent, source_id, status, …)`。
|
||||
新增:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Lead:
|
||||
# ... existing fields
|
||||
proposed_by: str = "" # "strategist" | "filesystem" | ... — 提案 agent
|
||||
motivating_hypothesis: str = "" # hyp-id this lead is meant to corroborate/refute
|
||||
expected_evidence_type: str = "" # one of edge_types — 期望产出的边类型
|
||||
round_number: int = 0 # 哪一轮 strategist 产生
|
||||
```
|
||||
|
||||
`motivating_hypothesis` 是关键——它把 lead 和 hypothesis 显式挂钩,让事后能算
|
||||
"这条 lead 跑完到底有没有改变假设状态",即 strategist 的边际收益度量。
|
||||
|
||||
### 1.2 新增 `InvestigationRound` 节点
|
||||
|
||||
记录每一轮 strategist 的决策本身——provenance 也要可审计:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class InvestigationRound:
|
||||
id: str # "round-001"
|
||||
round_number: int
|
||||
started_at: str
|
||||
completed_at: str = ""
|
||||
strategist_action: str = "" # "propose_leads" | "declare_complete"
|
||||
leads_proposed: list[str] = field(default_factory=list)
|
||||
leads_executed: list[str] = field(default_factory=list)
|
||||
hypothesis_status_snapshot_before: dict = field(default_factory=dict) # hyp_id → status
|
||||
hypothesis_status_snapshot_after: dict = field(default_factory=dict)
|
||||
new_phenomena_count: int = 0
|
||||
new_edges_count: int = 0
|
||||
decision_rationale: str = "" # strategist 自述
|
||||
```
|
||||
|
||||
随 graph 序列化(加进 `to_dict`/`from_dict`)。
|
||||
|
||||
---
|
||||
|
||||
## 2. 新工具
|
||||
|
||||
放在新文件 `tools/strategy.py`。按现有 `TOOL_CATALOG` 注册模式登记。
|
||||
|
||||
### 2.1 `graph_overview()` — 全局态势(只读)
|
||||
|
||||
**Signature**: `graph_overview() -> str`
|
||||
|
||||
**输出**(markdown,比 JSON 更易 LLM 解读):
|
||||
|
||||
```markdown
|
||||
# Investigation State
|
||||
|
||||
## Hypotheses (8)
|
||||
| id | title | L | conf | status | edges_in | distinct_sources | flipped_in_last_2_rounds |
|
||||
|----|-------|---|------|--------|----------|------------------|---------------------------|
|
||||
| hyp-83db8748 | Multi-Device Composite | +8.75 | 0.99 | supported | 23 | 1 | no |
|
||||
| hyp-daa7c704 | Multiple Identity Aliases | +9.21 | 0.99 | supported | 11 | 3 | no |
|
||||
| hyp-7fa9b13e | Sunny.zip contains timer_a | +2.08 | 0.99 | supported | 4 | 1 | yes (active→supported in R2) |
|
||||
| ...
|
||||
|
||||
## Sources (6)
|
||||
| id | type | phenomena | identities | last_touched_in_round |
|
||||
| src-usb-leung | disk_image | 8 | 1 | R1 |
|
||||
| ...
|
||||
|
||||
## Pending Leads (3)
|
||||
| id | from | targeting | for_hypothesis | reason |
|
||||
| lead-aaa | filesystem | src-ios-chan/Safari | hyp-83db8748 | Safari history likely contains device-switching evidence |
|
||||
```
|
||||
|
||||
**关键标注**:`distinct_sources` 一栏暴露了"这个假设只靠一个源支撑"——strategist
|
||||
看到 23 边都来自 android 源会自动判断"需要从别处独立证据"。
|
||||
|
||||
### 2.2 `source_coverage(source_id: str)` — 单源覆盖度(只读)
|
||||
|
||||
**Signature**: `source_coverage(source_id: str) -> str`
|
||||
|
||||
**实现**:扫 `graph.tool_invocations`,过滤 `source_id == 该源`,按工具名 + 主要 args
|
||||
分组。然后跟 `EXPECTED_ARTEFACTS[source_type]` 比对,未触达项打 ✗。
|
||||
|
||||
```python
|
||||
# tools/strategy.py
|
||||
EXPECTED_ARTEFACTS: dict[str, list[dict]] = {
|
||||
"disk_image+windows": [
|
||||
{"name": "filesystem layout", "detector": "fls|mmls", "value_for": "deleted files, hidden partitions"},
|
||||
{"name": "registry hives", "detector": "parse_registry_key", "value_for": "user activity, installed software"},
|
||||
{"name": "browser history", "detector": "list_directory@AppData/.../History", "value_for": "URL access, downloads"},
|
||||
{"name": "prefetch", "detector": "extract_file@Windows/Prefetch", "value_for": "program execution evidence"},
|
||||
# ...
|
||||
],
|
||||
"mobile_extraction": [
|
||||
{"name": "AddressBook", "detector": "sqlite_query@AddressBook.sqlitedb", "value_for": "contacts"},
|
||||
{"name": "SMS messages", "detector": "sqlite_query@sms.db", "value_for": "messaging content"},
|
||||
{"name": "WhatsApp messages", "detector": "sqlite_query@ChatStorage.sqlite", "value_for": "WhatsApp content"},
|
||||
{"name": "Call history", "detector": "sqlite_query@CallHistoryDB", "value_for": "call records"},
|
||||
{"name": "Safari history", "detector": "sqlite_query@History.db|read_text@Bookmarks.plist", "value_for": "web browsing"},
|
||||
{"name": "Photos library", "detector": "sqlite_query@Photos.sqlite", "value_for": "photo metadata, EXIF, geolocation"},
|
||||
{"name": "iCloud accounts", "detector": "parse_plist@Accounts3.sqlite|parse_keychain", "value_for": "Apple ID, services"},
|
||||
{"name": "App inventory", "detector": "list_directory@var/containers/Bundle/Application", "value_for": "installed apps"},
|
||||
],
|
||||
"disk_image+android": [...],
|
||||
"media_collection": [
|
||||
{"name": "OCR text", "detector": "ocr_image", "value_for": "screenshot text"},
|
||||
{"name": "EXIF metadata", "detector": "exif_image", "value_for": "device, timestamps, geolocation"},
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
**软提示语义**:output 末尾必带一句:
|
||||
|
||||
> Coverage hints are heuristics, not requirements. Skip an item if the case theory
|
||||
> makes it irrelevant. Investigate ✗ items only when they could materially affect
|
||||
> an active hypothesis.
|
||||
|
||||
这一句是**"应试能力存在但不被绑死"的关键**——LLM 看到 ✗ 不会盲投,会先看
|
||||
hypothesis 列表问"这个工件对当前任何 hypothesis 有意义吗"。
|
||||
|
||||
### 2.3 `marginal_yield(last_n_rounds: int = 2)` — 边际收益(只读)
|
||||
|
||||
**Signature**: `marginal_yield(last_n_rounds: int = 2) -> str`
|
||||
|
||||
**实现**:扫最近 N 个 `InvestigationRound`,统计:
|
||||
- 每轮新增 phenomena 数
|
||||
- 每轮新增 P→H 边数
|
||||
- 每轮 hypothesis status flips 数(active→supported / 反向)
|
||||
|
||||
**输出**:
|
||||
|
||||
```markdown
|
||||
# Marginal Yield (last 2 rounds)
|
||||
|
||||
| round | new_phenomena | new_edges | status_flips |
|
||||
| R3 | 5 | 7 | 1 |
|
||||
| R4 | 2 | 1 | 0 |
|
||||
|
||||
Trend: decelerating (R4 yield 33% of R3).
|
||||
Recommendation interpretation aid: yield trending to zero suggests diminishing
|
||||
returns; consider declare_complete after one more probe.
|
||||
```
|
||||
|
||||
最后一行是 LLM-friendly heuristic prose,不是强制信号。
|
||||
|
||||
### 2.4 `budget_status()` — 预算视图(只读)
|
||||
|
||||
**Signature**: `budget_status() -> str`
|
||||
|
||||
```markdown
|
||||
# Budget Status
|
||||
|
||||
| metric | used | cap | pct |
|
||||
| tool_calls | 1248 | 5000 | 25% |
|
||||
| strategist_rounds | 3 | 10 | 30% |
|
||||
| wall_clock_minutes | 142 | 360 | 39% |
|
||||
|
||||
Phase 1 used 89% of allocated. Phase 2 used 4%. Phase 3 (strategist) so far: 7%.
|
||||
```
|
||||
|
||||
预算从 config.yaml 读,新增字段见 §6。无预算配置时进 unbounded 模式(仅靠
|
||||
strategist 自宣 complete + hard safety cap)。
|
||||
|
||||
### 2.5 决策动作工具(写入)
|
||||
|
||||
注册到 strategist 的 `mandatory_record_tools`。Strategist 每轮必须 call 至少一个,
|
||||
否则 forced-retry 触发(复用现有机制)。
|
||||
|
||||
**`propose_lead(...)`**:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "propose_lead",
|
||||
"input_schema": {
|
||||
"type": "object",
|
||||
"required": [
|
||||
"description", "target_agent",
|
||||
"motivating_hypothesis", "expected_evidence_type",
|
||||
],
|
||||
"properties": {
|
||||
"description": {
|
||||
"type": "string",
|
||||
"description": "1-2 sentence specific investigation request, including target source/artefact",
|
||||
},
|
||||
"target_agent": {
|
||||
"type": "string",
|
||||
"enum": ["filesystem","registry","communication","network","ios_artifact","android_artifact","media"],
|
||||
},
|
||||
"source_id": {"type": "string", "description": "which source to investigate"},
|
||||
"motivating_hypothesis": {
|
||||
"type": "string",
|
||||
"description": "hyp-id this lead is meant to corroborate or refute",
|
||||
},
|
||||
"expected_evidence_type": {
|
||||
"type": "string",
|
||||
"enum": ["direct_evidence","supports","contradicts","weakens","prerequisite_met","consequence_observed"],
|
||||
},
|
||||
"rationale": {"type": "string", "description": "why this fills a real gap"},
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**`declare_investigation_complete(...)`**:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "declare_investigation_complete",
|
||||
"input_schema": {
|
||||
"type": "object",
|
||||
"required": ["reason"],
|
||||
"properties": {
|
||||
"reason": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"marginal_yield_zero",
|
||||
"budget_exhausted",
|
||||
"all_hypotheses_resolved",
|
||||
"coverage_saturated",
|
||||
"other",
|
||||
],
|
||||
},
|
||||
"rationale": {"type": "string"},
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Terminal tool —— 调用即结束循环(复用现有 `terminal_tools` 机制)。
|
||||
|
||||
---
|
||||
|
||||
## 3. `InvestigationStrategist` agent
|
||||
|
||||
新文件 `agents/strategist.py`,约 150 行。
|
||||
|
||||
```python
|
||||
class InvestigationStrategist(BaseAgent):
|
||||
name = "strategist"
|
||||
role = (
|
||||
"You are the investigation strategist. You do not run forensic tools yourself. "
|
||||
"Your job is to read the current evidence graph and decide ONE of:\n"
|
||||
" (a) propose 1-3 new investigation leads that would materially affect an active hypothesis, or\n"
|
||||
" (b) declare the investigation complete.\n"
|
||||
"\n"
|
||||
"Use graph_overview / source_coverage / marginal_yield / budget_status to ground your judgment. "
|
||||
"DO NOT propose a lead that just adds more same-direction evidence to an already-supported hypothesis "
|
||||
"(harmonic damping makes it ~useless). DO propose leads when:\n"
|
||||
" - A hypothesis is supported by edges from only ONE source — get cross-source corroboration.\n"
|
||||
" - A hypothesis is in the active band (0.2 < conf < 0.8) — it needs the deciding evidence.\n"
|
||||
" - A specific high-value artefact is uncovered on a source where the active hypotheses suggest it matters.\n"
|
||||
"\n"
|
||||
"Declare complete when marginal_yield is approaching zero AND no remaining active hypotheses have "
|
||||
"obvious investigation paths."
|
||||
)
|
||||
|
||||
mandatory_record_tools = ("propose_lead", "declare_investigation_complete")
|
||||
terminal_tools = ("declare_investigation_complete",)
|
||||
|
||||
def _register_graph_tools(self):
|
||||
# Read-only tools — strategist NEVER writes phenomena/edges directly.
|
||||
# All graph writes happen via the workers it dispatches.
|
||||
self._register_graph_read_tools()
|
||||
# No graph_write_tools.
|
||||
# Add strategy-specific tools:
|
||||
for tool_name in (
|
||||
"graph_overview", "source_coverage", "marginal_yield", "budget_status",
|
||||
"propose_lead", "declare_investigation_complete",
|
||||
):
|
||||
td = TOOL_CATALOG[tool_name]
|
||||
self.register_tool(td.name, td.description, td.input_schema, td.executor)
|
||||
```
|
||||
|
||||
注册到 `agent_factory._AGENT_CLASSES["strategist"]`。
|
||||
|
||||
---
|
||||
|
||||
## 4. 编排器改造
|
||||
|
||||
### 4.1 删除/替换:现在的 Phase 3
|
||||
|
||||
`orchestrator.py:Phase 3` 当前逻辑(约 150 行):检查 leads → 派 worker →
|
||||
检查 converged → 退出。**删除**。
|
||||
|
||||
### 4.2 新 Phase 3:strategist loop
|
||||
|
||||
```python
|
||||
async def _phase3_strategist_loop(self, run_dir: Path) -> None:
|
||||
"""Belief-driven investigation: strategist proposes, workers execute, repeat."""
|
||||
_log("Phase 3: Strategist-Driven Investigation", event="phase")
|
||||
|
||||
strategist = self.factory.get_or_create_agent("strategist")
|
||||
max_rounds = self.config.get("budgets", {}).get("strategist_rounds_max", 10)
|
||||
|
||||
for round_num in range(1, max_rounds + 1):
|
||||
# 1. Record round start + snapshot
|
||||
rid = await self.graph.start_investigation_round(round_num)
|
||||
|
||||
# 2. Strategist run
|
||||
_log(f"Strategist Round {round_num}", event="phase")
|
||||
await strategist.run(
|
||||
f"Review the graph and decide the next investigation action. "
|
||||
f"This is round {round_num}/{max_rounds}. Budget used so far: see budget_status."
|
||||
)
|
||||
|
||||
# 3. Did strategist declare complete?
|
||||
if self.graph.is_round_terminal(rid):
|
||||
_log(f"Strategist declared complete at round {round_num}", event="progress")
|
||||
break
|
||||
|
||||
# 4. Collect new leads proposed this round
|
||||
new_leads = self.graph.leads_from_round(round_num)
|
||||
if not new_leads:
|
||||
_log(f"No leads proposed in round {round_num} — stopping", event="progress")
|
||||
break
|
||||
|
||||
# 5. Dispatch each lead
|
||||
for lead in new_leads:
|
||||
await self._execute_lead(lead, round_num)
|
||||
|
||||
# 6. Close round + record yield
|
||||
await self.graph.complete_investigation_round(rid)
|
||||
|
||||
# 7. Hard budget check
|
||||
if self._budget_exceeded():
|
||||
_log(f"Budget exhausted at round {round_num}", event="progress")
|
||||
break
|
||||
```
|
||||
|
||||
### 4.3 `_execute_lead` 复用现有 worker 派发逻辑
|
||||
|
||||
```python
|
||||
async def _execute_lead(self, lead: Lead, round_num: int) -> None:
|
||||
agent_type = AGENT_ALIASES.get(lead.target_agent, lead.target_agent)
|
||||
worker = self.factory.get_or_create_agent(agent_type)
|
||||
if worker is None:
|
||||
logger.warning(f"No worker for lead {lead.id}: {agent_type}")
|
||||
return
|
||||
|
||||
src = self.graph.case.get_source(lead.source_id) if lead.source_id else None
|
||||
if src:
|
||||
self.graph.set_active_source(src)
|
||||
|
||||
_log(
|
||||
f"Round {round_num} dispatching: {lead.description}",
|
||||
event="dispatch", agent=agent_type,
|
||||
)
|
||||
await worker.run(
|
||||
f"Investigate this specific lead from the strategist:\n\n"
|
||||
f"REQUEST: {lead.description}\n"
|
||||
f"MOTIVATING HYPOTHESIS: {lead.motivating_hypothesis}\n"
|
||||
f"EXPECTED EVIDENCE TYPE: {lead.expected_evidence_type}\n"
|
||||
f"RATIONALE: {lead.rationale}\n\n"
|
||||
f"After investigating, record findings via add_phenomenon AND link relevant phenomena "
|
||||
f"to {lead.motivating_hypothesis} via the appropriate edge_type."
|
||||
)
|
||||
lead.status = "completed"
|
||||
self.graph._auto_save()
|
||||
```
|
||||
|
||||
### 4.4 自动 hypothesis 重生成(可选,建议加)
|
||||
|
||||
新增 phenomena 可能产生**新假设**(不只是更新现有假设)。让 strategist 用
|
||||
`propose_lead(target_agent="hypothesis", description="re-examine recent phenomena for new hypotheses")`
|
||||
显式触发——这是 strategist 自决定的,不是定时触发。一致性优于自动定时。
|
||||
|
||||
---
|
||||
|
||||
## 5. 状态持久化
|
||||
|
||||
`graph_state.json` 新增顶层 key `investigation_rounds: list[InvestigationRound]`。
|
||||
`save_state` / `load_state` 处理。**断连恢复**时:
|
||||
|
||||
- 找最近一个未 completed 的 round → 视为该 round 失败
|
||||
- 从下一个 round 重新开始
|
||||
- 已完成 round 的 phenomena / edges 自然保留
|
||||
|
||||
---
|
||||
|
||||
## 6. 配置
|
||||
|
||||
`config.yaml` 新增:
|
||||
|
||||
```yaml
|
||||
strategist:
|
||||
enabled: true # false = 走老 Phase 3 逻辑(safety fallback)
|
||||
max_rounds: 10
|
||||
hard_stop_marginal_yield_zero_rounds: 3 # 连续 3 轮 yield=0 强制停
|
||||
|
||||
budgets:
|
||||
tool_calls_total: 5000
|
||||
wall_clock_minutes_max: 480
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 测试策略
|
||||
|
||||
新文件 `tests/test_strategist.py` 或加入 `test_optimizations.py`。最少要测:
|
||||
|
||||
1. Strategist 调 `declare_complete` 时 loop 立即退出
|
||||
2. Strategist 调 `propose_lead` 时 lead 入 graph 且 round_number 正确
|
||||
3. Round snapshot 正确捕获 before/after status
|
||||
4. 预算耗尽时即使 strategist 还想继续也强制停
|
||||
5. 断连恢复:中途中断后重启从下一 round 开始
|
||||
6. `graph_overview` 输出包含 `distinct_sources` 标注
|
||||
7. `source_coverage` 对未触达项标 ✗
|
||||
8. `marginal_yield` 数字与 `confidence_log` 一致
|
||||
|
||||
不写 LLM 集成测试——strategist 行为通过 mock LLM 验证(已有这种模式见
|
||||
`test_forced_record_retry_fires_when_zero_phenomena`)。
|
||||
|
||||
---
|
||||
|
||||
## 8. 实施顺序
|
||||
|
||||
按依赖排(**每步独立 commit**——结构性改造,单点回滚关键):
|
||||
|
||||
| 步 | 内容 | 依赖 | 工作量估算 |
|
||||
|---|---|---|---|
|
||||
| 1 | `Lead` 加 4 字段 + `InvestigationRound` 数据类 + 序列化 | — | 60 行 + 测试 |
|
||||
| 2 | `graph_overview` / `source_coverage` / `marginal_yield` / `budget_status` 实现 | 1 | 250 行 + 测试 |
|
||||
| 3 | `propose_lead` / `declare_investigation_complete` 工具 | 1 | 80 行 + 测试 |
|
||||
| 4 | `InvestigationStrategist` agent class | 2, 3 | 120 行 + 测试 |
|
||||
| 5 | 编排器 Phase 3 重写 | 4 | 150 行(替换 ~50 行旧)+ 测试 |
|
||||
| 6 | config schema + 加载逻辑 | 5 | 30 行 |
|
||||
| 7 | 断连恢复处理 | 5 | 40 行 + 测试 |
|
||||
| 8 | 真实案件 smoke run(小规模:USB only) | 7 | 0 代码 |
|
||||
| 9 | 文档:DESIGN.md §4.9 改写 + 本文件归档 | 8 | 文档 |
|
||||
|
||||
总:~800 行新代码 + 测试 + 文档。
|
||||
|
||||
---
|
||||
|
||||
## 9. 风险 + 缓解
|
||||
|
||||
| 风险 | 缓解 |
|
||||
|---|---|
|
||||
| Strategist 太保守(永远 declare_complete) | 加 prompt 例子展示什么是"该深入的情况";测试时小样本验证 |
|
||||
| Strategist 太激进(每轮都 propose 7+ leads) | `propose_lead` 工具 schema 限制每轮最多 3-5 个;prompt 强调"重质不重量" |
|
||||
| 单 worker 跑不完 lead 导致预算雪崩 | worker 调用本身 max_iter 不变;strategist 预算独立 |
|
||||
| LLM 不理解 `distinct_sources` 这种暗示 | `graph_overview` 末尾加 1-2 句 plain-English 解读 "Hypothesis X has 23 edges but all from one source → cross-source corroboration would strengthen it" |
|
||||
| Phase 1 触发产生的 leads 被 strategist 忽略 | strategist prompt 明确"先处理已有 pending leads,再产新的" |
|
||||
| 死循环(strategist 反复产同样 lead) | Lead 表上加 `(motivating_hyp, expected_type, source_id)` 三元组去重 |
|
||||
| `EXPECTED_ARTEFACTS` 清单维护成本 | 故意保持"软提示"——清单不完整也不会破,只是某些深度需要更多 LLM 自觉 |
|
||||
|
||||
---
|
||||
|
||||
## 10. 开放问题
|
||||
|
||||
1. **InvestigationRound 该不该自己跑 hypothesis agent?**
|
||||
倾向 strategist 用 lead 显式触发(一致性更好),不做定时触发。
|
||||
|
||||
2. **预算超用怎么办——硬停 vs 软警告?**
|
||||
当前设计硬停;可加 "strategist 看到 budget < 10% 时只能 declare_complete"
|
||||
的 schema enforcement。
|
||||
|
||||
3. **跨 source 边的"独立性奖励"是否纳入 log-odds?**
|
||||
上次衰减用了 `1/k`,没区分跨源 vs 同源。如果要纳入,公式应改为
|
||||
`1/k_within_source × bonus_for_distinct_sources`。这是后续单独工程。
|
||||
|
||||
4. **Strategist 输出的 `rationale` 该不该走 grounding?**
|
||||
它不会写 phenomena,但 `rationale` 字段可能包含具体值
|
||||
("based on inv-12345...")。倾向不强制——这是元层判断,不是事实落地。
|
||||
|
||||
5. **现 Phase 3 的 `max_investigation_rounds` config 留还是删?**
|
||||
建议留作 `strategist.enabled=false` 时的 fallback 旋钮。
|
||||
|
||||
---
|
||||
|
||||
## 11. 与 DESIGN.md 的关系
|
||||
|
||||
本文档落地后,DESIGN.md 需要的对应更新:
|
||||
|
||||
- **§4.5**:补一段「同时也要看 log_odds 的**结构**——edges_in 数 / distinct_sources
|
||||
是 strategist 判断是否深入的关键信号,不只是 confidence 数值」
|
||||
- **§4.9 Phase 3**:表格内容从「leads 派发到源感知 agent」改为
|
||||
「strategist 循环:看图、提案、执行、复盘、停 / 续」
|
||||
- **§8**(设计取舍):新增第 6 条:「调度层 LLM 化的取舍——strategist 决定深度,
|
||||
但每轮预算受 `budgets.*` 硬限制;这是"LLM 提议、代码裁决"原则在调度层的兑现」
|
||||
|
||||
---
|
||||
|
||||
## 12. 备忘:本设计**不解决**的问题
|
||||
|
||||
- 应试题 8% 命中率的根因是**工具集不全**(无 vision、无 ZIP 暴力破解、无 VeraCrypt
|
||||
挂载、无 blockchain explorer),不是调度问题。strategist 让现有工具被用得更狠,
|
||||
但不会凭空多出工具。
|
||||
- LLM 编造 `invocation_id`(已修补,见 `feedback_grounding_pending` memory)和
|
||||
log-odds 通胀(已修补:调和衰减)是本设计的**前置依赖**,不在本设计范围内。
|
||||
- Per-edge-type 的更精细贝叶斯建模(如跨源独立性 bonus)是独立工程。
|
||||
@@ -33,6 +33,7 @@ def _load_agent_classes() -> None:
|
||||
from agents.network import NetworkAgent
|
||||
from agents.registry import RegistryAgent
|
||||
from agents.report import ReportAgent
|
||||
from agents.strategist import InvestigationStrategist
|
||||
from agents.timeline import TimelineAgent
|
||||
_AGENT_CLASSES["filesystem"] = FileSystemAgent
|
||||
_AGENT_CLASSES["registry"] = RegistryAgent
|
||||
@@ -44,6 +45,7 @@ def _load_agent_classes() -> None:
|
||||
_AGENT_CLASSES["ios_artifact"] = IOSArtifactAgent
|
||||
_AGENT_CLASSES["android_artifact"] = AndroidArtifactAgent
|
||||
_AGENT_CLASSES["media"] = MediaAgent
|
||||
_AGENT_CLASSES["strategist"] = InvestigationStrategist
|
||||
|
||||
|
||||
# Triage agent per (source.type, platform). disk_image is ambiguous on its
|
||||
|
||||
134
agents/strategist.py
Normal file
134
agents/strategist.py
Normal file
@@ -0,0 +1,134 @@
|
||||
"""InvestigationStrategist — the LLM that decides depth vs breadth.
|
||||
|
||||
DESIGN_STRATEGIST.md §3.
|
||||
|
||||
The strategist does NOT run forensic tools. Its job per round is exactly one
|
||||
decision: propose 1-3 leads that would move an active hypothesis, OR declare
|
||||
the investigation complete. It reads the graph through four read-only views
|
||||
(graph_overview / source_coverage / marginal_yield / budget_status) and
|
||||
expresses its decision through two write tools (propose_lead /
|
||||
declare_investigation_complete).
|
||||
|
||||
This is the smallest possible agent in the system — the entire point is that
|
||||
strategy decisions live in one agent so they're auditable and the rest of the
|
||||
codebase doesn't carry implicit depth/breadth policy.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
|
||||
from base_agent import BaseAgent
|
||||
from evidence_graph import EvidenceGraph
|
||||
from llm_client import LLMClient
|
||||
from tool_registry import TOOL_CATALOG
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class InvestigationStrategist(BaseAgent):
|
||||
name = "strategist"
|
||||
role = (
|
||||
"Investigation strategist. You do not run forensic tools yourself. "
|
||||
"Each round you take ONE decision: propose 1-3 new investigation leads "
|
||||
"that would materially affect an active hypothesis, OR declare the "
|
||||
"investigation complete. Your judgment is grounded in the graph "
|
||||
"(hypotheses, sources, coverage, marginal yield, budget) — never in "
|
||||
"speculation."
|
||||
)
|
||||
# At least one of these must be called every round, otherwise BaseAgent's
|
||||
# forced RECORD retry kicks in and re-prompts the strategist to take a
|
||||
# documented decision.
|
||||
mandatory_record_tools = ("propose_lead", "declare_investigation_complete")
|
||||
# declare_complete is terminal — calling it short-circuits the tool loop,
|
||||
# which is what we want (strategist returns immediately on "done").
|
||||
terminal_tools = ("declare_investigation_complete",)
|
||||
|
||||
# Strategist-specific tools, plus the read-only graph queries inherited
|
||||
# from BaseAgent. NO graph write tools (no add_phenomenon / link_to_entity
|
||||
# / observe_identity); the strategist must NOT mutate evidence directly.
|
||||
_STRATEGY_TOOLS = (
|
||||
"graph_overview",
|
||||
"source_coverage",
|
||||
"marginal_yield",
|
||||
"budget_status",
|
||||
"propose_lead",
|
||||
"declare_investigation_complete",
|
||||
)
|
||||
|
||||
def _register_graph_tools(self) -> None:
|
||||
"""Strategist gets read-only graph queries + the six strategy tools.
|
||||
|
||||
It does NOT get write tools (no add_phenomenon, observe_identity,
|
||||
link_to_entity, add_temporal_edge). Every graph mutation must come
|
||||
from a dispatched worker, not from the planner.
|
||||
"""
|
||||
self._register_graph_read_tools()
|
||||
for tool_name in self._STRATEGY_TOOLS:
|
||||
td = TOOL_CATALOG.get(tool_name)
|
||||
if td is None:
|
||||
logger.warning(
|
||||
"Strategist could not find tool %s in TOOL_CATALOG — "
|
||||
"register_all_tools must run before agent instantiation.",
|
||||
tool_name,
|
||||
)
|
||||
continue
|
||||
self.register_tool(td.name, td.description, td.input_schema, td.executor)
|
||||
|
||||
def _build_system_prompt(self, task: str) -> str:
|
||||
"""Strategist-specific prompt. Replaces the BaseAgent default which
|
||||
walks an INVESTIGATE→RECORD→LINK workflow that is wrong for a
|
||||
planner agent.
|
||||
"""
|
||||
return (
|
||||
f"You are {self.name}, the investigation strategist.\n"
|
||||
f"Role: {self.role}\n\n"
|
||||
f"Your task: {task}\n\n"
|
||||
f"WORKFLOW (do this exactly):\n"
|
||||
f" 1. Call graph_overview FIRST. Look at: which hypotheses are\n"
|
||||
f" active (conf 0.2-0.8) vs already supported/refuted; which\n"
|
||||
f" ones have many edges but only 1 distinct_source; which had\n"
|
||||
f" a recent_flip vs none in two rounds.\n"
|
||||
f" 2. Call marginal_yield to see if the last rounds produced anything.\n"
|
||||
f" 3. Call budget_status to know your runway.\n"
|
||||
f" 4. For each candidate lead direction, call source_coverage on\n"
|
||||
f" the relevant source to see what's been touched.\n"
|
||||
f" 5. Take exactly ONE of these terminal actions:\n"
|
||||
f" (a) Call propose_lead 1-3 times for leads that would\n"
|
||||
f" materially move an active hypothesis. STOP after this.\n"
|
||||
f" (b) Call declare_investigation_complete with a specific\n"
|
||||
f" reason. STOP after this.\n"
|
||||
f"\n"
|
||||
f"DECISION CRITERIA — when to propose vs when to stop:\n"
|
||||
f" PROPOSE when:\n"
|
||||
f" - A hypothesis is supported only by ONE source — get\n"
|
||||
f" cross-source corroboration. Same-source repeats are\n"
|
||||
f" cheap (harmonic damping).\n"
|
||||
f" - A hypothesis is in the active band (0.2 < conf < 0.8) —\n"
|
||||
f" it needs the deciding evidence.\n"
|
||||
f" - A high-value artefact is ✗ on source_coverage AND an\n"
|
||||
f" active hypothesis depends on the kind of evidence that\n"
|
||||
f" artefact would produce.\n"
|
||||
f" STOP (declare_complete) when:\n"
|
||||
f" - marginal_yield shows zero across 2+ rounds.\n"
|
||||
f" - budget_status warns ≥90% on tool_calls or rounds.\n"
|
||||
f" - all active hypotheses are resolved (supported or refuted).\n"
|
||||
f" - coverage saturation: every ✗ on every source is irrelevant\n"
|
||||
f" to active hypotheses.\n"
|
||||
f"\n"
|
||||
f"HARD RULES:\n"
|
||||
f" - You CANNOT call investigation tools (list_directory,\n"
|
||||
f" sqlite_query, parse_registry_key, extract_file, etc.) — your\n"
|
||||
f" job is to direct workers, not to investigate yourself.\n"
|
||||
f" - You CANNOT call write tools (add_phenomenon, observe_identity,\n"
|
||||
f" link_to_entity, add_hypothesis, add_temporal_edge). All\n"
|
||||
f" evidence mutations come from the workers you dispatch.\n"
|
||||
f" - Every propose_lead MUST cite a real hyp-id from\n"
|
||||
f" graph_overview's table — fabricated ids will be rejected.\n"
|
||||
f" - Don't propose more than 3 leads in one round. Quality over\n"
|
||||
f" quantity — a 4th lead almost always means you're not really\n"
|
||||
f" sure what would move the graph.\n"
|
||||
f" - Don't re-propose a lead that's already pending. The system\n"
|
||||
f" deduplicates (motivating_hyp, expected_type, agent, source)\n"
|
||||
f" so duplicates silently no-op, but they waste your budget."
|
||||
)
|
||||
@@ -228,12 +228,27 @@ class BaseAgent:
|
||||
f"what you already found. Then end."
|
||||
),
|
||||
})
|
||||
# Narrow the retry tool surface so the agent can't wander off
|
||||
# to investigate again — only RECORD and read-only graph
|
||||
# query tools survive. Each grounding-rejected call burns one
|
||||
# iteration, so the cap is 30 (not the original 10): a
|
||||
# Timeline agent writing ~10 temporal edges with one rejection
|
||||
# apiece needs ~20 turns under the rewritten gateway.
|
||||
retry_tool_names = set(registered_mandatory) | {
|
||||
"list_phenomena", "list_assets", "search_graph",
|
||||
"add_temporal_edge", "link_to_entity", "add_lead",
|
||||
"add_hypothesis", "save_report",
|
||||
}
|
||||
retry_tools = [
|
||||
td for td in self.get_tool_definitions()
|
||||
if td["name"] in retry_tool_names
|
||||
]
|
||||
final_text, _ = await self.llm.tool_call_loop(
|
||||
messages=conversation,
|
||||
tools=self.get_tool_definitions(),
|
||||
tools=retry_tools,
|
||||
tool_executor=self._executors,
|
||||
system=system,
|
||||
max_iterations=10,
|
||||
max_iterations=30,
|
||||
terminal_tools=self.terminal_tools,
|
||||
)
|
||||
|
||||
|
||||
71
config.example.yaml
Normal file
71
config.example.yaml
Normal file
@@ -0,0 +1,71 @@
|
||||
# MASForensics Configuration — template.
|
||||
#
|
||||
# Copy this file to `config.yaml` and fill in your API key. config.yaml is
|
||||
# git-ignored so secrets don't land in commits. The two files share schema;
|
||||
# only this template is tracked.
|
||||
|
||||
agent:
|
||||
base_url: "https://api.deepseek.com"
|
||||
api_key: "YOUR-API-KEY-HERE"
|
||||
model: "deepseek-v4-pro"
|
||||
max_tokens: 16384
|
||||
reasoning_effort: "high" # DeepSeek/o1-style reasoning depth; omit to disable
|
||||
thinking_enabled: true # DeepSeek extra_body.thinking switch
|
||||
|
||||
# Maximum rounds of hypothesis-directed investigation (Phase 3).
|
||||
# Only consulted when strategist.enabled is false (legacy fallback path).
|
||||
max_investigation_rounds: 1
|
||||
|
||||
# Phase 3 strategist loop (DESIGN_STRATEGIST.md). When enabled, the
|
||||
# InvestigationStrategist agent decides each round whether to propose new
|
||||
# leads or declare the investigation complete. When disabled, the legacy
|
||||
# fixed-round investigation loop runs instead.
|
||||
strategist:
|
||||
enabled: true
|
||||
max_rounds: 10
|
||||
# Safety net: if the strategist keeps proposing leads but yield (new
|
||||
# phenomena + edges + status flips) is zero for this many consecutive
|
||||
# rounds, the orchestrator force-stops Phase 3 regardless.
|
||||
hard_stop_marginal_yield_zero_rounds: 3
|
||||
|
||||
# Hard caps that bound the whole run. The strategist's budget_status tool
|
||||
# reads these to pace its proposals; the orchestrator also enforces them
|
||||
# as hard stops (DESIGN_STRATEGIST.md §4.2 step 7). Comment out any cap
|
||||
# to make it unbounded.
|
||||
budgets:
|
||||
tool_calls_total: 5000
|
||||
strategist_rounds_max: 10
|
||||
wall_clock_minutes_max: 480
|
||||
|
||||
# Optional: override the per-edge-type log₁₀(LR) calibration table.
|
||||
# Confidence updates accumulate these in odds space (additive, order-
|
||||
# independent), then map back to probability via sigmoid. Single edge
|
||||
# magnitudes: ≥ +0.602 lifts confidence above the 0.8 supported threshold,
|
||||
# ≤ −0.602 drops it below the 0.2 refuted threshold.
|
||||
# If omitted, evidence_graph._DEFAULT_LOG_LR is used.
|
||||
# hypothesis_log_lr:
|
||||
# direct_evidence: 2.0
|
||||
# supports: 1.0
|
||||
# consequence_observed: 1.0
|
||||
# prerequisite_met: 0.5
|
||||
# weakens: -0.5
|
||||
# contradicts: -2.0
|
||||
|
||||
# Optional: manually specify initial hypotheses. If omitted, the
|
||||
# HypothesisAgent auto-generates them from Phase 1 findings.
|
||||
# hypotheses:
|
||||
# - title: "..."
|
||||
# description: "..."
|
||||
|
||||
# Investigation areas — LLM-derived from active hypotheses after Phase 2.
|
||||
# Each entry below acts as a MANUAL OVERRIDE: it is seeded into the graph
|
||||
# before the LLM derives areas, so manual entries always survive (slug-based
|
||||
# dedupe; LLM only augments keyword/tool lists, never overwrites).
|
||||
#
|
||||
# investigation_areas:
|
||||
# - area: shutdown_time
|
||||
# description: "Last recorded shutdown time"
|
||||
# agent: registry
|
||||
# priority: 3
|
||||
# keywords: [shutdown, last shutdown]
|
||||
# tools: [get_shutdown_time]
|
||||
@@ -417,7 +417,14 @@ class Edge:
|
||||
|
||||
@dataclass
|
||||
class Lead:
|
||||
"""An investigative lead that should be followed up by an agent."""
|
||||
"""An investigative lead that should be followed up by an agent.
|
||||
|
||||
Phase 1 agents create leads as "things outside my scope but worth chasing".
|
||||
The strategist (DESIGN_STRATEGIST.md) also creates leads, and additionally
|
||||
annotates each with the hypothesis it's meant to corroborate or refute plus
|
||||
the kind of edge it expects to produce — so the orchestrator can later
|
||||
measure "did this lead actually change any belief".
|
||||
"""
|
||||
|
||||
id: str
|
||||
target_agent: str
|
||||
@@ -426,13 +433,77 @@ class Lead:
|
||||
context: dict = field(default_factory=dict)
|
||||
status: str = "pending" # pending, assigned, completed, failed
|
||||
hypothesis_id: str | None = None
|
||||
# Strategist-loop annotations. proposed_by names the agent that created
|
||||
# the lead ("filesystem", "strategist", ...). motivating_hypothesis and
|
||||
# expected_evidence_type let the orchestrator measure marginal yield.
|
||||
# round_number is 0 for Phase 1 leads, ≥1 for strategist-produced leads.
|
||||
proposed_by: str = ""
|
||||
motivating_hypothesis: str = ""
|
||||
expected_evidence_type: str = ""
|
||||
round_number: int = 0
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return asdict(self)
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, d: dict) -> Lead:
|
||||
return cls(**d)
|
||||
# Forward-compatible: old state files predate the strategist annotations.
|
||||
known = set(cls.__dataclass_fields__)
|
||||
return cls(**{k: v for k, v in d.items() if k in known})
|
||||
|
||||
|
||||
@dataclass
|
||||
class InvestigationRound:
|
||||
"""One round of strategist-driven investigation.
|
||||
|
||||
DESIGN_STRATEGIST.md §1.2: provenance for the strategist's decisions. Each
|
||||
round records what hypothesis statuses looked like before vs. after, what
|
||||
leads were proposed, which actually got executed, and how many new
|
||||
phenomena/edges resulted. ``marginal_yield`` over recent rounds is what
|
||||
the strategist consults to decide whether to keep digging or declare
|
||||
complete.
|
||||
"""
|
||||
|
||||
id: str # "round-{nnn}"
|
||||
round_number: int
|
||||
started_at: str
|
||||
completed_at: str = ""
|
||||
strategist_action: str = "" # "propose_leads" | "declare_complete"
|
||||
leads_proposed: list[str] = field(default_factory=list)
|
||||
leads_executed: list[str] = field(default_factory=list)
|
||||
hypothesis_status_snapshot_before: dict = field(default_factory=dict)
|
||||
hypothesis_status_snapshot_after: dict = field(default_factory=dict)
|
||||
phenomena_count_before: int = 0
|
||||
phenomena_count_after: int = 0
|
||||
edges_count_before: int = 0
|
||||
edges_count_after: int = 0
|
||||
decision_rationale: str = ""
|
||||
|
||||
@property
|
||||
def new_phenomena_count(self) -> int:
|
||||
return max(0, self.phenomena_count_after - self.phenomena_count_before)
|
||||
|
||||
@property
|
||||
def new_edges_count(self) -> int:
|
||||
return max(0, self.edges_count_after - self.edges_count_before)
|
||||
|
||||
@property
|
||||
def status_flips(self) -> int:
|
||||
before = self.hypothesis_status_snapshot_before
|
||||
after = self.hypothesis_status_snapshot_after
|
||||
flips = 0
|
||||
for hid, after_status in after.items():
|
||||
if before.get(hid) and before.get(hid) != after_status:
|
||||
flips += 1
|
||||
return flips
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return asdict(self)
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, d: dict) -> InvestigationRound:
|
||||
known = set(cls.__dataclass_fields__)
|
||||
return cls(**{k: v for k, v in d.items() if k in known})
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -598,6 +669,26 @@ class EvidenceGraph:
|
||||
# claimed fact values against real tool outputs.
|
||||
self.tool_invocations: dict[str, ToolInvocation] = {}
|
||||
|
||||
# Investigation rounds — provenance for the strategist's per-round
|
||||
# decisions (DESIGN_STRATEGIST.md). Empty for runs that don't reach
|
||||
# Phase 3 or that disable the strategist via config.
|
||||
self.investigation_rounds: list[InvestigationRound] = []
|
||||
|
||||
# Budget config + run-start monotonic clock. Set by the orchestrator
|
||||
# when it boots; the budget_status strategy tool reads these. None
|
||||
# means unbounded / not yet started.
|
||||
self.budgets: dict[str, int] = {}
|
||||
self.run_start_monotonic: float | None = None
|
||||
|
||||
# Current strategist round number. Set by the orchestrator at the
|
||||
# top of each strategist loop iteration so propose_lead / declare_
|
||||
# investigation_complete can tag their actions correctly. 0 when
|
||||
# the strategist is not running.
|
||||
self.current_strategist_round: int = 0
|
||||
# Set to True by declare_investigation_complete so the orchestrator
|
||||
# knows to break out of the strategist loop after this round.
|
||||
self.strategist_complete_requested: bool = False
|
||||
|
||||
# _current_agent / _current_task_id are exposed as @property below,
|
||||
# backed by module-level ContextVars (race-free under asyncio.gather).
|
||||
|
||||
@@ -658,6 +749,9 @@ class EvidenceGraph:
|
||||
"tool_invocations": {
|
||||
iid: inv.to_dict() for iid, inv in self.tool_invocations.items()
|
||||
},
|
||||
"investigation_rounds": [
|
||||
r.to_dict() for r in self.investigation_rounds
|
||||
],
|
||||
"saved_at": datetime.now().isoformat(),
|
||||
}
|
||||
tmp = self._persist_path.with_suffix(".tmp")
|
||||
@@ -730,6 +824,10 @@ class EvidenceGraph:
|
||||
iid: ToolInvocation.from_dict(inv)
|
||||
for iid, inv in data.get("tool_invocations", {}).items()
|
||||
}
|
||||
graph.investigation_rounds = [
|
||||
InvestigationRound.from_dict(r)
|
||||
for r in data.get("investigation_rounds", [])
|
||||
]
|
||||
graph._rebuild_adjacency()
|
||||
logger.info(
|
||||
"EvidenceGraph restored: %d phenomena, %d hypotheses, %d entities, "
|
||||
@@ -1020,9 +1118,20 @@ class EvidenceGraph:
|
||||
**Idempotency**: if a ``(phenomenon, hypothesis, edge_type)`` edge
|
||||
already exists, this is a no-op — the same agent re-recording the
|
||||
same link (or two agents linking via the orchestrator's batch
|
||||
judge and a manual override) does not double-count. Independent
|
||||
evidence — *different* phenomena pointing the same way — still
|
||||
accumulates fully.
|
||||
judge and a manual override) does not double-count.
|
||||
|
||||
**Harmonic damping of repeated same-direction evidence** (added
|
||||
post first full-case run, 2026-05-20): independent evidence —
|
||||
different phenomena pointing the same way — still accumulates,
|
||||
but with diminishing returns: the k-th edge of the same
|
||||
``(hyp_id, edge_type)`` contributes ``log_lr_base / k``. After N
|
||||
same-direction edges the cumulative contribution is
|
||||
``log_lr_base · H_N`` (harmonic sum, grows as ln N). This
|
||||
formalises the naive-Bayes-breakdown DESIGN.md §4.5 calls out:
|
||||
"同一发现被多 agent 重复入图". Single-edge hypotheses are
|
||||
unaffected (k=1, damping = 1.0). Cross-direction edges (supports
|
||||
vs contradicts) keep their own independent counts so a strong
|
||||
contradicting fact still bites against piled-on supports.
|
||||
"""
|
||||
if edge_type not in self.edge_log_lr:
|
||||
raise ValueError(
|
||||
@@ -1045,7 +1154,17 @@ class EvidenceGraph:
|
||||
):
|
||||
return hyp.confidence
|
||||
|
||||
log_lr = self.edge_log_lr[edge_type]
|
||||
# Harmonic damping rank: count existing edges of the SAME
|
||||
# edge_type already incident on this hypothesis. The new edge
|
||||
# becomes the (rank+1)-th of its kind. _adj_rev is keyed by
|
||||
# target so this is O(in-degree(hyp)) without scanning all edges.
|
||||
existing_same_type = sum(
|
||||
1 for e in self._adj_rev.get(hyp_id, [])
|
||||
if e.edge_type == edge_type
|
||||
)
|
||||
rank = existing_same_type + 1
|
||||
log_lr_base = self.edge_log_lr[edge_type]
|
||||
log_lr = log_lr_base / rank
|
||||
old_log_odds = hyp.log_odds
|
||||
old_conf = hyp.confidence
|
||||
new_log_odds = old_log_odds + log_lr
|
||||
@@ -1065,7 +1184,9 @@ class EvidenceGraph:
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"phenomenon_id": phenomenon_id,
|
||||
"edge_type": edge_type,
|
||||
"log_lr": log_lr,
|
||||
"log_lr_base": log_lr_base,
|
||||
"rank": rank,
|
||||
"log_lr": round(log_lr, 4),
|
||||
"old_log_odds": round(old_log_odds, 4),
|
||||
"new_log_odds": round(new_log_odds, 4),
|
||||
"old_confidence": round(old_conf, 4),
|
||||
@@ -1474,9 +1595,29 @@ class EvidenceGraph:
|
||||
priority: int = 5,
|
||||
context: dict | None = None,
|
||||
hypothesis_id: str | None = None,
|
||||
proposed_by: str = "",
|
||||
motivating_hypothesis: str = "",
|
||||
expected_evidence_type: str = "",
|
||||
round_number: int = 0,
|
||||
) -> str:
|
||||
async with self._lock:
|
||||
lid = f"lead-{uuid.uuid4().hex[:8]}"
|
||||
# Idempotency for strategist proposals: identical
|
||||
# (motivating_hypothesis, expected_evidence_type, target_agent,
|
||||
# source_id) triple should not be created twice — this guards
|
||||
# against the "strategist loops on the same lead" failure mode.
|
||||
if motivating_hypothesis and proposed_by == "strategist":
|
||||
source_id = (context or {}).get("source_id", "")
|
||||
for existing in self.leads:
|
||||
if (
|
||||
existing.proposed_by == "strategist"
|
||||
and existing.motivating_hypothesis == motivating_hypothesis
|
||||
and existing.expected_evidence_type == expected_evidence_type
|
||||
and existing.target_agent == target_agent
|
||||
and (existing.context or {}).get("source_id", "") == source_id
|
||||
and existing.status in ("pending", "assigned")
|
||||
):
|
||||
return existing.id
|
||||
self.leads.append(Lead(
|
||||
id=lid,
|
||||
target_agent=target_agent,
|
||||
@@ -1484,10 +1625,89 @@ class EvidenceGraph:
|
||||
priority=priority,
|
||||
context=context or {},
|
||||
hypothesis_id=hypothesis_id,
|
||||
proposed_by=proposed_by,
|
||||
motivating_hypothesis=motivating_hypothesis,
|
||||
expected_evidence_type=expected_evidence_type,
|
||||
round_number=round_number,
|
||||
))
|
||||
self._auto_save()
|
||||
return lid
|
||||
|
||||
# ---- Investigation rounds (strategist loop) ----------------------------
|
||||
|
||||
async def start_investigation_round(
|
||||
self, round_number: int,
|
||||
) -> str:
|
||||
"""Open a new investigation round + capture pre-round snapshot.
|
||||
|
||||
Called by the orchestrator at the top of each strategist iteration.
|
||||
The snapshot records hypothesis status, phenomena count, and edges
|
||||
count so that ``complete_investigation_round`` can compute the
|
||||
round's yield deltas.
|
||||
"""
|
||||
async with self._lock:
|
||||
rid = f"round-{round_number:03d}"
|
||||
snapshot_before = {
|
||||
hid: h.status for hid, h in self.hypotheses.items()
|
||||
}
|
||||
self.investigation_rounds.append(InvestigationRound(
|
||||
id=rid,
|
||||
round_number=round_number,
|
||||
started_at=datetime.now().isoformat(),
|
||||
hypothesis_status_snapshot_before=snapshot_before,
|
||||
phenomena_count_before=len(self.phenomena),
|
||||
edges_count_before=len(self.edges),
|
||||
))
|
||||
self._auto_save()
|
||||
return rid
|
||||
|
||||
async def complete_investigation_round(
|
||||
self,
|
||||
round_id: str,
|
||||
strategist_action: str = "propose_leads",
|
||||
leads_executed: list[str] | None = None,
|
||||
decision_rationale: str = "",
|
||||
) -> InvestigationRound | None:
|
||||
"""Close a round, recording after-snapshot + which leads got executed.
|
||||
|
||||
Idempotent on already-closed rounds (returns the existing record).
|
||||
"""
|
||||
async with self._lock:
|
||||
for r in self.investigation_rounds:
|
||||
if r.id != round_id:
|
||||
continue
|
||||
if r.completed_at:
|
||||
return r
|
||||
r.completed_at = datetime.now().isoformat()
|
||||
r.strategist_action = strategist_action
|
||||
r.leads_executed = list(leads_executed or [])
|
||||
r.leads_proposed = [
|
||||
l.id for l in self.leads
|
||||
if l.round_number == r.round_number
|
||||
and l.proposed_by == "strategist"
|
||||
]
|
||||
r.hypothesis_status_snapshot_after = {
|
||||
hid: h.status for hid, h in self.hypotheses.items()
|
||||
}
|
||||
r.phenomena_count_after = len(self.phenomena)
|
||||
r.edges_count_after = len(self.edges)
|
||||
r.decision_rationale = decision_rationale
|
||||
self._auto_save()
|
||||
return r
|
||||
return None
|
||||
|
||||
def get_investigation_round(self, round_id: str) -> InvestigationRound | None:
|
||||
for r in self.investigation_rounds:
|
||||
if r.id == round_id:
|
||||
return r
|
||||
return None
|
||||
|
||||
def latest_round(self) -> InvestigationRound | None:
|
||||
return self.investigation_rounds[-1] if self.investigation_rounds else None
|
||||
|
||||
def leads_from_round(self, round_number: int) -> list[Lead]:
|
||||
return [l for l in self.leads if l.round_number == round_number]
|
||||
|
||||
async def get_pending_leads(self, agent_type: str | None = None) -> list[Lead]:
|
||||
async with self._lock:
|
||||
leads = [l for l in self.leads if l.status == "pending"]
|
||||
|
||||
@@ -148,6 +148,8 @@ READ_ONLY_TOOLS: set[str] = {
|
||||
"parse_ios_keychain", "read_idevice_info",
|
||||
# Android + media reads (S6) — set_active_partition is NOT read-only.
|
||||
"probe_android_partitions", "ocr_image",
|
||||
# Strategist view tools (DESIGN_STRATEGIST.md §2) — pure renders.
|
||||
"graph_overview", "source_coverage", "marginal_yield", "budget_status",
|
||||
}
|
||||
|
||||
|
||||
|
||||
340
orchestrator.py
340
orchestrator.py
@@ -119,6 +119,11 @@ class Orchestrator:
|
||||
self._failure_count = 0
|
||||
self._max_failures = 3
|
||||
self._start_time = datetime.now()
|
||||
# Make budgets visible to strategy tools via the graph object. The
|
||||
# budget_status tool reads graph.budgets / graph.run_start_monotonic
|
||||
# directly so it does not need a back-reference to the orchestrator.
|
||||
self.graph.budgets = dict(self.config.get("budgets", {}) or {})
|
||||
self.graph.run_start_monotonic = time.monotonic()
|
||||
|
||||
def _resolve_agent_type(self, agent_type: str) -> str:
|
||||
return AGENT_ALIASES.get(agent_type, agent_type)
|
||||
@@ -195,6 +200,298 @@ class Orchestrator:
|
||||
lead.context["retry"] = True
|
||||
await self._dispatch_leads_parallel(failed)
|
||||
|
||||
# ---- Phase 3: strategist loop (DESIGN_STRATEGIST.md §4) ------------------
|
||||
|
||||
def _budget_exceeded(self) -> bool:
|
||||
"""Hard budget enforcement, complementing strategist self-throttling.
|
||||
|
||||
Any of these triggers an immediate Phase 3 exit even if the
|
||||
strategist hasn't called declare_investigation_complete. Each cap
|
||||
is optional — leave it out of config to make it unbounded.
|
||||
"""
|
||||
b = self.graph.budgets or {}
|
||||
tc_cap = b.get("tool_calls_total")
|
||||
if tc_cap and len(self.graph.tool_invocations) >= tc_cap:
|
||||
return True
|
||||
wc_cap = b.get("wall_clock_minutes_max")
|
||||
if wc_cap and self.graph.run_start_monotonic is not None:
|
||||
elapsed_min = (time.monotonic() - self.graph.run_start_monotonic) / 60.0
|
||||
if elapsed_min >= wc_cap:
|
||||
return True
|
||||
return False
|
||||
|
||||
async def _execute_strategist_lead(self, lead, round_num: int) -> None:
|
||||
"""Dispatch one strategist-proposed lead to its target worker.
|
||||
|
||||
Unlike the legacy bulk dispatcher this runs leads serially so each
|
||||
worker run reads a graph that includes prior leads' findings — the
|
||||
strategist's next round can see the cumulative effect of this round.
|
||||
"""
|
||||
agent_type = AGENT_ALIASES.get(lead.target_agent, lead.target_agent)
|
||||
worker = self.factory.get_or_create_agent(agent_type)
|
||||
if worker is None:
|
||||
logger.warning(
|
||||
"No worker registered for lead %s: target_agent=%s",
|
||||
lead.id, agent_type,
|
||||
)
|
||||
lead.status = "failed"
|
||||
lead.context["failure_reason"] = f"no worker for agent type '{agent_type}'"
|
||||
self.graph._auto_save()
|
||||
return
|
||||
|
||||
source_id = (lead.context or {}).get("source_id", "")
|
||||
if source_id and self.graph.case is not None:
|
||||
src = self.graph.case.get_source(source_id)
|
||||
if src:
|
||||
self.graph.set_active_source(src)
|
||||
|
||||
rationale = (lead.context or {}).get("rationale", "")
|
||||
worker_task = (
|
||||
f"Investigate this specific lead from the strategist:\n\n"
|
||||
f"REQUEST: {lead.description}\n"
|
||||
f"MOTIVATING HYPOTHESIS: {lead.motivating_hypothesis or '(unspecified)'}\n"
|
||||
f"EXPECTED EVIDENCE TYPE: {lead.expected_evidence_type or '(unspecified)'}\n"
|
||||
f"RATIONALE: {rationale or '(unspecified)'}\n\n"
|
||||
f"After investigating, record findings via add_phenomenon AND "
|
||||
f"link relevant phenomena to "
|
||||
f"{lead.motivating_hypothesis or 'the motivating hypothesis'} via the "
|
||||
f"appropriate edge_type. If your investigation produces no relevant "
|
||||
f"finding, record that as a negative phenomenon so the strategist "
|
||||
f"can see the gap was probed."
|
||||
)
|
||||
_log(
|
||||
f"Round {round_num} dispatching: {lead.description[:80]}",
|
||||
event="dispatch", agent=agent_type, lead=lead.id,
|
||||
)
|
||||
lead.status = "assigned"
|
||||
self.graph._auto_save()
|
||||
try:
|
||||
await worker.run(worker_task, lead_id=lead.id)
|
||||
lead.status = "completed"
|
||||
except Exception as e:
|
||||
logger.error("Strategist lead %s failed: %s", lead.id, e, exc_info=True)
|
||||
lead.status = "failed"
|
||||
lead.context["failure_reason"] = str(e)
|
||||
finally:
|
||||
self.graph._auto_save()
|
||||
|
||||
async def _resume_strategist_state(self) -> int:
|
||||
"""Repair any open InvestigationRound after a resume and return the
|
||||
next round number to use.
|
||||
|
||||
An "open" round is one with ``started_at`` set but ``completed_at``
|
||||
empty — interrupted before its complete step. Mark it as completed
|
||||
with action=interrupted_resume so the run history is self-describing,
|
||||
and mark any leads still in the "assigned" state from that round as
|
||||
"failed" so the gap-analysis / retry paths can re-process them.
|
||||
Returns the round number the strategist loop should start from
|
||||
(1 + the highest existing round_number).
|
||||
"""
|
||||
if not self.graph.investigation_rounds:
|
||||
return 1
|
||||
highest = max(r.round_number for r in self.graph.investigation_rounds)
|
||||
last = self.graph.latest_round()
|
||||
if last is not None and not last.completed_at:
|
||||
assigned_in_round = [
|
||||
l for l in self.graph.leads
|
||||
if l.round_number == last.round_number
|
||||
and l.status == "assigned"
|
||||
]
|
||||
for lead in assigned_in_round:
|
||||
lead.status = "failed"
|
||||
lead.context["failure_reason"] = "interrupted before complete"
|
||||
await self.graph.complete_investigation_round(
|
||||
last.id, strategist_action="interrupted_resume",
|
||||
decision_rationale=(
|
||||
f"resume repair: this round was interrupted before "
|
||||
f"completion; {len(assigned_in_round)} assigned leads "
|
||||
f"have been re-marked as failed."
|
||||
),
|
||||
)
|
||||
logger.info(
|
||||
"Strategist resume: repaired open round R%d (closed %d assigned leads)",
|
||||
last.round_number, len(assigned_in_round),
|
||||
)
|
||||
return highest + 1
|
||||
|
||||
async def _phase3_strategist_loop(self) -> None:
|
||||
"""Belief-driven investigation: strategist proposes, workers execute,
|
||||
repeat. Replaces the legacy fixed-round investigation loop.
|
||||
"""
|
||||
_log("Phase 3: Strategist-Driven Investigation", event="phase")
|
||||
strategist_cfg = self.config.get("strategist", {}) or {}
|
||||
max_rounds = int(strategist_cfg.get("max_rounds", 10))
|
||||
zero_yield_cap = int(strategist_cfg.get("hard_stop_marginal_yield_zero_rounds", 3))
|
||||
|
||||
strategist = self.factory.get_or_create_agent("strategist")
|
||||
if strategist is None:
|
||||
logger.error(
|
||||
"InvestigationStrategist agent not registered — falling back "
|
||||
"to legacy Phase 3 loop. Check agent_factory._AGENT_CLASSES."
|
||||
)
|
||||
await self._phase3_legacy_loop()
|
||||
return
|
||||
|
||||
# Resume support: if we're restarting after an interruption, repair
|
||||
# any half-open round and pick up at the next number.
|
||||
start_round = await self._resume_strategist_state()
|
||||
if start_round > 1:
|
||||
_log(
|
||||
f"Resuming strategist loop at round {start_round} "
|
||||
f"(history: {len(self.graph.investigation_rounds)} prior rounds)",
|
||||
event="progress",
|
||||
)
|
||||
|
||||
zero_yield_streak = 0
|
||||
|
||||
for round_num in range(start_round, max_rounds + 1):
|
||||
# Reset per-round flags so a previous round's declare_complete
|
||||
# doesn't leak across iterations (defensive — strategist also
|
||||
# only sets True, never False).
|
||||
self.graph.strategist_complete_requested = False
|
||||
self.graph.current_strategist_round = round_num
|
||||
rid = await self.graph.start_investigation_round(round_num)
|
||||
_log(
|
||||
f"Strategist Round {round_num}/{max_rounds}", event="phase",
|
||||
round=round_num,
|
||||
)
|
||||
t0 = time.monotonic()
|
||||
|
||||
try:
|
||||
await strategist.run(
|
||||
f"Review the current investigation state and decide the "
|
||||
f"next action. This is round {round_num}/{max_rounds}. "
|
||||
f"Use graph_overview / marginal_yield / budget_status / "
|
||||
f"source_coverage to ground your decision, then call "
|
||||
f"propose_lead 1-3 times OR declare_investigation_complete."
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error("Strategist round %d failed: %s", round_num, e, exc_info=True)
|
||||
await self.graph.complete_investigation_round(
|
||||
rid, decision_rationale=f"strategist crashed: {e}",
|
||||
)
|
||||
break
|
||||
|
||||
# Strategist declared complete → no leads execute, exit loop.
|
||||
if self.graph.strategist_complete_requested:
|
||||
_log(
|
||||
f"Strategist declared complete at round {round_num}",
|
||||
event="progress", elapsed=time.monotonic() - t0,
|
||||
)
|
||||
await self.graph.complete_investigation_round(
|
||||
rid, strategist_action="declare_complete",
|
||||
decision_rationale="strategist declare_investigation_complete",
|
||||
)
|
||||
break
|
||||
|
||||
# Collect this round's leads (proposed_by=strategist + matching round).
|
||||
new_leads = [
|
||||
l for l in self.graph.leads
|
||||
if l.round_number == round_num
|
||||
and l.proposed_by == "strategist"
|
||||
and l.status == "pending"
|
||||
]
|
||||
if not new_leads:
|
||||
_log(
|
||||
f"Round {round_num}: strategist proposed no new leads — exiting loop",
|
||||
event="progress", elapsed=time.monotonic() - t0,
|
||||
)
|
||||
await self.graph.complete_investigation_round(
|
||||
rid, strategist_action="no_leads",
|
||||
decision_rationale="strategist proposed no new leads",
|
||||
)
|
||||
break
|
||||
|
||||
# Dispatch each lead to its worker.
|
||||
for lead in new_leads:
|
||||
await self._execute_strategist_lead(lead, round_num)
|
||||
|
||||
# After workers run, judge any new phenomena against existing
|
||||
# hypotheses (so confidence updates happen before the next round
|
||||
# of strategist reasoning).
|
||||
if self.graph.phenomena and self.graph.hypotheses:
|
||||
await self._judge_new_phenomena()
|
||||
|
||||
closed = await self.graph.complete_investigation_round(
|
||||
rid, strategist_action="propose_leads",
|
||||
leads_executed=[l.id for l in new_leads],
|
||||
)
|
||||
|
||||
# Show round outcome.
|
||||
for h in self.graph.hypotheses.values():
|
||||
_log(f" {h.summary()}", event="hypothesis")
|
||||
_log(
|
||||
_progress_summary(self.graph) + f" (yield: +{closed.new_phenomena_count}ph, +{closed.new_edges_count}edges, {closed.status_flips}flips)",
|
||||
event="progress", elapsed=time.monotonic() - t0,
|
||||
)
|
||||
|
||||
# Marginal-yield hard stop. Distinct from strategist self-throttle:
|
||||
# if the strategist insists on continuing through repeated dry
|
||||
# rounds, force-stop. This protects against an over-eager
|
||||
# strategist + a confused worker that produces no edges.
|
||||
yield_total = (
|
||||
closed.new_phenomena_count
|
||||
+ closed.new_edges_count
|
||||
+ closed.status_flips
|
||||
)
|
||||
if yield_total == 0:
|
||||
zero_yield_streak += 1
|
||||
if zero_yield_streak >= zero_yield_cap:
|
||||
_log(
|
||||
f"Hard stop: {zero_yield_streak} consecutive "
|
||||
f"zero-yield rounds (cap {zero_yield_cap})",
|
||||
event="progress",
|
||||
)
|
||||
break
|
||||
else:
|
||||
zero_yield_streak = 0
|
||||
|
||||
if self._budget_exceeded():
|
||||
_log(
|
||||
f"Budget exhausted after round {round_num} — exiting Phase 3",
|
||||
event="progress",
|
||||
)
|
||||
break
|
||||
else:
|
||||
_log(
|
||||
f"Strategist max_rounds={max_rounds} reached", event="progress",
|
||||
)
|
||||
|
||||
# Always reset the round counter on exit so subsequent runs don't
|
||||
# inherit the last value.
|
||||
self.graph.current_strategist_round = 0
|
||||
|
||||
async def _phase3_legacy_loop(self) -> None:
|
||||
"""Legacy fixed-round Phase 3 — preserved for fallback / regression.
|
||||
|
||||
Engaged when config has ``strategist.enabled: false`` or when the
|
||||
strategist agent class is somehow not registered. Behaves identically
|
||||
to the pre-DESIGN_STRATEGIST orchestrator: bounded iteration,
|
||||
hypothesis-derived leads, parallel dispatch, gap analysis.
|
||||
"""
|
||||
max_rounds = self.config.get("max_investigation_rounds", 5)
|
||||
for round_num in range(max_rounds):
|
||||
_log(f"Phase 3: Investigation Round {round_num}", event="phase")
|
||||
t0 = time.monotonic()
|
||||
|
||||
if self.graph.hypotheses_converged():
|
||||
_log("All hypotheses converged — stopping", event="progress")
|
||||
break
|
||||
|
||||
await self._generate_hypothesis_leads()
|
||||
|
||||
pending = await self.graph.get_pending_leads()
|
||||
if not pending:
|
||||
_log("No pending leads — round complete", event="progress")
|
||||
break
|
||||
|
||||
await self._dispatch_leads_parallel(pending)
|
||||
await self._judge_new_phenomena()
|
||||
|
||||
for h in self.graph.hypotheses.values():
|
||||
_log(f" {h.summary()}", event="hypothesis")
|
||||
_log(_progress_summary(self.graph), event="progress", elapsed=time.monotonic() - t0)
|
||||
|
||||
# ---- Hypothesis generation -----------------------------------------------
|
||||
|
||||
async def _generate_hypotheses_manual(self, hypotheses_config: list[dict]) -> None:
|
||||
@@ -881,39 +1178,26 @@ class Orchestrator:
|
||||
event="progress", elapsed=time.monotonic() - t0,
|
||||
)
|
||||
|
||||
# Phase 3: Hypothesis-directed investigation (iterative)
|
||||
# Phase 3: Strategist-driven investigation (DESIGN_STRATEGIST.md)
|
||||
if resume_phase <= 3:
|
||||
max_rounds = self.config.get("max_investigation_rounds", 5)
|
||||
for round_num in range(max_rounds):
|
||||
_log(f"Phase 3: Investigation Round {round_num}", event="phase")
|
||||
t0 = time.monotonic()
|
||||
strategist_cfg = self.config.get("strategist", {}) or {}
|
||||
strategist_enabled = strategist_cfg.get("enabled", True)
|
||||
if strategist_enabled:
|
||||
await self._phase3_strategist_loop()
|
||||
else:
|
||||
# Legacy fallback — keep the old hypothesis-directed
|
||||
# iterative loop available for runs that explicitly
|
||||
# disable the strategist (debugging, regression
|
||||
# comparison, or environments without the strategist
|
||||
# agent registered).
|
||||
await self._phase3_legacy_loop()
|
||||
|
||||
if self.graph.hypotheses_converged():
|
||||
_log("All hypotheses converged — stopping", event="progress")
|
||||
break
|
||||
|
||||
await self._generate_hypothesis_leads()
|
||||
|
||||
pending = await self.graph.get_pending_leads()
|
||||
if not pending:
|
||||
_log("No pending leads — round complete", event="progress")
|
||||
break
|
||||
|
||||
await self._dispatch_leads_parallel(pending)
|
||||
await self._judge_new_phenomena()
|
||||
|
||||
# Show hypothesis status update
|
||||
for h in self.graph.hypotheses.values():
|
||||
_log(f" {h.summary()}", event="hypothesis")
|
||||
_log(_progress_summary(self.graph), event="progress", elapsed=time.monotonic() - t0)
|
||||
|
||||
# Retry failed leads
|
||||
# Retry failed leads + Gap Analysis run regardless of which
|
||||
# Phase 3 variant was used — they operate on the leads/
|
||||
# hypothesis graph the strategist loop leaves behind.
|
||||
await self._retry_failed_leads()
|
||||
|
||||
# Gap analysis
|
||||
_log("Phase 3: Gap Analysis", event="phase")
|
||||
await self._run_gap_analysis()
|
||||
|
||||
self.graph.mark_remaining_inconclusive()
|
||||
|
||||
# Phase 4: Timeline construction
|
||||
|
||||
@@ -9,6 +9,7 @@ import pytest
|
||||
|
||||
from evidence_graph import (
|
||||
EvidenceGraph, Phenomenon, Hypothesis, Lead, GroundingError,
|
||||
InvestigationRound,
|
||||
_compute_quality_score, _jaccard_similarity,
|
||||
prob_to_log_odds, log_odds_to_prob,
|
||||
)
|
||||
@@ -1332,6 +1333,69 @@ class TestInvestigationAreaDerivation:
|
||||
assert edge_added["n"] == 1
|
||||
assert agent._record_call_counts["add_temporal_edge"] == 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_forced_retry_uses_higher_cap_and_narrowed_tools(self):
|
||||
"""The forced RECORD retry must (a) get a generous iter cap so that
|
||||
grounding-rejected retries don't blow the budget, and (b) hand the
|
||||
LLM a tool surface restricted to RECORD + read-only graph tools so
|
||||
it can't wander back into investigation.
|
||||
"""
|
||||
from unittest.mock import AsyncMock
|
||||
from base_agent import BaseAgent
|
||||
|
||||
graph = EvidenceGraph()
|
||||
llm = AsyncMock()
|
||||
|
||||
# Capture per-call kwargs of tool_call_loop so we can assert what
|
||||
# the retry round received.
|
||||
call_kwargs: list[dict] = []
|
||||
|
||||
async def real_add_edge(**kw):
|
||||
return None
|
||||
|
||||
class TimelineLike(BaseAgent):
|
||||
mandatory_record_tools = ("add_temporal_edge",)
|
||||
|
||||
agent = TimelineLike(llm, graph)
|
||||
agent.name = "timeline_like"
|
||||
agent._register_graph_tools = lambda: None
|
||||
agent.register_tool("add_temporal_edge", "", {}, real_add_edge)
|
||||
# An investigation-style tool the retry must NOT expose.
|
||||
async def real_inv(**kw): return ""
|
||||
agent.register_tool("list_directory", "", {}, real_inv)
|
||||
# A read-only graph query — should remain available in retry.
|
||||
async def real_ro(**kw): return ""
|
||||
agent.register_tool("list_phenomena", "", {}, real_ro)
|
||||
|
||||
async def fake_tool_call_loop(messages, tools, tool_executor, system, max_iterations=40, terminal_tools=()):
|
||||
call_kwargs.append({
|
||||
"tools": [t["name"] for t in tools],
|
||||
"max_iterations": max_iterations,
|
||||
})
|
||||
already_retrying = any(
|
||||
"STOP." in (m.get("content", "") if isinstance(m, dict) else "")
|
||||
for m in messages
|
||||
)
|
||||
if not already_retrying:
|
||||
return "no record", list(messages)
|
||||
await tool_executor["add_temporal_edge"]()
|
||||
return "recorded.", []
|
||||
|
||||
llm.tool_call_loop = fake_tool_call_loop
|
||||
await agent.run("build timeline")
|
||||
|
||||
assert len(call_kwargs) == 2
|
||||
first_call, retry_call = call_kwargs
|
||||
# First call: full tool surface, default iter cap.
|
||||
assert "list_directory" in first_call["tools"]
|
||||
# Retry call: investigation tool dropped, mandatory + read-only kept.
|
||||
assert "list_directory" not in retry_call["tools"]
|
||||
assert "add_temporal_edge" in retry_call["tools"]
|
||||
assert "list_phenomena" in retry_call["tools"]
|
||||
# Iter cap on the retry is now generous — 10 was empirically too tight
|
||||
# because grounding-rejected calls burn iterations.
|
||||
assert retry_call["max_iterations"] >= 30
|
||||
|
||||
# ---- terminal_tools: real LLMClient.tool_call_loop short-circuit -----
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@@ -1957,8 +2021,9 @@ class TestLogOddsConfidence:
|
||||
# All three orderings must agree exactly.
|
||||
assert confs[0] == pytest.approx(confs[1])
|
||||
assert confs[1] == pytest.approx(confs[2])
|
||||
# And the value should be 1 + 1 − 0.5 = 1.5 → sigmoid ≈ 0.9694
|
||||
assert confs[0] == pytest.approx(0.9694, abs=1e-3)
|
||||
# With harmonic damping: 2 supports → 1.0 + 0.5 = +1.5, 1 weakens
|
||||
# → −0.5, net log_odds = +1.0 → sigmoid 1/(1+10^-1) ≈ 0.9091.
|
||||
assert confs[0] == pytest.approx(0.9091, abs=1e-3)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_each_edge_type_calibrated(self, graph):
|
||||
@@ -2019,17 +2084,91 @@ class TestLogOddsConfidence:
|
||||
assert sum(1 for e in graph.edges if e.edge_type == "supports") == first_edges
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_independent_evidence_accumulates(self, graph):
|
||||
"""Distinct phenomena with same edge_type DO accumulate (independent)."""
|
||||
async def test_repeated_same_direction_evidence_dampens_harmonically(self, graph):
|
||||
"""Distinct phenomena with same edge_type accumulate WITH harmonic
|
||||
damping (post 2026-05-20 full-case-run fix). The k-th edge of a
|
||||
given (hyp, edge_type) contributes log_lr_base/k; after 3 supports
|
||||
edges the total is 1.0 · (1 + 1/2 + 1/3) ≈ 1.833, not 3.0. This
|
||||
formalises the breakdown of the naive-Bayes independence
|
||||
assumption when multiple agents pile on the same finding.
|
||||
"""
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
for i in range(3):
|
||||
pid, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", f"ph {i}", f"d {i}", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(hid, pid, "supports", "")
|
||||
# 3 × +1.0 = +3.0 log_odds → conf ≈ 0.999
|
||||
assert graph.hypotheses[hid].log_odds == pytest.approx(3.0)
|
||||
assert graph.hypotheses[hid].confidence > 0.99
|
||||
# 1.0 · H_3 = 1.0 + 0.5 + 0.3333... = 1.8333
|
||||
expected = 1.0 + 0.5 + (1.0 / 3.0)
|
||||
assert graph.hypotheses[hid].log_odds == pytest.approx(expected, abs=1e-6)
|
||||
# Still above the 0.8 supported threshold (good — 3 independent
|
||||
# observations remain strong evidence) but well below the runaway
|
||||
# confidence pre-fix produced.
|
||||
assert graph.hypotheses[hid].status == "supported"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_first_edge_undamped(self, graph):
|
||||
"""Damping is rank-1 → /1 → no change. A hypothesis with a single
|
||||
edge must contribute the full calibrated log_lr value (otherwise the
|
||||
whole calibration table would be off by a factor)."""
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
pid, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", "lone", "d", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(hid, pid, "direct_evidence", "")
|
||||
assert graph.hypotheses[hid].log_odds == pytest.approx(2.0)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_damping_independent_per_edge_type(self, graph):
|
||||
"""Damping rank is keyed per (hyp, edge_type) — supports' counter
|
||||
does NOT advance direct_evidence's counter."""
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
for i in range(5):
|
||||
pid, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", f"s{i}", "d", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(hid, pid, "supports", "")
|
||||
log_after_supports = graph.hypotheses[hid].log_odds
|
||||
# H_5 = 1 + 1/2 + 1/3 + 1/4 + 1/5 = 2.2833
|
||||
assert log_after_supports == pytest.approx(2.2833, abs=1e-3)
|
||||
|
||||
pid_de, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", "de", "d", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(hid, pid_de, "direct_evidence", "")
|
||||
delta = graph.hypotheses[hid].log_odds - log_after_supports
|
||||
assert delta == pytest.approx(2.0, abs=1e-6)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_damping_independent_per_hypothesis(self, graph):
|
||||
"""The rank counter is per (hyp, edge_type), so two hypotheses
|
||||
each receiving 'supports' edges accumulate independently."""
|
||||
h1 = await graph.add_hypothesis("h1", "d")
|
||||
h2 = await graph.add_hypothesis("h2", "d")
|
||||
for i in range(3):
|
||||
pid, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", f"to-h1 {i}", "d", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(h1, pid, "supports", "")
|
||||
pid_h2, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", "to-h2", "d", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(h2, pid_h2, "supports", "")
|
||||
assert graph.hypotheses[h2].log_odds == pytest.approx(1.0)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_damping_rank_persists_in_confidence_log(self, graph):
|
||||
"""The rank used for damping must be recorded so the math is
|
||||
auditable from the persisted graph (no need to recompute)."""
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
for i in range(2):
|
||||
pid, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", f"p {i}", "d", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(hid, pid, "supports", "")
|
||||
entries = graph.hypotheses[hid].confidence_log
|
||||
assert entries[0]["rank"] == 1 and entries[0]["log_lr"] == pytest.approx(1.0)
|
||||
assert entries[1]["rank"] == 2 and entries[1]["log_lr"] == pytest.approx(0.5)
|
||||
@pytest.mark.asyncio
|
||||
async def test_prior_prob_shifts_starting_log_odds(self, graph):
|
||||
# prior 0.9 → log_odds ≈ +0.954
|
||||
@@ -2898,3 +3037,641 @@ class TestOrchestratorMultiSource:
|
||||
assert Orchestrator._is_analysable(ok_media)
|
||||
assert not Orchestrator._is_analysable(no_path)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Strategist loop foundation (DESIGN_STRATEGIST.md §1)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestLeadExtensionsForStrategist:
|
||||
"""Lead now carries strategist-loop annotations: proposed_by,
|
||||
motivating_hypothesis, expected_evidence_type, round_number. Old state
|
||||
files predate these fields — from_dict must accept them missing.
|
||||
"""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_add_lead_records_strategist_annotations(self):
|
||||
graph = EvidenceGraph()
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
lid = await graph.add_lead(
|
||||
target_agent="filesystem",
|
||||
description="Check Safari bookmarks for device-switching evidence",
|
||||
proposed_by="strategist",
|
||||
motivating_hypothesis=hid,
|
||||
expected_evidence_type="supports",
|
||||
round_number=2,
|
||||
context={"source_id": "src-ios-chan"},
|
||||
)
|
||||
lead = next(l for l in graph.leads if l.id == lid)
|
||||
assert lead.proposed_by == "strategist"
|
||||
assert lead.motivating_hypothesis == hid
|
||||
assert lead.expected_evidence_type == "supports"
|
||||
assert lead.round_number == 2
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_strategist_lead_idempotency(self):
|
||||
"""Same (motivating_hyp, expected_type, target_agent, source_id) from
|
||||
the strategist should NOT create a duplicate while pending.
|
||||
"""
|
||||
graph = EvidenceGraph()
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
first = await graph.add_lead(
|
||||
target_agent="filesystem", description="probe X",
|
||||
proposed_by="strategist", motivating_hypothesis=hid,
|
||||
expected_evidence_type="supports", round_number=1,
|
||||
context={"source_id": "src-A"},
|
||||
)
|
||||
again = await graph.add_lead(
|
||||
target_agent="filesystem", description="probe X (rephrased)",
|
||||
proposed_by="strategist", motivating_hypothesis=hid,
|
||||
expected_evidence_type="supports", round_number=2,
|
||||
context={"source_id": "src-A"},
|
||||
)
|
||||
assert first == again
|
||||
assert len(graph.leads) == 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_non_strategist_leads_not_deduped(self):
|
||||
"""Phase 1 worker leads (proposed_by != 'strategist') should NOT be
|
||||
deduped — agents can legitimately propose the same kind of lead
|
||||
multiple times from different contexts.
|
||||
"""
|
||||
graph = EvidenceGraph()
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
a = await graph.add_lead(
|
||||
target_agent="filesystem", description="x", proposed_by="filesystem",
|
||||
motivating_hypothesis=hid, expected_evidence_type="supports",
|
||||
)
|
||||
b = await graph.add_lead(
|
||||
target_agent="filesystem", description="x", proposed_by="filesystem",
|
||||
motivating_hypothesis=hid, expected_evidence_type="supports",
|
||||
)
|
||||
assert a != b
|
||||
assert len(graph.leads) == 2
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_lead_from_old_state_file_loads_with_defaults(self, tmp_path):
|
||||
"""Forward-compat: a state file written before the strategist fields
|
||||
existed must still load. The new fields take their defaults.
|
||||
"""
|
||||
legacy = {
|
||||
"id": "lead-legacy01",
|
||||
"target_agent": "filesystem",
|
||||
"description": "old-style lead",
|
||||
"priority": 5,
|
||||
"context": {},
|
||||
"status": "pending",
|
||||
"hypothesis_id": None,
|
||||
}
|
||||
lead = Lead.from_dict(legacy)
|
||||
assert lead.proposed_by == ""
|
||||
assert lead.motivating_hypothesis == ""
|
||||
assert lead.round_number == 0
|
||||
|
||||
|
||||
class TestInvestigationRound:
|
||||
"""The InvestigationRound provenance node + start/complete lifecycle."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_round_lifecycle_captures_before_and_after(self):
|
||||
graph = EvidenceGraph()
|
||||
h1 = await graph.add_hypothesis("h1", "d")
|
||||
h2 = await graph.add_hypothesis("h2", "d")
|
||||
rid = await graph.start_investigation_round(1)
|
||||
r = graph.get_investigation_round(rid)
|
||||
assert r is not None
|
||||
assert r.round_number == 1
|
||||
assert r.hypothesis_status_snapshot_before == {h1: "active", h2: "active"}
|
||||
assert r.phenomena_count_before == 0
|
||||
assert r.completed_at == ""
|
||||
|
||||
pid, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", "found something", "interp", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(h1, pid, "direct_evidence", "")
|
||||
|
||||
closed = await graph.complete_investigation_round(
|
||||
rid, strategist_action="propose_leads",
|
||||
decision_rationale="found new evidence for h1",
|
||||
)
|
||||
assert closed is not None
|
||||
assert closed.completed_at != ""
|
||||
assert closed.hypothesis_status_snapshot_after[h1] == "supported"
|
||||
assert closed.hypothesis_status_snapshot_after[h2] == "active"
|
||||
assert closed.new_phenomena_count == 1
|
||||
assert closed.new_edges_count == 1
|
||||
assert closed.status_flips == 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_complete_round_is_idempotent(self):
|
||||
graph = EvidenceGraph()
|
||||
rid = await graph.start_investigation_round(1)
|
||||
first = await graph.complete_investigation_round(rid)
|
||||
await graph.add_phenomenon("fs", "x", "y", "z", source_tool="t")
|
||||
second = await graph.complete_investigation_round(rid)
|
||||
assert first is second
|
||||
assert first.phenomena_count_after == 0
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_leads_from_round_filters_correctly(self):
|
||||
graph = EvidenceGraph()
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
await graph.add_lead(
|
||||
target_agent="filesystem", description="r1 lead",
|
||||
proposed_by="strategist", motivating_hypothesis=hid,
|
||||
expected_evidence_type="supports", round_number=1,
|
||||
)
|
||||
await graph.add_lead(
|
||||
target_agent="filesystem", description="r2 lead",
|
||||
proposed_by="strategist", motivating_hypothesis=hid,
|
||||
expected_evidence_type="supports", round_number=2,
|
||||
context={"source_id": "src-different"},
|
||||
)
|
||||
await graph.add_lead(
|
||||
target_agent="ios_artifact", description="phase 1 finding",
|
||||
proposed_by="filesystem", round_number=0,
|
||||
)
|
||||
r1 = graph.leads_from_round(1)
|
||||
r2 = graph.leads_from_round(2)
|
||||
r0 = graph.leads_from_round(0)
|
||||
assert len(r1) == 1 and r1[0].description == "r1 lead"
|
||||
assert len(r2) == 1 and r2[0].description == "r2 lead"
|
||||
assert len(r0) == 1 and r0[0].proposed_by == "filesystem"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_round_persistence_round_trip(self, tmp_path):
|
||||
"""Investigation rounds must survive save/load."""
|
||||
path = tmp_path / "state.json"
|
||||
graph = EvidenceGraph(persist_path=path)
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
rid = await graph.start_investigation_round(1)
|
||||
await graph.complete_investigation_round(
|
||||
rid, decision_rationale="probe complete",
|
||||
)
|
||||
loaded = EvidenceGraph.load_state(path)
|
||||
assert len(loaded.investigation_rounds) == 1
|
||||
r = loaded.investigation_rounds[0]
|
||||
assert r.id == rid
|
||||
assert r.decision_rationale == "probe complete"
|
||||
assert hid in r.hypothesis_status_snapshot_before
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_strategy_tool_helpers(self):
|
||||
"""Smoke-test the strategy tool renders against a small graph.
|
||||
Renders are markdown strings — we assert structural anchors rather
|
||||
than full text so the test is robust to wording tweaks.
|
||||
"""
|
||||
from tools import strategy
|
||||
from case import Case, EvidenceSource
|
||||
|
||||
graph = EvidenceGraph()
|
||||
src = EvidenceSource(
|
||||
id="src-test", label="test iOS", type="mobile_extraction",
|
||||
access_mode="tree", path="/tmp/x",
|
||||
)
|
||||
graph.case = Case(case_id="c", name="c", sources=[src])
|
||||
graph.set_active_source(src)
|
||||
graph._current_agent = "ios_artifact"
|
||||
graph._current_task_id = "task-1"
|
||||
|
||||
# graph_overview on empty graph still renders.
|
||||
ov = strategy.graph_overview(graph)
|
||||
assert "# Investigation State" in ov
|
||||
assert "_(none yet" in ov # hypotheses section
|
||||
assert "src-test" in ov
|
||||
|
||||
# Source coverage — every entry should be ✗ initially.
|
||||
cov = strategy.source_coverage(graph, "src-test")
|
||||
assert "Coverage: **0/" in cov
|
||||
assert "✗" in cov
|
||||
assert "Coverage hints are heuristics" in cov
|
||||
|
||||
# Record one invocation that matches the AddressBook detector.
|
||||
await graph.record_tool_invocation(
|
||||
tool="sqlite_query",
|
||||
args={"db_path": "/x/var/mobile/Library/AddressBook/AddressBook.sqlitedb"},
|
||||
output="contact list",
|
||||
)
|
||||
cov2 = strategy.source_coverage(graph, "src-test")
|
||||
assert "Coverage: **1/" in cov2 or "Coverage: **2/" in cov2
|
||||
|
||||
# marginal_yield: no rounds → empty render.
|
||||
my = strategy.marginal_yield(graph)
|
||||
assert "no completed investigation rounds" in my
|
||||
|
||||
# budget_status with no budgets shows unbounded.
|
||||
bs = strategy.budget_status(graph, None, None)
|
||||
assert "tool_calls" in bs
|
||||
assert "(unbounded)" in bs
|
||||
|
||||
# budget_status with budgets + pacing hint.
|
||||
bs2 = strategy.budget_status(
|
||||
graph,
|
||||
{"tool_calls_total": 1, "strategist_rounds_max": 1},
|
||||
None,
|
||||
)
|
||||
assert "≥ 90%" in bs2 # already over 90% (1 of 1 tool calls used)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_strategist_agent_registers_correct_toolset(self):
|
||||
"""Strategist gets read-only graph queries + the 6 strategy tools;
|
||||
crucially NO graph-write tools (no add_phenomenon, observe_identity,
|
||||
link_to_entity, add_hypothesis, add_temporal_edge).
|
||||
"""
|
||||
from tool_registry import register_all_tools
|
||||
from agent_factory import AgentFactory
|
||||
from llm_client import LLMClient
|
||||
|
||||
graph = EvidenceGraph()
|
||||
register_all_tools(graph)
|
||||
llm = LLMClient.__new__(LLMClient)
|
||||
factory = AgentFactory(llm, graph)
|
||||
agent = factory.get_or_create_agent("strategist")
|
||||
agent._register_graph_tools()
|
||||
|
||||
registered = set(agent._tools.keys())
|
||||
assert {
|
||||
"graph_overview", "source_coverage", "marginal_yield",
|
||||
"budget_status", "propose_lead", "declare_investigation_complete",
|
||||
} <= registered
|
||||
assert {"list_phenomena", "get_phenomenon", "search_graph"} <= registered
|
||||
forbidden = {
|
||||
"add_phenomenon", "observe_identity", "link_to_entity",
|
||||
"add_hypothesis", "add_temporal_edge", "add_lead",
|
||||
}
|
||||
leaked = registered & forbidden
|
||||
assert not leaked, f"Strategist must not have write tools: {leaked}"
|
||||
|
||||
def test_strategist_terminal_tool_is_declare_complete(self):
|
||||
"""The strategist class declares declare_investigation_complete as
|
||||
its terminal tool — the tool_call_loop must short-circuit on that
|
||||
call (verified at the LLM client level by an existing test).
|
||||
"""
|
||||
from agents.strategist import InvestigationStrategist
|
||||
assert InvestigationStrategist.terminal_tools == ("declare_investigation_complete",)
|
||||
assert "propose_lead" in InvestigationStrategist.mandatory_record_tools
|
||||
assert "declare_investigation_complete" in InvestigationStrategist.mandatory_record_tools
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_propose_lead_validates_hypothesis_id(self):
|
||||
"""propose_lead must reject leads whose motivating_hypothesis isn't
|
||||
actually a registered hypothesis — that's a strategist hallucination
|
||||
analogous to citing a bogus invocation_id.
|
||||
"""
|
||||
from tool_registry import register_all_tools, TOOL_CATALOG
|
||||
graph = EvidenceGraph()
|
||||
graph._current_agent = "strategist"
|
||||
graph._current_task_id = "task-strat-1"
|
||||
graph.current_strategist_round = 1
|
||||
register_all_tools(graph)
|
||||
td = TOOL_CATALOG["propose_lead"]
|
||||
result = await td.executor(
|
||||
description="probe X",
|
||||
target_agent="filesystem",
|
||||
motivating_hypothesis="hyp-does-not-exist",
|
||||
expected_evidence_type="supports",
|
||||
)
|
||||
assert "not in graph.hypotheses" in result
|
||||
assert not graph.leads
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_propose_lead_creates_strategist_lead(self):
|
||||
"""propose_lead happy path writes a strategist-attributed lead
|
||||
tagged with the current round_number."""
|
||||
from tool_registry import register_all_tools, TOOL_CATALOG
|
||||
graph = EvidenceGraph()
|
||||
graph._current_agent = "strategist"
|
||||
graph._current_task_id = "task-strat-2"
|
||||
graph.current_strategist_round = 3
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
register_all_tools(graph)
|
||||
td = TOOL_CATALOG["propose_lead"]
|
||||
result = await td.executor(
|
||||
description="check Safari bookmarks",
|
||||
target_agent="ios_artifact",
|
||||
motivating_hypothesis=hid,
|
||||
expected_evidence_type="supports",
|
||||
rationale="single-source hypothesis needs corroboration",
|
||||
)
|
||||
assert "proposed" in result
|
||||
lead = graph.leads[0]
|
||||
assert lead.proposed_by == "strategist"
|
||||
assert lead.motivating_hypothesis == hid
|
||||
assert lead.round_number == 3
|
||||
assert lead.expected_evidence_type == "supports"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_propose_lead_rejects_invalid_evidence_type(self):
|
||||
from tool_registry import register_all_tools, TOOL_CATALOG
|
||||
graph = EvidenceGraph()
|
||||
graph._current_agent = "strategist"
|
||||
graph._current_task_id = "task-strat-3"
|
||||
graph.current_strategist_round = 1
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
register_all_tools(graph)
|
||||
td = TOOL_CATALOG["propose_lead"]
|
||||
result = await td.executor(
|
||||
description="x", target_agent="filesystem",
|
||||
motivating_hypothesis=hid,
|
||||
expected_evidence_type="bogus_type",
|
||||
)
|
||||
assert "not one of" in result
|
||||
assert not graph.leads
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_declare_complete_flips_request_flag(self):
|
||||
from tool_registry import register_all_tools, TOOL_CATALOG
|
||||
graph = EvidenceGraph()
|
||||
graph._current_agent = "strategist"
|
||||
graph._current_task_id = "task-strat-4"
|
||||
graph.current_strategist_round = 5
|
||||
register_all_tools(graph)
|
||||
td = TOOL_CATALOG["declare_investigation_complete"]
|
||||
assert graph.strategist_complete_requested is False
|
||||
result = await td.executor(
|
||||
reason="marginal_yield_zero",
|
||||
rationale="two rounds with 0 yield",
|
||||
)
|
||||
assert graph.strategist_complete_requested is True
|
||||
assert "round 5" in result
|
||||
assert "marginal_yield_zero" in result
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_declare_complete_rejects_bogus_reason(self):
|
||||
from tool_registry import register_all_tools, TOOL_CATALOG
|
||||
graph = EvidenceGraph()
|
||||
graph._current_agent = "strategist"
|
||||
graph._current_task_id = "task-strat-5"
|
||||
register_all_tools(graph)
|
||||
td = TOOL_CATALOG["declare_investigation_complete"]
|
||||
result = await td.executor(reason="i_just_want_to_quit")
|
||||
assert "not in" in result
|
||||
assert graph.strategist_complete_requested is False
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_resume_repairs_open_round(self, tmp_path):
|
||||
"""Simulate a crash mid-round: half-open InvestigationRound + a
|
||||
lead in 'assigned' state. _resume_strategist_state must close the
|
||||
round and re-mark the lead as failed."""
|
||||
from unittest.mock import AsyncMock
|
||||
from orchestrator import Orchestrator
|
||||
graph = EvidenceGraph()
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
rid = await graph.start_investigation_round(1)
|
||||
lid = await graph.add_lead(
|
||||
target_agent="filesystem", description="probe",
|
||||
proposed_by="strategist", motivating_hypothesis=hid,
|
||||
expected_evidence_type="supports", round_number=1,
|
||||
)
|
||||
lead = next(l for l in graph.leads if l.id == lid)
|
||||
lead.status = "assigned"
|
||||
|
||||
llm = AsyncMock()
|
||||
class FakeFactory:
|
||||
def get_or_create_agent(self, name):
|
||||
return AsyncMock()
|
||||
orch = Orchestrator(llm, graph, FakeFactory(), config={})
|
||||
next_round = await orch._resume_strategist_state()
|
||||
|
||||
round_ = graph.get_investigation_round(rid)
|
||||
assert round_.completed_at != ""
|
||||
assert round_.strategist_action == "interrupted_resume"
|
||||
lead2 = next(l for l in graph.leads if l.id == lid)
|
||||
assert lead2.status == "failed"
|
||||
assert "interrupted" in lead2.context.get("failure_reason", "")
|
||||
assert next_round == 2
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_resume_state_is_idempotent_on_clean_graph(self):
|
||||
"""No prior rounds → resume returns 1, no changes."""
|
||||
from unittest.mock import AsyncMock
|
||||
from orchestrator import Orchestrator
|
||||
graph = EvidenceGraph()
|
||||
llm = AsyncMock()
|
||||
class FakeFactory:
|
||||
def get_or_create_agent(self, name):
|
||||
return AsyncMock()
|
||||
orch = Orchestrator(llm, graph, FakeFactory(), config={})
|
||||
result = await orch._resume_strategist_state()
|
||||
assert result == 1
|
||||
assert graph.investigation_rounds == []
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_strategist_loop_exits_on_declare_complete(self):
|
||||
"""Mock strategist that declares complete in round 1 — orchestrator
|
||||
must exit the Phase 3 loop without dispatching any worker."""
|
||||
from unittest.mock import AsyncMock
|
||||
from orchestrator import Orchestrator
|
||||
|
||||
graph = EvidenceGraph()
|
||||
llm = AsyncMock()
|
||||
worker_runs: list[str] = []
|
||||
|
||||
class FakeStrategist:
|
||||
name = "strategist"
|
||||
async def run(self, task, lead_id=None):
|
||||
graph.strategist_complete_requested = True
|
||||
return "complete"
|
||||
|
||||
class FakeFactory:
|
||||
def __init__(self):
|
||||
self._instances = {"strategist": FakeStrategist()}
|
||||
def get_or_create_agent(self, name):
|
||||
return self._instances.get(name)
|
||||
|
||||
orch = Orchestrator(llm, graph, FakeFactory(), config={
|
||||
"strategist": {"enabled": True, "max_rounds": 5},
|
||||
})
|
||||
await orch._phase3_strategist_loop()
|
||||
|
||||
assert len(graph.investigation_rounds) == 1
|
||||
r = graph.investigation_rounds[0]
|
||||
assert r.strategist_action == "declare_complete"
|
||||
assert r.completed_at != ""
|
||||
assert worker_runs == []
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_strategist_loop_dispatches_lead_then_completes(self):
|
||||
"""Strategist proposes 1 lead in round 1, declares complete in round 2.
|
||||
Loop must dispatch the worker for the lead, then exit cleanly.
|
||||
"""
|
||||
from unittest.mock import AsyncMock
|
||||
from orchestrator import Orchestrator
|
||||
from case import Case, EvidenceSource
|
||||
|
||||
graph = EvidenceGraph()
|
||||
src = EvidenceSource(id="src-A", label="A", type="disk_image",
|
||||
access_mode="image", path="/tmp/x")
|
||||
graph.case = Case(case_id="c", name="n", sources=[src])
|
||||
graph.set_active_source(src)
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
llm = AsyncMock()
|
||||
worker_calls: list[tuple[str, str]] = []
|
||||
|
||||
class FakeStrategist:
|
||||
name = "strategist"
|
||||
def __init__(self):
|
||||
self.round = 0
|
||||
async def run(self, task, lead_id=None):
|
||||
self.round += 1
|
||||
if self.round == 1:
|
||||
await graph.add_lead(
|
||||
target_agent="filesystem",
|
||||
description="probe X",
|
||||
proposed_by="strategist",
|
||||
motivating_hypothesis=hid,
|
||||
expected_evidence_type="supports",
|
||||
round_number=graph.current_strategist_round,
|
||||
)
|
||||
else:
|
||||
graph.strategist_complete_requested = True
|
||||
return "ok"
|
||||
|
||||
class FakeWorker:
|
||||
name = "filesystem"
|
||||
async def run(self, task, lead_id=None):
|
||||
worker_calls.append((self.name, lead_id))
|
||||
return "did the thing"
|
||||
|
||||
class FakeFactory:
|
||||
def __init__(self):
|
||||
self.s = FakeStrategist()
|
||||
self.w = FakeWorker()
|
||||
def get_or_create_agent(self, name):
|
||||
if name == "strategist": return self.s
|
||||
return self.w
|
||||
|
||||
orch = Orchestrator(llm, graph, FakeFactory(), config={
|
||||
"strategist": {"enabled": True, "max_rounds": 5,
|
||||
"hard_stop_marginal_yield_zero_rounds": 99},
|
||||
})
|
||||
await orch._phase3_strategist_loop()
|
||||
|
||||
assert len(graph.investigation_rounds) == 2
|
||||
assert graph.investigation_rounds[0].strategist_action == "propose_leads"
|
||||
assert graph.investigation_rounds[1].strategist_action == "declare_complete"
|
||||
assert len(worker_calls) == 1
|
||||
assert worker_calls[0][0] == "filesystem"
|
||||
leads = [l for l in graph.leads if l.proposed_by == "strategist"]
|
||||
assert len(leads) == 1
|
||||
assert leads[0].status == "completed"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_strategist_loop_hard_stop_on_zero_yield(self):
|
||||
"""If the strategist insists on more rounds but yield stays zero for
|
||||
N consecutive rounds, the orchestrator force-stops as a safety net."""
|
||||
from unittest.mock import AsyncMock
|
||||
from orchestrator import Orchestrator
|
||||
|
||||
graph = EvidenceGraph()
|
||||
llm = AsyncMock()
|
||||
|
||||
class FakeStrategist:
|
||||
name = "strategist"
|
||||
async def run(self, task, lead_id=None):
|
||||
hid_local = next(iter(graph.hypotheses)) if graph.hypotheses else None
|
||||
await graph.add_lead(
|
||||
target_agent="filesystem", description="probe",
|
||||
proposed_by="strategist",
|
||||
motivating_hypothesis=hid_local or "",
|
||||
expected_evidence_type="supports",
|
||||
round_number=graph.current_strategist_round,
|
||||
)
|
||||
|
||||
class FakeWorker:
|
||||
name = "filesystem"
|
||||
async def run(self, task, lead_id=None):
|
||||
return ""
|
||||
|
||||
class FakeFactory:
|
||||
def __init__(self):
|
||||
self.s = FakeStrategist()
|
||||
self.w = FakeWorker()
|
||||
def get_or_create_agent(self, name):
|
||||
return self.s if name == "strategist" else self.w
|
||||
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
orch = Orchestrator(llm, graph, FakeFactory(), config={
|
||||
"strategist": {
|
||||
"enabled": True, "max_rounds": 20,
|
||||
"hard_stop_marginal_yield_zero_rounds": 2,
|
||||
},
|
||||
})
|
||||
await orch._phase3_strategist_loop()
|
||||
assert len(graph.investigation_rounds) == 2
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_strategist_loop_budget_exhaustion_stops_loop(self):
|
||||
"""Hard budget cap on tool_calls_total kills the loop even when the
|
||||
strategist wants to continue."""
|
||||
from unittest.mock import AsyncMock
|
||||
from orchestrator import Orchestrator
|
||||
|
||||
graph = EvidenceGraph()
|
||||
llm = AsyncMock()
|
||||
# Pre-stuff the invocations log so we're already past the cap.
|
||||
await graph.record_tool_invocation(
|
||||
tool="probe", args={}, output="x",
|
||||
)
|
||||
await graph.record_tool_invocation(
|
||||
tool="probe", args={}, output="y",
|
||||
)
|
||||
|
||||
class FakeStrategist:
|
||||
name = "strategist"
|
||||
async def run(self, task, lead_id=None):
|
||||
hid_local = next(iter(graph.hypotheses)) if graph.hypotheses else ""
|
||||
await graph.add_lead(
|
||||
target_agent="filesystem", description="x",
|
||||
proposed_by="strategist",
|
||||
motivating_hypothesis=hid_local,
|
||||
expected_evidence_type="supports",
|
||||
round_number=graph.current_strategist_round,
|
||||
)
|
||||
|
||||
class FakeWorker:
|
||||
name = "filesystem"
|
||||
async def run(self, task, lead_id=None):
|
||||
await graph.record_tool_invocation(
|
||||
tool="probe", args={}, output="z",
|
||||
)
|
||||
|
||||
class FakeFactory:
|
||||
def __init__(self):
|
||||
self.s = FakeStrategist()
|
||||
self.w = FakeWorker()
|
||||
def get_or_create_agent(self, name):
|
||||
return self.s if name == "strategist" else self.w
|
||||
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
orch = Orchestrator(llm, graph, FakeFactory(), config={
|
||||
"strategist": {"enabled": True, "max_rounds": 99,
|
||||
"hard_stop_marginal_yield_zero_rounds": 99},
|
||||
"budgets": {"tool_calls_total": 2},
|
||||
})
|
||||
await orch._phase3_strategist_loop()
|
||||
assert len(graph.investigation_rounds) == 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_marginal_yield_after_two_rounds(self):
|
||||
"""Verify marginal_yield captures phenomena/edge/status deltas."""
|
||||
from tools import strategy
|
||||
|
||||
graph = EvidenceGraph()
|
||||
hid = await graph.add_hypothesis("h", "d")
|
||||
|
||||
rid1 = await graph.start_investigation_round(1)
|
||||
pid, _ = await graph.add_phenomenon(
|
||||
"fs", "filesystem", "p1", "interp", source_tool="t",
|
||||
)
|
||||
await graph.update_hypothesis_confidence(hid, pid, "direct_evidence", "")
|
||||
await graph.complete_investigation_round(rid1)
|
||||
|
||||
rid2 = await graph.start_investigation_round(2)
|
||||
await graph.complete_investigation_round(rid2)
|
||||
|
||||
out = strategy.marginal_yield(graph, last_n_rounds=2)
|
||||
assert "R1" in out and "R2" in out
|
||||
assert "Trend" in out
|
||||
assert ("collapsed" in out or "Decelerating" in out
|
||||
or "Diminishing" in out or "diminishing" in out)
|
||||
|
||||
|
||||
266
tool_registry.py
266
tool_registry.py
@@ -24,6 +24,7 @@ from tools import mobile_ios as ios
|
||||
from tools import parsers
|
||||
from tools import registry as reg
|
||||
from tools import sleuthkit as tsk
|
||||
from tools import strategy as strat
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -985,6 +986,271 @@ def register_all_tools(graph: Any) -> None:
|
||||
tags=["media", "ocr", "image"],
|
||||
)
|
||||
|
||||
# ---- Strategist-loop view tools (DESIGN_STRATEGIST.md §2) ----
|
||||
# Pure read-only renders over graph state. The strategist agent uses
|
||||
# these to decide whether to keep investigating or to declare complete.
|
||||
# They go through invocation logging like every other tool (so the
|
||||
# strategist's reads are auditable) but are NOT cacheable — graph
|
||||
# state changes between calls and a stale snapshot would mislead.
|
||||
|
||||
async def _exec_graph_overview() -> str:
|
||||
return strat.graph_overview(graph)
|
||||
|
||||
TOOL_CATALOG["graph_overview"] = ToolDefinition(
|
||||
name="graph_overview",
|
||||
description=(
|
||||
"Top-level investigation state: hypotheses (with log-odds, "
|
||||
"confidence, edges_in, distinct_sources contributing, recent "
|
||||
"status flips), sources (phenomena/identity counts, last-touched "
|
||||
"round), and pending leads. Always call this first when deciding "
|
||||
"the next strategist action."
|
||||
),
|
||||
input_schema={"type": "object", "properties": {}},
|
||||
executor=_exec_graph_overview,
|
||||
module="strategy",
|
||||
tags=["strategy", "overview", "read-only"],
|
||||
)
|
||||
|
||||
async def _exec_source_coverage(source_id: str) -> str:
|
||||
return strat.source_coverage(graph, source_id)
|
||||
|
||||
TOOL_CATALOG["source_coverage"] = ToolDefinition(
|
||||
name="source_coverage",
|
||||
description=(
|
||||
"Per-source artefact coverage report: which expected categories "
|
||||
"have been touched (✓) vs not (✗) on the given source. Coverage "
|
||||
"items are heuristic hints, not requirements — investigate ✗ "
|
||||
"items only when an active hypothesis depends on them."
|
||||
),
|
||||
input_schema={
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"source_id": {"type": "string", "description": "Source id, e.g. 'src-ios-chan'."},
|
||||
},
|
||||
"required": ["source_id"],
|
||||
},
|
||||
executor=_exec_source_coverage,
|
||||
module="strategy",
|
||||
tags=["strategy", "coverage", "read-only"],
|
||||
)
|
||||
|
||||
async def _exec_marginal_yield(last_n_rounds: int = 2) -> str:
|
||||
return strat.marginal_yield(graph, int(last_n_rounds))
|
||||
|
||||
TOOL_CATALOG["marginal_yield"] = ToolDefinition(
|
||||
name="marginal_yield",
|
||||
description=(
|
||||
"How much information the last N investigation rounds added: "
|
||||
"new phenomena, new edges, and hypothesis status flips per round. "
|
||||
"Two consecutive zero-yield rounds means diminishing returns are "
|
||||
"decisive — declare_investigation_complete with reason "
|
||||
"marginal_yield_zero."
|
||||
),
|
||||
input_schema={
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"last_n_rounds": {"type": "integer", "description": "How many recent rounds to summarise (default 2)."},
|
||||
},
|
||||
},
|
||||
executor=_exec_marginal_yield,
|
||||
module="strategy",
|
||||
tags=["strategy", "yield", "read-only"],
|
||||
)
|
||||
|
||||
async def _exec_budget_status() -> str:
|
||||
return strat.budget_status(
|
||||
graph,
|
||||
getattr(graph, "budgets", None),
|
||||
getattr(graph, "run_start_monotonic", None),
|
||||
)
|
||||
|
||||
TOOL_CATALOG["budget_status"] = ToolDefinition(
|
||||
name="budget_status",
|
||||
description=(
|
||||
"Budget vs caps: tool_calls, strategist_rounds, wall_clock_minutes. "
|
||||
"Includes pacing hints when usage crosses 70% / 90% thresholds. "
|
||||
"Use this to decide whether to keep proposing leads or to wind down."
|
||||
),
|
||||
input_schema={"type": "object", "properties": {}},
|
||||
executor=_exec_budget_status,
|
||||
module="strategy",
|
||||
tags=["strategy", "budget", "read-only"],
|
||||
)
|
||||
|
||||
# ---- Strategist decision actions (DESIGN_STRATEGIST.md §2.5) ----
|
||||
# propose_lead is the strategist's tool for "go deeper here";
|
||||
# declare_investigation_complete is its tool for "we're done".
|
||||
# Both are required to be in BaseAgent.mandatory_record_tools for the
|
||||
# strategist subclass so the agent can't return without taking a
|
||||
# documented decision.
|
||||
|
||||
_ALLOWED_EVIDENCE_EDGE_TYPES = (
|
||||
"direct_evidence", "supports", "contradicts",
|
||||
"weakens", "prerequisite_met", "consequence_observed",
|
||||
)
|
||||
|
||||
async def _exec_propose_lead(
|
||||
description: str,
|
||||
target_agent: str,
|
||||
motivating_hypothesis: str,
|
||||
expected_evidence_type: str,
|
||||
rationale: str = "",
|
||||
source_id: str = "",
|
||||
) -> str:
|
||||
"""Propose a new lead from the strategist. Idempotent on the
|
||||
(motivating_hypothesis, expected_evidence_type, target_agent,
|
||||
source_id) tuple within a single run.
|
||||
"""
|
||||
# Validate refs early so the strategist gets a fast, specific error.
|
||||
if motivating_hypothesis and motivating_hypothesis not in graph.hypotheses:
|
||||
return (
|
||||
f"Error: motivating_hypothesis {motivating_hypothesis!r} is "
|
||||
f"not in graph.hypotheses. Call graph_overview to see the "
|
||||
f"current hypothesis ids."
|
||||
)
|
||||
if expected_evidence_type not in _ALLOWED_EVIDENCE_EDGE_TYPES:
|
||||
return (
|
||||
f"Error: expected_evidence_type {expected_evidence_type!r} is "
|
||||
f"not one of {list(_ALLOWED_EVIDENCE_EDGE_TYPES)}."
|
||||
)
|
||||
if source_id:
|
||||
src_obj = graph.case.get_source(source_id) if graph.case else None
|
||||
if src_obj is None:
|
||||
return (
|
||||
f"Error: source_id {source_id!r} is not in the case. "
|
||||
f"Valid ids: {[s.id for s in (graph.case.sources if graph.case else [])]}"
|
||||
)
|
||||
|
||||
lid = await graph.add_lead(
|
||||
target_agent=target_agent,
|
||||
description=description,
|
||||
proposed_by="strategist",
|
||||
motivating_hypothesis=motivating_hypothesis,
|
||||
expected_evidence_type=expected_evidence_type,
|
||||
round_number=graph.current_strategist_round,
|
||||
hypothesis_id=motivating_hypothesis or None,
|
||||
context={"source_id": source_id, "rationale": rationale} if source_id or rationale else {},
|
||||
)
|
||||
return (
|
||||
f"Lead {lid} proposed: target_agent={target_agent}, "
|
||||
f"motivating_hypothesis={motivating_hypothesis}, "
|
||||
f"expected={expected_evidence_type}, source={source_id or '—'}."
|
||||
)
|
||||
|
||||
TOOL_CATALOG["propose_lead"] = ToolDefinition(
|
||||
name="propose_lead",
|
||||
description=(
|
||||
"Propose a specific investigation lead that will be dispatched "
|
||||
"after this strategist round. Each lead MUST name a motivating "
|
||||
"hypothesis it expects to move and the kind of edge it expects "
|
||||
"to produce. Do NOT propose a lead that just adds more same-"
|
||||
"direction evidence to an already-supported hypothesis — harmonic "
|
||||
"damping makes repeats cheap. DO propose leads when (a) a "
|
||||
"hypothesis is supported only by one source — get cross-source "
|
||||
"corroboration; (b) a hypothesis is in the active band — give it "
|
||||
"the deciding evidence; (c) a high-value artefact is uncovered on "
|
||||
"a source where an active hypothesis suggests it matters. "
|
||||
"Idempotent on (motivating_hypothesis, expected_evidence_type, "
|
||||
"target_agent, source_id) — re-proposing the same triple while "
|
||||
"pending is a no-op that returns the existing lead's id."
|
||||
),
|
||||
input_schema={
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"description": {
|
||||
"type": "string",
|
||||
"description": "1-2 sentence specific investigation request, including target source/artefact.",
|
||||
},
|
||||
"target_agent": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"filesystem", "registry", "communication", "network",
|
||||
"ios_artifact", "android_artifact", "media",
|
||||
"hypothesis", "timeline",
|
||||
],
|
||||
"description": "Which worker agent should pick this up.",
|
||||
},
|
||||
"source_id": {
|
||||
"type": "string",
|
||||
"description": "Which evidence source to investigate (e.g. 'src-ios-chan'). Optional for cross-source leads.",
|
||||
},
|
||||
"motivating_hypothesis": {
|
||||
"type": "string",
|
||||
"description": "hyp-id this lead is meant to corroborate or refute.",
|
||||
},
|
||||
"expected_evidence_type": {
|
||||
"type": "string",
|
||||
"enum": list(_ALLOWED_EVIDENCE_EDGE_TYPES),
|
||||
"description": "What kind of P→H edge you expect this lead to produce.",
|
||||
},
|
||||
"rationale": {
|
||||
"type": "string",
|
||||
"description": "Why this fills a real gap — referenced in audit + worker prompt.",
|
||||
},
|
||||
},
|
||||
"required": [
|
||||
"description", "target_agent",
|
||||
"motivating_hypothesis", "expected_evidence_type",
|
||||
],
|
||||
},
|
||||
executor=_exec_propose_lead,
|
||||
module="strategy",
|
||||
tags=["strategy", "lead", "decision"],
|
||||
)
|
||||
|
||||
_COMPLETE_REASONS = (
|
||||
"marginal_yield_zero", "budget_exhausted",
|
||||
"all_hypotheses_resolved", "coverage_saturated", "other",
|
||||
)
|
||||
|
||||
async def _exec_declare_investigation_complete(
|
||||
reason: str, rationale: str = "",
|
||||
) -> str:
|
||||
"""Terminal strategist action: signal "we're done" to the orchestrator."""
|
||||
if reason not in _COMPLETE_REASONS:
|
||||
return (
|
||||
f"Error: reason {reason!r} not in "
|
||||
f"{list(_COMPLETE_REASONS)}."
|
||||
)
|
||||
graph.strategist_complete_requested = True
|
||||
return (
|
||||
f"Investigation marked complete in round "
|
||||
f"{graph.current_strategist_round}. reason={reason}. "
|
||||
f"rationale={rationale or '(none)'}. The orchestrator will exit "
|
||||
f"the strategist loop after this round."
|
||||
)
|
||||
|
||||
TOOL_CATALOG["declare_investigation_complete"] = ToolDefinition(
|
||||
name="declare_investigation_complete",
|
||||
description=(
|
||||
"Terminal strategist action. Call this when (a) marginal_yield "
|
||||
"shows zero across 2+ rounds, (b) budget is exhausted, (c) all "
|
||||
"active hypotheses are resolved, or (d) coverage is saturated "
|
||||
"with respect to the active hypotheses. After this call, the "
|
||||
"orchestrator finishes the strategist loop and proceeds to "
|
||||
"Phase 4 (timeline) and Phase 5 (report). The current round's "
|
||||
"in-flight work still completes."
|
||||
),
|
||||
input_schema={
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"reason": {
|
||||
"type": "string",
|
||||
"enum": list(_COMPLETE_REASONS),
|
||||
"description": "Termination cause — picked from a closed set so the audit log is consistent.",
|
||||
},
|
||||
"rationale": {
|
||||
"type": "string",
|
||||
"description": "Free-text justification — quoted into the InvestigationRound's decision_rationale.",
|
||||
},
|
||||
},
|
||||
"required": ["reason"],
|
||||
},
|
||||
executor=_exec_declare_investigation_complete,
|
||||
module="strategy",
|
||||
tags=["strategy", "terminal", "decision"],
|
||||
)
|
||||
|
||||
# ---- Wrap every executor with invocation logging (+ cache + auto-record) ----
|
||||
# Must run AFTER all tools are registered. Every tool call now produces
|
||||
# a ToolInvocation entry on the graph (provenance for grounding), and
|
||||
|
||||
485
tools/strategy.py
Normal file
485
tools/strategy.py
Normal file
@@ -0,0 +1,485 @@
|
||||
"""Strategist-loop tools — read-only views over graph state that let the
|
||||
InvestigationStrategist agent decide whether to keep investigating or to
|
||||
declare the investigation complete.
|
||||
|
||||
DESIGN_STRATEGIST.md §2. Four read-only views:
|
||||
|
||||
graph_overview() → hypotheses + sources + pending leads snapshot
|
||||
source_coverage(src_id) → which artefact categories on this source have
|
||||
been touched vs are still ✗
|
||||
marginal_yield(n_rounds) → how much information the last N rounds added
|
||||
budget_status() → tool calls / rounds / wall-clock against caps
|
||||
|
||||
These are pure render functions over the graph — they MUST NOT mutate state.
|
||||
The strategist never writes phenomena/edges directly; all graph mutations
|
||||
happen through worker agents that the strategist dispatches via propose_lead
|
||||
(which is registered separately in tool_registry).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Expected artefact catalogue (per source type)
|
||||
#
|
||||
# These are SOFT HINTS — items the strategist might want to check on a given
|
||||
# source type if any active hypothesis depends on them. The catalogue is
|
||||
# intentionally compact; expand it in-place when a new forensic specialty
|
||||
# joins the toolset. Each entry:
|
||||
#
|
||||
# name human-readable artefact category
|
||||
# detector how to recognise that this category has been touched — either
|
||||
# a tool name OR a `<tool>@<path-substring>` pattern, joined with
|
||||
# `|` for alternatives. The matcher is substring on the tool name
|
||||
# and on the args' string representation.
|
||||
# value_for one-line description of why this category might matter
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
EXPECTED_ARTEFACTS: dict[str, list[dict[str, str]]] = {
|
||||
"disk_image+windows": [
|
||||
{"name": "partition layout", "detector": "partition_info|mmls",
|
||||
"value_for": "deleted files, hidden partitions"},
|
||||
{"name": "filesystem walk", "detector": "list_directory|fls",
|
||||
"value_for": "directory tree, recoverable deleted entries"},
|
||||
{"name": "registry hives", "detector": "parse_registry_key|list_installed_software|get_user_activity",
|
||||
"value_for": "installed software, user activity, timezone"},
|
||||
{"name": "browser history", "detector": "list_directory@AppData|read_text_file@History|read_text_file@Bookmarks",
|
||||
"value_for": "URL access, downloads, web search terms"},
|
||||
{"name": "prefetch", "detector": "parse_prefetch|extract_file@Prefetch",
|
||||
"value_for": "program execution evidence"},
|
||||
{"name": "email/IM config", "detector": "get_email_config",
|
||||
"value_for": "user accounts, configured mail/IM clients"},
|
||||
{"name": "recycle bin", "detector": "list_directory@$Recycle|count_deleted_files",
|
||||
"value_for": "deleted file metadata and recovery"},
|
||||
],
|
||||
"disk_image+android": [
|
||||
{"name": "partition probe", "detector": "probe_android_partitions",
|
||||
"value_for": "discover EFS / SYSTEM / USERDATA layout"},
|
||||
{"name": "system properties", "detector": "read_text_file@build.prop|read_text_file@default.prop",
|
||||
"value_for": "device model, OS version, CSC region"},
|
||||
{"name": "app inventory", "detector": "list_directory@data/app|list_directory@data/data",
|
||||
"value_for": "installed apps, package names"},
|
||||
{"name": "user data dbs", "detector": "list_directory@data/data|sqlite_query",
|
||||
"value_for": "messages, contacts, app-specific data"},
|
||||
{"name": "device identity", "detector": "search_strings@imei|search_strings@serial|search_strings@DRI",
|
||||
"value_for": "IMEI, serial, device fingerprint"},
|
||||
],
|
||||
"mobile_extraction": [
|
||||
{"name": "device info", "detector": "read_idevice_info|read_text_file@iDevice_info",
|
||||
"value_for": "model, iOS version, IMEI, ICCID, Bluetooth MAC, UDID"},
|
||||
{"name": "AddressBook", "detector": "sqlite_query@AddressBook.sqlitedb",
|
||||
"value_for": "contacts, owner identity"},
|
||||
{"name": "SMS / iMessage", "detector": "sqlite_query@sms.db",
|
||||
"value_for": "messaging content, OTP / verification codes"},
|
||||
{"name": "WhatsApp messages", "detector": "sqlite_query@ChatStorage.sqlite|sqlite_query@WhatsApp",
|
||||
"value_for": "WhatsApp content, group membership, call records"},
|
||||
{"name": "WeChat", "detector": "sqlite_query@MM.sqlite|sqlite_query@wcdb|list_directory@WeChat",
|
||||
"value_for": "WeChat IDs, messages, follow targets"},
|
||||
{"name": "Call history", "detector": "sqlite_query@CallHistory|sqlite_query@call_history",
|
||||
"value_for": "incoming/outgoing call log"},
|
||||
{"name": "Safari history", "detector": "sqlite_query@History.db|read_text_file@Bookmarks.plist|parse_plist@Bookmarks",
|
||||
"value_for": "URL access, bookmarks, search queries"},
|
||||
{"name": "Photos library", "detector": "sqlite_query@Photos.sqlite|parse_plist@Photos",
|
||||
"value_for": "photo metadata, EXIF, geolocation, source app"},
|
||||
{"name": "iCloud accounts", "detector": "parse_plist@Accounts3|parse_ios_keychain",
|
||||
"value_for": "Apple ID, registered services, authentication tokens"},
|
||||
{"name": "app inventory", "detector": "list_directory@Bundle/Application|list_directory@Containers",
|
||||
"value_for": "installed apps, app-specific containers"},
|
||||
{"name": "Wi-Fi history", "detector": "parse_plist@com.apple.wifi|read_text_file@known_networks",
|
||||
"value_for": "connected SSIDs, keys, first/last seen times"},
|
||||
],
|
||||
"media_collection": [
|
||||
{"name": "archive unpack", "detector": "unzip_archive|list_directory",
|
||||
"value_for": "extract images / docs for downstream analysis"},
|
||||
{"name": "OCR text", "detector": "ocr_image",
|
||||
"value_for": "screenshot text content (chat, transaction, IDs)"},
|
||||
{"name": "metadata", "detector": "read_binary_preview|search_strings",
|
||||
"value_for": "EXIF, embedded timestamps, device fingerprints"},
|
||||
],
|
||||
"archive": [
|
||||
{"name": "archive unpack", "detector": "unzip_archive",
|
||||
"value_for": "expose contents for further analysis"},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def _key_for_source(src) -> str:
|
||||
"""Return the EXPECTED_ARTEFACTS key for a source: 'disk_image+platform'
|
||||
when platform is set in meta, otherwise just the source type."""
|
||||
src_type = getattr(src, "type", "")
|
||||
if src_type == "disk_image":
|
||||
platform = (getattr(src, "meta", {}) or {}).get("platform", "").lower()
|
||||
if platform:
|
||||
return f"disk_image+{platform}"
|
||||
return src_type
|
||||
|
||||
|
||||
def _detector_matches(detector: str, tool_name: str, args_str: str) -> bool:
|
||||
"""Return True if any '|'-separated branch of `detector` matches.
|
||||
|
||||
A branch like ``sqlite_query@AddressBook.sqlitedb`` requires both the
|
||||
tool name (substring) AND the args (substring) to match. A branch like
|
||||
``parse_prefetch`` is a tool-name-only check.
|
||||
"""
|
||||
for branch in detector.split("|"):
|
||||
branch = branch.strip()
|
||||
if not branch:
|
||||
continue
|
||||
if "@" in branch:
|
||||
t, sub = branch.split("@", 1)
|
||||
if t in tool_name and sub.lower() in args_str.lower():
|
||||
return True
|
||||
else:
|
||||
if branch in tool_name:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# graph_overview()
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def graph_overview(graph) -> str:
|
||||
"""Render hypotheses + sources + pending leads as the strategist's
|
||||
primary decision view.
|
||||
|
||||
Annotates each hypothesis with the count of distinct sources that
|
||||
contribute supporting (positive-LR) edges. A hypothesis with many edges
|
||||
but only one source is a strategist signal to seek cross-source
|
||||
corroboration.
|
||||
"""
|
||||
lines: list[str] = ["# Investigation State", ""]
|
||||
|
||||
# Hypotheses table.
|
||||
if graph.hypotheses:
|
||||
lines.append(f"## Hypotheses ({len(graph.hypotheses)})")
|
||||
lines.append("")
|
||||
lines.append(
|
||||
"| id | title | L | conf | status | edges_in | distinct_sources | recent_flip |"
|
||||
)
|
||||
lines.append("|----|-------|---|------|--------|---------:|-----------------:|--------------|")
|
||||
# Sort by absolute log-odds magnitude descending so the strategist
|
||||
# sees the most decided hypotheses first; active ones float to the
|
||||
# middle of the table where decisions matter most.
|
||||
for hid, h in sorted(
|
||||
graph.hypotheses.items(),
|
||||
key=lambda kv: (kv[1].status != "active", -abs(kv[1].log_odds)),
|
||||
):
|
||||
in_edges = graph._adj_rev.get(hid, [])
|
||||
edges_in = len(in_edges)
|
||||
# Distinct sources contributing edges (looked up via source
|
||||
# phenomenon's source_id; entity→entity edges have no source).
|
||||
distinct_sources: set[str] = set()
|
||||
for e in in_edges:
|
||||
src_node = graph.phenomena.get(e.source_id)
|
||||
if src_node is not None and src_node.source_id:
|
||||
distinct_sources.add(src_node.source_id)
|
||||
# Did this hypothesis's status change in the last 2 rounds?
|
||||
recent = "no"
|
||||
recent_rounds = graph.investigation_rounds[-2:]
|
||||
for r in recent_rounds:
|
||||
before = r.hypothesis_status_snapshot_before.get(hid)
|
||||
after = r.hypothesis_status_snapshot_after.get(hid)
|
||||
if before and after and before != after:
|
||||
recent = f"yes ({before}→{after} in R{r.round_number})"
|
||||
break
|
||||
title = (h.title or "")[:60].replace("|", "/")
|
||||
lines.append(
|
||||
f"| {hid[:14]} | {title} | {h.log_odds:+.2f} | "
|
||||
f"{h.confidence:.2f} | {h.status} | {edges_in} | "
|
||||
f"{len(distinct_sources)} | {recent} |"
|
||||
)
|
||||
lines.append("")
|
||||
else:
|
||||
lines.append("## Hypotheses\n\n_(none yet — Phase 2 has not produced any)_\n")
|
||||
|
||||
# Sources table.
|
||||
if graph.case and graph.case.sources:
|
||||
lines.append(f"## Sources ({len(graph.case.sources)})")
|
||||
lines.append("")
|
||||
lines.append(
|
||||
"| id | type | phenomena | identities | last_touched_in_round |"
|
||||
)
|
||||
lines.append("|----|------|----------:|-----------:|----------------------|")
|
||||
for src in graph.case.sources:
|
||||
ph_count = sum(
|
||||
1 for p in graph.phenomena.values() if p.source_id == src.id
|
||||
)
|
||||
id_count = sum(
|
||||
1 for e in graph.entities.values()
|
||||
for i in e.identifiers
|
||||
if any(
|
||||
p.source_id == src.id
|
||||
for p in graph.phenomena.values()
|
||||
if p.id == i.get("phenomenon_id")
|
||||
)
|
||||
)
|
||||
# Latest round in which a tool invocation was made against this src.
|
||||
last_r = "—"
|
||||
for r in reversed(graph.investigation_rounds):
|
||||
if r.new_phenomena_count > 0:
|
||||
# Heuristic: if any phenomenon created during this round
|
||||
# was on this source, mark this round as the last touch.
|
||||
in_round = [
|
||||
p for p in graph.phenomena.values()
|
||||
if p.source_id == src.id
|
||||
and r.started_at <= p.created_at
|
||||
and (not r.completed_at or p.created_at <= r.completed_at)
|
||||
]
|
||||
if in_round:
|
||||
last_r = f"R{r.round_number}"
|
||||
break
|
||||
lines.append(
|
||||
f"| {src.id} | {src.type} | {ph_count} | {id_count} | {last_r} |"
|
||||
)
|
||||
lines.append("")
|
||||
|
||||
# Pending leads.
|
||||
pending = [l for l in graph.leads if l.status == "pending"]
|
||||
if pending:
|
||||
lines.append(f"## Pending Leads ({len(pending)})")
|
||||
lines.append("")
|
||||
lines.append("| id | from | target_agent | for_hypothesis | description |")
|
||||
lines.append("|----|------|--------------|----------------|-------------|")
|
||||
for l in pending[:20]:
|
||||
desc = (l.description or "")[:80].replace("|", "/")
|
||||
mh = l.motivating_hypothesis or l.hypothesis_id or "—"
|
||||
lines.append(
|
||||
f"| {l.id} | {l.proposed_by or '—'} | {l.target_agent} | "
|
||||
f"{mh[:14] if mh != '—' else '—'} | {desc} |"
|
||||
)
|
||||
if len(pending) > 20:
|
||||
lines.append(f"\n_(+{len(pending) - 20} more pending leads not shown)_")
|
||||
lines.append("")
|
||||
else:
|
||||
lines.append("## Pending Leads\n\n_(none — no investigations queued)_\n")
|
||||
|
||||
# Interpretation hint at the end, plain English.
|
||||
lines.append("---")
|
||||
lines.append(
|
||||
"**Interpretation hints**: A hypothesis with many edges but only one "
|
||||
"distinct_source has fragile cross-source independence — a single "
|
||||
"edge from a *different* source would do more for it than another "
|
||||
"edge from the same source (harmonic damping makes repeats cheap). "
|
||||
"Hypotheses in the active band (0.2 < conf < 0.8) are the ones a "
|
||||
"well-targeted lead can flip. recent_flip = 'yes' means belief is "
|
||||
"still moving on that hypothesis; 'no' across 2 rounds suggests "
|
||||
"stability."
|
||||
)
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# source_coverage(source_id)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def source_coverage(graph, source_id: str) -> str:
|
||||
"""Render which expected artefact categories have been touched on
|
||||
*source_id*, and which remain ✗.
|
||||
|
||||
Output is markdown. The closing paragraph reminds the strategist that
|
||||
coverage hints are heuristics — investigate ✗ items only when an active
|
||||
hypothesis depends on them. This is the design's central guardrail
|
||||
against the system devolving into a fixed forensic checklist.
|
||||
"""
|
||||
src = graph.case.get_source(source_id) if graph.case else None
|
||||
if src is None:
|
||||
return f"Error: source_id {source_id!r} not found in case."
|
||||
|
||||
key = _key_for_source(src)
|
||||
expected = EXPECTED_ARTEFACTS.get(key, [])
|
||||
|
||||
# Collect this source's invocation history.
|
||||
invs = [
|
||||
inv for inv in graph.tool_invocations.values()
|
||||
if inv.source_id == source_id
|
||||
]
|
||||
|
||||
# For each expected category, decide ✓ / ✗ + show example invocation if ✓.
|
||||
rows: list[tuple[str, str, str, str]] = []
|
||||
for entry in expected:
|
||||
name = entry["name"]
|
||||
detector = entry["detector"]
|
||||
value_for = entry["value_for"]
|
||||
matched: str | None = None
|
||||
for inv in invs:
|
||||
args_str = ""
|
||||
try:
|
||||
args_str = " ".join(f"{k}={v}" for k, v in (inv.args or {}).items())
|
||||
except Exception:
|
||||
args_str = str(inv.args)
|
||||
if _detector_matches(detector, inv.tool, args_str):
|
||||
matched = f"{inv.tool}({args_str[:60]})"
|
||||
break
|
||||
mark = "✓" if matched else "✗"
|
||||
evidence = matched or "—"
|
||||
rows.append((mark, name, evidence, value_for))
|
||||
|
||||
lines: list[str] = [
|
||||
f"# Coverage of source `{source_id}` ({src.label})",
|
||||
"",
|
||||
f"Source type: `{src.type}` / access_mode: `{src.access_mode}`",
|
||||
f"Invocations made against this source: **{len(invs)}**",
|
||||
"",
|
||||
]
|
||||
if not expected:
|
||||
lines.append(
|
||||
f"_(no expected-artefact catalogue entry for source type `{key}` — "
|
||||
"coverage cannot be assessed against a baseline)_"
|
||||
)
|
||||
else:
|
||||
lines.append(
|
||||
"| ✓/✗ | category | example invocation | what it would tell us |"
|
||||
)
|
||||
lines.append("|-----|----------|---------------------|------------------------|")
|
||||
for mark, name, evidence, value_for in rows:
|
||||
lines.append(
|
||||
f"| {mark} | {name} | {evidence[:70].replace('|','/')} | {value_for} |"
|
||||
)
|
||||
n_covered = sum(1 for r in rows if r[0] == "✓")
|
||||
n_total = len(rows)
|
||||
lines.append("")
|
||||
lines.append(f"Coverage: **{n_covered}/{n_total}** ({n_covered*100//max(n_total,1)}%)")
|
||||
|
||||
# Other invocations on this source that didn't match any expected entry —
|
||||
# could be genuine novel exploration; strategist might want to know.
|
||||
lines.append("")
|
||||
lines.append("---")
|
||||
lines.append(
|
||||
"**Coverage hints are heuristics, not requirements.** Skip an item if "
|
||||
"the case theory makes it irrelevant — a financial-fraud case has no "
|
||||
"reason to OCR every photo. Investigate ✗ items only when they could "
|
||||
"materially affect an active hypothesis. If you propose a lead just "
|
||||
"because something is ✗, the strategist prompt is being misused."
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# marginal_yield(last_n_rounds)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def marginal_yield(graph, last_n_rounds: int = 2) -> str:
|
||||
"""Render the last N investigation rounds' yield deltas.
|
||||
|
||||
Yield columns:
|
||||
- new_phenomena: phenomena created during the round
|
||||
- new_edges: edges (any direction) added during the round
|
||||
- status_flips: hypotheses whose status changed during the round
|
||||
|
||||
A row of zeros means that round didn't move the graph. Two consecutive
|
||||
such rows is strong evidence of diminishing returns; the strategist
|
||||
should consider declare_investigation_complete with reason
|
||||
marginal_yield_zero.
|
||||
"""
|
||||
rounds = [r for r in graph.investigation_rounds if r.completed_at]
|
||||
if not rounds:
|
||||
return (
|
||||
"# Marginal Yield\n\n"
|
||||
"_(no completed investigation rounds yet — yield not applicable)_"
|
||||
)
|
||||
recent = rounds[-max(1, last_n_rounds):]
|
||||
lines = [f"# Marginal Yield (last {len(recent)} of {len(rounds)} rounds)", ""]
|
||||
lines.append("| round | new_phenomena | new_edges | status_flips |")
|
||||
lines.append("|-------|--------------:|----------:|-------------:|")
|
||||
yields: list[tuple[int, int, int]] = []
|
||||
for r in recent:
|
||||
yields.append((r.new_phenomena_count, r.new_edges_count, r.status_flips))
|
||||
lines.append(
|
||||
f"| R{r.round_number} | {r.new_phenomena_count} | "
|
||||
f"{r.new_edges_count} | {r.status_flips} |"
|
||||
)
|
||||
|
||||
# Trend interpretation aid.
|
||||
lines.append("")
|
||||
if all(y == (0, 0, 0) for y in yields):
|
||||
trend = (
|
||||
"Yield is zero across these rounds — diminishing returns are "
|
||||
"confirmed. Strongly consider declare_investigation_complete "
|
||||
"(reason: marginal_yield_zero)."
|
||||
)
|
||||
elif len(yields) >= 2:
|
||||
first = yields[0][0] + yields[0][1] + yields[0][2]
|
||||
last = yields[-1][0] + yields[-1][1] + yields[-1][2]
|
||||
if last == 0 and first > 0:
|
||||
trend = (
|
||||
"Yield collapsed to zero in the most recent round. One more "
|
||||
"well-targeted probe is reasonable; another zero-yield round "
|
||||
"after that means stop."
|
||||
)
|
||||
elif last < first / 2 and first > 0:
|
||||
trend = (
|
||||
f"Decelerating ({last}/{first} ≈ "
|
||||
f"{int(100*last/first)}% of the earlier round). Diminishing "
|
||||
"returns are accumulating."
|
||||
)
|
||||
else:
|
||||
trend = "Yield is still active — further investigation is paying off."
|
||||
else:
|
||||
trend = (
|
||||
"Only one completed round — too early to call a trend. Run at "
|
||||
"least one more before considering completion."
|
||||
)
|
||||
lines.append(f"**Trend**: {trend}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# budget_status()
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def budget_status(graph, budgets: dict[str, Any] | None, start_time: float | None) -> str:
|
||||
"""Render budget usage against config.yaml `budgets` block.
|
||||
|
||||
Counters:
|
||||
- tool_calls: len(graph.tool_invocations)
|
||||
- strategist_rounds: len(graph.investigation_rounds)
|
||||
- wall_clock_minutes: now - start_time (when start_time is supplied)
|
||||
"""
|
||||
budgets = budgets or {}
|
||||
tool_calls_used = len(graph.tool_invocations)
|
||||
rounds_used = len(graph.investigation_rounds)
|
||||
minutes_used: float | None = None
|
||||
if start_time is not None:
|
||||
minutes_used = (time.monotonic() - start_time) / 60.0
|
||||
|
||||
def _row(name: str, used: float, cap: Any) -> str:
|
||||
if cap is None:
|
||||
return f"| {name} | {used:g} | — | (unbounded) |"
|
||||
pct = (used / cap) * 100 if cap else 0
|
||||
return f"| {name} | {used:g} | {cap} | {pct:.0f}% |"
|
||||
|
||||
lines = ["# Budget Status", ""]
|
||||
lines.append("| metric | used | cap | pct |")
|
||||
lines.append("|--------|-----:|----:|----:|")
|
||||
lines.append(_row("tool_calls", tool_calls_used, budgets.get("tool_calls_total")))
|
||||
lines.append(_row("strategist_rounds", rounds_used, budgets.get("strategist_rounds_max")))
|
||||
if minutes_used is not None:
|
||||
lines.append(_row(
|
||||
"wall_clock_minutes", round(minutes_used, 1),
|
||||
budgets.get("wall_clock_minutes_max"),
|
||||
))
|
||||
|
||||
# Pacing hint.
|
||||
lines.append("")
|
||||
flags = []
|
||||
cap_calls = budgets.get("tool_calls_total")
|
||||
cap_rounds = budgets.get("strategist_rounds_max")
|
||||
if cap_calls and tool_calls_used / cap_calls >= 0.9:
|
||||
flags.append("tool_calls budget ≥ 90% used — favour declare_complete")
|
||||
if cap_rounds and rounds_used / cap_rounds >= 0.7:
|
||||
flags.append("strategist rounds ≥ 70% used — only propose leads with high expected yield")
|
||||
if flags:
|
||||
lines.append("**Budget warnings**:")
|
||||
for f in flags:
|
||||
lines.append(f"- {f}")
|
||||
else:
|
||||
lines.append(
|
||||
"Budget room remains. Standard rule: each propose_lead should "
|
||||
"name a specific hypothesis it expects to move; otherwise skip it."
|
||||
)
|
||||
return "\n".join(lines)
|
||||
Reference in New Issue
Block a user