Files
MASForensic/DESIGN_STRATEGIST.md
BattleTag 8b964b5dec docs(strategist) S8/9: DESIGN.md updates + DESIGN_STRATEGIST.md spec
DESIGN_STRATEGIST.md §11. The strategist refit is the first sub-design
big enough to need its own document, so it lives as a sibling to
DESIGN.md rather than inline.

DESIGN_STRATEGIST.md (new, 543 lines) covers:
  §0  Scope, non-goals, invariants preserved
  §1  Data model (Lead extension, InvestigationRound)
  §2  Six tools (graph_overview / source_coverage / marginal_yield /
      budget_status / propose_lead / declare_investigation_complete)
      with full input_schema
  §3  InvestigationStrategist agent class
  §4  Orchestrator Phase 3 loop pseudocode
  §5  Persistence + resume strategy
  §6  config schema
  §7  Test plan (8 scenarios)
  §8  9-step build order (matches commit history)
  §9  Risks + mitigations
  §10 Open questions
  §11 Required DESIGN.md updates (applied here)
  §12 What this design does NOT solve (exam-test coverage, vision-
      capable LLM, blockchain explorer, etc.)

DESIGN.md updates per §11:
  §4.5  Note harmonic damping is now landed
  §4.9  Phase 3 table row now points at the strategist loop +
        inline summary
  §5    Lead + InvestigationRound rows added to the data-model
        summary table

This commit closes the strategist refit. All 174 tests pass / 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 02:28:06 -10:00

22 KiB
Raw Blame History

Strategist Loop —— Phase 3 信念驱动改造

这是 DESIGN.md 的补充设计文档,针对 §4.9 编排器 Phase 3 的具体重写。

触发动因2026-05-20 第一次全 6-source 实战(runs/2026-05-20T20-15-04/ 暴露 Phase 3 不工作——8 条 pending leads 一个都没派发,因为 log-odds 通胀让所有 hypothesis 立即 converged。即使在「调和衰减」修复 log-odds 数学后commit 在 evidence_graph.py:update_hypothesis_confidence Phase 3 在当前架构下仍然是「单轮触发、规则收敛」的机械流程——LLM 在调度层完全没有发言权。本设计把 Phase 3 改为 LLM 驱动的探索循环。


0. 范围

做什么

orchestrator.py:Phase 3 从「单轮、规则触发」改造为「strategist-loop、信念驱动」 新增一个 InvestigationStrategist agent + 4 个决策视图工具 + 2 个决策动作工具

  • 编排器循环改写。

不做什么

  • 不改 Phase 1per-source triage 保持现状)
  • 不改 Phase 2HypothesisAgent 不动strategist 可以调用它,但不替代)
  • 不改 Phase 4/5timeline / report
  • 不写专家级 per-source 检查清单(只在 source_coverage 工具里塞软提示清单)
  • 不引入新的图节点类型leads 复用现有结构

保留的不变式

  • DESIGN.md §4.3 grounding 网关,所有写入仍走它
  • DESIGN.md §4.5 log-odds + 调和衰减
  • DESIGN.md §4.4 verified_facts vs interpretation 划界
  • 断连恢复(graph_state.json 序列化兼容)

设计原则

  1. "LLM 提议,代码裁决" 上移到调度层DESIGN.md 第一原则现在只在事实层 grounding兑现调度层「该不该深入、深入哪里、何时停」目前是代码硬决策。 本设计让 LLM 持有调度决策权。
  2. 应试能力存在但不被绑死:系统的工具集和软提示清单覆盖应试场景所需的工件 类别;但是否查某个工件、查到什么深度,由 strategist 看具体案件性质决定, 不被预定义清单强制。
  3. 可解释、可审计:每一轮 strategist 决策、动机、产出收益都被记入持久化的 InvestigationRound,可事后复盘。

1. 数据模型变更

1.1 Lead 扩 4 字段

evidence_graph.py:Lead 现有 (id, title, description, target_agent, source_id, status, …)。 新增:

@dataclass
class Lead:
    # ... existing fields
    proposed_by: str = ""           # "strategist" | "filesystem" | ... — 提案 agent
    motivating_hypothesis: str = "" # hyp-id this lead is meant to corroborate/refute
    expected_evidence_type: str = "" # one of edge_types — 期望产出的边类型
    round_number: int = 0           # 哪一轮 strategist 产生

motivating_hypothesis 是关键——它把 lead 和 hypothesis 显式挂钩,让事后能算 "这条 lead 跑完到底有没有改变假设状态",即 strategist 的边际收益度量。

1.2 新增 InvestigationRound 节点

记录每一轮 strategist 的决策本身——provenance 也要可审计:

@dataclass
class InvestigationRound:
    id: str                          # "round-001"
    round_number: int
    started_at: str
    completed_at: str = ""
    strategist_action: str = ""      # "propose_leads" | "declare_complete"
    leads_proposed: list[str] = field(default_factory=list)
    leads_executed: list[str] = field(default_factory=list)
    hypothesis_status_snapshot_before: dict = field(default_factory=dict)  # hyp_id → status
    hypothesis_status_snapshot_after: dict = field(default_factory=dict)
    new_phenomena_count: int = 0
    new_edges_count: int = 0
    decision_rationale: str = ""     # strategist 自述

随 graph 序列化(加进 to_dict/from_dict)。


2. 新工具

放在新文件 tools/strategy.py。按现有 TOOL_CATALOG 注册模式登记。

2.1 graph_overview() — 全局态势(只读)

Signature: graph_overview() -> str

输出markdown比 JSON 更易 LLM 解读):

# Investigation State

## Hypotheses (8)
| id | title | L | conf | status | edges_in | distinct_sources | flipped_in_last_2_rounds |
|----|-------|---|------|--------|----------|------------------|---------------------------|
| hyp-83db8748 | Multi-Device Composite | +8.75 | 0.99 | supported | 23 | 1 | no |
| hyp-daa7c704 | Multiple Identity Aliases | +9.21 | 0.99 | supported | 11 | 3 | no |
| hyp-7fa9b13e | Sunny.zip contains timer_a | +2.08 | 0.99 | supported | 4 | 1 | yes (active→supported in R2) |
| ...

## Sources (6)
| id | type | phenomena | identities | last_touched_in_round |
| src-usb-leung | disk_image | 8 | 1 | R1 |
| ...

## Pending Leads (3)
| id | from | targeting | for_hypothesis | reason |
| lead-aaa | filesystem | src-ios-chan/Safari | hyp-83db8748 | Safari history likely contains device-switching evidence |

关键标注distinct_sources 一栏暴露了"这个假设只靠一个源支撑"——strategist 看到 23 边都来自 android 源会自动判断"需要从别处独立证据"。

2.2 source_coverage(source_id: str) — 单源覆盖度(只读)

Signature: source_coverage(source_id: str) -> str

实现:扫 graph.tool_invocations,过滤 source_id == 该源,按工具名 + 主要 args 分组。然后跟 EXPECTED_ARTEFACTS[source_type] 比对,未触达项打 ✗。

# tools/strategy.py
EXPECTED_ARTEFACTS: dict[str, list[dict]] = {
    "disk_image+windows": [
        {"name": "filesystem layout", "detector": "fls|mmls", "value_for": "deleted files, hidden partitions"},
        {"name": "registry hives", "detector": "parse_registry_key", "value_for": "user activity, installed software"},
        {"name": "browser history", "detector": "list_directory@AppData/.../History", "value_for": "URL access, downloads"},
        {"name": "prefetch", "detector": "extract_file@Windows/Prefetch", "value_for": "program execution evidence"},
        # ...
    ],
    "mobile_extraction": [
        {"name": "AddressBook", "detector": "sqlite_query@AddressBook.sqlitedb", "value_for": "contacts"},
        {"name": "SMS messages", "detector": "sqlite_query@sms.db", "value_for": "messaging content"},
        {"name": "WhatsApp messages", "detector": "sqlite_query@ChatStorage.sqlite", "value_for": "WhatsApp content"},
        {"name": "Call history", "detector": "sqlite_query@CallHistoryDB", "value_for": "call records"},
        {"name": "Safari history", "detector": "sqlite_query@History.db|read_text@Bookmarks.plist", "value_for": "web browsing"},
        {"name": "Photos library", "detector": "sqlite_query@Photos.sqlite", "value_for": "photo metadata, EXIF, geolocation"},
        {"name": "iCloud accounts", "detector": "parse_plist@Accounts3.sqlite|parse_keychain", "value_for": "Apple ID, services"},
        {"name": "App inventory", "detector": "list_directory@var/containers/Bundle/Application", "value_for": "installed apps"},
    ],
    "disk_image+android": [...],
    "media_collection": [
        {"name": "OCR text", "detector": "ocr_image", "value_for": "screenshot text"},
        {"name": "EXIF metadata", "detector": "exif_image", "value_for": "device, timestamps, geolocation"},
    ],
}

软提示语义output 末尾必带一句:

Coverage hints are heuristics, not requirements. Skip an item if the case theory makes it irrelevant. Investigate ✗ items only when they could materially affect an active hypothesis.

这一句是**"应试能力存在但不被绑死"的关键**——LLM 看到 ✗ 不会盲投,会先看 hypothesis 列表问"这个工件对当前任何 hypothesis 有意义吗"。

2.3 marginal_yield(last_n_rounds: int = 2) — 边际收益(只读)

Signature: marginal_yield(last_n_rounds: int = 2) -> str

实现:扫最近 N 个 InvestigationRound,统计:

  • 每轮新增 phenomena 数
  • 每轮新增 P→H 边数
  • 每轮 hypothesis status flips 数active→supported / 反向)

输出

# Marginal Yield (last 2 rounds)

| round | new_phenomena | new_edges | status_flips |
|   R3  |  5            |  7        |  1           |
|   R4  |  2            |  1        |  0           |

Trend: decelerating (R4 yield 33% of R3).
Recommendation interpretation aid: yield trending to zero suggests diminishing
returns; consider declare_complete after one more probe.

最后一行是 LLM-friendly heuristic prose不是强制信号。

2.4 budget_status() — 预算视图(只读)

Signature: budget_status() -> str

# Budget Status

| metric | used | cap | pct |
| tool_calls | 1248 | 5000 | 25% |
| strategist_rounds | 3 | 10 | 30% |
| wall_clock_minutes | 142 | 360 | 39% |

Phase 1 used 89% of allocated. Phase 2 used 4%. Phase 3 (strategist) so far: 7%.

预算从 config.yaml 读,新增字段见 §6。无预算配置时进 unbounded 模式(仅靠 strategist 自宣 complete + hard safety cap

2.5 决策动作工具(写入)

注册到 strategist 的 mandatory_record_tools。Strategist 每轮必须 call 至少一个, 否则 forced-retry 触发(复用现有机制)。

propose_lead(...)

{
    "name": "propose_lead",
    "input_schema": {
        "type": "object",
        "required": [
            "description", "target_agent",
            "motivating_hypothesis", "expected_evidence_type",
        ],
        "properties": {
            "description": {
                "type": "string",
                "description": "1-2 sentence specific investigation request, including target source/artefact",
            },
            "target_agent": {
                "type": "string",
                "enum": ["filesystem","registry","communication","network","ios_artifact","android_artifact","media"],
            },
            "source_id": {"type": "string", "description": "which source to investigate"},
            "motivating_hypothesis": {
                "type": "string",
                "description": "hyp-id this lead is meant to corroborate or refute",
            },
            "expected_evidence_type": {
                "type": "string",
                "enum": ["direct_evidence","supports","contradicts","weakens","prerequisite_met","consequence_observed"],
            },
            "rationale": {"type": "string", "description": "why this fills a real gap"},
        }
    }
}

declare_investigation_complete(...)

{
    "name": "declare_investigation_complete",
    "input_schema": {
        "type": "object",
        "required": ["reason"],
        "properties": {
            "reason": {
                "type": "string",
                "enum": [
                    "marginal_yield_zero",
                    "budget_exhausted",
                    "all_hypotheses_resolved",
                    "coverage_saturated",
                    "other",
                ],
            },
            "rationale": {"type": "string"},
        }
    }
}

Terminal tool —— 调用即结束循环(复用现有 terminal_tools 机制)。


3. InvestigationStrategist agent

新文件 agents/strategist.py,约 150 行。

class InvestigationStrategist(BaseAgent):
    name = "strategist"
    role = (
        "You are the investigation strategist. You do not run forensic tools yourself. "
        "Your job is to read the current evidence graph and decide ONE of:\n"
        "  (a) propose 1-3 new investigation leads that would materially affect an active hypothesis, or\n"
        "  (b) declare the investigation complete.\n"
        "\n"
        "Use graph_overview / source_coverage / marginal_yield / budget_status to ground your judgment. "
        "DO NOT propose a lead that just adds more same-direction evidence to an already-supported hypothesis "
        "(harmonic damping makes it ~useless). DO propose leads when:\n"
        "  - A hypothesis is supported by edges from only ONE source — get cross-source corroboration.\n"
        "  - A hypothesis is in the active band (0.2 < conf < 0.8) — it needs the deciding evidence.\n"
        "  - A specific high-value artefact is uncovered on a source where the active hypotheses suggest it matters.\n"
        "\n"
        "Declare complete when marginal_yield is approaching zero AND no remaining active hypotheses have "
        "obvious investigation paths."
    )

    mandatory_record_tools = ("propose_lead", "declare_investigation_complete")
    terminal_tools = ("declare_investigation_complete",)

    def _register_graph_tools(self):
        # Read-only tools — strategist NEVER writes phenomena/edges directly.
        # All graph writes happen via the workers it dispatches.
        self._register_graph_read_tools()
        # No graph_write_tools.
        # Add strategy-specific tools:
        for tool_name in (
            "graph_overview", "source_coverage", "marginal_yield", "budget_status",
            "propose_lead", "declare_investigation_complete",
        ):
            td = TOOL_CATALOG[tool_name]
            self.register_tool(td.name, td.description, td.input_schema, td.executor)

注册到 agent_factory._AGENT_CLASSES["strategist"]


4. 编排器改造

4.1 删除/替换:现在的 Phase 3

orchestrator.py:Phase 3 当前逻辑(约 150 行):检查 leads → 派 worker → 检查 converged → 退出。删除

4.2 新 Phase 3strategist loop

async def _phase3_strategist_loop(self, run_dir: Path) -> None:
    """Belief-driven investigation: strategist proposes, workers execute, repeat."""
    _log("Phase 3: Strategist-Driven Investigation", event="phase")

    strategist = self.factory.get_or_create_agent("strategist")
    max_rounds = self.config.get("budgets", {}).get("strategist_rounds_max", 10)

    for round_num in range(1, max_rounds + 1):
        # 1. Record round start + snapshot
        rid = await self.graph.start_investigation_round(round_num)

        # 2. Strategist run
        _log(f"Strategist Round {round_num}", event="phase")
        await strategist.run(
            f"Review the graph and decide the next investigation action. "
            f"This is round {round_num}/{max_rounds}. Budget used so far: see budget_status."
        )

        # 3. Did strategist declare complete?
        if self.graph.is_round_terminal(rid):
            _log(f"Strategist declared complete at round {round_num}", event="progress")
            break

        # 4. Collect new leads proposed this round
        new_leads = self.graph.leads_from_round(round_num)
        if not new_leads:
            _log(f"No leads proposed in round {round_num} — stopping", event="progress")
            break

        # 5. Dispatch each lead
        for lead in new_leads:
            await self._execute_lead(lead, round_num)

        # 6. Close round + record yield
        await self.graph.complete_investigation_round(rid)

        # 7. Hard budget check
        if self._budget_exceeded():
            _log(f"Budget exhausted at round {round_num}", event="progress")
            break

4.3 _execute_lead 复用现有 worker 派发逻辑

async def _execute_lead(self, lead: Lead, round_num: int) -> None:
    agent_type = AGENT_ALIASES.get(lead.target_agent, lead.target_agent)
    worker = self.factory.get_or_create_agent(agent_type)
    if worker is None:
        logger.warning(f"No worker for lead {lead.id}: {agent_type}")
        return

    src = self.graph.case.get_source(lead.source_id) if lead.source_id else None
    if src:
        self.graph.set_active_source(src)

    _log(
        f"Round {round_num} dispatching: {lead.description}",
        event="dispatch", agent=agent_type,
    )
    await worker.run(
        f"Investigate this specific lead from the strategist:\n\n"
        f"REQUEST: {lead.description}\n"
        f"MOTIVATING HYPOTHESIS: {lead.motivating_hypothesis}\n"
        f"EXPECTED EVIDENCE TYPE: {lead.expected_evidence_type}\n"
        f"RATIONALE: {lead.rationale}\n\n"
        f"After investigating, record findings via add_phenomenon AND link relevant phenomena "
        f"to {lead.motivating_hypothesis} via the appropriate edge_type."
    )
    lead.status = "completed"
    self.graph._auto_save()

4.4 自动 hypothesis 重生成(可选,建议加)

新增 phenomena 可能产生新假设(不只是更新现有假设)。让 strategist 用 propose_lead(target_agent="hypothesis", description="re-examine recent phenomena for new hypotheses") 显式触发——这是 strategist 自决定的,不是定时触发。一致性优于自动定时。


5. 状态持久化

graph_state.json 新增顶层 key investigation_rounds: list[InvestigationRound]save_state / load_state 处理。断连恢复时:

  • 找最近一个未 completed 的 round → 视为该 round 失败
  • 从下一个 round 重新开始
  • 已完成 round 的 phenomena / edges 自然保留

6. 配置

config.yaml 新增:

strategist:
  enabled: true                     # false = 走老 Phase 3 逻辑safety fallback
  max_rounds: 10
  hard_stop_marginal_yield_zero_rounds: 3  # 连续 3 轮 yield=0 强制停

budgets:
  tool_calls_total: 5000
  wall_clock_minutes_max: 480

7. 测试策略

新文件 tests/test_strategist.py 或加入 test_optimizations.py。最少要测:

  1. Strategist 调 declare_complete 时 loop 立即退出
  2. Strategist 调 propose_lead 时 lead 入 graph 且 round_number 正确
  3. Round snapshot 正确捕获 before/after status
  4. 预算耗尽时即使 strategist 还想继续也强制停
  5. 断连恢复:中途中断后重启从下一 round 开始
  6. graph_overview 输出包含 distinct_sources 标注
  7. source_coverage 对未触达项标 ✗
  8. marginal_yield 数字与 confidence_log 一致

不写 LLM 集成测试——strategist 行为通过 mock LLM 验证(已有这种模式见 test_forced_record_retry_fires_when_zero_phenomena)。


8. 实施顺序

按依赖排(每步独立 commit——结构性改造,单点回滚关键):

内容 依赖 工作量估算
1 Lead 加 4 字段 + InvestigationRound 数据类 + 序列化 60 行 + 测试
2 graph_overview / source_coverage / marginal_yield / budget_status 实现 1 250 行 + 测试
3 propose_lead / declare_investigation_complete 工具 1 80 行 + 测试
4 InvestigationStrategist agent class 2, 3 120 行 + 测试
5 编排器 Phase 3 重写 4 150 行(替换 ~50 行旧)+ 测试
6 config schema + 加载逻辑 5 30 行
7 断连恢复处理 5 40 行 + 测试
8 真实案件 smoke run小规模USB only 7 0 代码
9 文档DESIGN.md §4.9 改写 + 本文件归档 8 文档

总:~800 行新代码 + 测试 + 文档。


9. 风险 + 缓解

风险 缓解
Strategist 太保守(永远 declare_complete 加 prompt 例子展示什么是"该深入的情况";测试时小样本验证
Strategist 太激进(每轮都 propose 7+ leads propose_lead 工具 schema 限制每轮最多 3-5 个prompt 强调"重质不重量"
单 worker 跑不完 lead 导致预算雪崩 worker 调用本身 max_iter 不变strategist 预算独立
LLM 不理解 distinct_sources 这种暗示 graph_overview 末尾加 1-2 句 plain-English 解读 "Hypothesis X has 23 edges but all from one source → cross-source corroboration would strengthen it"
Phase 1 触发产生的 leads 被 strategist 忽略 strategist prompt 明确"先处理已有 pending leads再产新的"
死循环strategist 反复产同样 lead Lead 表上加 (motivating_hyp, expected_type, source_id) 三元组去重
EXPECTED_ARTEFACTS 清单维护成本 故意保持"软提示"——清单不完整也不会破,只是某些深度需要更多 LLM 自觉

10. 开放问题

  1. InvestigationRound 该不该自己跑 hypothesis agent 倾向 strategist 用 lead 显式触发(一致性更好),不做定时触发。

  2. 预算超用怎么办——硬停 vs 软警告? 当前设计硬停;可加 "strategist 看到 budget < 10% 时只能 declare_complete" 的 schema enforcement。

  3. 跨 source 边的"独立性奖励"是否纳入 log-odds 上次衰减用了 1/k,没区分跨源 vs 同源。如果要纳入,公式应改为 1/k_within_source × bonus_for_distinct_sources。这是后续单独工程。

  4. Strategist 输出的 rationale 该不该走 grounding 它不会写 phenomenarationale 字段可能包含具体值 "based on inv-12345...")。倾向不强制——这是元层判断,不是事实落地。

  5. 现 Phase 3 的 max_investigation_rounds config 留还是删? 建议留作 strategist.enabled=false 时的 fallback 旋钮。


11. 与 DESIGN.md 的关系

本文档落地后DESIGN.md 需要的对应更新:

  • §4.5:补一段「同时也要看 log_odds 的结构——edges_in 数 / distinct_sources 是 strategist 判断是否深入的关键信号,不只是 confidence 数值」
  • §4.9 Phase 3表格内容从「leads 派发到源感知 agent」改为 「strategist 循环:看图、提案、执行、复盘、停 / 续」
  • §8(设计取舍):新增第 6 条:「调度层 LLM 化的取舍——strategist 决定深度, 但每轮预算受 budgets.* 硬限制;这是"LLM 提议、代码裁决"原则在调度层的兑现」

12. 备忘:本设计不解决的问题

  • 应试题 8% 命中率的根因是工具集不全(无 vision、无 ZIP 暴力破解、无 VeraCrypt 挂载、无 blockchain explorer不是调度问题。strategist 让现有工具被用得更狠, 但不会凭空多出工具。
  • LLM 编造 invocation_id(已修补,见 feedback_grounding_pending memory和 log-odds 通胀(已修补:调和衰减)是本设计的前置依赖,不在本设计范围内。
  • Per-edge-type 的更精细贝叶斯建模(如跨源独立性 bonus是独立工程。