Initial commit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
BattleTag
2026-05-09 17:36:26 +08:00
commit 097d2ce472
25 changed files with 5944 additions and 0 deletions

1
.python-version Normal file
View File

@@ -0,0 +1 @@
3.14

168
README.md Normal file
View File

@@ -0,0 +1,168 @@
# MASForensics
Multi-Agent System for Digital Forensics — 基于大语言模型的多智能体电子取证系统。
系统通过 6 个专业化 Agent 协同工作,对磁盘镜像进行自动化取证分析,最终生成结构化的取证报告。
## 架构
```
main.py 入口:配置加载、恢复检测、运行管理
├── Orchestrator 四阶段流水线调度
│ │
│ ├── FileSystemAgent 磁盘结构、文件系统、删除文件、Prefetch
│ ├── RegistryAgent 注册表分析(系统/用户/网络/软件)
│ ├── CommunicationAgent 邮件、IRC 聊天记录
│ ├── NetworkAgent 浏览器历史、PCAP 抓包
│ ├── TimelineAgent 跨类别时间线关联
│ └── ReportAgent 综合报告生成
├── Blackboard 共享知识库Evidence + Lead
└── LLMClient Claude API 调用ReAct 模式)
```
Agent 之间不直接通信,通过 **Blackboard黑板** 共享发现Evidence和线索Lead
## 调查流程
| 阶段 | 说明 |
|------|------|
| **Phase 1** | FileSystemAgent 勘查磁盘镜像,识别分区、目录结构、关键文件,产出初始 Lead |
| **Phase 2** | 多轮线索追踪 — Lead 按 Agent 类型分组并行派发,最多 10 轮迭代 |
| **Phase 2.5** | 覆盖率缺口分析 — 对照 config.yaml 中的 10 个调查领域,自动补漏 |
| **Phase 3** | TimelineAgent 综合所有 evidence 建立事件时间线 |
| **Phase 4** | ReportAgent 生成 Markdown 格式取证报告 |
## 取证工具链
### Sleuth Kit磁盘取证
通过异步子进程调用 TSK 命令行工具:
| 工具 | 用途 |
|------|------|
| `mmls` | 分区表分析 |
| `fsstat` | 文件系统元数据 |
| `fls` | 目录列举(含已删除文件) |
| `icat` | 按 inode 提取文件 |
| `srch_strings` | 磁盘字符串搜索 |
| `fls -m` | MAC 时间线生成 |
### regipy注册表解析
直接解析 Windows 注册表 hive 二进制文件SYSTEM、SOFTWARE、SAM、NTUSER.DAT提取系统信息、用户账户、网络配置、已安装软件、邮件账户、关机时间等。
### 文件解析器
- **Prefetch** — 二进制解析 Windows XP .pf 文件(运行次数、最后执行时间)
- **PCAP** — 从抓包文件提取 HTTP 请求、Host、Cookie、User-Agent
- **通用文本/二进制** — 按偏移读取、正则搜索、Hex dump
## 断连恢复与数据归档
系统设计了三层防护,应对长时间运行中的网络中断:
1. **Blackboard 自动持久化** — 每次 add_evidence / add_lead 自动写盘(原子写入)
2. **Agent 级容错** — 单个 Agent 失败标记 Lead 为 failed不影响其他 Agent自动重试一次
3. **优雅退出** — 连续 3 次 Agent 失败后保存现有成果并干净退出
每次运行自动创建带时间戳的归档目录:
```
runs/
2026-04-02T14-30-00/
config.yaml 配置快照
blackboard_state.json 实时状态(用于恢复)
evidence.json 结构化证据导出
leads.json 线索及最终状态
report.md 取证报告
run_metadata.json 运行元数据(时长、统计、错误)
masforensics.log 运行日志
```
中断后再次运行 `python main.py`,系统自动检测未完成的运行并提示恢复。
## 快速开始
### 环境要求
- Python >= 3.14
- The Sleuth Kit系统安装提供 `mmls``fls``icat` 等命令)
- 磁盘镜像文件置于 `image/` 目录
### 安装
```bash
uv sync
```
### 配置
编辑 `config.yaml`,填入 LLM API 地址和密钥:
```yaml
agent:
base_url: "https://your-api-proxy.com"
api_key: "sk-your-key"
model: "claude-sonnet-4-6"
max_tokens: 4096
```
`investigation_areas` 部分定义了必须覆盖的调查领域,可按需增减。
### 运行
```bash
python main.py
```
报告和所有结构化数据将保存在 `runs/<timestamp>/` 目录下。
## 项目结构
```
MASForensics/
├── main.py 入口
├── orchestrator.py 流水线调度
├── blackboard.py 共享知识库
├── llm_client.py LLM API 客户端
├── base_agent.py Agent 基类
├── config.yaml 配置文件
├── agents/
│ ├── filesystem.py 文件系统 Agent
│ ├── registry.py 注册表 Agent
│ ├── communication.py 通信 Agent
│ ├── network.py 网络 Agent
│ ├── timeline.py 时间线 Agent
│ └── report.py 报告 Agent
├── tools/
│ ├── sleuthkit.py Sleuth Kit 封装
│ ├── registry.py 注册表解析regipy
│ └── parsers.py 文件格式解析器
├── image/ 磁盘镜像
├── extracted/ 提取的文件(运行时生成)
└── runs/ 运行归档
```
## 依赖
| 包 | 用途 |
|----|------|
| `httpx[socks]` | 异步 HTTP 客户端(支持 SOCKS 代理) |
| `pyyaml` | 配置文件解析 |
| `regipy` | Windows 注册表 hive 解析 |
## 当前案例
默认配置分析 **CFReDS Hacking Case**NIST 标准取证教学镜像):
- 镜像SCHARDT.001~4.6GBIBM 硬盘8 个分段)
- 系统Windows XP
- 场景:涉嫌黑客入侵的计算机取证分析
## 测试
```bash
python -m pytest tests/ -v
```

280
agent_factory.py Normal file
View File

@@ -0,0 +1,280 @@
"""Agent Factory — composes agents from tool registry and role templates.
Provides both pre-defined agent templates (filesystem, registry, etc.)
and LLM-driven dynamic agent composition for capability gaps.
"""
from __future__ import annotations
import json
import logging
from dataclasses import dataclass, field
from base_agent import BaseAgent
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
from tool_registry import TOOL_CATALOG, ToolDefinition
# Agent classes with custom tools — keyed by template name
_AGENT_CLASSES: dict[str, type] = {}
def _load_agent_classes() -> None:
"""Lazy-import custom agent classes to avoid circular imports."""
if _AGENT_CLASSES:
return
from agents.hypothesis import HypothesisAgent
from agents.report import ReportAgent
_AGENT_CLASSES["hypothesis"] = HypothesisAgent
_AGENT_CLASSES["report"] = ReportAgent
logger = logging.getLogger(__name__)
@dataclass
class RoleTemplate:
"""Pre-defined agent archetype."""
name: str
role: str
default_tools: list[str] # tool names from TOOL_CATALOG
tags: list[str] = field(default_factory=list)
# Pre-defined templates matching the original 6 agents + hypothesis agent.
ROLE_TEMPLATES: dict[str, RoleTemplate] = {
"filesystem": RoleTemplate(
name="filesystem",
role=(
"File system forensic analyst. You examine disk image partition layouts, "
"directory structures, file metadata, and recover deleted files. "
"You identify suspicious files, installed programs, and user data locations. "
"You also handle Recycle Bin forensics and Prefetch execution evidence."
),
default_tools=[
"partition_info", "filesystem_info", "list_directory",
"extract_file", "find_file", "search_strings",
"parse_prefetch", "count_deleted_files",
"read_text_file", "search_text_file", "read_binary_preview",
],
tags=["filesystem", "disk", "files", "deleted", "prefetch"],
),
"registry": RoleTemplate(
name="registry",
role=(
"Windows registry forensic analyst. You parse registry hive files "
"(SYSTEM, SOFTWARE, SAM, NTUSER.DAT) to extract system configuration, "
"user accounts, installed software, network settings, email accounts, "
"and other Windows artifacts."
),
default_tools=[
"extract_file", "list_directory",
"parse_registry_key", "list_installed_software",
"get_user_activity", "search_registry",
"get_system_info", "get_timezone_info", "get_computer_name",
"get_shutdown_time", "enumerate_users",
"get_network_interfaces", "get_email_config",
],
tags=["registry", "windows", "system", "user", "software"],
),
"communication": RoleTemplate(
name="communication",
role=(
"Communication forensic analyst. You analyze email files (.dbx, .pst), "
"IRC/mIRC chat logs, newsgroup data, and other messaging artifacts "
"to identify communication patterns and contacts."
),
default_tools=[
"list_directory", "extract_file",
"read_text_file", "read_binary_preview",
"list_extracted_dir", "search_strings",
"search_text_file", "read_text_file_section",
],
tags=["email", "chat", "irc", "messaging", "communication"],
),
"network": RoleTemplate(
name="network",
role=(
"Network forensic analyst. You analyze browser history, cookies, "
"network captures (PCAP), wireless artifacts, and other network-related "
"evidence to reconstruct online activities."
),
default_tools=[
"list_directory", "extract_file",
"read_text_file", "read_binary_preview",
"list_extracted_dir", "search_strings",
"search_text_file", "read_text_file_section",
"parse_pcap_strings",
],
tags=["network", "browser", "pcap", "http", "internet"],
),
"timeline": RoleTemplate(
name="timeline",
role=(
"Timeline correlation analyst. You build chronological timelines "
"by combining filesystem MAC times with evidence from other agents. "
"You identify temporal patterns and correlate events across categories."
),
default_tools=[
"build_filesystem_timeline",
],
tags=["timeline", "correlation", "temporal"],
),
"report": RoleTemplate(
name="report",
role=(
"Forensic report writer. You synthesize all evidence and hypotheses "
"into a comprehensive forensic analysis report with executive summary, "
"detailed findings organized by hypothesis, timeline of events, and conclusions."
),
default_tools=[], # Report agent uses only graph query tools
tags=["report", "summary", "writing"],
),
"hypothesis": RoleTemplate(
name="hypothesis",
role=(
"Hypothesis analyst. You review all phenomena discovered so far "
"and formulate investigative hypotheses about what happened on the system. "
"For each hypothesis, identify which existing phenomena support or contradict it."
),
default_tools=[], # Uses only graph query + hypothesis tools
tags=["hypothesis", "analysis", "reasoning"],
),
}
class AgentFactory:
"""Creates agents from templates or dynamically via LLM composition."""
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
self.llm = llm
self.graph = graph
self._cache: dict[str, BaseAgent] = {}
def get_or_create_agent(self, agent_type: str) -> BaseAgent | None:
"""Get a cached agent or create one from a template."""
if agent_type in self._cache:
return self._cache[agent_type]
template = ROLE_TEMPLATES.get(agent_type)
if template is None:
logger.warning("No template for agent type: %s", agent_type)
return None
# Use custom agent class if one exists, otherwise BaseAgent
_load_agent_classes()
agent_cls = _AGENT_CLASSES.get(agent_type)
if agent_cls is not None:
agent = agent_cls(self.llm, self.graph)
else:
agent = self._instantiate_from_template(template)
self._cache[agent_type] = agent
return agent
def _instantiate_from_template(self, template: RoleTemplate) -> BaseAgent:
"""Create a BaseAgent from a role template, registering tools from the catalog."""
agent = BaseAgent(self.llm, self.graph)
agent.name = template.name
agent.role = template.role
for tool_name in template.default_tools:
td = TOOL_CATALOG.get(tool_name)
if td is None:
logger.warning("Tool '%s' not in catalog (template: %s)", tool_name, template.name)
continue
agent.register_tool(td.name, td.description, td.input_schema, td.executor)
return agent
async def create_specialized_agent(
self,
hypothesis_title: str,
hypothesis_desc: str,
capability_gap: str,
) -> BaseAgent:
"""Use LLM to compose a new agent for a specific investigative need.
1. LLM sees available tools + the capability gap
2. LLM selects tools and writes role + strategy
3. Factory instantiates BaseAgent with chosen tools
"""
available = "\n".join(
f"- {td.name}: {td.description} [tags: {', '.join(td.tags)}]"
for td in TOOL_CATALOG.values()
)
prompt = (
f"You are designing a specialized forensic investigation agent.\n\n"
f"Hypothesis to investigate: {hypothesis_title}\n"
f"Details: {hypothesis_desc}\n"
f"Capability gap: {capability_gap}\n\n"
f"Available tools:\n{available}\n\n"
f"Select 3-8 tools from the list above that would be most useful.\n"
f"Write a role description (2-3 sentences) for this agent.\n"
f"Write an investigation strategy (3-5 numbered steps).\n\n"
f"Respond ONLY with JSON (no markdown):\n"
f'{{"agent_name": "short_name", "role": "...", "tools": ["tool1", "tool2"], "strategy": "1. ...\\n2. ..."}}'
)
response = await self.llm.chat(
messages=[{"role": "user", "content": prompt}],
)
# Parse response — try to extract JSON
try:
config = json.loads(response)
except json.JSONDecodeError:
# Try to find JSON in the response
import re
match = re.search(r'\{.*\}', response, re.DOTALL)
if match:
config = json.loads(match.group())
else:
logger.error("Failed to parse agent composition response: %s", response[:300])
# Fallback: create a generic agent with all tools
return self._create_fallback_agent(capability_gap)
agent_name = config.get("agent_name", "specialized")
role_text = config.get("role", f"Specialized agent for: {capability_gap}")
strategy = config.get("strategy", "")
tool_names = config.get("tools", [])
# Validate tool names against catalog
valid_tools = [t for t in tool_names if t in TOOL_CATALOG]
if not valid_tools:
logger.warning("No valid tools selected by LLM, using fallback")
return self._create_fallback_agent(capability_gap)
# Build agent
agent = BaseAgent(self.llm, self.graph)
agent.name = agent_name
agent.role = f"{role_text}\n\nInvestigation Strategy:\n{strategy}"
for tool_name in valid_tools:
td = TOOL_CATALOG[tool_name]
agent.register_tool(td.name, td.description, td.input_schema, td.executor)
self._cache[agent_name] = agent
logger.info(
"Dynamic agent created: '%s' with %d tools: %s",
agent_name, len(valid_tools), valid_tools,
)
return agent
def _create_fallback_agent(self, gap_description: str) -> BaseAgent:
"""Create a generic agent with common tools as fallback."""
agent = BaseAgent(self.llm, self.graph)
agent.name = "generic_investigator"
agent.role = f"General forensic investigator. Focus: {gap_description}"
fallback_tools = [
"list_directory", "extract_file", "read_text_file",
"read_binary_preview", "search_strings", "search_text_file",
]
for tool_name in fallback_tools:
td = TOOL_CATALOG.get(tool_name)
if td:
agent.register_tool(td.name, td.description, td.input_schema, td.executor)
self._cache["generic_investigator"] = agent
return agent

0
agents/__init__.py Normal file
View File

33
agents/communication.py Normal file
View File

@@ -0,0 +1,33 @@
"""Communication Agent — analyzes email, chat logs, and messaging artifacts."""
from __future__ import annotations
from base_agent import BaseAgent
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
from tool_registry import TOOL_CATALOG
class CommunicationAgent(BaseAgent):
name = "communication"
role = (
"Communication forensic analyst. You analyze email files (.dbx, .pst), "
"IRC/mIRC chat logs, newsgroup data, and other messaging artifacts "
"to identify communication patterns, contacts, and content."
)
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
super().__init__(llm, graph)
self._register_tools()
def _register_tools(self) -> None:
tool_names = [
"list_directory", "extract_file",
"read_text_file", "read_binary_preview",
"list_extracted_dir", "search_strings",
"search_text_file", "read_text_file_section",
]
for name in tool_names:
td = TOOL_CATALOG.get(name)
if td:
self.register_tool(td.name, td.description, td.input_schema, td.executor)

34
agents/filesystem.py Normal file
View File

@@ -0,0 +1,34 @@
"""FileSystem Agent — analyzes disk structure, files, and deleted data."""
from __future__ import annotations
from base_agent import BaseAgent
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
from tool_registry import TOOL_CATALOG
class FileSystemAgent(BaseAgent):
name = "filesystem"
role = (
"File system forensic analyst. You examine disk image partition layouts, "
"directory structures, file metadata, and recover deleted files. "
"You identify suspicious files, installed programs, and user data locations. "
"You also handle malware analysis, Recycle Bin forensics, and Prefetch execution evidence."
)
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
super().__init__(llm, graph)
self._register_tools()
def _register_tools(self) -> None:
tool_names = [
"partition_info", "filesystem_info", "list_directory",
"extract_file", "find_file", "search_strings",
"parse_prefetch", "count_deleted_files",
"read_text_file", "search_text_file", "read_binary_preview",
]
for name in tool_names:
td = TOOL_CATALOG.get(name)
if td:
self.register_tool(td.name, td.description, td.input_schema, td.executor)

130
agents/hypothesis.py Normal file
View File

@@ -0,0 +1,130 @@
"""Hypothesis Agent — analyzes phenomena and generates investigative hypotheses."""
from __future__ import annotations
import json
import logging
from base_agent import BaseAgent
from evidence_graph import EvidenceGraph, HYPOTHESIS_EDGE_WEIGHTS
from llm_client import LLMClient
logger = logging.getLogger(__name__)
class HypothesisAgent(BaseAgent):
name = "hypothesis"
role = (
"Hypothesis analyst. You review all phenomena discovered so far "
"and formulate investigative hypotheses about what happened on this system. "
"Your ultimate goal: build the most complete picture of events that occurred. "
"For each hypothesis, identify which existing phenomena support or contradict it."
)
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
super().__init__(llm, graph)
self._register_hypothesis_tools()
def _register_hypothesis_tools(self) -> None:
"""Register hypothesis-specific tools."""
valid_edge_types = list(HYPOTHESIS_EDGE_WEIGHTS.keys())
self.register_tool(
name="add_hypothesis",
description=(
"Create a new investigative hypothesis about what happened on the system. "
"Each hypothesis should be a specific, testable claim."
),
input_schema={
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Short title for the hypothesis.",
},
"description": {
"type": "string",
"description": "Detailed description of what this hypothesis claims.",
},
},
"required": ["title", "description"],
},
executor=self._add_hypothesis,
)
self.register_tool(
name="link_phenomenon_to_hypothesis",
description=(
"Link an existing phenomenon to a hypothesis with a relationship type. "
f"Valid relationship types: {', '.join(valid_edge_types)}. "
"direct_evidence = the phenomenon IS the hypothesis. "
"supports = consistent with the hypothesis. "
"prerequisite_met = a necessary condition is satisfied. "
"consequence_observed = an expected result of the hypothesis is found. "
"contradicts = directly contradicts the hypothesis. "
"weakens = makes the hypothesis less likely."
),
input_schema={
"type": "object",
"properties": {
"phenomenon_id": {
"type": "string",
"description": "ID of the phenomenon (e.g. 'ph-a1b2c3d4').",
},
"hypothesis_id": {
"type": "string",
"description": "ID of the hypothesis (e.g. 'hyp-e5f6g7h8').",
},
"edge_type": {
"type": "string",
"enum": valid_edge_types,
"description": "The edge_type of the relationship.",
},
"reason": {
"type": "string",
"description": "The reason this relationship holds (1-2 sentences).",
},
},
"required": ["phenomenon_id", "hypothesis_id", "edge_type", "reason"],
},
executor=self._link_phenomenon_to_hypothesis,
)
async def _add_hypothesis(self, title: str, description: str) -> str:
hid = await self.graph.add_hypothesis(
title=title,
description=description,
created_by=self.name,
)
return f"Hypothesis created: {hid}{title} (confidence: 0.50)"
async def _link_phenomenon_to_hypothesis(
self,
phenomenon_id: str,
hypothesis_id: str,
edge_type: str = "",
reason: str = "",
# Common LLM misnaming — accept as fallbacks
relationship: str = "",
note: str = "",
) -> str:
edge_type = edge_type or relationship
reason = reason or note
if not edge_type:
return "Error: edge_type is required."
try:
new_conf = await self.graph.update_hypothesis_confidence(
hyp_id=hypothesis_id,
phenomenon_id=phenomenon_id,
edge_type=edge_type,
reason=reason,
)
weight = HYPOTHESIS_EDGE_WEIGHTS[edge_type]
direction = "+" if weight > 0 else ""
return (
f"Linked: {phenomenon_id} —[{edge_type}]→ {hypothesis_id} "
f"(weight: {direction}{weight}, new confidence: {new_conf:.3f})"
)
except ValueError as e:
return f"Error linking: {e}"

34
agents/network.py Normal file
View File

@@ -0,0 +1,34 @@
"""Network Agent — analyzes browser history, network tool artifacts, and wireless evidence."""
from __future__ import annotations
from base_agent import BaseAgent
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
from tool_registry import TOOL_CATALOG
class NetworkAgent(BaseAgent):
name = "network"
role = (
"Network forensic analyst. You analyze browser history, cookies, "
"network captures (PCAP), wireless artifacts, and other network-related "
"evidence to reconstruct online activities and network attacks."
)
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
super().__init__(llm, graph)
self._register_tools()
def _register_tools(self) -> None:
tool_names = [
"list_directory", "extract_file",
"read_text_file", "read_binary_preview",
"list_extracted_dir", "search_strings",
"search_text_file", "read_text_file_section",
"parse_pcap_strings",
]
for name in tool_names:
td = TOOL_CATALOG.get(name)
if td:
self.register_tool(td.name, td.description, td.input_schema, td.executor)

36
agents/registry.py Normal file
View File

@@ -0,0 +1,36 @@
"""Registry Agent — analyzes Windows registry hives."""
from __future__ import annotations
from base_agent import BaseAgent
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
from tool_registry import TOOL_CATALOG
class RegistryAgent(BaseAgent):
name = "registry"
role = (
"Windows registry forensic analyst. You parse registry hive files "
"(SYSTEM, SOFTWARE, SAM, NTUSER.DAT) to extract system configuration, "
"user accounts, installed software, network settings, email accounts, "
"and other Windows artifacts."
)
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
super().__init__(llm, graph)
self._register_tools()
def _register_tools(self) -> None:
tool_names = [
"extract_file", "list_directory",
"parse_registry_key", "list_installed_software",
"get_user_activity", "search_registry",
"get_system_info", "get_timezone_info", "get_computer_name",
"get_shutdown_time", "enumerate_users",
"get_network_interfaces", "get_email_config",
]
for name in tool_names:
td = TOOL_CATALOG.get(name)
if td:
self.register_tool(td.name, td.description, td.input_schema, td.executor)

191
agents/report.py Normal file
View File

@@ -0,0 +1,191 @@
"""Report Agent — generates forensic analysis reports."""
from __future__ import annotations
import json
import os
from base_agent import BaseAgent
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
class ReportAgent(BaseAgent):
name = "report"
role = (
"Forensic report writer. You synthesize all findings from the investigation "
"into a structured, professional forensic analysis report organized by hypotheses.\n\n"
"IMPORTANT: Only include findings that have a source_tool attribution (marked VERIFIED). "
"If evidence lacks source attribution, mark it as UNVERIFIED. "
"Do NOT invent or fabricate any data, timestamps, or findings not present in the evidence.\n\n"
"CRITICAL: You MUST call save_report to write the final report."
)
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
super().__init__(llm, graph)
self._register_tools()
def _build_system_prompt(self, task: str) -> str:
"""Report agent gets a clean prompt — no Phase A/B/C/D workflow."""
return (
f"You are a forensic report writer.\n"
f"Role: {self.role}\n\n"
f"Investigation state:\n{self.graph.stats_summary()}\n\n"
f"Your task: {task}\n\n"
f"WORKFLOW:\n"
f"1. Call get_hypotheses_with_evidence to get all hypotheses and their linked evidence\n"
f"2. Call get_all_phenomena to get detailed findings by category\n"
f"3. Call get_entities to get people, programs, and hosts\n"
f"4. Call get_case_info for case metadata\n"
f"5. Write the complete report directly in your <answer> block\n\n"
f"RULES:\n"
f"- Write the report DIRECTLY in <answer> — do NOT use save_report tool\n"
f"- Only include findings present in the evidence graph\n"
f"- Do NOT invent timestamps, file paths, or data not in the phenomena\n"
f"- The report must be complete — do not cut off mid-section\n"
)
def _register_tools(self) -> None:
self.register_tool(
name="get_all_phenomena",
description="Get all phenomena across all categories with full details.",
input_schema={"type": "object", "properties": {}},
executor=self._get_all_phenomena,
)
self.register_tool(
name="get_hypotheses_with_evidence",
description="Get all hypotheses with their linked phenomena (supporting and contradicting).",
input_schema={"type": "object", "properties": {}},
executor=self._get_hypotheses_with_evidence,
)
self.register_tool(
name="get_case_info",
description="Get case metadata (image info, drive details, etc.).",
input_schema={"type": "object", "properties": {}},
executor=self._get_case_info,
)
self.register_tool(
name="get_entities",
description="Get all entities (people, programs, hosts) and their connections.",
input_schema={"type": "object", "properties": {}},
executor=self._get_entities,
)
self.register_tool(
name="save_report",
description="Save the final report to a file.",
input_schema={
"type": "object",
"properties": {
"content": {"type": "string", "description": "Report content in Markdown."},
"output_path": {"type": "string", "description": "File path to save the report."},
},
"required": ["content", "output_path"],
},
executor=self._save_report,
)
self.register_tool(
name="verify_phenomena",
description="Check phenomena provenance — VERIFIED (has source_tool) vs UNVERIFIED.",
input_schema={"type": "object", "properties": {}},
executor=self._verify_phenomena,
)
async def _get_all_phenomena(self) -> str:
phenomena = self.graph.phenomena
if not phenomena:
return "No phenomena in the evidence graph."
categories = sorted(set(ph.category for ph in phenomena.values()))
lines = [f"=== All Phenomena ({len(phenomena)} entries) ==="]
for cat in categories:
items = [ph for ph in phenomena.values() if ph.category == cat]
lines.append(f"\n--- {cat.upper()} ({len(items)} entries) ---")
for ph in items:
verified = "VERIFIED" if ph.source_tool else "UNVERIFIED"
lines.append(f"\n[{verified}] {ph.title} ({ph.id})")
lines.append(f" Source: {ph.source_agent} | Tool: {ph.source_tool or 'N/A'}")
if ph.timestamp:
lines.append(f" Timestamp: {ph.timestamp}")
lines.append(f" {ph.description[:500]}")
return "\n".join(lines)
async def _get_hypotheses_with_evidence(self) -> str:
if not self.graph.hypotheses:
return "No hypotheses defined."
lines = [f"=== Hypotheses ({len(self.graph.hypotheses)}) ==="]
for hyp in self.graph.hypotheses.values():
lines.append(f"\n### {hyp.title}")
lines.append(f"Confidence: {hyp.confidence:.2f} | Status: {hyp.status}")
lines.append(f"Description: {hyp.description}")
related = self.graph.get_related(hyp.id, direction="in")
supporting = [r for r in related if r["edge_type"] in ("direct_evidence", "supports", "prerequisite_met", "consequence_observed")]
contradicting = [r for r in related if r["edge_type"] in ("contradicts", "weakens")]
if supporting:
lines.append(f"\nSupporting evidence ({len(supporting)}):")
for r in supporting:
lines.append(f" [{r['edge_type']}] {r['node']}")
if contradicting:
lines.append(f"\nContradicting evidence ({len(contradicting)}):")
for r in contradicting:
lines.append(f" [{r['edge_type']}] {r['node']}")
if not supporting and not contradicting:
lines.append(" (no evidence linked)")
return "\n".join(lines)
async def _get_case_info(self) -> str:
info = self.graph.case_info
lines = ["=== Case Information ==="]
for k, v in info.items():
lines.append(f" {k}: {v}")
lines.append(f" Image path: {self.graph.image_path}")
lines.append(f" Partition offset: {self.graph.partition_offset}")
return "\n".join(lines)
async def _get_entities(self) -> str:
if not self.graph.entities:
return "No entities recorded."
lines = [f"=== Entities ({len(self.graph.entities)}) ==="]
for ent in self.graph.entities.values():
lines.append(f"\n{ent.name} ({ent.entity_type})")
if ent.description:
lines.append(f" {ent.description}")
related = self.graph.get_related(ent.id, direction="in")
if related:
for r in related:
lines.append(f" ← [{r['edge_type']}] {r['node']}")
return "\n".join(lines)
async def _verify_phenomena(self) -> str:
verified = []
unverified = []
for ph in self.graph.phenomena.values():
entry = f" [{ph.category}] {ph.title} (agent: {ph.source_agent}, tool: {ph.source_tool or 'N/A'})"
if ph.source_tool:
verified.append(entry)
else:
unverified.append(entry)
lines = ["=== Phenomena Verification Report ==="]
lines.append(f"\nVERIFIED ({len(verified)} — have source_tool):")
lines.extend(verified)
lines.append(f"\nUNVERIFIED ({len(unverified)} — no source_tool):")
lines.extend(unverified)
return "\n".join(lines)
async def _save_report(self, content: str, output_path: str) -> str:
try:
os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
with open(output_path, "w") as f:
f.write(content)
return f"Report saved to {output_path} ({len(content)} chars)"
except Exception as e:
return f"Error saving report: {e}"

88
agents/timeline.py Normal file
View File

@@ -0,0 +1,88 @@
"""Timeline Agent — correlates evidence across time."""
from __future__ import annotations
import json
from base_agent import BaseAgent
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
from tool_registry import TOOL_CATALOG
class TimelineAgent(BaseAgent):
name = "timeline"
role = (
"Timeline forensic analyst. You build chronological timelines from filesystem "
"MAC timestamps and correlate events across all phenomena categories in the "
"evidence graph to reconstruct the sequence of activities on the system."
)
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
super().__init__(llm, graph)
self._register_tools()
def _register_tools(self) -> None:
# Filesystem timeline tool from catalog
td = TOOL_CATALOG.get("build_filesystem_timeline")
if td:
self.register_tool(td.name, td.description, td.input_schema, td.executor)
# Custom tool to get all phenomena with timestamps for correlation
self.register_tool(
name="get_timestamped_phenomena",
description="Get all phenomena that have timestamps, sorted chronologically. Use for timeline correlation.",
input_schema={"type": "object", "properties": {}},
executor=self._get_timestamped_phenomena,
)
# Tool to add temporal edges between phenomena
self.register_tool(
name="add_temporal_edge",
description="Add a temporal relationship between two phenomena (before, after, or concurrent).",
input_schema={
"type": "object",
"properties": {
"source_id": {"type": "string", "description": "ID of the earlier/source phenomenon."},
"target_id": {"type": "string", "description": "ID of the later/target phenomenon."},
"relation": {
"type": "string",
"enum": ["before", "after", "concurrent"],
"description": "Temporal relationship.",
},
},
"required": ["source_id", "target_id", "relation"],
},
executor=self._add_temporal_edge,
)
async def _get_timestamped_phenomena(self) -> str:
items = [
ph for ph in self.graph.phenomena.values()
if ph.timestamp
]
items.sort(key=lambda ph: ph.timestamp or "")
if not items:
return "No phenomena with timestamps found."
lines = []
for ph in items:
lines.append(f"{ph.timestamp} | [{ph.category}] {ph.title} ({ph.id})")
lines.append(f" {ph.description[:150]}")
return "\n".join(lines)
async def _add_temporal_edge(
self, source_id: str, target_id: str, relation: str,
) -> str:
try:
await self.graph.add_edge(
source_id=source_id,
target_id=target_id,
edge_type="temporal",
metadata={"relation": relation},
created_by=self.name,
)
return f"Temporal edge added: {source_id} —[{relation}]→ {target_id}"
except ValueError as e:
return f"Error: {e}"

448
base_agent.py Normal file
View File

@@ -0,0 +1,448 @@
"""Base class for forensic analysis agents."""
from __future__ import annotations
import json
import logging
import time
from typing import Any
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
logger = logging.getLogger(__name__)
def _log(msg: str, **extra) -> None:
"""Emit a structured log message with extra fields."""
logger.info(msg, extra=extra)
class BaseAgent:
"""Base class for all forensic agents.
Each agent has:
- A name and role description
- A set of tools it can use (registered as methods)
- Access to the shared EvidenceGraph
- An LLM client for reasoning and tool-calling
"""
name: str = "base"
role: str = "A forensic analysis agent."
def __init__(self, llm: LLMClient, graph: EvidenceGraph) -> None:
self.llm = llm
self.graph = graph
self._tools: dict[str, dict] = {} # name -> schema
self._executors: dict[str, Any] = {} # name -> async callable
self._work_log: list[str] = []
def register_tool(
self,
name: str,
description: str,
input_schema: dict,
executor: Any,
) -> None:
"""Register a tool that this agent can use."""
self._tools[name] = {
"name": name,
"description": description,
"input_schema": input_schema,
}
self._executors[name] = executor
def get_tool_definitions(self) -> list[dict]:
"""Get tool definitions in Claude API format."""
return list(self._tools.values())
def _build_system_prompt(self, task: str) -> str:
"""Build the system prompt — lightweight stats, no full evidence dump."""
work_log_section = ""
if self._work_log:
entries = self._work_log[-5:]
log_lines = "\n".join(f" {i+1}. {entry}" for i, entry in enumerate(entries))
work_log_section = (
f"\nYour prior work on this investigation:\n{log_lines}\n"
f"Avoid repeating tools/approaches that already succeeded or failed. Build on prior findings.\n"
)
return (
f"You are {self.name}, a specialized digital forensics agent.\n"
f"Role: {self.role}\n\n"
f"You are analyzing a disk image as part of a multi-agent forensic investigation.\n"
f"Image: {self.graph.image_path}\n\n"
f"Current investigation state:\n{self.graph.stats_summary()}\n"
f"{work_log_section}\n"
f"Your current task: {task}\n\n"
f"CRITICAL WORKFLOW — you MUST follow these steps IN ORDER, one phase at a time:\n\n"
f"Phase A — INVESTIGATE:\n"
f" Use list_phenomena/search_graph to review existing findings.\n"
f" Call list_assets to see what files are already extracted.\n"
f" Call investigation tools (list_directory, parse_registry_key, etc.) to gather data.\n"
f" Only extract_file for forensically relevant files (user data, logs, configs, hives) — NOT system DLLs or OS files.\n"
f" Create add_lead for anything outside your expertise.\n\n"
f"Phase B — RECORD PHENOMENA:\n"
f" For EACH significant finding from Phase A, call add_phenomenon.\n"
f" Do NOT call link_to_entity yet — just record all phenomena first.\n\n"
f"Phase C — LINK ENTITIES:\n"
f" FIRST call list_phenomena to get the current IDs — do NOT rely on memory.\n"
f" Then call link_to_entity for each relevant phenomenon.\n"
f" NEVER guess or fabricate a phenomenon ID. If an ID is not in list_phenomena output, it does not exist.\n\n"
f"Phase D — ANSWER:\n"
f" Only give your <answer> AFTER completing Phases B and C.\n\n"
f"IMPORTANT:\n"
f"- You MUST call add_phenomenon at least once before finishing\n"
f"- Complete each phase before starting the next\n"
f"- Other agents can ONLY see what you write to the graph\n"
f"- If you don't record findings, they are LOST\n"
f"- Include relevant file paths, inode numbers, timestamps, and raw data\n\n"
f"ANTI-HALLUCINATION RULES — STRICTLY ENFORCED:\n"
f"- ONLY record findings that appear VERBATIM in tool results you received\n"
f"- NEVER invent or guess timestamps, file paths, inode numbers, or program names\n"
f"- If tool output was truncated, state '[truncated]' — do NOT fill in the missing data\n"
f"- If you are unsure whether something exists, call a tool to verify or create a lead — do NOT assume\n"
f"- Quote exact strings from tool output when recording evidence descriptions\n"
f"- Do NOT fabricate execution timestamps — only report timestamps returned by tools"
)
async def run(self, task: str) -> str:
"""Run this agent with a specific task."""
_log(task, event="agent_start", agent=self.name)
self.graph.agent_status[self.name] = "running"
self.graph._current_agent = self.name
self._register_graph_tools()
system = self._build_system_prompt(task)
messages = [{"role": "user", "content": task}]
t0 = time.monotonic()
ph_before = len(self.graph.phenomena)
try:
final_text, _ = await self.llm.tool_call_loop(
messages=messages,
tools=self.get_tool_definitions(),
tool_executor=self._executors,
system=system,
)
self._work_log.append(f"[Task: {task[:80]}] -> {final_text[:150]}")
except Exception:
self.graph.agent_status[self.name] = "failed"
logger.error("[%s] Failed during task execution", self.name, exc_info=True)
raise
self.graph.agent_status[self.name] = "completed"
elapsed = time.monotonic() - t0
new_ph = len(self.graph.phenomena) - ph_before
_log(f"+{new_ph} phenomena, {len(final_text)} chars", event="agent_done", agent=self.name, elapsed=elapsed)
return final_text
# ---- Graph interaction tools --------------------------------------------
def _register_graph_tools(self) -> None:
"""Register tools for querying and writing to the evidence graph."""
# --- Read tools ---
self.register_tool(
name="list_phenomena",
description=(
"List all phenomena (evidence artifacts) on the graph. "
"Returns one-line summaries with IDs. Use get_phenomenon(id) for full details."
),
input_schema={
"type": "object",
"properties": {
"category": {
"type": "string",
"description": "Filter by category (filesystem, registry, email, network, timeline). Omit for all.",
},
},
},
executor=self._list_phenomena,
)
self.register_tool(
name="get_phenomenon",
description="Get full details of a specific phenomenon by ID, including raw_data.",
input_schema={
"type": "object",
"properties": {
"id": {"type": "string", "description": "Phenomenon ID (e.g. 'ph-a1b2c3d4')."},
},
"required": ["id"],
},
executor=self._get_phenomenon,
)
self.register_tool(
name="search_graph",
description="Search across phenomena, hypotheses, and entities by keyword. Returns matching summaries.",
input_schema={
"type": "object",
"properties": {
"keyword": {"type": "string", "description": "Search keyword."},
},
"required": ["keyword"],
},
executor=self._search_graph,
)
self.register_tool(
name="get_related",
description="Get all nodes connected to a given node via edges. Returns summaries and edge types.",
input_schema={
"type": "object",
"properties": {
"node_id": {"type": "string", "description": "Any node ID (ph-*, hyp-*, ent-*)."},
},
"required": ["node_id"],
},
executor=self._get_related,
)
self.register_tool(
name="get_hypothesis_status",
description="Get current status and confidence of all hypotheses being investigated.",
input_schema={"type": "object", "properties": {}},
executor=self._get_hypothesis_status,
)
# --- Write tools ---
self.register_tool(
name="add_phenomenon",
description=(
"Record a forensic finding (phenomenon) on the evidence graph. "
"You MUST specify source_tool: the name of the tool call that produced this finding."
),
input_schema={
"type": "object",
"properties": {
"category": {"type": "string", "description": "Category of the finding."},
"title": {"type": "string", "description": "Short title."},
"description": {"type": "string", "description": "Detailed description. Quote exact data from tool output."},
"raw_data": {"type": "object", "description": "Structured raw data supporting this finding."},
"timestamp": {"type": "string", "description": "Timestamp if any. ONLY use timestamps from tool output."},
"source_tool": {"type": "string", "description": "Name of the tool that produced this (e.g. 'list_directory')."},
},
"required": ["category", "title", "description", "source_tool"],
},
executor=self._add_phenomenon,
)
self.register_tool(
name="add_lead",
description="Create an investigative lead for another agent to follow up on.",
input_schema={
"type": "object",
"properties": {
"target_agent": {
"type": "string",
"enum": ["filesystem", "registry", "communication", "network", "timeline"],
"description": "Which agent should handle this lead.",
},
"description": {"type": "string", "description": "What should be investigated."},
"priority": {"type": "integer", "description": "Priority 1 (highest) to 10 (lowest). Default 5."},
},
"required": ["target_agent", "description"],
},
executor=self._add_lead,
)
self.register_tool(
name="link_to_entity",
description=(
"Link a phenomenon to a named entity (person, program, host, etc). "
"Creates the entity if it doesn't exist."
),
input_schema={
"type": "object",
"properties": {
"phenomenon_id": {"type": "string", "description": "Phenomenon ID to link from."},
"entity_name": {"type": "string", "description": "Name of the entity (e.g. 'Mr. Evil', 'mIRC.exe')."},
"entity_type": {
"type": "string",
"enum": ["person", "program", "file", "host", "ip_address"],
"description": "Type of entity.",
},
"edge_type": {
"type": "string",
"enum": ["created_by", "executed_by", "owned_by", "targets", "associated_with", "found_on", "used_by"],
"description": "Relationship type.",
},
},
"required": ["phenomenon_id", "entity_name", "entity_type", "edge_type"],
},
executor=self._link_to_entity,
)
# --- Asset library tools ---
self.register_tool(
name="list_assets",
description=(
"List all files extracted from the disk image. "
"Shows filename, category, size, local path, and inode. "
"Check this before calling extract_file to avoid re-extraction."
),
input_schema={
"type": "object",
"properties": {
"category": {
"type": "string",
"enum": [
"registry_hive", "chat_log", "prefetch", "network_capture",
"config_file", "address_book", "recycle_bin", "executable",
"text_log", "other",
],
"description": "Filter by category. Omit to list all.",
},
},
},
executor=self._list_assets,
)
self.register_tool(
name="find_extracted_file",
description=(
"Find an already-extracted file by inode or filename. "
"Returns the local path so you can use it directly with "
"parse_registry_key, read_text_file, etc. without re-extracting."
),
input_schema={
"type": "object",
"properties": {
"inode": {"type": "string", "description": "Inode to look up."},
"filename": {"type": "string", "description": "Filename or partial name to search."},
},
},
executor=self._find_extracted_file,
)
# ---- Tool executors -----------------------------------------------------
async def _list_phenomena(self, category: str | None = None) -> str:
results = self.graph.list_phenomena(category)
if not results:
return "No phenomena recorded yet." if not category else f"No phenomena in category '{category}'."
return "\n".join(results)
async def _get_phenomenon(self, id: str) -> str:
data = self.graph.get_phenomenon(id)
if data is None:
return f"Phenomenon not found: {id}"
return json.dumps(data, ensure_ascii=False, indent=2)
async def _search_graph(self, keyword: str) -> str:
results = self.graph.search_graph(keyword)
if not results:
return f"No matches for '{keyword}'."
return "\n".join(results)
async def _get_related(self, node_id: str) -> str:
results = self.graph.get_related(node_id)
if not results:
return f"No connections found for {node_id}."
lines = []
for r in results:
lines.append(f" {r['direction']} [{r['edge_type']}] → {r['node']}")
return "\n".join(lines)
async def _get_hypothesis_status(self) -> str:
results = self.graph.get_hypothesis_status()
if not results:
return "No hypotheses defined yet."
return "\n".join(results)
async def _add_phenomenon(
self,
category: str,
title: str,
description: str,
raw_data: dict | None = None,
timestamp: str | None = None,
source_tool: str = "",
) -> str:
pid, merged = await self.graph.add_phenomenon(
source_agent=self.name,
category=category,
title=title,
description=description,
raw_data=raw_data,
timestamp=timestamp,
source_tool=source_tool,
)
if merged:
return f"Phenomenon merged into existing: {pid}{title} (corroboration boost)"
return f"Phenomenon recorded: {pid}{title}"
async def _add_lead(
self,
target_agent: str,
description: str,
priority: int = 5,
) -> str:
lid = await self.graph.add_lead(
target_agent=target_agent,
description=description,
priority=priority,
)
return f"Lead created: {lid} — [{target_agent}] {description}"
async def _link_to_entity(
self,
phenomenon_id: str,
entity_name: str,
entity_type: str,
edge_type: str,
) -> str:
# Validate phenomenon exists before creating entity
if not self.graph._node_exists(phenomenon_id):
return (
f"Error: phenomenon '{phenomenon_id}' not found. "
f"Call list_phenomena first to get valid IDs."
)
eid, existing = await self.graph.add_entity(entity_name, entity_type)
await self.graph.add_edge(
source_id=phenomenon_id,
target_id=eid,
edge_type=edge_type,
created_by=self.name,
)
status = "linked to existing" if existing else "created and linked"
return f"Entity {status}: {entity_name} ({entity_type}) ←[{edge_type}]— {phenomenon_id}"
async def _list_assets(self, category: str | None = None) -> str:
results = self.graph.list_assets(category)
if not results:
return "No files extracted yet." if not category else f"No assets in category '{category}'."
return "\n".join(results)
async def _find_extracted_file(
self,
inode: str | None = None,
filename: str | None = None,
) -> str:
if inode:
asset = self.graph.lookup_asset_by_inode(inode)
if asset:
return (
f"Found: {asset.local_path} "
f"({asset.size_bytes} bytes, {asset.category}, inode:{asset.inode})"
)
return f"No extracted file with inode {inode}."
if filename:
results = self.graph.query_assets(filename_pattern=filename)
if not results:
return f"No extracted files matching '{filename}'."
lines = [f"Found {len(results)} matching file(s):"]
for a in results:
lines.append(f" {a.local_path} (inode:{a.inode}, {a.size_bytes} bytes, {a.category})")
return "\n".join(lines)
return "Provide either inode or filename to search."

799
evidence_graph.py Normal file
View File

@@ -0,0 +1,799 @@
"""Evidence Knowledge Graph for multi-agent forensic analysis.
Replaces the flat Blackboard with a graph-based evidence store.
Nodes: Phenomenon (observable artifacts), Hypothesis (interpretive claims), Entity (recurring objects).
Edges: typed relationships with predefined weights for hypothesis confidence computation.
"""
from __future__ import annotations
import asyncio
import json
import logging
import uuid
from dataclasses import asdict, dataclass, field
from datetime import datetime
from pathlib import Path
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Predefined edge weights for Phenomenon → Hypothesis relationships.
# LLM only picks the edge type (categorical); the weight is looked up here.
# ---------------------------------------------------------------------------
HYPOTHESIS_EDGE_WEIGHTS: dict[str, float] = {
"direct_evidence": +0.25,
"supports": +0.15,
"prerequisite_met": +0.10,
"consequence_observed": +0.15,
"contradicts": -0.20,
"weakens": -0.10,
}
# All valid edge types across the graph.
VALID_EDGE_TYPES: set[str] = {
# Phenomenon → Hypothesis
"direct_evidence", "supports", "prerequisite_met",
"consequence_observed", "contradicts", "weakens",
# Phenomenon → Phenomenon
"temporal", "causal", "input_to", "modifies", "co_located", "corroborates",
# Phenomenon → Entity
"created_by", "executed_by", "owned_by", "targets",
"associated_with", "found_on", "used_by",
# Hypothesis → Hypothesis
"refines", "conflicts", "depends_on",
}
# ---------------------------------------------------------------------------
# Graph node types
# ---------------------------------------------------------------------------
def _compute_quality_score(
source_tool: str,
timestamp: str | None,
raw_data: dict,
description: str,
related_ids: list[str],
) -> float:
"""Compute a quality score (0.0-1.0) based on evidence completeness."""
score = 0.0
if source_tool:
score += 0.25
if timestamp is not None:
score += 0.20
if raw_data:
score += 0.25
if len(description) >= 50:
score += 0.15
if related_ids:
score += 0.15
return score
def _jaccard_similarity(a: str, b: str) -> float:
"""Token-level Jaccard similarity between two strings."""
tokens_a = set(a.lower().split())
tokens_b = set(b.lower().split())
if not tokens_a or not tokens_b:
return 0.0
return len(tokens_a & tokens_b) / len(tokens_a | tokens_b)
@dataclass
class Phenomenon:
"""Raw observable artifact found on disk."""
id: str # "ph-{uuid8}"
source_agent: str
category: str # filesystem, registry, email, network, timeline
title: str
description: str
raw_data: dict = field(default_factory=dict)
timestamp: str | None = None
confidence: float = 1.0
source_tool: str = ""
corroborating_agents: list[str] = field(default_factory=list)
created_at: str = ""
def to_dict(self) -> dict:
return asdict(self)
@classmethod
def from_dict(cls, d: dict) -> Phenomenon:
return cls(**d)
def summary(self) -> str:
ts = f" @ {self.timestamp}" if self.timestamp else ""
return f"[{self.id}] [{self.category}] {self.title}{ts} (conf={self.confidence:.2f})"
@dataclass
class Hypothesis:
"""Interpretive claim about what happened on the system."""
id: str # "hyp-{uuid8}"
title: str
description: str
confidence: float = 0.5
status: str = "active" # active, supported, refuted, inconclusive
parent_id: str | None = None
created_by: str = "" # "manual", "hypothesis_agent", agent name
created_at: str = ""
confidence_log: list[dict] = field(default_factory=list)
def to_dict(self) -> dict:
return asdict(self)
@classmethod
def from_dict(cls, d: dict) -> Hypothesis:
return cls(**d)
def summary(self) -> str:
return f"[{self.id}] {self.title} (conf={self.confidence:.2f}, {self.status})"
@dataclass
class Entity:
"""Recurring actor or object across phenomena."""
id: str # "ent-{uuid8}"
name: str
entity_type: str # person, program, file, host, ip_address
description: str = ""
created_at: str = ""
def to_dict(self) -> dict:
return asdict(self)
@classmethod
def from_dict(cls, d: dict) -> Entity:
return cls(**d)
def summary(self) -> str:
return f"[{self.id}] {self.entity_type}: {self.name}"
@dataclass
class Edge:
"""Directed edge in the evidence graph."""
id: str # "edge-{uuid8}"
source_id: str
target_id: str
edge_type: str
metadata: dict = field(default_factory=dict)
created_by: str = ""
created_at: str = ""
def to_dict(self) -> dict:
return asdict(self)
@classmethod
def from_dict(cls, d: dict) -> Edge:
return cls(**d)
@dataclass
class Lead:
"""An investigative lead that should be followed up by an agent."""
id: str
target_agent: str
description: str
priority: int = 5 # 1 (highest) - 10 (lowest)
context: dict = field(default_factory=dict)
status: str = "pending" # pending, assigned, completed, failed
hypothesis_id: str | None = None
def to_dict(self) -> dict:
return asdict(self)
@classmethod
def from_dict(cls, d: dict) -> Lead:
return cls(**d)
@dataclass
class ExtractedAsset:
"""A file extracted from the disk image and tracked in the asset library."""
id: str # "asset-{uuid8}"
inode: str # e.g. "334-128-4"
original_path: str # disk image path from ffind
local_path: str # "extracted/SYSTEM"
category: str # registry_hive, chat_log, prefetch, ...
filename: str # "SYSTEM"
size_bytes: int
extracted_by: str # agent name
extracted_at: str # ISO timestamp
def to_dict(self) -> dict:
return asdict(self)
@classmethod
def from_dict(cls, d: dict) -> ExtractedAsset:
return cls(**d)
def summary(self) -> str:
size_kb = self.size_bytes / 1024
return (
f"[{self.id}] {self.filename} ({self.category}) "
f"{size_kb:.1f}KB @ {self.local_path} [inode:{self.inode}]"
)
# ---------------------------------------------------------------------------
# Evidence Graph
# ---------------------------------------------------------------------------
class EvidenceGraph:
"""Graph-based evidence store for multi-agent forensic analysis.
Agents interact with the graph via query tools (list_phenomena,
get_phenomenon, search_graph, get_related) rather than reading
a full dump in the system prompt.
"""
def __init__(
self,
case_info: dict | None = None,
persist_path: Path | None = None,
) -> None:
self.case_info: dict = case_info or {}
self.image_path: str = ""
self.partition_offset: int = 0
self.extracted_dir: str = "extracted"
# Graph storage
self.phenomena: dict[str, Phenomenon] = {}
self.hypotheses: dict[str, Hypothesis] = {}
self.entities: dict[str, Entity] = {}
self.edges: list[Edge] = []
# Adjacency index for fast traversal
self._adj: dict[str, list[Edge]] = {} # node_id → outgoing edges
self._adj_rev: dict[str, list[Edge]] = {} # node_id → incoming edges
# Lead / status management (carried over from Blackboard)
self.leads: list[Lead] = []
self.agent_status: dict[str, str] = {}
# Asset library — tracks all files extracted from the disk image
self.asset_library: dict[str, ExtractedAsset] = {}
self._inode_index: dict[str, str] = {} # inode → asset_id
# Set by BaseAgent.run() before each agent execution
self._current_agent: str = ""
self._lock = asyncio.Lock()
self._persist_path: Path | None = persist_path
# ---- Persistence -------------------------------------------------------
def _auto_save(self) -> None:
"""Persist full state to disk. Must be called inside _lock."""
if self._persist_path is None:
return
try:
state = {
"case_info": self.case_info,
"image_path": self.image_path,
"partition_offset": self.partition_offset,
"extracted_dir": self.extracted_dir,
"phenomena": {pid: p.to_dict() for pid, p in self.phenomena.items()},
"hypotheses": {hid: h.to_dict() for hid, h in self.hypotheses.items()},
"entities": {eid: e.to_dict() for eid, e in self.entities.items()},
"edges": [e.to_dict() for e in self.edges],
"leads": [l.to_dict() for l in self.leads],
"agent_status": dict(self.agent_status),
"asset_library": {aid: a.to_dict() for aid, a in self.asset_library.items()},
"saved_at": datetime.now().isoformat(),
}
tmp = self._persist_path.with_suffix(".tmp")
tmp.write_text(json.dumps(state, ensure_ascii=False, indent=2))
tmp.replace(self._persist_path)
except Exception as e:
logger.error("EvidenceGraph auto-save failed: %s", e)
def save_state(self, path: Path) -> None:
"""Explicitly save state to the given path."""
old = self._persist_path
self._persist_path = path
self._auto_save()
self._persist_path = old
@classmethod
def load_state(cls, path: Path) -> EvidenceGraph:
"""Restore an EvidenceGraph from a saved JSON state file."""
data = json.loads(path.read_text())
graph = cls(
case_info=data.get("case_info", {}),
persist_path=path,
)
graph.image_path = data.get("image_path", "")
graph.partition_offset = data.get("partition_offset", 0)
graph.extracted_dir = data.get("extracted_dir", "extracted")
graph.phenomena = {
pid: Phenomenon.from_dict(p)
for pid, p in data.get("phenomena", {}).items()
}
graph.hypotheses = {
hid: Hypothesis.from_dict(h)
for hid, h in data.get("hypotheses", {}).items()
}
graph.entities = {
eid: Entity.from_dict(e)
for eid, e in data.get("entities", {}).items()
}
graph.edges = [Edge.from_dict(e) for e in data.get("edges", [])]
graph.leads = [Lead.from_dict(l) for l in data.get("leads", [])]
graph.agent_status = data.get("agent_status", {})
for aid, a_data in data.get("asset_library", {}).items():
asset = ExtractedAsset.from_dict(a_data)
graph.asset_library[aid] = asset
graph._inode_index[asset.inode] = aid
graph._rebuild_adjacency()
logger.info(
"EvidenceGraph restored: %d phenomena, %d hypotheses, %d entities, "
"%d edges, %d assets",
len(graph.phenomena), len(graph.hypotheses),
len(graph.entities), len(graph.edges), len(graph.asset_library),
)
return graph
def _rebuild_adjacency(self) -> None:
"""Rebuild adjacency index from edges list."""
self._adj.clear()
self._adj_rev.clear()
for edge in self.edges:
self._adj.setdefault(edge.source_id, []).append(edge)
self._adj_rev.setdefault(edge.target_id, []).append(edge)
# ---- Node helpers -------------------------------------------------------
def _node_exists(self, node_id: str) -> bool:
if node_id.startswith("ph-"):
return node_id in self.phenomena
if node_id.startswith("hyp-"):
return node_id in self.hypotheses
if node_id.startswith("ent-"):
return node_id in self.entities
return False
def get_node(self, node_id: str) -> Phenomenon | Hypothesis | Entity | None:
if node_id.startswith("ph-"):
return self.phenomena.get(node_id)
if node_id.startswith("hyp-"):
return self.hypotheses.get(node_id)
if node_id.startswith("ent-"):
return self.entities.get(node_id)
return None
# ---- Similarity merging (Phenomenon only) --------------------------------
def _find_similar_phenomenon(
self, title: str, description: str, category: str,
) -> Phenomenon | None:
best_match: Phenomenon | None = None
best_score = 0.0
for ph in self.phenomena.values():
if ph.category != category:
continue
title_sim = _jaccard_similarity(ph.title, title)
if title_sim <= 0.6:
continue
desc_sim = _jaccard_similarity(ph.description[:200], description[:200])
if desc_sim <= 0.4:
continue
combined = title_sim * 0.6 + desc_sim * 0.4
if combined > best_score:
best_score = combined
best_match = ph
return best_match
# ---- Mutation methods (async, under lock) --------------------------------
async def add_phenomenon(
self,
source_agent: str,
category: str,
title: str,
description: str,
raw_data: dict | None = None,
timestamp: str | None = None,
source_tool: str = "",
) -> tuple[str, bool]:
"""Add a phenomenon. Returns (id, was_merged).
Confidence is auto-computed from evidence completeness (source_tool,
timestamp, raw_data, description length).
"""
async with self._lock:
similar = self._find_similar_phenomenon(title, description, category)
if similar is not None:
similar.confidence = min(1.0, similar.confidence + 0.15)
if source_agent not in similar.corroborating_agents:
similar.corroborating_agents.append(source_agent)
if raw_data:
for k, v in raw_data.items():
if k not in similar.raw_data:
similar.raw_data[k] = v
self._auto_save()
return similar.id, True
pid = f"ph-{uuid.uuid4().hex[:8]}"
confidence = _compute_quality_score(
source_tool, timestamp, raw_data or {},
description, [],
)
ph = Phenomenon(
id=pid,
source_agent=source_agent,
category=category,
title=title,
description=description,
raw_data=raw_data or {},
timestamp=timestamp,
confidence=confidence,
source_tool=source_tool,
created_at=datetime.now().isoformat(),
)
self.phenomena[pid] = ph
self._auto_save()
return pid, False
async def add_hypothesis(
self,
title: str,
description: str,
created_by: str = "",
parent_id: str | None = None,
) -> str:
"""Add a hypothesis. Returns the hypothesis ID."""
async with self._lock:
hid = f"hyp-{uuid.uuid4().hex[:8]}"
hyp = Hypothesis(
id=hid,
title=title,
description=description,
confidence=0.5,
status="active",
parent_id=parent_id,
created_by=created_by,
created_at=datetime.now().isoformat(),
)
self.hypotheses[hid] = hyp
self._auto_save()
return hid
async def add_entity(
self,
name: str,
entity_type: str,
description: str = "",
) -> tuple[str, bool]:
"""Add an entity. Deduplicates on (name, entity_type). Returns (id, was_existing)."""
async with self._lock:
for ent in self.entities.values():
if ent.name == name and ent.entity_type == entity_type:
return ent.id, True
eid = f"ent-{uuid.uuid4().hex[:8]}"
self.entities[eid] = Entity(
id=eid,
name=name,
entity_type=entity_type,
description=description,
created_at=datetime.now().isoformat(),
)
self._auto_save()
return eid, False
async def add_edge(
self,
source_id: str,
target_id: str,
edge_type: str,
metadata: dict | None = None,
created_by: str = "",
) -> str:
"""Add a directed edge. Validates nodes exist and edge type is valid."""
async with self._lock:
if not self._node_exists(source_id):
raise ValueError(f"Source node not found: {source_id}")
if not self._node_exists(target_id):
raise ValueError(f"Target node not found: {target_id}")
if edge_type not in VALID_EDGE_TYPES:
raise ValueError(f"Invalid edge type: {edge_type}")
eid = f"edge-{uuid.uuid4().hex[:8]}"
edge = Edge(
id=eid,
source_id=source_id,
target_id=target_id,
edge_type=edge_type,
metadata=metadata or {},
created_by=created_by,
created_at=datetime.now().isoformat(),
)
self.edges.append(edge)
self._adj.setdefault(source_id, []).append(edge)
self._adj_rev.setdefault(target_id, []).append(edge)
self._auto_save()
return eid
async def update_hypothesis_confidence(
self,
hyp_id: str,
phenomenon_id: str,
edge_type: str,
reason: str = "",
) -> float:
"""Update hypothesis confidence based on a phenomenon linkage.
The edge_type must be one of HYPOTHESIS_EDGE_WEIGHTS keys.
Weight is looked up from the predefined table, NOT judged by LLM.
Returns the new confidence value.
"""
if edge_type not in HYPOTHESIS_EDGE_WEIGHTS:
raise ValueError(
f"Invalid hypothesis edge type: {edge_type}. "
f"Must be one of: {list(HYPOTHESIS_EDGE_WEIGHTS.keys())}"
)
async with self._lock:
if not self._node_exists(phenomenon_id):
raise ValueError(f"Phenomenon not found: {phenomenon_id}")
hyp = self.hypotheses.get(hyp_id)
if hyp is None:
raise ValueError(f"Hypothesis not found: {hyp_id}")
weight = HYPOTHESIS_EDGE_WEIGHTS[edge_type]
old_conf = hyp.confidence
if weight > 0:
delta = weight * (1 - old_conf)
else:
delta = weight * old_conf
new_conf = max(0.0, min(1.0, old_conf + delta))
hyp.confidence = new_conf
if new_conf >= 0.8:
hyp.status = "supported"
elif new_conf <= 0.2:
hyp.status = "refuted"
hyp.confidence_log.append({
"timestamp": datetime.now().isoformat(),
"phenomenon_id": phenomenon_id,
"edge_type": edge_type,
"weight": weight,
"old_confidence": round(old_conf, 4),
"new_confidence": round(new_conf, 4),
"reason": reason,
})
# Also create the edge in the graph
eid = f"edge-{uuid.uuid4().hex[:8]}"
edge = Edge(
id=eid,
source_id=phenomenon_id,
target_id=hyp_id,
edge_type=edge_type,
metadata={"reason": reason},
created_by="hypothesis_engine",
created_at=datetime.now().isoformat(),
)
self.edges.append(edge)
self._adj.setdefault(phenomenon_id, []).append(edge)
self._adj_rev.setdefault(hyp_id, []).append(edge)
self._auto_save()
return new_conf
# ---- Lead management (same as old Blackboard) ----------------------------
async def add_lead(
self,
target_agent: str,
description: str,
priority: int = 5,
context: dict | None = None,
hypothesis_id: str | None = None,
) -> str:
async with self._lock:
lid = f"lead-{uuid.uuid4().hex[:8]}"
self.leads.append(Lead(
id=lid,
target_agent=target_agent,
description=description,
priority=priority,
context=context or {},
hypothesis_id=hypothesis_id,
))
self._auto_save()
return lid
async def get_pending_leads(self, agent_type: str | None = None) -> list[Lead]:
async with self._lock:
leads = [l for l in self.leads if l.status == "pending"]
if agent_type:
leads = [l for l in leads if l.target_agent == agent_type]
return sorted(leads, key=lambda l: l.priority)
async def mark_lead_completed(self, lead_id: str) -> None:
async with self._lock:
for lead in self.leads:
if lead.id == lead_id:
lead.status = "completed"
break
self._auto_save()
async def mark_lead_failed(self, lead_id: str, error: str = "") -> None:
async with self._lock:
for lead in self.leads:
if lead.id == lead_id:
lead.status = "failed"
lead.context["failure_reason"] = error
break
self._auto_save()
# ---- Asset library -------------------------------------------------------
async def register_asset(
self,
inode: str,
original_path: str,
local_path: str,
category: str,
filename: str,
size_bytes: int,
extracted_by: str,
) -> tuple[str, bool]:
"""Register an extracted file. Deduplicates by inode. Returns (id, already_existed)."""
async with self._lock:
if inode in self._inode_index:
return self._inode_index[inode], True
aid = f"asset-{uuid.uuid4().hex[:8]}"
asset = ExtractedAsset(
id=aid,
inode=inode,
original_path=original_path,
local_path=local_path,
category=category,
filename=filename,
size_bytes=size_bytes,
extracted_by=extracted_by,
extracted_at=datetime.now().isoformat(),
)
self.asset_library[aid] = asset
self._inode_index[inode] = aid
self._auto_save()
return aid, False
def lookup_asset_by_inode(self, inode: str) -> ExtractedAsset | None:
"""Look up an extracted asset by inode (synchronous, no lock needed for reads)."""
aid = self._inode_index.get(inode)
return self.asset_library.get(aid) if aid else None
def list_assets(self, category: str | None = None) -> list[str]:
"""Return one-line summaries of all assets, optionally filtered."""
results = []
for asset in self.asset_library.values():
if category and asset.category != category:
continue
results.append(asset.summary())
return results
def query_assets(
self,
category: str | None = None,
filename_pattern: str | None = None,
) -> list[ExtractedAsset]:
"""Query the asset library with optional filters."""
results = []
for asset in self.asset_library.values():
if category and asset.category != category:
continue
if filename_pattern and filename_pattern.lower() not in asset.filename.lower():
continue
results.append(asset)
return results
# ---- Query methods (for agent tools) ------------------------------------
def list_phenomena(self, category: str | None = None) -> list[str]:
"""Return one-line summaries of all phenomena, optionally filtered."""
results = []
for ph in self.phenomena.values():
if category and ph.category != category:
continue
results.append(ph.summary())
return results
def get_phenomenon(self, ph_id: str) -> dict | None:
"""Return full phenomenon details as dict, or None."""
ph = self.phenomena.get(ph_id)
return ph.to_dict() if ph else None
def search_graph(self, keyword: str) -> list[str]:
"""Search across all node types by keyword. Returns summaries."""
kw = keyword.lower()
results = []
for ph in self.phenomena.values():
if kw in ph.title.lower() or kw in ph.description.lower():
results.append(ph.summary())
for hyp in self.hypotheses.values():
if kw in hyp.title.lower() or kw in hyp.description.lower():
results.append(hyp.summary())
for ent in self.entities.values():
if kw in ent.name.lower() or kw in ent.description.lower():
results.append(ent.summary())
return results
def get_related(
self,
node_id: str,
edge_type: str | None = None,
direction: str = "both",
) -> list[dict]:
"""Get nodes connected to the given node. Returns list of {node_summary, edge_type, direction}."""
results = []
if direction in ("out", "both"):
for edge in self._adj.get(node_id, []):
if edge_type and edge.edge_type != edge_type:
continue
node = self.get_node(edge.target_id)
if node:
results.append({
"node": node.summary(),
"edge_type": edge.edge_type,
"direction": "outgoing",
"metadata": edge.metadata,
})
if direction in ("in", "both"):
for edge in self._adj_rev.get(node_id, []):
if edge_type and edge.edge_type != edge_type:
continue
node = self.get_node(edge.source_id)
if node:
results.append({
"node": node.summary(),
"edge_type": edge.edge_type,
"direction": "incoming",
"metadata": edge.metadata,
})
return results
def get_hypothesis_status(self) -> list[str]:
"""Return summaries of all hypotheses."""
return [h.summary() for h in self.hypotheses.values()]
def get_phenomena_by_category(self, category: str) -> list[Phenomenon]:
return [p for p in self.phenomena.values() if p.category == category]
def hypotheses_converged(self) -> bool:
"""True if no hypotheses are still active."""
return all(h.status != "active" for h in self.hypotheses.values())
def mark_remaining_inconclusive(self) -> None:
"""Mark all still-active hypotheses as inconclusive."""
for h in self.hypotheses.values():
if h.status == "active":
h.status = "inconclusive"
# ---- Summary (lightweight, for system prompt) ----------------------------
def stats_summary(self) -> str:
"""Ultra-compact stats for inclusion in system prompt."""
active_hyp = sum(1 for h in self.hypotheses.values() if h.status == "active")
return (
f"Graph: {len(self.phenomena)} phenomena, "
f"{len(self.hypotheses)} hypotheses ({active_hyp} active), "
f"{len(self.entities)} entities, {len(self.edges)} edges. "
f"Asset library: {len(self.asset_library)} extracted files. "
f"Pending leads: {sum(1 for l in self.leads if l.status == 'pending')}."
)

619
llm_client.py Normal file
View File

@@ -0,0 +1,619 @@
"""Custom LLM client using httpx for Claude Messages API via third-party proxy.
The proxy does not support Claude's native tool_use format (it strips the `tools`
field from requests). So we embed tool definitions in the system prompt and parse
structured JSON tool calls from the model's text output (ReAct-style).
"""
from __future__ import annotations
import asyncio
import json
import logging
import os
import re
import time
from collections import Counter
from dataclasses import dataclass, field
from typing import Any
import httpx
logger = logging.getLogger(__name__)
class LLMAPIError(Exception):
"""Raised when the LLM API is unreachable after all retries."""
def __init__(self, message: str, attempts: int) -> None:
super().__init__(message)
self.attempts = attempts
# Markers the model uses to signal tool calls and final answers
TOOL_CALL_TAG = "<tool_call>"
TOOL_CALL_END = "</tool_call>"
TOOL_RESULT_TAG = "<tool_result>"
TOOL_RESULT_END = "</tool_result>"
ANSWER_TAG = "<answer>"
ANSWER_END = "</answer>"
def _build_tools_prompt(tools: list[dict]) -> str:
"""Format tool definitions for inclusion in the system prompt."""
lines = ["You have access to the following tools:\n"]
for t in tools:
schema = t.get("input_schema", {})
props = schema.get("properties", {})
required = schema.get("required", [])
params = []
for pname, pdef in props.items():
req = " (required)" if pname in required else ""
desc = pdef.get("description", "")
ptype = pdef.get("type", "string")
enum_vals = pdef.get("enum")
if enum_vals:
allowed = ", ".join(f'"{v}"' for v in enum_vals)
params.append(f" - {pname}: {ptype}{req}{desc} Allowed values: [{allowed}]")
else:
params.append(f" - {pname}: {ptype}{req}{desc}")
param_block = "\n".join(params) if params else " (no parameters)"
lines.append(f"## {t['name']}\n{t['description']}\nParameters:\n{param_block}\n")
lines.append(
"## How to use tools\n"
"To call a tool, output a JSON block wrapped in XML tags like this:\n"
f"{TOOL_CALL_TAG}\n"
'{"name": "tool_name", "arguments": {"param1": "value1"}}\n'
f"{TOOL_CALL_END}\n\n"
"You can call multiple tools in sequence. After each tool call, you will receive the result in:\n"
f"{TOOL_RESULT_TAG}\n...result...\n{TOOL_RESULT_END}\n\n"
"When you have finished your analysis and have a final answer, wrap it in:\n"
f"{ANSWER_TAG}\nyour final answer here\n{ANSWER_END}\n\n"
"Think step by step. Call tools to gather evidence before drawing conclusions.\n"
"You MUST call at least one tool before giving your final answer."
)
return "\n".join(lines)
def _extract_tool_calls(text: str) -> list[dict]:
"""Extract tool call JSON blocks from model output."""
pattern = re.compile(
re.escape(TOOL_CALL_TAG) + r"\s*(.*?)\s*" + re.escape(TOOL_CALL_END),
re.DOTALL,
)
calls = []
for match in pattern.finditer(text):
raw = match.group(1).strip()
try:
parsed = json.loads(raw)
calls.append(parsed)
except json.JSONDecodeError:
logger.warning("Failed to parse tool call JSON: %s", raw[:200])
return calls
def _extract_answer(text: str) -> str | None:
"""Extract the final answer from model output."""
pattern = re.compile(
re.escape(ANSWER_TAG) + r"\s*(.*?)\s*" + re.escape(ANSWER_END),
re.DOTALL,
)
match = pattern.search(text)
if match:
return match.group(1).strip()
return None
def _truncate_tool_result(result_text: str, max_chars: int = 3000) -> str:
"""Truncate a tool result if it exceeds max_chars."""
if len(result_text) <= max_chars:
return result_text
return result_text[: max_chars - 200] + f"\n... [truncated, {len(result_text)} total chars]"
# Tools that only read and never mutate state — safe to run concurrently.
READ_ONLY_TOOLS: set[str] = {
# Graph queries
"list_phenomena", "get_phenomenon", "search_graph", "get_related",
"get_hypothesis_status", "list_assets", "find_extracted_file",
# Sleuth Kit reads
"partition_info", "filesystem_info", "list_directory", "find_file",
"search_strings", "count_deleted_files", "build_filesystem_timeline",
# Registry reads (without auto-record wrappers)
"parse_registry_key", "search_registry", "get_user_activity",
# Parser reads
"read_text_file", "read_binary_preview", "search_text_file",
"read_text_file_section", "list_extracted_dir", "parse_pcap_strings",
}
def _fix_tool_args(tool_name: str, tool_args: dict, tools: list[dict]) -> dict:
"""Try to fix misnamed tool arguments from LLM hallucination.
The LLM sometimes confuses parameter names across tools (e.g. passing
`key_path` to search_registry which expects `pattern`). This function
maps unknown kwargs to missing expected params by position/best-effort.
"""
# Find the schema for this tool
schema = None
for t in tools:
if t.get("name") == tool_name:
schema = t.get("input_schema", {})
break
if schema is None:
return tool_args
props = schema.get("properties", {})
required = set(schema.get("required", []))
unknown = [k for k in tool_args if k not in props]
if not unknown:
return tool_args # all args are valid, nothing to fix
# Build the fixed args: start with valid args
fixed = {k: v for k, v in tool_args.items() if k in props}
# Find which expected params are still missing
missing = [p for p in (required or props.keys()) if p not in fixed]
# Try to map unknown args to missing params, in order
unknown_values = [(k, tool_args[k]) for k in unknown]
for wrong_name, value in unknown_values:
if not missing:
break
# Pick the best match from missing params
best = missing.pop(0)
logger.warning(
"Auto-fixing tool arg: %s(%s=...) -> %s(%s=...)",
tool_name, wrong_name, tool_name, best,
)
fixed[best] = value
return fixed
def _emit_tool_call_summary(tool_calls: list[dict], elapsed: float) -> None:
"""Emit a folded tool-call summary line for the terminal formatter.
Instead of logging each tool call individually, we group by name:
"list_directory x27, extract_file x3, read_text_file x3"
"""
counts = Counter(tc.get("name", "?") for tc in tool_calls)
parts = []
for name, count in counts.most_common():
if count > 1:
parts.append(f"{name} x{count}")
else:
parts.append(name)
summary = ", ".join(parts)
logger.info(summary, extra={"event": "tool_calls", "elapsed": elapsed})
@dataclass
class _ToolBatch:
"""A batch of tool calls with the same read/write classification."""
is_read_only: bool
calls: list[dict] = field(default_factory=list)
def _partition_tool_calls(
tool_calls: list[dict],
read_only: set[str] | None = None,
) -> list[_ToolBatch]:
"""Partition tool calls into batches: consecutive read-only tools are
grouped together (will run in parallel), write tools are isolated."""
if read_only is None:
read_only = READ_ONLY_TOOLS
batches: list[_ToolBatch] = []
for tc in tool_calls:
is_ro = tc.get("name", "") in read_only
if batches and batches[-1].is_read_only and is_ro:
batches[-1].calls.append(tc)
else:
batches.append(_ToolBatch(is_read_only=is_ro, calls=[tc]))
return batches
# ---------------------------------------------------------------------------
# Context compression — keeps the message list from growing unboundedly.
# ---------------------------------------------------------------------------
# Stage A: Progressive tool result decay thresholds.
# Messages are counted in (assistant, user) pairs from the END of the list.
# "Round" = one pair of (assistant tool-calling msg, user tool-result msg).
_DECAY_TIERS: list[tuple[int, int]] = [
# (rounds_ago_threshold, max_chars_for_tool_results)
(5, 3000), # recent 5 rounds: keep full (3000 chars per tool result)
(15, 500), # 5-15 rounds ago: aggressive truncation
(999, 100), # older than 15 rounds: minimal stub
]
def _apply_progressive_decay(messages: list[dict]) -> list[dict]:
"""Truncate tool results in older messages to save context space.
Operates in-place-style on a copy. Only touches user messages that
contain <tool_result> blocks (these are the tool-result messages
generated by tool_call_loop).
"""
# Count rounds from the end. A "round" is a (assistant, user) pair.
# messages alternate: [user, assistant, user, assistant, user, ...]
# The initial user message is index 0, then pairs start at index 1.
total = len(messages)
if total <= 10: # not enough messages to bother
return messages
result = []
# Count tool-result user messages from the end
tool_result_indices = [
i for i, m in enumerate(messages)
if m["role"] == "user" and TOOL_RESULT_TAG in m.get("content", "")
]
# Build a set of indices that need decay, mapped to their max_chars
decay_map: dict[int, int] = {}
n_tool_msgs = len(tool_result_indices)
for rank, idx in enumerate(reversed(tool_result_indices)):
rounds_ago = rank # 0 = most recent, 1 = second most recent, ...
for threshold, max_chars in _DECAY_TIERS:
if rounds_ago < threshold:
decay_map[idx] = max_chars
break
for i, msg in enumerate(messages):
if i in decay_map:
max_chars = decay_map[i]
content = msg["content"]
if len(content) > max_chars + 200:
# Truncate but preserve the tool_result tags structure
truncated = content[:max_chars]
# Count how many tool results are in this message
n_results = content.count(TOOL_RESULT_TAG)
truncated += (
f"\n... [context compressed: {len(content)} -> {max_chars} chars, "
f"{n_results} tool result(s)]"
)
result.append({"role": msg["role"], "content": truncated})
else:
result.append(msg)
else:
result.append(msg)
return result
# Stage B: LLM-powered message folding.
# When messages exceed this count, fold the oldest ones into a summary.
_FOLD_THRESHOLD = 24 # trigger folding when messages exceed this count
_FOLD_KEEP_RECENT = 10 # always keep the most recent N messages intact
_FOLD_SUMMARY_SYSTEM = (
"You are a concise summarizer for an ongoing forensic investigation conversation. "
"Summarize the following early conversation between a forensic analysis agent and its "
"tool execution results. Preserve:\n"
"- Key findings and evidence discovered (file paths, inode numbers, timestamps)\n"
"- Tools that were called and their important results\n"
"- Decisions made and current investigation direction\n"
"Keep the summary under 800 words. Use bullet points."
)
class LLMClient:
"""Calls Claude Messages API through a third-party proxy using raw httpx.
Uses prompt-based tool calling (ReAct pattern) since the proxy does not
support Claude's native tool_use format.
"""
def __init__(
self,
base_url: str,
api_key: str,
model: str = "claude-sonnet-4-6",
max_tokens: int = 4096,
proxy: str | None = "auto",
) -> None:
self.base_url = base_url.rstrip("/")
self.api_key = api_key
self.model = model
self.max_tokens = max_tokens
# proxy="auto": read from env; proxy=None/""/"none": no proxy; proxy="http://...": use it
if proxy == "auto":
proxy_url = os.environ.get("https_proxy") or os.environ.get("HTTPS_PROXY")
elif proxy and proxy.lower() != "none":
proxy_url = proxy
else:
proxy_url = None
self._client = httpx.AsyncClient(
base_url=self.base_url,
headers={
"x-api-key": self.api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
},
timeout=300.0,
proxy=proxy_url,
)
async def close(self) -> None:
await self._client.aclose()
async def chat(
self,
messages: list[dict],
system: str | None = None,
max_retries: int = 5,
) -> str:
"""Send a streaming chat request and return the assembled text response.
Uses SSE streaming to keep the connection alive and avoid gateway
timeouts (504/524) on long-running completions.
"""
import asyncio as _asyncio
payload: dict[str, Any] = {
"model": self.model,
"max_tokens": self.max_tokens,
"messages": messages,
"stream": True,
}
if system:
payload["system"] = system
for attempt in range(max_retries):
logger.debug("LLM request (stream): %d messages (attempt %d)", len(messages), attempt + 1)
text_parts: list[str] = []
try:
async with self._client.stream(
"POST", "/v1/messages", json=payload,
) as resp:
# Check for HTTP errors before consuming stream
if resp.status_code >= 400:
body = await resp.aread()
raise httpx.HTTPStatusError(
f"Server error '{resp.status_code}' for url '{resp.url}'",
request=resp.request,
response=resp,
)
# Parse SSE events
async for line in resp.aiter_lines():
if not line.startswith("data: "):
continue
data_str = line[6:] # strip "data: " prefix
if data_str.strip() == "[DONE]":
break
try:
event = json.loads(data_str)
except json.JSONDecodeError:
continue
event_type = event.get("type", "")
if event_type == "content_block_delta":
delta = event.get("delta", {})
if delta.get("type") == "text_delta":
text_parts.append(delta["text"])
elif event_type == "message_stop":
break
elif event_type == "error":
err_msg = event.get("error", {}).get("message", "Unknown streaming error")
raise httpx.HTTPStatusError(
err_msg, request=resp.request, response=resp,
)
text = "".join(text_parts)
logger.debug("LLM response (stream): %d chars", len(text))
return text
except (httpx.HTTPStatusError, httpx.ConnectError, httpx.ReadTimeout, httpx.RemoteProtocolError) as e:
if attempt < max_retries - 1:
wait = 2 ** attempt * 10
logger.warning("Request failed (%s), retrying in %ds...", e, wait)
await _asyncio.sleep(wait)
else:
raise LLMAPIError(
f"LLM API unreachable after {max_retries} attempts: {e}",
attempts=max_retries,
) from e
# Should not reach here, but just in case
return ""
async def tool_call_loop(
self,
messages: list[dict],
tools: list[dict],
tool_executor: dict[str, Any],
system: str | None = None,
max_iterations: int = 40,
) -> tuple[str, list[dict]]:
"""Run a ReAct-style tool-calling loop.
The model outputs <tool_call> blocks which we parse and execute,
feeding results back as <tool_result> blocks until the model
outputs an <answer> block.
Returns:
(final_text, all_messages)
"""
# Build system prompt with tool definitions
tools_prompt = _build_tools_prompt(tools)
full_system = f"{system}\n\n{tools_prompt}" if system else tools_prompt
messages = list(messages) # don't mutate caller's list
_folded = False # Track whether we've already folded once this loop
for i in range(max_iterations):
# ── Context compression before each API call ──────────────
# Stage A: progressively decay old tool results
messages = _apply_progressive_decay(messages)
# Stage B: fold oldest messages into LLM summary if too long
if not _folded and len(messages) > _FOLD_THRESHOLD:
messages = await self._fold_old_messages(messages, full_system)
_folded = True
elif _folded and len(messages) > _FOLD_THRESHOLD + _FOLD_KEEP_RECENT:
# Allow a second fold if messages grew back significantly
messages = await self._fold_old_messages(messages, full_system)
text = await self.chat(messages, system=full_system)
# Check for final answer
answer = _extract_answer(text)
if answer is not None:
messages.append({"role": "assistant", "content": text})
return answer, messages
# Check for tool calls
tool_calls = _extract_tool_calls(text)
if not tool_calls:
# No tool calls and no answer tag — treat entire text as answer
messages.append({"role": "assistant", "content": text})
return text, messages
# Execute tool calls — read-only tools run in parallel
messages.append({"role": "assistant", "content": text})
result_parts = []
batches = _partition_tool_calls(tool_calls)
t_batch_start = time.monotonic()
for batch in batches:
if batch.is_read_only and len(batch.calls) > 1:
batch_results = await self._execute_tool_batch_parallel(
batch.calls, tool_executor, tools,
)
result_parts.extend(batch_results)
else:
for tc in batch.calls:
result_parts.append(
await self._execute_single_tool(tc, tool_executor, tools)
)
# Emit folded tool-call summary for the terminal
t_batch_elapsed = time.monotonic() - t_batch_start
_emit_tool_call_summary(tool_calls, t_batch_elapsed)
# Feed results back as a user message
result_message = "\n\n".join(result_parts)
messages.append({"role": "user", "content": result_message})
logger.warning("Tool call loop hit max iterations (%d)", max_iterations)
return "[Max tool call iterations reached]", messages
async def _execute_single_tool(
self, tc: dict, tool_executor: dict[str, Any],
tools: list[dict] | None = None,
) -> str:
"""Execute a single tool call and return the formatted result."""
tool_name = tc.get("name", "")
tool_args = tc.get("arguments", {})
if tools:
tool_args = _fix_tool_args(tool_name, tool_args, tools)
logger.info("Calling tool: %s(%s)", tool_name, json.dumps(tool_args, ensure_ascii=False))
executor = tool_executor.get(tool_name)
if executor is None:
result_text = f"Error: unknown tool '{tool_name}'"
else:
try:
result_text = await executor(**tool_args)
except Exception as e:
logger.error("Tool %s failed: %s", tool_name, e)
result_text = f"Error executing {tool_name}: {e}"
return (
f"{TOOL_RESULT_TAG}\n"
f"[{tool_name}] {_truncate_tool_result(result_text)}\n"
f"{TOOL_RESULT_END}"
)
async def _execute_tool_batch_parallel(
self, calls: list[dict], tool_executor: dict[str, Any],
tools: list[dict] | None = None,
) -> list[str]:
"""Execute multiple read-only tool calls concurrently."""
logger.info("Executing %d read-only tools in parallel", len(calls))
async def _run_one(tc: dict) -> str:
tool_name = tc.get("name", "")
tool_args = tc.get("arguments", {})
if tools:
tool_args = _fix_tool_args(tool_name, tool_args, tools)
logger.info("Calling tool (parallel): %s(%s)", tool_name, json.dumps(tool_args, ensure_ascii=False))
executor = tool_executor.get(tool_name)
if executor is None:
result_text = f"Error: unknown tool '{tool_name}'"
else:
try:
result_text = await executor(**tool_args)
except Exception as e:
logger.error("Tool %s failed: %s", tool_name, e)
result_text = f"Error executing {tool_name}: {e}"
return (
f"{TOOL_RESULT_TAG}\n"
f"[{tool_name}] {_truncate_tool_result(result_text)}\n"
f"{TOOL_RESULT_END}"
)
results = await asyncio.gather(*[_run_one(tc) for tc in calls])
return list(results)
async def _fold_old_messages(
self, messages: list[dict], system: str,
) -> list[dict]:
"""Fold old messages into an LLM-generated summary (Stage B).
Keeps the most recent _FOLD_KEEP_RECENT messages intact and
replaces earlier ones with a single summary message.
"""
n_to_fold = len(messages) - _FOLD_KEEP_RECENT
if n_to_fold <= 2:
return messages
old_messages = messages[:n_to_fold]
recent_messages = messages[n_to_fold:]
# Build a text dump of old messages for summarization
old_text_parts = []
for msg in old_messages:
role = msg["role"]
content = msg.get("content", "")
# Truncate each message for the summary prompt to avoid overload
if len(content) > 1000:
content = content[:1000] + "..."
old_text_parts.append(f"[{role}]: {content}")
old_text = "\n\n".join(old_text_parts)
# Cap total size sent to summarizer
if len(old_text) > 15000:
old_text = old_text[:15000] + "\n\n... [further messages omitted for brevity]"
logger.info(
"Context folding: summarizing %d old messages (%d chars) into summary",
n_to_fold, len(old_text),
)
try:
summary = await self.chat(
messages=[{"role": "user", "content": old_text}],
system=_FOLD_SUMMARY_SYSTEM,
)
except Exception as e:
logger.warning("Context folding failed: %s — keeping original messages", e)
return messages
# Replace old messages with a single summary
summary_message = {
"role": "user",
"content": (
f"[Context summary — the following summarizes {n_to_fold} earlier "
f"messages in this conversation]\n\n{summary}"
),
}
return [summary_message] + recent_messages

243
log_config.py Normal file
View File

@@ -0,0 +1,243 @@
"""Logging configuration — colored terminal output + detailed log file.
Terminal: compact, colored, hierarchical display with tool-call folding.
File: full-detail timestamped log for post-run analysis.
"""
from __future__ import annotations
import logging
import sys
import time
from pathlib import Path
# ---------------------------------------------------------------------------
# ANSI color codes
# ---------------------------------------------------------------------------
class _C:
"""ANSI escape sequences for terminal colors."""
RESET = "\033[0m"
BOLD = "\033[1m"
DIM = "\033[2m"
# Foreground
RED = "\033[31m"
GREEN = "\033[32m"
YELLOW = "\033[33m"
BLUE = "\033[34m"
MAGENTA = "\033[35m"
CYAN = "\033[36m"
WHITE = "\033[37m"
GREY = "\033[90m"
# Bright foreground
B_RED = "\033[91m"
B_GREEN = "\033[92m"
B_YELLOW = "\033[93m"
B_BLUE = "\033[94m"
B_MAGENTA = "\033[95m"
B_CYAN = "\033[96m"
# Agent name → color mapping
_AGENT_COLORS: dict[str, str] = {
"filesystem": _C.B_CYAN,
"registry": _C.B_MAGENTA,
"communication": _C.B_YELLOW,
"network": _C.B_GREEN,
"timeline": _C.B_BLUE,
"hypothesis": _C.MAGENTA,
"report": _C.CYAN,
}
def _agent_color(name: str) -> str:
return _AGENT_COLORS.get(name, _C.WHITE)
def _format_elapsed(seconds: float) -> str:
"""Format elapsed seconds as human-readable string."""
if seconds < 1:
return f"{seconds * 1000:.0f}ms"
if seconds < 60:
return f"{seconds:.1f}s"
m, s = divmod(int(seconds), 60)
if m < 60:
return f"{m}m {s:02d}s"
h, m = divmod(m, 60)
return f"{h}h {m:02d}m {s:02d}s"
# ---------------------------------------------------------------------------
# Terminal formatter — compact, colored, hierarchical
# ---------------------------------------------------------------------------
class TerminalFormatter(logging.Formatter):
"""Colored, compact formatter for terminal output.
Recognizes structured log messages via extra fields:
- extra['event'] : event type for special formatting
- extra['agent'] : agent name for coloring
- extra['elapsed'] : elapsed seconds for timing display
- extra['tool_name'] : tool name for tool-call lines
- extra['tool_count'] : count for folded tool calls
"""
def format(self, record: logging.LogRecord) -> str:
ts = time.strftime("%H:%M:%S", time.localtime(record.created))
ts_str = f"{_C.GREY}[{ts}]{_C.RESET}"
event = getattr(record, "event", None)
# ── Phase banner ──────────────────────────────────────────
if event == "phase":
phase_name = record.getMessage()
line = f"\n{ts_str} {_C.BOLD}{_C.WHITE}══ {phase_name} {'' * max(1, 52 - len(phase_name))}{_C.RESET}"
return line
# ── Agent start ───────────────────────────────────────────
if event == "agent_start":
agent = getattr(record, "agent", "?")
task = record.getMessage()
color = _agent_color(agent)
return f"{ts_str} {color}{agent:<14}{_C.RESET} {task}"
# ── Agent done ────────────────────────────────────────────
if event == "agent_done":
agent = getattr(record, "agent", "?")
elapsed = getattr(record, "elapsed", 0)
summary = record.getMessage()
color = _agent_color(agent)
elapsed_str = f"{_C.GREY}({_format_elapsed(elapsed)}){_C.RESET}" if elapsed else ""
return f"{ts_str} {color}└─{_C.RESET} {summary} {elapsed_str}"
# ── Tool calls (folded) ───────────────────────────────────
if event == "tool_calls":
summary = record.getMessage()
elapsed = getattr(record, "elapsed", 0)
elapsed_str = f"{_C.GREY}({_format_elapsed(elapsed)}){_C.RESET}" if elapsed else ""
return f"{ts_str} {_C.DIM}├─ {summary}{_C.RESET} {elapsed_str}"
# ── Lead dispatch ─────────────────────────────────────────
if event == "dispatch":
msg = record.getMessage()
return f"{ts_str} {_C.BLUE}{msg}{_C.RESET}"
# ── Evidence progress ─────────────────────────────────────
if event == "progress":
msg = record.getMessage()
elapsed = getattr(record, "elapsed", 0)
elapsed_str = f" {_C.GREY}({_format_elapsed(elapsed)}){_C.RESET}" if elapsed else ""
return f"{ts_str} {_C.GREEN}{msg}{_C.RESET}{elapsed_str}"
# ── Hypothesis update ─────────────────────────────────────
if event == "hypothesis":
msg = record.getMessage()
return f"{ts_str} {_C.MAGENTA}{msg}{_C.RESET}"
# ── Warnings ──────────────────────────────────────────────
if record.levelno >= logging.WARNING:
msg = record.getMessage()
return f"{ts_str} {_C.B_YELLOW}WARN{_C.RESET} {msg}"
# ── Errors ────────────────────────────────────────────────
if record.levelno >= logging.ERROR:
msg = record.getMessage()
return f"{ts_str} {_C.B_RED}ERROR{_C.RESET} {msg}"
# ── Default (suppressed unless DEBUG) ─────────────────────
# Most raw INFO messages (httpx, debug-level tool calls) are
# handled by the level filter and never reach here.
msg = record.getMessage()
return f"{ts_str} {_C.DIM}{msg}{_C.RESET}"
class FileFormatter(logging.Formatter):
"""Detailed formatter for log files — full timestamps, all fields."""
def format(self, record: logging.LogRecord) -> str:
ts = self.formatTime(record, "%Y-%m-%d %H:%M:%S")
ms = f"{record.created % 1:.3f}"[1:] # .NNN
level = record.levelname[0] # I/W/E/D
name = record.name
event = getattr(record, "event", "")
agent = getattr(record, "agent", "")
elapsed = getattr(record, "elapsed", "")
# Build context tags
tags = ""
if event:
tags += f" @{event}"
if agent:
tags += f" [{agent}]"
if elapsed:
tags += f" ({_format_elapsed(elapsed)})"
msg = record.getMessage()
return f"{ts}{ms} {level} {name}{tags}: {msg}"
# ---------------------------------------------------------------------------
# Filters
# ---------------------------------------------------------------------------
class TerminalFilter(logging.Filter):
"""Filter for terminal handler — suppress noisy loggers and low-value messages."""
# Logger names to suppress from terminal entirely
_SUPPRESSED = {"httpx", "httpcore"}
def filter(self, record: logging.LogRecord) -> bool:
# Suppress httpx noise
if record.name in self._SUPPRESSED:
return False
# Suppress DEBUG from all loggers in terminal
if record.levelno < logging.INFO:
return False
# Suppress raw llm_client tool-call lines (we show folded summaries instead)
if record.name == "llm_client" and "Calling tool" in record.getMessage():
return False
if record.name == "llm_client" and "(parallel)" in record.getMessage():
return False
# Suppress raw LLM request/response debug lines
if record.name == "llm_client" and record.getMessage().startswith("LLM"):
return False
return True
# ---------------------------------------------------------------------------
# Setup
# ---------------------------------------------------------------------------
def setup_logging(run_dir: Path, verbose: bool = False) -> None:
"""Configure logging with colored terminal + detailed file output.
Args:
run_dir: Directory for the log file.
verbose: If True, show all messages in terminal (no filtering).
"""
root = logging.getLogger()
root.setLevel(logging.DEBUG)
# Remove any existing handlers (e.g., from basicConfig)
root.handlers.clear()
# ── Terminal handler ──────────────────────────────────────
term_handler = logging.StreamHandler(sys.stderr)
term_handler.setLevel(logging.DEBUG if verbose else logging.INFO)
term_handler.setFormatter(TerminalFormatter())
if not verbose:
term_handler.addFilter(TerminalFilter())
root.addHandler(term_handler)
# ── File handler (full detail) ────────────────────────────
file_handler = logging.FileHandler(run_dir / "masforensics.log")
file_handler.setLevel(logging.DEBUG)
file_handler.setFormatter(FileFormatter())
root.addHandler(file_handler)
# Suppress overly chatty third-party loggers even in the file
logging.getLogger("httpcore").setLevel(logging.WARNING)

272
main.py Normal file
View File

@@ -0,0 +1,272 @@
"""MASForensics — Multi-Agent System for Digital Forensics."""
from __future__ import annotations
import asyncio
import json
import logging
import re
import shutil
import subprocess
import sys
from datetime import datetime
from pathlib import Path
import yaml
from agent_factory import AgentFactory
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
from log_config import setup_logging
from orchestrator import AnalysisAborted, Orchestrator
from tool_registry import register_all_tools
RUNS_DIR = Path("runs")
IMAGE_DIR = Path("image")
# Common forensic image extensions (only first segment / single-file formats)
_IMAGE_GLOBS = ["*.001", "*.dd", "*.raw", "*.img", "*.E01", "*.iso"]
def load_config(path: str = "config.yaml") -> dict:
with open(path) as f:
return yaml.safe_load(f)
# ---------------------------------------------------------------------------
# Interactive image & partition selection
# ---------------------------------------------------------------------------
def _discover_images(search_dir: Path = IMAGE_DIR) -> list[Path]:
"""Find forensic disk image files under *search_dir*."""
images: set[Path] = set()
for glob in _IMAGE_GLOBS:
images.update(search_dir.glob(glob))
return sorted(images)
def _parse_mmls(output: str) -> list[dict]:
"""Parse mmls output into a list of partition dicts.
Returns only data partitions (skips Meta / Unallocated rows).
"""
partitions: list[dict] = []
for line in output.splitlines():
# Typical line: "002: 000:000 0000000063 0009510479 0009510417 NTFS / exFAT (0x07)"
m = re.match(
r"\s*\d+:\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(.*)",
line,
)
if not m:
continue
slot, start, end, length, desc = m.groups()
# Skip non-data rows
if slot == "Meta" or slot.startswith("---"):
continue
partitions.append({
"slot": slot,
"start": int(start),
"end": int(end),
"length": int(length),
"description": desc.strip(),
})
return partitions
def _run_mmls(image_path: str) -> list[dict]:
"""Run mmls and return parsed partition list."""
try:
result = subprocess.run(
["mmls", image_path],
capture_output=True, text=True, timeout=30,
)
except FileNotFoundError:
print("Error: mmls not found. Is The Sleuth Kit installed?")
sys.exit(1)
if result.returncode != 0:
return []
return _parse_mmls(result.stdout)
def select_image_interactive(image_dir: Path | None = None) -> tuple[str, int]:
"""Interactively select a disk image and partition.
If *image_dir* is None, prompts the user for the image folder path.
Returns (image_path, partition_offset_in_sectors).
"""
# --- Image folder selection ---
if image_dir is None:
raw = input("Image folder path: ").strip()
if not raw:
print("No path provided.")
sys.exit(1)
image_dir = Path(raw).expanduser().resolve()
if not image_dir.is_dir():
print(f"Error: {image_dir} is not a directory.")
sys.exit(1)
# --- Image file selection ---
images = _discover_images(image_dir)
if not images:
print(f"No disk images found in {image_dir}/")
print("Supported formats: " + ", ".join(_IMAGE_GLOBS))
sys.exit(1)
if len(images) == 1:
image_path = images[0]
print(f"Found image: {image_path}")
else:
print("Available disk images:")
for i, img in enumerate(images, 1):
size_mb = img.stat().st_size / (1024 * 1024)
print(f" [{i}] {img.name} ({size_mb:.0f} MB)")
while True:
choice = input(f"Select image [1-{len(images)}]: ").strip()
if choice.isdigit() and 1 <= int(choice) <= len(images):
image_path = images[int(choice) - 1]
break
print("Invalid choice.")
# --- Partition detection ---
print(f"Detecting partitions ({image_path}) ...")
partitions = _run_mmls(str(image_path))
if not partitions:
print("Warning: mmls could not detect partitions. Using offset 0.")
return str(image_path), 0
if len(partitions) == 1:
p = partitions[0]
print(f"Found partition: {p['description']} (offset={p['start']})")
return str(image_path), p["start"]
print("Partitions:")
for i, p in enumerate(partitions, 1):
size_mb = p["length"] * 512 / (1024 * 1024)
print(f" [{i}] {p['description']} (offset={p['start']}, {size_mb:.0f} MB)")
while True:
choice = input(f"Select partition [1-{len(partitions)}]: ").strip()
if choice.isdigit() and 1 <= int(choice) <= len(partitions):
p = partitions[int(choice) - 1]
return str(image_path), p["start"]
print("Invalid choice.")
def find_resumable_run() -> Path | None:
"""Find the most recent incomplete run with a saved graph state."""
if not RUNS_DIR.exists():
return None
candidates = sorted(RUNS_DIR.glob("*/graph_state.json"), reverse=True)
for state_file in candidates:
# Incomplete = has graph_state.json but no run_metadata.json (final archive)
if not (state_file.parent / "run_metadata.json").exists():
return state_file
return None
def setup_run_dir() -> Path:
"""Create a timestamped run directory."""
run_dir = RUNS_DIR / datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
run_dir.mkdir(parents=True, exist_ok=True)
return run_dir
async def async_main() -> None:
config = load_config()
agent_cfg = config["agent"]
# Check for resumable run
resumable = find_resumable_run()
resume_phase = 1
run_dir: Path | None = None
graph: EvidenceGraph | None = None
if resumable:
print(f"Found incomplete run: {resumable.parent.name}")
try:
data = json.loads(resumable.read_text())
ph_count = len(data.get("phenomena", {}))
hyp_count = len(data.get("hypotheses", {}))
print(f" ({ph_count} phenomena, {hyp_count} hypotheses)")
except Exception:
pass
print("Resume? [y/N] ", end="", flush=True)
choice = input().strip().lower()
if choice == "y":
run_dir = resumable.parent
graph = EvidenceGraph.load_state(resumable)
status = graph.agent_status
if any(h.status != "active" for h in graph.hypotheses.values()):
resume_phase = 3 # hypotheses exist, resume investigation
elif graph.phenomena:
resume_phase = 2 # have phenomena, generate hypotheses
print(f"Resuming from Phase {resume_phase}...")
if run_dir is None:
run_dir = setup_run_dir()
# Setup logging — colored terminal + detailed log file
setup_logging(run_dir)
# Save config snapshot
shutil.copy2("config.yaml", run_dir / "config.yaml")
# Initialize LLM client
llm = LLMClient(
base_url=agent_cfg["base_url"],
api_key=agent_cfg["api_key"],
model=agent_cfg["model"],
max_tokens=agent_cfg.get("max_tokens", 4096),
proxy=agent_cfg.get("proxy", "auto"),
)
# Initialize evidence graph
if graph is None:
# CLI arg takes priority, otherwise interactive prompt
cli_dir = Path(sys.argv[1]) if len(sys.argv) > 1 else None
image_path, partition_offset = select_image_interactive(cli_dir)
graph = EvidenceGraph(
case_info=config.get("cfreds_hacking_case", {}),
persist_path=run_dir / "graph_state.json",
)
graph.image_path = image_path
graph.partition_offset = partition_offset
graph.extracted_dir = str(run_dir / "extracted")
else:
graph._persist_path = run_dir / "graph_state.json"
# Register all tools with bound image path
register_all_tools(graph.image_path, graph.partition_offset, graph, graph.extracted_dir)
# Create agent factory
factory = AgentFactory(llm, graph)
# Run orchestrator
orchestrator = Orchestrator(llm, graph, factory, config=config, run_dir=run_dir)
try:
report = await orchestrator.run(resume_phase=resume_phase)
print("\n" + "=" * 60)
print("FORENSIC ANALYSIS COMPLETE")
print("=" * 60)
print(f"Results archived to: {run_dir}")
print(report)
except AnalysisAborted:
print("\n" + "=" * 60)
print("ANALYSIS ABORTED — too many consecutive failures")
print("=" * 60)
print(f"Partial results saved to: {run_dir}")
print("Run again to resume from saved state.")
except KeyboardInterrupt:
print("\nInterrupted. State saved.")
print(f"Partial results in: {run_dir}")
finally:
await llm.close()
def main() -> None:
asyncio.run(async_main())
if __name__ == "__main__":
main()

702
orchestrator.py Normal file
View File

@@ -0,0 +1,702 @@
"""Orchestrator — coordinates hypothesis-driven multi-agent forensic analysis."""
from __future__ import annotations
import asyncio
import json
import logging
import re
import time
from datetime import datetime
from pathlib import Path
from agent_factory import AgentFactory
from evidence_graph import EvidenceGraph, HYPOTHESIS_EDGE_WEIGHTS
from llm_client import LLMClient
logger = logging.getLogger(__name__)
def _log(msg: str, **extra) -> None:
"""Emit a structured log message with extra fields for the terminal formatter."""
logger.info(msg, extra=extra)
def _progress_summary(graph: EvidenceGraph) -> str:
"""One-line progress summary: phenomena, hypotheses, entities counts."""
active = sum(1 for h in graph.hypotheses.values() if h.status == "active")
parts = []
if graph.phenomena:
parts.append(f"{len(graph.phenomena)} phenomena")
if graph.hypotheses:
parts.append(f"{len(graph.hypotheses)} hypotheses ({active} active)")
if graph.entities:
parts.append(f"{len(graph.entities)} entities")
return ", ".join(parts) if parts else "empty graph"
# Map LLM-generated lead target names to actual registered agent names.
AGENT_ALIASES: dict[str, str] = {
"malware": "filesystem",
"deleted_files": "filesystem",
"file_recovery": "filesystem",
"recycle_bin": "filesystem",
"chat_email": "communication",
"email": "communication",
"chat": "communication",
"irc": "communication",
"messaging": "communication",
"browser_internet": "network",
"browser": "network",
"internet": "network",
"web": "network",
"wireless": "network",
"pcap": "network",
"password_analysis": "registry",
"user_activity": "registry",
"event_logs": "registry",
"system_config": "registry",
}
class AnalysisAborted(Exception):
"""Raised when too many failures trigger a graceful shutdown."""
pass
class Orchestrator:
"""Coordinates the hypothesis-driven multi-agent forensic analysis.
Flow (5 phases):
1. FileSystemAgent initial survey → Phenomenon nodes
2. Hypothesis generation (manual or auto via HypothesisAgent)
3. Hypothesis-directed investigation (iterative)
4. Timeline construction
5. Report generation
"""
_LEAD_GEN_SYSTEM = (
"You are the lead investigator coordinating a multi-agent digital forensics analysis. "
"Your job is to identify gaps in the evidence and generate specific, actionable "
"investigation tasks for field agents (filesystem, registry, communication, network). "
"Each task should be concrete enough for an agent to execute immediately — "
"specify file paths, registry keys, or artifact types to examine. "
"Avoid vague tasks like 'investigate further'. "
"Prioritize tasks that would most strongly confirm OR refute the hypothesis."
)
_JUDGE_SYSTEM = (
"You are the lead investigator evaluating new forensic evidence against hypotheses. "
"Judge each phenomenon's relationship to the hypothesis based ONLY on the evidence described. "
"Use 'direct_evidence' only when the phenomenon IS the hypothesized activity. "
"Use 'supports' when it's consistent but not conclusive. "
"Use 'contradicts' when it actively disproves the hypothesis. "
"Omit phenomena that are unrelated. Be conservative — only link genuinely relevant evidence."
)
def __init__(
self,
llm: LLMClient,
graph: EvidenceGraph,
factory: AgentFactory,
config: dict | None = None,
run_dir: Path | None = None,
) -> None:
self.llm = llm
self.graph = graph
self.factory = factory
self.config = config or {}
self.run_dir = run_dir
self._semaphore = asyncio.Semaphore(3)
self._failure_count = 0
self._max_failures = 3
self._start_time = datetime.now()
def _resolve_agent_type(self, agent_type: str) -> str:
return AGENT_ALIASES.get(agent_type, agent_type)
# ---- Lead dispatch (with fault tolerance) --------------------------------
async def _dispatch_leads_parallel(self, pending: list) -> None:
"""Dispatch leads grouped by agent type, running different types concurrently."""
by_agent: dict[str, list] = {}
for lead in pending:
resolved = self._resolve_agent_type(lead.target_agent)
by_agent.setdefault(resolved, []).append(lead)
agents_str = ", ".join(f"{at} x{len(leads)}" for at, leads in by_agent.items())
_log(f"{len(pending)} leads dispatched -> {agents_str}", event="dispatch")
async def run_agent_leads(agent_type: str, leads: list) -> None:
for lead in leads:
resolved = self._resolve_agent_type(lead.target_agent)
agent = self.factory.get_or_create_agent(resolved)
if agent is None:
logger.warning("Skipping lead %s: no agent for '%s'", lead.id, lead.target_agent)
await self.graph.mark_lead_completed(lead.id)
continue
lead.status = "assigned"
try:
async with self._semaphore:
# Build hypothesis context if this lead is linked to one
hyp_line = ""
if lead.hypothesis_id:
hyp = self.graph.hypotheses.get(lead.hypothesis_id)
if hyp:
hyp_line = (
f"Hypothesis under investigation: {hyp.title} "
f"(confidence: {hyp.confidence:.2f})\n"
)
await agent.run(
f"Investigate this lead: {lead.description}\n"
f"{hyp_line}"
f"Focus area: {lead.target_agent}"
)
await self.graph.mark_lead_completed(lead.id)
self._failure_count = 0
except Exception as e:
logger.error("Agent [%s] failed on lead %s: %s", agent_type, lead.id, e)
await self.graph.mark_lead_failed(lead.id, str(e))
self._failure_count += 1
if self._failure_count >= self._max_failures:
raise AnalysisAborted(
f"Too many consecutive agent failures ({self._failure_count})"
)
results = await asyncio.gather(
*(run_agent_leads(at, leads) for at, leads in by_agent.items()),
return_exceptions=True,
)
for result in results:
if isinstance(result, AnalysisAborted):
raise result
for at, result in zip(by_agent.keys(), results):
if isinstance(result, Exception):
logger.error("Agent group [%s] failed: %s", at, result)
async def _retry_failed_leads(self) -> None:
"""Retry leads that failed (once only)."""
failed = [l for l in self.graph.leads
if l.status == "failed" and not l.context.get("retry")]
if not failed:
return
logger.info("--- Retrying %d failed leads ---", len(failed))
for lead in failed:
lead.status = "pending"
lead.context["retry"] = True
await self._dispatch_leads_parallel(failed)
# ---- Hypothesis generation -----------------------------------------------
async def _generate_hypotheses_manual(self, hypotheses_config: list[dict]) -> None:
"""Load hypotheses from config."""
for h in hypotheses_config:
hid = await self.graph.add_hypothesis(
title=h["title"],
description=h["description"],
created_by="manual",
)
logger.info("Manual hypothesis loaded: %s%s", hid, h["title"])
async def _generate_hypotheses_auto(self) -> None:
"""Use HypothesisAgent to analyze phenomena and generate hypotheses."""
hyp_agent = self.factory.get_or_create_agent("hypothesis")
if hyp_agent is None:
logger.error("Cannot create hypothesis agent")
return
await hyp_agent.run(
"Analyze all phenomena discovered in Phase 1. Generate 3-7 investigative "
"hypotheses about what happened on this system. Each hypothesis should be:\n"
"1. Specific and testable\n"
"2. About a distinct aspect of activity (e.g., hacking tools, communication, "
"network attacks, data theft)\n\n"
"For each hypothesis:\n"
"- Call add_hypothesis to create it\n"
"- Then call link_phenomenon_to_hypothesis to link relevant existing phenomena\n"
"- Choose the relationship type carefully: direct_evidence, supports, "
"prerequisite_met, consequence_observed, contradicts, or weakens\n\n"
"The ultimate goal is to reconstruct a detailed timeline of what happened on this host."
)
# ---- Hypothesis-directed investigation -----------------------------------
async def _generate_hypothesis_leads(self) -> None:
"""Ask LLM what evidence is still needed — all hypotheses in one call."""
active = [h for h in self.graph.hypotheses.values() if h.status == "active"]
if not active:
return
# Build a combined prompt with ALL active hypotheses
hyp_blocks = []
hyp_by_id = {h.id: h for h in active}
for hyp in active:
related = self.graph.get_related(hyp.id, direction="in")
existing = "\n".join(
f" - {r['node']} [{r['edge_type']}]" for r in related
) or " (none yet)"
hyp_blocks.append(
f"Hypothesis [{hyp.id}]: {hyp.title}\n"
f" Description: {hyp.description}\n"
f" Current confidence: {hyp.confidence:.2f}\n"
f" Existing evidence:\n{existing}"
)
prompt = (
f"Active hypotheses under investigation:\n\n"
+ "\n\n".join(hyp_blocks) + "\n\n"
f"For EACH hypothesis, suggest 1-3 specific, actionable investigation tasks.\n"
f"For each task, specify which agent type should handle it: "
f"filesystem, registry, communication, network.\n"
f"Prioritize tasks that would most strongly confirm OR refute the hypothesis.\n"
f"Avoid vague tasks like 'investigate further'.\n\n"
f"Respond ONLY with JSON (no markdown):\n"
f'[{{"hypothesis_id": "hyp-xxx", "agent": "agent_type", "task": "what to investigate", "priority": 1-10}}]'
)
try:
response = await self.llm.chat(
messages=[{"role": "user", "content": prompt}],
system=self._LEAD_GEN_SYSTEM,
)
match = re.search(r'\[.*?\]', response, re.DOTALL)
if match:
tasks = json.loads(match.group())
else:
tasks = json.loads(response)
for task in tasks:
hyp_id = task.get("hypothesis_id", "")
# Validate hypothesis_id exists
if hyp_id not in hyp_by_id:
# Try to match by prefix or skip
logger.warning("Unknown hypothesis_id in lead gen response: %s", hyp_id)
hyp_id = None
await self.graph.add_lead(
target_agent=task.get("agent", "filesystem"),
description=task.get("task", ""),
priority=task.get("priority", 5),
hypothesis_id=hyp_id,
)
except Exception as e:
logger.warning("Batched lead generation failed: %s — falling back to per-hypothesis", e)
await self._generate_hypothesis_leads_fallback(active)
async def _generate_hypothesis_leads_fallback(self, active: list) -> None:
"""Per-hypothesis fallback if batched call fails."""
for hyp in active:
related = self.graph.get_related(hyp.id, direction="in")
existing_evidence = "\n".join(
f" - {r['node']} [{r['edge_type']}]" for r in related
) or " (none yet)"
prompt = (
f"Hypothesis: {hyp.title}\n"
f"Description: {hyp.description}\n"
f"Current confidence: {hyp.confidence:.2f}\n\n"
f"Existing evidence linked to this hypothesis:\n{existing_evidence}\n\n"
f"What additional evidence should we look for to CONFIRM or DENY this hypothesis?\n"
f"List 1-3 specific, actionable investigation tasks.\n"
f"For each, specify which agent type should handle it: "
f"filesystem, registry, communication, network.\n\n"
f"Respond ONLY with JSON (no markdown):\n"
f'[{{"agent": "agent_type", "task": "what to investigate", "priority": 1-10}}]'
)
try:
response = await self.llm.chat(
messages=[{"role": "user", "content": prompt}],
system=self._LEAD_GEN_SYSTEM,
)
match = re.search(r'\[.*?\]', response, re.DOTALL)
tasks = json.loads(match.group()) if match else json.loads(response)
for task in tasks:
await self.graph.add_lead(
target_agent=task.get("agent", "filesystem"),
description=task.get("task", ""),
priority=task.get("priority", 5),
hypothesis_id=hyp.id,
)
except Exception as e:
logger.warning("Failed to generate leads for hypothesis %s: %s", hyp.id, e)
async def _judge_new_phenomena(self) -> None:
"""Judge new phenomena against active hypotheses — all in one LLM call."""
active = [h for h in self.graph.hypotheses.values() if h.status == "active"]
if not active:
return
# Find phenomena not yet linked to any hypothesis
linked_ph_ids = set()
for edge in self.graph.edges:
if edge.target_id.startswith("hyp-") and edge.source_id.startswith("ph-"):
linked_ph_ids.add(edge.source_id)
unlinked = [ph for ph in self.graph.phenomena.values() if ph.id not in linked_ph_ids]
if not unlinked:
return
valid_types = list(HYPOTHESIS_EDGE_WEIGHTS.keys())
hyp_section = "\n".join(
f" [{h.id}] {h.title}: {h.description}" for h in active
)
ph_section = "\n".join(f" - {ph.summary()}" for ph in unlinked)
prompt = (
f"Hypotheses under investigation:\n{hyp_section}\n\n"
f"New phenomena not yet linked to any hypothesis:\n{ph_section}\n\n"
f"For each phenomenon, judge whether it is related to ANY of the hypotheses above.\n"
f"A phenomenon may be linked to MULTIPLE hypotheses if relevant.\n"
f"Valid relationship types: {', '.join(valid_types)}\n"
f"If a phenomenon is NOT related to a hypothesis, omit that pairing.\n\n"
f"Respond ONLY with JSON (no markdown):\n"
f'[{{"hypothesis_id": "hyp-xxx", "phenomenon_id": "ph-xxx", "edge_type": "supports|contradicts|...", "reason": "brief explanation"}}]'
)
try:
response = await self.llm.chat(
messages=[{"role": "user", "content": prompt}],
system=self._JUDGE_SYSTEM,
)
match = re.search(r'\[.*?\]', response, re.DOTALL)
if match:
judgments = json.loads(match.group())
else:
judgments = json.loads(response)
for j in judgments:
hyp_id = j.get("hypothesis_id", "")
ph_id = j.get("phenomenon_id", "")
edge_type = j.get("edge_type", "")
reason = j.get("reason", "")
if (
hyp_id in self.graph.hypotheses
and ph_id in self.graph.phenomena
and edge_type in HYPOTHESIS_EDGE_WEIGHTS
):
await self.graph.update_hypothesis_confidence(
hyp_id=hyp_id,
phenomenon_id=ph_id,
edge_type=edge_type,
reason=reason,
)
logger.info(
"Judged: %s —[%s]→ %s (%s)",
ph_id, edge_type, hyp_id, reason[:60],
)
except Exception as e:
logger.warning("Batched judging failed: %s — falling back to per-hypothesis", e)
await self._judge_new_phenomena_fallback(active, unlinked, valid_types)
async def _judge_new_phenomena_fallback(
self, active: list, unlinked: list, valid_types: list[str],
) -> None:
"""Per-hypothesis fallback if batched judging call fails."""
for hyp in active:
ph_summaries = "\n".join(f" - {ph.summary()}" for ph in unlinked)
prompt = (
f"Hypothesis: {hyp.title}\n"
f"Description: {hyp.description}\n\n"
f"New phenomena not yet linked to this hypothesis:\n{ph_summaries}\n\n"
f"For each phenomenon, judge whether it is related to this hypothesis.\n"
f"Valid relationship types: {', '.join(valid_types)}\n"
f"If a phenomenon is NOT related, omit it.\n\n"
f"Respond ONLY with JSON (no markdown):\n"
f'[{{"phenomenon_id": "ph-xxx", "edge_type": "supports|contradicts|...", "reason": "brief explanation"}}]'
)
try:
response = await self.llm.chat(
messages=[{"role": "user", "content": prompt}],
system=self._JUDGE_SYSTEM,
)
match = re.search(r'\[.*?\]', response, re.DOTALL)
judgments = json.loads(match.group()) if match else json.loads(response)
for j in judgments:
ph_id = j.get("phenomenon_id", "")
edge_type = j.get("edge_type", "")
reason = j.get("reason", "")
if ph_id in self.graph.phenomena and edge_type in HYPOTHESIS_EDGE_WEIGHTS:
await self.graph.update_hypothesis_confidence(
hyp_id=hyp.id,
phenomenon_id=ph_id,
edge_type=edge_type,
reason=reason,
)
logger.info(
"Judged: %s —[%s]→ %s (%s)",
ph_id, edge_type, hyp.id, reason[:60],
)
except Exception as e:
logger.warning("Failed to judge phenomena for hypothesis %s: %s", hyp.id, e)
# ---- Gap analysis (coverage check) ---------------------------------------
_AREA_KEYWORDS: dict[str, list[str]] = {
"system_info": ["install date", "registered owner", "product name", "windows xp", "system information"],
"user_accounts": ["user account", "enumerate", "sam hive", "administrator", "mr. evil"],
"shutdown_time": ["shutdown"],
"network_config": ["network interface", "network adapter", "ip address", "dhcp", "mac address", "network config"],
"installed_software": ["installed software", "program files", "installed program"],
"email_config": ["smtp", "pop3", "nntp", "email account", "email config"],
"chat_logs": ["irc", "mirc", "chat log", "channel"],
"network_activity": ["packet capture", "pcap", "interception", "http request", "user-agent"],
"deleted_files": ["deleted file", "recycle", "recycler"],
"execution_evidence": ["prefetch", "execution", "run count", "last execution"],
}
# Deterministic coverage: if the canonical tool was called, the area is covered.
_AREA_TOOLS: dict[str, list[str]] = {
"system_info": ["get_system_info"],
"user_accounts": ["enumerate_users"],
"shutdown_time": ["get_shutdown_time"],
"network_config": ["get_network_interfaces"],
"installed_software": ["list_installed_software"],
"email_config": ["get_email_config"],
"network_activity": ["parse_pcap_strings"],
"deleted_files": ["count_deleted_files"],
"execution_evidence": ["parse_prefetch"],
}
def _check_coverage(self, areas: list[dict]) -> set[str]:
# Layer 1: keyword matching on category + title + description
evidence_text = ""
for ph in self.graph.phenomena.values():
evidence_text += f" {ph.category} {ph.title} {ph.description} ".lower()
# Layer 2: collect all source_tools that produced phenomena
used_tools: set[str] = {ph.source_tool for ph in self.graph.phenomena.values() if ph.source_tool}
covered = set()
for area in areas:
area_name = area["area"]
# Check keywords
keywords = self._AREA_KEYWORDS.get(area_name, [])
if any(kw in evidence_text for kw in keywords):
covered.add(area_name)
continue
# Check source_tool
area_tools = self._AREA_TOOLS.get(area_name, [])
if any(tool in used_tools for tool in area_tools):
covered.add(area_name)
return covered
async def _run_gap_analysis(self) -> None:
areas = self.config.get("investigation_areas", [])
if not areas:
return
covered = self._check_coverage(areas)
uncovered = [a for a in areas if a["area"] not in covered]
if not uncovered:
_log(f"All {len(areas)} investigation areas covered", event="progress")
return
uncovered_names = ", ".join(a["area"] for a in uncovered)
_log(f"{len(uncovered)}/{len(areas)} areas uncovered: {uncovered_names}", event="dispatch")
for area in uncovered:
await self.graph.add_lead(
target_agent=area["agent"],
description=area["task"],
priority=3,
)
for round_num in range(3):
pending = await self.graph.get_pending_leads()
if not pending:
break
_log(f"Gap fill round {round_num}: {len(pending)} leads", event="dispatch")
await self._dispatch_leads_parallel(pending)
# ---- Run archiving -------------------------------------------------------
def _archive_run(self, report: str = "") -> None:
if self.run_dir is None:
return
try:
# Phenomena export
phenomena_data = {pid: ph.to_dict() for pid, ph in self.graph.phenomena.items()}
(self.run_dir / "phenomena.json").write_text(
json.dumps(phenomena_data, ensure_ascii=False, indent=2)
)
# Hypotheses export
hyp_data = {hid: h.to_dict() for hid, h in self.graph.hypotheses.items()}
(self.run_dir / "hypotheses.json").write_text(
json.dumps(hyp_data, ensure_ascii=False, indent=2)
)
# Edges export
edges_data = [e.to_dict() for e in self.graph.edges]
(self.run_dir / "edges.json").write_text(
json.dumps(edges_data, ensure_ascii=False, indent=2)
)
# Entities export
ent_data = {eid: e.to_dict() for eid, e in self.graph.entities.items()}
(self.run_dir / "entities.json").write_text(
json.dumps(ent_data, ensure_ascii=False, indent=2)
)
# Leads export
leads_data = [l.to_dict() for l in self.graph.leads]
(self.run_dir / "leads.json").write_text(
json.dumps(leads_data, ensure_ascii=False, indent=2)
)
# Run metadata
end_time = datetime.now()
metadata = {
"start_time": self._start_time.isoformat(),
"end_time": end_time.isoformat(),
"duration_seconds": (end_time - self._start_time).total_seconds(),
"phenomena_count": len(self.graph.phenomena),
"hypotheses_count": len(self.graph.hypotheses),
"entities_count": len(self.graph.entities),
"edges_count": len(self.graph.edges),
"leads_total": len(self.graph.leads),
"leads_completed": sum(1 for l in self.graph.leads if l.status == "completed"),
"leads_failed": sum(1 for l in self.graph.leads if l.status == "failed"),
"agent_status": dict(self.graph.agent_status),
"hypotheses_summary": {
hid: {"title": h.title, "confidence": h.confidence, "status": h.status}
for hid, h in self.graph.hypotheses.items()
},
}
(self.run_dir / "run_metadata.json").write_text(
json.dumps(metadata, ensure_ascii=False, indent=2)
)
if report:
(self.run_dir / "report.md").write_text(report)
logger.info("Run archived to %s", self.run_dir)
except Exception as e:
logger.error("Failed to archive run: %s", e)
# ---- Main pipeline -------------------------------------------------------
async def run(self, resume_phase: int = 1) -> str:
"""Run the 5-phase hypothesis-driven forensic analysis pipeline."""
_log(f"Phase 1: Filesystem Survey (image: {Path(self.graph.image_path).name})", event="phase")
report = ""
try:
# Phase 1: Initial filesystem survey
if resume_phase <= 1:
t0 = time.monotonic()
ph_before = len(self.graph.phenomena)
fs_agent = self.factory.get_or_create_agent("filesystem")
if fs_agent:
await fs_agent.run(
"Perform an initial survey of this disk image. "
"Examine the partition table, filesystem type, and root directory structure. "
"List key user directories and identify interesting files (documents, emails, "
"chat logs, installed programs, registry hives). "
"Create leads for other agents based on what you find."
)
new_ph = len(self.graph.phenomena) - ph_before
new_leads = sum(1 for l in self.graph.leads if l.status == "pending")
_log(f"+{new_ph} phenomena, +{new_leads} leads", event="progress", elapsed=time.monotonic() - t0)
# Phase 2: Hypothesis generation
if resume_phase <= 2:
_log("Phase 2: Hypothesis Generation", event="phase")
t0 = time.monotonic()
manual_hypotheses = self.config.get("hypotheses", [])
if manual_hypotheses:
await self._generate_hypotheses_manual(manual_hypotheses)
if self.graph.phenomena:
await self._judge_new_phenomena()
else:
await self._generate_hypotheses_auto()
for h in self.graph.hypotheses.values():
_log(f" {h.summary()}", event="hypothesis")
_log(
f"+{len(self.graph.hypotheses)} hypotheses generated",
event="progress", elapsed=time.monotonic() - t0,
)
# Phase 3: Hypothesis-directed investigation (iterative)
if resume_phase <= 3:
max_rounds = self.config.get("max_investigation_rounds", 5)
for round_num in range(max_rounds):
_log(f"Phase 3: Investigation Round {round_num}", event="phase")
t0 = time.monotonic()
if self.graph.hypotheses_converged():
_log("All hypotheses converged — stopping", event="progress")
break
await self._generate_hypothesis_leads()
pending = await self.graph.get_pending_leads()
if not pending:
_log("No pending leads — round complete", event="progress")
break
await self._dispatch_leads_parallel(pending)
await self._judge_new_phenomena()
# Show hypothesis status update
for h in self.graph.hypotheses.values():
_log(f" {h.summary()}", event="hypothesis")
_log(_progress_summary(self.graph), event="progress", elapsed=time.monotonic() - t0)
# Retry failed leads
await self._retry_failed_leads()
# Gap analysis
_log("Phase 3: Gap Analysis", event="phase")
await self._run_gap_analysis()
self.graph.mark_remaining_inconclusive()
# Phase 4: Timeline construction
if resume_phase <= 4:
_log("Phase 4: Timeline Construction", event="phase")
timeline_agent = self.factory.get_or_create_agent("timeline")
if timeline_agent:
await timeline_agent.run(
"Build a detailed chronological timeline from the evidence graph. "
"Use the build_filesystem_timeline tool for MAC times, then correlate "
"with phenomena timestamps. Use temporal edges to connect events."
)
# Phase 5: Report generation
_log("Phase 5: Report Generation", event="phase")
t0 = time.monotonic()
report_agent = self.factory.get_or_create_agent("report")
if report_agent:
report = await report_agent.run(
"Generate a comprehensive forensic report. Structure it as follows:\n"
"1. Executive Summary\n"
"2. Hypotheses and Evidence (for each hypothesis: title, confidence, "
"supporting/contradicting phenomena)\n"
"3. Detailed Findings by Category\n"
"4. Timeline of Events\n"
"5. Entity Summary (key people, programs, hosts involved)\n"
"6. Conclusions and Recommendations"
)
image_stem = Path(self.graph.image_path).stem
report_name = f"{image_stem}_forensic_report.md"
report_path = (self.run_dir / report_name) if self.run_dir else Path(report_name)
try:
report_path.write_text(report)
_log(f"Report saved: {report_path}", event="progress", elapsed=time.monotonic() - t0)
except Exception as e:
logger.error("Failed to save report: %s", e)
except AnalysisAborted:
logger.critical("Analysis aborted due to repeated failures. Partial results preserved.")
finally:
self._archive_run(report)
total = (datetime.now() - self._start_time).total_seconds()
_log(
f"Run complete: {_progress_summary(self.graph)}",
event="progress", elapsed=total,
)
if not report:
return self.graph.stats_summary()
return report

23
pyproject.toml Normal file
View File

@@ -0,0 +1,23 @@
[project]
name = "masforensics"
version = "0.1.0"
description = "Multi-Agent System for Digital Forensics"
requires-python = ">=3.14"
dependencies = [
"httpx[socks]>=0.28.1",
"pyyaml",
"regipy>=6.2.1",
]
[project.scripts]
masforensics = "main:main"
[tool.setuptools]
py-modules = ["main", "llm_client", "evidence_graph", "base_agent", "orchestrator", "tool_registry", "agent_factory"]
packages = ["agents", "tools"]
[dependency-groups]
dev = [
"pytest>=9.0.2",
"pytest-asyncio>=1.3.0",
]

63
regenerate_report.py Normal file
View File

@@ -0,0 +1,63 @@
"""Regenerate the forensic report from a saved graph state without re-running the full pipeline."""
import asyncio
import sys
from pathlib import Path
import yaml
from agent_factory import AgentFactory
from evidence_graph import EvidenceGraph
from llm_client import LLMClient
from tool_registry import register_all_tools
async def main() -> None:
# Find the run to regenerate from
run_dir = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("runs/2026-04-02T15-11-25")
state_path = run_dir / "graph_state.json"
if not state_path.exists():
print(f"No graph_state.json in {run_dir}")
return
config = yaml.safe_load(open("config.yaml"))
agent_cfg = config["agent"]
# Load graph
graph = EvidenceGraph.load_state(state_path)
print(f"Loaded: {graph.stats_summary()}")
# LLM client with larger max_tokens for report
llm = LLMClient(
base_url=agent_cfg["base_url"],
api_key=agent_cfg["api_key"],
model=agent_cfg["model"],
max_tokens=16384,
)
register_all_tools(graph.image_path, graph.partition_offset, graph)
factory = AgentFactory(llm, graph)
# Run only the report agent
report_agent = factory.get_or_create_agent("report")
report = await report_agent.run(
"Generate a comprehensive forensic report. Structure it as follows:\n"
"1. Executive Summary\n"
"2. Hypotheses and Evidence (for each hypothesis: title, confidence, "
"supporting/contradicting phenomena)\n"
"3. Detailed Findings by Category\n"
"4. Timeline of Events\n"
"5. Entity Summary (key people, programs, hosts involved)\n"
"6. Conclusions and Recommendations"
)
report_path = run_dir / "SCHARDT_forensic_report.md"
report_path.write_text(report)
print(f"\nReport saved to {report_path} ({len(report)} chars)")
await llm.close()
if __name__ == "__main__":
asyncio.run(main())

615
tool_registry.py Normal file
View File

@@ -0,0 +1,615 @@
"""Central tool registry — catalogs all available forensic tools.
Tools are registered once at startup with bound image_path and offset.
The AgentFactory uses this catalog to compose agents dynamically.
"""
from __future__ import annotations
import hashlib
import json
import logging
import os
import re
from dataclasses import dataclass, field
from typing import Any
from tools import parsers
from tools import registry as reg
from tools import sleuthkit as tsk
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Tool result cache — keyed by (tool_name, args_hash).
# Disk image tools are deterministic (image is read-only), so identical
# calls always produce the same output.
# ---------------------------------------------------------------------------
_tool_result_cache: dict[str, str] = {}
# Tools safe to cache: deterministic reads with no side effects.
CACHEABLE_TOOLS: set[str] = {
"partition_info", "filesystem_info", "list_directory", "find_file",
"search_strings", "count_deleted_files", "build_filesystem_timeline",
"parse_registry_key", "search_registry", "get_user_activity",
"read_text_file", "read_binary_preview", "search_text_file",
"read_text_file_section", "list_extracted_dir", "parse_pcap_strings",
}
def _cache_key(tool_name: str, kwargs: dict) -> str:
"""Build a deterministic cache key from tool name + arguments."""
args_str = json.dumps(kwargs, sort_keys=True, ensure_ascii=False)
args_hash = hashlib.md5(args_str.encode()).hexdigest()
return f"{tool_name}:{args_hash}"
def _make_cached(tool_name: str, executor: Any) -> Any:
"""Wrap an executor with an in-memory result cache."""
async def wrapper(**kwargs) -> str:
key = _cache_key(tool_name, kwargs)
cached = _tool_result_cache.get(key)
if cached is not None:
logger.debug("Cache hit: %s(%s)", tool_name, kwargs)
return cached
result = await executor(**kwargs)
# Only cache successful results (not errors)
if not result.startswith("Error") and not result.startswith("[Command failed"):
_tool_result_cache[key] = result
return result
return wrapper
def get_cache_stats() -> dict[str, int]:
"""Return cache statistics for diagnostics."""
return {"entries": len(_tool_result_cache)}
# Category auto-detection patterns (filename → category)
_REGISTRY_HIVE_NAMES = {"system", "software", "sam", "ntuser.dat", "security", "default"}
ASSET_CATEGORIES = [
"registry_hive", "chat_log", "prefetch", "network_capture",
"config_file", "address_book", "recycle_bin", "executable",
"text_log", "other",
]
def _auto_categorize(filename: str) -> str:
"""Infer asset category from filename."""
name_lower = filename.lower()
ext = os.path.splitext(name_lower)[1]
# Check full name (with extension) and base name against known hive names
if name_lower in _REGISTRY_HIVE_NAMES:
return "registry_hive"
if ext == ".pf":
return "prefetch"
if ext in (".pcap", ".cap") or name_lower == "interception":
return "network_capture"
if ext == ".wab":
return "address_book"
if name_lower == "info2" or re.match(r"dc\d+\.exe", name_lower):
return "recycle_bin"
# Extension-based checks before keyword-based (e.g. mirc.ini → config, not chat)
if ext in (".ini", ".csv", ".dat", ".cfg"):
return "config_file"
if ext in (".log", ".lst"):
if any(kw in name_lower for kw in ("irc", "mirc", "channel", "chat")):
return "chat_log"
return "text_log"
if any(kw in name_lower for kw in ("irc", "mirc", "channel", "chat")):
return "chat_log"
if ext in (".exe", ".dll", ".com"):
return "executable"
return "other"
@dataclass
class ToolDefinition:
"""A registered tool available for agent composition."""
name: str
description: str
input_schema: dict
executor: Any # async callable (or sync for some parsers)
module: str # "sleuthkit", "registry", "parsers"
tags: list[str] = field(default_factory=list)
# Global tool catalog, populated by register_all_tools().
TOOL_CATALOG: dict[str, ToolDefinition] = {}
def _make_auto_record(tool_name: str, category: str, executor: Any, graph: Any) -> Any:
"""Wrap a forensic tool to auto-record its result as a phenomenon."""
async def wrapper(**kwargs) -> str:
result = await executor(**kwargs)
if graph is None or not result or result.startswith("Error") or result.startswith("["):
return result
# Auto-record: the tool produced a forensic fact
agent = getattr(graph, "_current_agent", "") or "unknown"
title = f"{tool_name}: {result.split(chr(10))[0][:80]}"
await graph.add_phenomenon(
source_agent=agent,
category=category,
title=title,
description=result[:2000],
source_tool=tool_name,
)
return result
return wrapper
def register_all_tools(
image_path: str,
partition_offset: int,
graph: Any = None,
extracted_dir: str = "extracted",
) -> None:
"""Populate TOOL_CATALOG with all available tools, pre-bound to image/offset."""
TOOL_CATALOG.clear()
# ---- Sleuth Kit tools ----
TOOL_CATALOG["partition_info"] = ToolDefinition(
name="partition_info",
description="Get the partition table layout of the disk image. Run this first to understand disk structure.",
input_schema={"type": "object", "properties": {}},
executor=lambda: tsk.partition_info(image_path),
module="sleuthkit",
tags=["filesystem", "disk", "partition"],
)
TOOL_CATALOG["filesystem_info"] = ToolDefinition(
name="filesystem_info",
description="Get detailed filesystem information (type, block size, volume name, etc.) for the selected partition.",
input_schema={"type": "object", "properties": {}},
executor=lambda: tsk.filesystem_info(image_path, partition_offset),
module="sleuthkit",
tags=["filesystem", "disk"],
)
TOOL_CATALOG["list_directory"] = ToolDefinition(
name="list_directory",
description="List files and directories. Without inode, lists root. Use recursive=true for all files.",
input_schema={
"type": "object",
"properties": {
"inode": {"type": "string", "description": "Inode of directory. Omit for root."},
"recursive": {"type": "boolean", "description": "List all files recursively."},
},
},
executor=lambda inode=None, recursive=False: tsk.list_directory(
image_path, partition_offset, inode, recursive
),
module="sleuthkit",
tags=["filesystem", "directory", "listing"],
)
async def _extract_with_tracking(inode: str) -> str:
"""Extract a file by inode. Name and category are derived from the real disk path."""
# Dedup
if graph is not None:
existing = graph.lookup_asset_by_inode(inode)
if existing is not None:
return (
f"Already extracted: {existing.local_path} "
f"({existing.size_bytes} bytes, {existing.category}). "
f"Disk path: {existing.original_path}"
)
# Resolve real disk path first
orig_path = (await tsk.find_file(image_path, inode, partition_offset)).strip()
if not orig_path or "not found" in orig_path.lower():
return f"Error: inode {inode} not found on the disk image."
# Derive local filename from real disk path
filename = os.path.basename(orig_path)
local_path = os.path.join(extracted_dir, filename)
# Handle name collisions by appending inode
if os.path.exists(local_path):
base, ext = os.path.splitext(filename)
local_path = os.path.join(extracted_dir, f"{base}_{inode.replace('-', '_')}{ext}")
filename = os.path.basename(local_path)
# Extract
result = await tsk.extract_file(image_path, inode, local_path, partition_offset)
if result.startswith("[icat failed"):
return result
size = os.path.getsize(local_path) if os.path.exists(local_path) else 0
category = _auto_categorize(os.path.basename(orig_path))
# Register
if graph is not None:
agent_name = getattr(graph, "_current_agent", "") or "unknown"
await graph.register_asset(
inode=inode,
original_path=orig_path,
local_path=local_path,
category=category,
filename=filename,
size_bytes=size,
extracted_by=agent_name,
)
logger.info("Asset registered: %s (%s, %d bytes)", local_path, category, size)
return (
f"Extracted to {local_path} ({size} bytes, {category})\n"
f"Disk path: {orig_path}"
)
TOOL_CATALOG["extract_file"] = ToolDefinition(
name="extract_file",
description=(
"Extract a file from the disk image by inode number. "
"The filename is automatically determined from the disk path. "
"Checks if already extracted (returns existing path if so). "
"Returns the local path and the original disk path."
),
input_schema={
"type": "object",
"properties": {
"inode": {"type": "string", "description": "Inode number of the file (e.g. '334-128-4' or '334')."},
},
"required": ["inode"],
},
executor=_extract_with_tracking,
module="sleuthkit",
tags=["filesystem", "extraction"],
)
TOOL_CATALOG["find_file"] = ToolDefinition(
name="find_file",
description="Find the file path for a given inode number.",
input_schema={
"type": "object",
"properties": {
"inode": {"type": "string", "description": "Inode number to look up."},
},
"required": ["inode"],
},
executor=lambda inode: tsk.find_file(image_path, inode, partition_offset),
module="sleuthkit",
tags=["filesystem"],
)
TOOL_CATALOG["search_strings"] = ToolDefinition(
name="search_strings",
description="Search for a string pattern across the entire disk image (slow on first call, fast after). Prefer search_text_file on already-extracted files when possible.",
input_schema={
"type": "object",
"properties": {
"pattern": {"type": "string", "description": "String pattern (case-insensitive grep)."},
},
"required": ["pattern"],
},
executor=lambda pattern: tsk.search_strings(image_path, pattern),
module="sleuthkit",
tags=["filesystem", "search", "strings"],
)
TOOL_CATALOG["count_deleted_files"] = ToolDefinition(
name="count_deleted_files",
description="List and count all deleted files. Shows total count, executables, and extension breakdown.",
input_schema={"type": "object", "properties": {}},
executor=lambda: tsk.count_deleted_files(image_path, partition_offset),
module="sleuthkit",
tags=["filesystem", "deleted", "recovery"],
)
TOOL_CATALOG["build_filesystem_timeline"] = ToolDefinition(
name="build_filesystem_timeline",
description="Build a MAC timeline from the filesystem (Modified/Accessed/Changed times for all files).",
input_schema={"type": "object", "properties": {}},
executor=lambda: tsk.build_timeline(image_path, partition_offset),
module="sleuthkit",
tags=["filesystem", "timeline"],
)
# ---- Registry tools ----
TOOL_CATALOG["parse_registry_key"] = ToolDefinition(
name="parse_registry_key",
description="Parse a registry hive file and list subkeys/values at a given path.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to extracted hive file."},
"key_path": {"type": "string", "description": "Registry key path to inspect."},
},
"required": ["hive_path", "key_path"],
},
executor=lambda hive_path, key_path: reg.parse_registry_key(hive_path, key_path),
module="registry",
tags=["registry", "hive"],
)
TOOL_CATALOG["list_installed_software"] = ToolDefinition(
name="list_installed_software",
description="List installed software from a SOFTWARE registry hive.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to SOFTWARE hive."},
},
"required": ["hive_path"],
},
executor=_make_auto_record("list_installed_software", "registry",
lambda hive_path: reg.list_installed_software(hive_path), graph),
module="registry",
tags=["registry", "software", "installed"],
)
TOOL_CATALOG["get_user_activity"] = ToolDefinition(
name="get_user_activity",
description="Extract user activity from NTUSER.DAT (recent docs, typed URLs, run dialog history).",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to NTUSER.DAT."},
},
"required": ["hive_path"],
},
executor=lambda hive_path: reg.get_user_activity(hive_path),
module="registry",
tags=["registry", "user", "activity"],
)
TOOL_CATALOG["search_registry"] = ToolDefinition(
name="search_registry",
description="Search for a pattern in registry key names and values.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to hive file."},
"pattern": {"type": "string", "description": "Search pattern."},
},
"required": ["hive_path", "pattern"],
},
executor=lambda hive_path, pattern: reg.search_registry(hive_path, pattern),
module="registry",
tags=["registry", "search"],
)
# ---- Registry tools (auto-record: results are forensic facts) ----
TOOL_CATALOG["get_system_info"] = ToolDefinition(
name="get_system_info",
description="Extract OS version, install date, and registered owner from a SOFTWARE hive.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to SOFTWARE hive."},
},
"required": ["hive_path"],
},
executor=_make_auto_record("get_system_info", "registry",
lambda hive_path: reg.get_system_info(hive_path), graph),
module="registry",
tags=["registry", "system"],
)
TOOL_CATALOG["get_timezone_info"] = ToolDefinition(
name="get_timezone_info",
description="Extract timezone settings from a SYSTEM hive.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to SYSTEM hive."},
},
"required": ["hive_path"],
},
executor=_make_auto_record("get_timezone_info", "registry",
lambda hive_path: reg.get_timezone_info(hive_path), graph),
module="registry",
tags=["registry", "timezone", "system"],
)
TOOL_CATALOG["get_computer_name"] = ToolDefinition(
name="get_computer_name",
description="Extract computer/host name from a SYSTEM hive.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to SYSTEM hive."},
},
"required": ["hive_path"],
},
executor=_make_auto_record("get_computer_name", "registry",
lambda hive_path: reg.get_computer_name(hive_path), graph),
module="registry",
tags=["registry", "system", "hostname"],
)
TOOL_CATALOG["get_shutdown_time"] = ToolDefinition(
name="get_shutdown_time",
description="Extract last shutdown time from a SYSTEM hive.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to SYSTEM hive."},
},
"required": ["hive_path"],
},
executor=_make_auto_record("get_shutdown_time", "registry",
lambda hive_path: reg.get_shutdown_time(hive_path), graph),
module="registry",
tags=["registry", "system", "shutdown"],
)
TOOL_CATALOG["enumerate_users"] = ToolDefinition(
name="enumerate_users",
description="List all user accounts and RIDs from a SAM hive.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to SAM hive."},
},
"required": ["hive_path"],
},
executor=_make_auto_record("enumerate_users", "registry",
lambda hive_path: reg.enumerate_users(hive_path), graph),
module="registry",
tags=["registry", "user", "accounts", "sam"],
)
TOOL_CATALOG["get_network_interfaces"] = ToolDefinition(
name="get_network_interfaces",
description="Extract network adapter and TCP/IP config from a SYSTEM hive.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to SYSTEM hive."},
},
"required": ["hive_path"],
},
executor=_make_auto_record("get_network_interfaces", "registry",
lambda hive_path: reg.get_network_interfaces(hive_path), graph),
module="registry",
tags=["registry", "network", "adapter", "ip"],
)
TOOL_CATALOG["get_email_config"] = ToolDefinition(
name="get_email_config",
description="Extract email account configuration (SMTP, POP3, NNTP) from NTUSER.DAT.",
input_schema={
"type": "object",
"properties": {
"hive_path": {"type": "string", "description": "Path to NTUSER.DAT."},
},
"required": ["hive_path"],
},
executor=_make_auto_record("get_email_config", "registry",
lambda hive_path: reg.get_email_config(hive_path), graph),
module="registry",
tags=["registry", "email", "account"],
)
# ---- Parser tools ----
TOOL_CATALOG["parse_prefetch"] = ToolDefinition(
name="parse_prefetch",
description="Parse a Windows Prefetch (.pf) file to extract executable name, last execution time, and run count.",
input_schema={
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "Path to extracted .pf file."},
},
"required": ["file_path"],
},
executor=_make_auto_record("parse_prefetch", "filesystem",
lambda file_path: parsers.parse_prefetch(file_path), graph),
module="parsers",
tags=["filesystem", "prefetch", "execution"],
)
TOOL_CATALOG["read_text_file"] = ToolDefinition(
name="read_text_file",
description="Read an extracted text file (configs, logs, chat logs, etc.).",
input_schema={
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "Local path to the file."},
},
"required": ["file_path"],
},
executor=lambda file_path: parsers.read_text_file(file_path),
module="parsers",
tags=["text", "read"],
)
TOOL_CATALOG["read_binary_preview"] = ToolDefinition(
name="read_binary_preview",
description="Preview a binary file in hex+ASCII format.",
input_schema={
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "Local path to the file."},
},
"required": ["file_path"],
},
executor=lambda file_path: parsers.read_binary_preview(file_path),
module="parsers",
tags=["binary", "hex", "preview"],
)
TOOL_CATALOG["search_text_file"] = ToolDefinition(
name="search_text_file",
description="Search for a regex pattern in an extracted text file. Returns matching lines with line numbers.",
input_schema={
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "Path to extracted file."},
"pattern": {"type": "string", "description": "Regex pattern."},
},
"required": ["file_path", "pattern"],
},
executor=lambda file_path, pattern: parsers.search_text_file(file_path, pattern),
module="parsers",
tags=["text", "search", "regex"],
)
TOOL_CATALOG["read_text_file_section"] = ToolDefinition(
name="read_text_file_section",
description="Read a section of a large text file starting at a byte offset.",
input_schema={
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "Path to file."},
"start": {"type": "integer", "description": "Byte offset to start reading."},
"max_bytes": {"type": "integer", "description": "Maximum bytes to read."},
},
"required": ["file_path"],
},
executor=lambda file_path, start=0, max_bytes=8000: parsers.read_text_file_section(
file_path, start, max_bytes
),
module="parsers",
tags=["text", "read", "section"],
)
TOOL_CATALOG["list_extracted_dir"] = ToolDefinition(
name="list_extracted_dir",
description="List files in an extracted directory with sizes.",
input_schema={
"type": "object",
"properties": {
"dir_path": {"type": "string", "description": "Directory path."},
},
"required": ["dir_path"],
},
executor=lambda dir_path: parsers.list_extracted_dir(dir_path),
module="parsers",
tags=["filesystem", "listing", "extracted"],
)
TOOL_CATALOG["parse_pcap_strings"] = ToolDefinition(
name="parse_pcap_strings",
description="Extract HTTP headers, hosts, User-Agent, cookies, and URLs from a PCAP/capture file.",
input_schema={
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "Path to PCAP file."},
},
"required": ["file_path"],
},
executor=lambda file_path: parsers.parse_pcap_strings(file_path),
module="parsers",
tags=["network", "pcap", "http", "capture"],
)
# ---- Apply result caching to deterministic read-only tools ----
# Must come AFTER all tools are registered. Auto-record wrapped tools
# (e.g. get_system_info) are NOT in CACHEABLE_TOOLS since they write
# to the evidence graph as a side effect.
_tool_result_cache.clear()
for tool_name, td in TOOL_CATALOG.items():
if tool_name in CACHEABLE_TOOLS:
td.executor = _make_cached(tool_name, td.executor)

0
tools/__init__.py Normal file
View File

234
tools/parsers.py Normal file
View File

@@ -0,0 +1,234 @@
"""Parsers for various forensic artifact formats."""
from __future__ import annotations
import asyncio
import logging
import os
import re
import struct
from datetime import datetime, timedelta, timezone
logger = logging.getLogger(__name__)
async def read_text_file(file_path: str, max_bytes: int = 8000) -> str:
"""Read a text file, with size limit."""
try:
with open(file_path, "r", errors="replace") as f:
content = f.read(max_bytes)
size = os.path.getsize(file_path)
if size > max_bytes:
content += f"\n\n[Truncated: file is {size} bytes, showing first {max_bytes}]"
return content
except Exception as e:
return f"[Error reading {file_path}: {e}]"
async def read_binary_preview(file_path: str, max_bytes: int = 2000) -> str:
"""Read a binary file and show hex + ASCII preview."""
try:
with open(file_path, "rb") as f:
data = f.read(max_bytes)
lines = []
for i in range(0, len(data), 16):
chunk = data[i:i + 16]
hex_part = " ".join(f"{b:02x}" for b in chunk)
ascii_part = "".join(chr(b) if 32 <= b < 127 else "." for b in chunk)
lines.append(f"{i:08x} {hex_part:<48} {ascii_part}")
size = os.path.getsize(file_path)
header = f"File: {file_path} ({size} bytes)\n"
return header + "\n".join(lines)
except Exception as e:
return f"[Error reading {file_path}: {e}]"
async def read_text_file_section(file_path: str, start: int = 0, max_bytes: int = 8000) -> str:
"""Read a section of a text file starting at byte offset `start`."""
try:
size = os.path.getsize(file_path)
with open(file_path, "r", errors="replace") as f:
if start > 0:
f.seek(start)
content = f.read(max_bytes)
remaining = size - start - len(content.encode("utf-8", errors="replace"))
header = f"[File: {file_path}, {size} bytes, showing offset {start}{start + len(content.encode('utf-8', errors='replace'))}]"
if remaining > 0:
content += f"\n\n[{remaining} bytes remaining after this section]"
return header + "\n" + content
except Exception as e:
return f"[Error reading {file_path}: {e}]"
async def search_text_file(file_path: str, pattern: str, max_matches: int = 50) -> str:
"""Search for a pattern in an extracted text file. Returns matching lines with line numbers."""
try:
size = os.path.getsize(file_path)
matches = []
try:
compiled = re.compile(pattern, re.IGNORECASE)
except re.error:
compiled = re.compile(re.escape(pattern), re.IGNORECASE)
with open(file_path, "r", errors="replace") as f:
for lineno, line in enumerate(f, 1):
if compiled.search(line):
matches.append(f" {lineno}: {line.rstrip()[:200]}")
if len(matches) >= max_matches:
matches.append(f" [Truncated: more than {max_matches} matches]")
break
header = f"Search '{pattern}' in {file_path} ({size} bytes): {len(matches)} matches"
if not matches:
return header + "\n (no matches)"
return header + "\n" + "\n".join(matches)
except Exception as e:
return f"[Error searching {file_path}: {e}]"
async def parse_pcap_strings(file_path: str) -> str:
"""Extract HTTP headers and other readable strings from a PCAP/capture file.
Uses the `strings` command to find printable text, then filters for
forensically relevant patterns (HTTP headers, URLs, credentials).
"""
try:
proc = await asyncio.create_subprocess_exec(
"srch_strings", "-a", "-n", "8", file_path,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, _ = await proc.communicate()
all_strings = stdout.decode("utf-8", errors="replace").splitlines()
hosts = set()
user_agents = set()
urls = []
cookies = []
http_methods = []
other_interesting = []
for line in all_strings:
stripped = line.strip()
if stripped.startswith("Host: "):
hosts.add(stripped[6:])
elif stripped.startswith("User-Agent: "):
user_agents.add(stripped[12:])
elif stripped.startswith("Cookie: "):
cookies.append(stripped[:200])
elif re.match(r"^(GET|POST|PUT|DELETE|HEAD) /", stripped):
urls.append(stripped[:200])
elif stripped.startswith("HTTP/"):
http_methods.append(stripped[:200])
elif any(kw in stripped.lower() for kw in ("password", "login", "username", "email", "set-cookie")):
other_interesting.append(stripped[:200])
size = os.path.getsize(file_path)
lines = [f"=== PCAP String Analysis: {file_path} ({size} bytes) ==="]
lines.append(f"Total printable strings (>=8 chars): {len(all_strings)}")
lines.append(f"\nUnique Hosts ({len(hosts)}):")
for h in sorted(hosts):
lines.append(f" {h}")
lines.append(f"\nUser-Agent strings ({len(user_agents)}):")
for ua in sorted(user_agents):
lines.append(f" {ua}")
lines.append(f"\nHTTP Requests ({len(urls)}):")
for u in urls[:30]:
lines.append(f" {u}")
if len(urls) > 30:
lines.append(f" ... ({len(urls) - 30} more)")
lines.append(f"\nHTTP Responses ({len(http_methods)}):")
for m in http_methods[:20]:
lines.append(f" {m}")
if cookies:
lines.append(f"\nCookies ({len(cookies)}):")
for c in cookies[:20]:
lines.append(f" {c}")
if other_interesting:
lines.append(f"\nOther interesting strings ({len(other_interesting)}):")
for o in other_interesting[:30]:
lines.append(f" {o}")
return "\n".join(lines)
except Exception as e:
return f"[Error parsing PCAP strings: {e}]"
async def parse_prefetch(file_path: str) -> str:
"""Parse a Windows XP Prefetch (.pf) file to extract execution info.
Returns: executable name, last execution time, and run count.
"""
try:
with open(file_path, "rb") as f:
data = f.read()
if len(data) < 0x94:
return f"[Error: file too small for Prefetch format ({len(data)} bytes)]"
version = struct.unpack_from("<I", data, 0)[0]
sig = data[4:8]
if sig != b"SCCA":
return f"[Error: not a Prefetch file — signature is {sig!r}, expected b'SCCA']"
# Filename: null-terminated UTF-16LE at offset 0x10
raw_name = data[0x10:0x4C]
name_end = raw_name.find(b"\x00\x00")
if name_end > 0:
if name_end % 2 == 1:
name_end += 1
filename = raw_name[:name_end].decode("utf-16-le")
else:
filename = raw_name.decode("utf-16-le", errors="replace").rstrip("\x00")
# Last execution time: FILETIME at offset 0x78 (Windows XP, version 17)
ft = struct.unpack_from("<Q", data, 0x78)[0]
if ft > 0:
epoch = datetime(1601, 1, 1, tzinfo=timezone.utc)
last_run = epoch + timedelta(microseconds=ft // 10)
last_run_str = last_run.strftime("%Y-%m-%d %H:%M:%S UTC")
else:
last_run_str = "(not available)"
# Run count at offset 0x90
run_count = struct.unpack_from("<I", data, 0x90)[0]
lines = [
f"=== Prefetch Analysis: {file_path} ===",
f"Prefetch Version: {version}",
f"Executable: {filename}",
f"Last Execution: {last_run_str}",
f"Run Count: {run_count}",
f"File Size: {len(data)} bytes",
]
return "\n".join(lines)
except Exception as e:
return f"[Error parsing Prefetch: {e}]"
async def list_extracted_dir(dir_path: str) -> str:
"""List files in an extracted directory."""
try:
entries = []
for root, dirs, files in os.walk(dir_path):
for f in files:
full = os.path.join(root, f)
rel = os.path.relpath(full, dir_path)
size = os.path.getsize(full)
entries.append(f" {rel} ({size} bytes)")
if len(entries) > 200:
entries.append(f" ... (truncated)")
break
return f"Directory: {dir_path}\nFiles ({len(entries)}):\n" + "\n".join(entries)
except Exception as e:
return f"[Error listing {dir_path}: {e}]"

449
tools/registry.py Normal file
View File

@@ -0,0 +1,449 @@
"""Windows registry parsing tools."""
from __future__ import annotations
import logging
import struct
from datetime import datetime, timedelta, timezone
logger = logging.getLogger(__name__)
# Suppress noisy regipy warnings (hive-type identification + binary encoding fallbacks)
logging.getLogger("regipy.registry").setLevel(logging.WARNING)
logging.getLogger("regipy.utils").setLevel(logging.ERROR)
async def parse_registry_key(hive_path: str, key_path: str = "") -> str:
"""Parse a registry hive and list subkeys/values at the given path.
Uses regipy for pure-Python registry parsing.
"""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed. Run: uv add regipy]"
try:
reg = RegistryHive(hive_path)
if key_path:
key = reg.get_key(key_path)
else:
key = reg.root_key()
lines = [f"Key: {key.path}", f"Timestamp: {key.header.last_modified}", ""]
# Subkeys
subkeys = list(key.iter_subkeys())
if subkeys:
lines.append(f"Subkeys ({len(subkeys)}):")
for sk in subkeys[:50]:
lines.append(f" {sk.name}")
if len(subkeys) > 50:
lines.append(f" ... ({len(subkeys) - 50} more)")
lines.append("")
# Values
values = list(key.iter_values())
if values:
lines.append(f"Values ({len(values)}):")
for v in values[:30]:
val_data = str(v.value)
if len(val_data) > 200:
val_data = val_data[:200] + "..."
lines.append(f" {v.name} ({v.value_type}) = {val_data}")
return "\n".join(lines)
except Exception as e:
return f"[Error parsing registry: {e}]"
async def list_installed_software(hive_path: str) -> str:
"""List installed software from a SOFTWARE registry hive."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(hive_path)
uninstall_path = "\\Microsoft\\Windows\\CurrentVersion\\Uninstall"
key = reg.get_key(uninstall_path)
programs = []
for sk in key.iter_subkeys():
name = sk.name
display_name = None
for v in sk.iter_values():
if v.name == "DisplayName":
display_name = v.value
break
programs.append(display_name or name)
lines = [f"Installed Software ({len(programs)} entries):", ""]
for p in sorted(programs):
lines.append(f" - {p}")
return "\n".join(lines)
except Exception as e:
return f"[Error listing software: {e}]"
async def get_user_activity(hive_path: str) -> str:
"""Extract user activity indicators from NTUSER.DAT."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(hive_path)
lines = ["=== User Activity from NTUSER.DAT ===", ""]
# Recent documents
try:
key = reg.get_key("\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\RecentDocs")
lines.append("Recent Documents:")
for v in key.iter_values():
if v.name != "MRUListEx":
lines.append(f" {v.name}")
lines.append("")
except Exception:
lines.append("Recent Documents: [not found]")
# Run MRU (commands typed in Run dialog)
try:
key = reg.get_key("\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\RunMRU")
lines.append("Run Dialog MRU:")
for v in key.iter_values():
if v.name not in ("MRUList",):
lines.append(f" {v.name}: {v.value}")
lines.append("")
except Exception:
lines.append("Run MRU: [not found]")
# Typed URLs
try:
key = reg.get_key("\\Software\\Microsoft\\Internet Explorer\\TypedURLs")
lines.append("Typed URLs:")
for v in key.iter_values():
lines.append(f" {v.value}")
lines.append("")
except Exception:
lines.append("Typed URLs: [not found]")
return "\n".join(lines)
except Exception as e:
return f"[Error analyzing user activity: {e}]"
def _filetime_to_datetime(ft: int) -> str:
"""Convert a Windows FILETIME (100-nanosecond intervals since 1601-01-01) to ISO string."""
if ft <= 0:
return "(not set)"
try:
epoch = datetime(1601, 1, 1, tzinfo=timezone.utc)
dt = epoch + timedelta(microseconds=ft // 10)
return dt.strftime("%Y-%m-%d %H:%M:%S UTC")
except (ValueError, OverflowError):
return f"(invalid FILETIME: {ft})"
async def get_system_info(software_hive_path: str) -> str:
"""Extract OS version, install date, registered owner from SOFTWARE hive."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(software_hive_path)
key = reg.get_key("\\Microsoft\\Windows NT\\CurrentVersion")
data = {}
for v in key.iter_values():
data[v.name] = v.value
lines = ["=== System Information (SOFTWARE hive) ==="]
lines.append(f"Product Name: {data.get('ProductName', 'N/A')}")
lines.append(f"Current Version: {data.get('CurrentVersion', 'N/A')}")
lines.append(f"Build Number: {data.get('CurrentBuildNumber', 'N/A')}")
lines.append(f"CSD Version (Service Pack): {data.get('CSDVersion', 'None')}")
lines.append(f"Registered Owner: {data.get('RegisteredOwner', 'N/A')}")
lines.append(f"Registered Organization: {data.get('RegisteredOrganization', 'N/A')}")
lines.append(f"Product ID: {data.get('ProductId', 'N/A')}")
lines.append(f"System Root: {data.get('SystemRoot', 'N/A')}")
install_epoch = data.get("InstallDate")
if install_epoch and isinstance(install_epoch, int):
install_dt = datetime.fromtimestamp(install_epoch, tz=timezone.utc)
lines.append(f"Install Date: {install_dt.strftime('%Y-%m-%d %H:%M:%S UTC')} (epoch: {install_epoch})")
else:
lines.append(f"Install Date: {install_epoch}")
return "\n".join(lines)
except Exception as e:
return f"[Error: {e}]"
async def get_timezone_info(system_hive_path: str) -> str:
"""Extract timezone settings from SYSTEM hive."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(system_hive_path)
key = reg.get_key("\\ControlSet001\\Control\\TimeZoneInformation")
data = {}
for v in key.iter_values():
data[v.name] = v.value
lines = ["=== Timezone Information (SYSTEM hive) ==="]
lines.append(f"Standard Name: {data.get('StandardName', 'N/A')}")
lines.append(f"Daylight Name: {data.get('DaylightName', 'N/A')}")
bias = data.get("Bias", "N/A")
if isinstance(bias, int):
hours = bias // 60
lines.append(f"Bias: {bias} minutes (UTC{-hours:+d}:00)")
else:
lines.append(f"Bias: {bias}")
lines.append(f"Active Time Bias: {data.get('ActiveTimeBias', 'N/A')}")
return "\n".join(lines)
except Exception as e:
return f"[Error: {e}]"
async def get_computer_name(system_hive_path: str) -> str:
"""Extract computer name from SYSTEM hive."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(system_hive_path)
lines = ["=== Computer Name (SYSTEM hive) ==="]
for path_label, path in [
("ComputerName", "\\ControlSet001\\Control\\ComputerName\\ComputerName"),
("ActiveComputerName", "\\ControlSet001\\Control\\ComputerName\\ActiveComputerName"),
]:
try:
key = reg.get_key(path)
for v in key.iter_values():
if v.name == "ComputerName":
lines.append(f"{path_label}: {v.value}")
except Exception:
pass
# Also try Tcpip hostname
try:
key = reg.get_key("\\ControlSet001\\Services\\Tcpip\\Parameters")
for v in key.iter_values():
if v.name in ("Hostname", "Domain", "NV Hostname"):
lines.append(f"TCP/IP {v.name}: {v.value}")
except Exception:
pass
return "\n".join(lines) if len(lines) > 1 else "Computer name not found in SYSTEM hive."
except Exception as e:
return f"[Error: {e}]"
async def get_shutdown_time(system_hive_path: str) -> str:
"""Extract last shutdown time from SYSTEM hive."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(system_hive_path)
lines = ["=== Shutdown Time (SYSTEM hive) ==="]
try:
key = reg.get_key("\\ControlSet001\\Control\\Windows")
for v in key.iter_values():
if v.name == "ShutdownTime":
raw = v.value
if isinstance(raw, bytes) and len(raw) >= 8:
ft = struct.unpack("<Q", raw[:8])[0]
lines.append(f"Last Shutdown: {_filetime_to_datetime(ft)}")
elif isinstance(raw, int):
lines.append(f"Last Shutdown: {_filetime_to_datetime(raw)}")
elif isinstance(raw, str):
# regipy may return hex-encoded string for REG_BINARY
try:
raw_bytes = bytes.fromhex(raw)
ft = struct.unpack("<Q", raw_bytes[:8])[0]
lines.append(f"Last Shutdown: {_filetime_to_datetime(ft)}")
except (ValueError, struct.error):
lines.append(f"ShutdownTime (raw): {raw!r}")
else:
lines.append(f"ShutdownTime (raw): {raw!r}")
except Exception:
lines.append("ShutdownTime value not found at ControlSet001\\Control\\Windows")
# Also show all values from the Windows key for context
try:
key = reg.get_key("\\ControlSet001\\Control\\Windows")
lines.append("\nAll values at ControlSet001\\Control\\Windows:")
for v in key.iter_values():
lines.append(f" {v.name} = {v.value}")
except Exception:
pass
return "\n".join(lines)
except Exception as e:
return f"[Error: {e}]"
async def enumerate_users(sam_hive_path: str) -> str:
"""Enumerate all user accounts from SAM hive."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(sam_hive_path)
key = reg.get_key("\\SAM\\Domains\\Account\\Users\\Names")
accounts = []
for sk in key.iter_subkeys():
accounts.append(sk.name)
lines = [f"=== User Accounts (SAM hive) — {len(accounts)} total ==="]
for acct in accounts:
lines.append(f" - {acct}")
# Try to get RIDs from the Users key
try:
users_key = reg.get_key("\\SAM\\Domains\\Account\\Users")
rid_entries = []
for sk in users_key.iter_subkeys():
if sk.name != "Names" and sk.name.startswith("0"):
rid = int(sk.name, 16)
rid_entries.append(f" RID {rid} (0x{sk.name})")
if rid_entries:
lines.append("\nUser RIDs:")
lines.extend(rid_entries)
except Exception:
pass
return "\n".join(lines)
except Exception as e:
return f"[Error: {e}]"
async def get_network_interfaces(system_hive_path: str) -> str:
"""Extract network adapter and TCP/IP configuration from SYSTEM hive."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(system_hive_path)
lines = ["=== Network Interfaces (SYSTEM hive) ==="]
# Try TCP/IP interfaces
try:
key = reg.get_key("\\ControlSet001\\Services\\Tcpip\\Parameters\\Interfaces")
for sk in key.iter_subkeys():
lines.append(f"\nInterface: {sk.name}")
for v in sk.iter_values():
if v.name in (
"IPAddress", "SubnetMask", "DefaultGateway",
"DhcpIPAddress", "DhcpSubnetMask", "DhcpDefaultGateway",
"DhcpServer", "NameServer", "Domain", "EnableDHCP",
):
lines.append(f" {v.name} = {v.value}")
except Exception as e:
lines.append(f"TCP/IP Interfaces: {e}")
# Try network adapter class
adapter_class = "\\ControlSet001\\Control\\Class\\{4D36E972-E325-11CE-BFC1-08002bE10318}"
try:
key = reg.get_key(adapter_class)
lines.append("\nNetwork Adapters:")
for sk in key.iter_subkeys():
if sk.name == "Properties":
continue
desc = None
for v in sk.iter_values():
if v.name == "DriverDesc":
desc = v.value
if desc:
lines.append(f" [{sk.name}] {desc}")
except Exception as e:
lines.append(f"Network Adapters: {e}")
# Try NetworkCards
try:
key = reg.get_key("\\ControlSet001\\Control\\NetworkCards")
for sk in key.iter_subkeys():
for v in sk.iter_values():
if v.name == "Description":
lines.append(f" NetworkCard {sk.name}: {v.value}")
except Exception:
pass
return "\n".join(lines) if len(lines) > 1 else "No network interface data found in SYSTEM hive."
except Exception as e:
return f"[Error: {e}]"
async def get_email_config(ntuser_hive_path: str) -> str:
"""Extract email account configuration (SMTP, POP3, NNTP) from NTUSER.DAT."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(ntuser_hive_path)
lines = ["=== Email Account Configuration (NTUSER.DAT) ==="]
try:
key = reg.get_key("\\Software\\Microsoft\\Internet Account Manager\\Accounts")
for sk in key.iter_subkeys():
lines.append(f"\n--- Account: {sk.name} ---")
for v in sk.iter_values():
# Skip binary password hash fields (but keep "Prompt for Password" flags)
if "Password" in v.name and "Prompt" not in v.name:
lines.append(f" {v.name} = [present, redacted]")
else:
lines.append(f" {v.name} = {v.value}")
except Exception as e:
lines.append(f"Internet Account Manager: {e}")
return "\n".join(lines)
except Exception as e:
return f"[Error: {e}]"
async def search_registry(hive_path: str, pattern: str) -> str:
"""Search for a pattern in registry key names and values."""
try:
from regipy.registry import RegistryHive
except ImportError:
return "[Error: regipy not installed]"
try:
reg = RegistryHive(hive_path)
pattern_lower = pattern.lower()
matches = []
for entry in reg.recurse_subkeys(as_json=True):
path = entry.path or ""
if pattern_lower in path.lower():
matches.append(f"KEY: {path}")
if hasattr(entry, "values") and entry.values:
for v in entry.values:
name = v.get("name", "")
value = str(v.get("value", ""))
if pattern_lower in name.lower() or pattern_lower in value.lower():
matches.append(f" {path}\\{name} = {value[:200]}")
if len(matches) >= 50:
matches.append(f"[Truncated: more than 50 matches for '{pattern}']")
break
if not matches:
return f"No registry entries matching '{pattern}' found."
return "\n".join(matches)
except Exception as e:
return f"[Error searching registry: {e}]"

229
tools/sleuthkit.py Normal file
View File

@@ -0,0 +1,229 @@
"""Wrappers around The Sleuth Kit CLI tools for forensic disk image analysis."""
from __future__ import annotations
import asyncio
import logging
import os
import tempfile
logger = logging.getLogger(__name__)
# Cache for srch_strings dump: keyed by image_path -> dump file path.
# srch_strings scans the entire image regardless of partition, so offset is irrelevant.
_strings_cache: dict[str, str] = {}
# Max output bytes to return to the LLM to avoid context overflow
MAX_OUTPUT = 8000
async def _run(cmd: list[str], max_output: int = MAX_OUTPUT) -> str:
"""Run a command asynchronously and return stdout."""
logger.debug("Running: %s", " ".join(cmd))
proc = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await proc.communicate()
output = stdout.decode("utf-8", errors="replace")
if proc.returncode != 0:
err = stderr.decode("utf-8", errors="replace")
return f"[Command failed (rc={proc.returncode})]\n{err}\n{output}"
if len(output) > max_output:
truncated = output[:max_output]
return truncated + f"\n\n[Output truncated: {len(output)} bytes total, showing first {max_output}]"
return output
async def partition_info(image_path: str) -> str:
"""Get partition table layout using mmls."""
return await _run(["mmls", image_path])
async def filesystem_info(image_path: str, offset: int = 0) -> str:
"""Get filesystem details using fsstat."""
cmd = ["fsstat", "-o", str(offset), image_path]
return await _run(cmd)
async def list_directory(
image_path: str,
offset: int = 0,
inode: str | None = None,
recursive: bool = False,
) -> str:
"""List directory contents using fls."""
cmd = ["fls", "-o", str(offset)]
if recursive:
cmd.append("-r")
cmd.append(image_path)
if inode:
cmd.append(inode)
return await _run(cmd, max_output=16000)
async def extract_file(
image_path: str,
inode: str,
output_path: str,
offset: int = 0,
) -> str:
"""Extract a file from the image using icat.
Streams icat stdout directly to the output file to avoid loading
large files entirely into memory.
"""
import os
os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
cmd = ["icat", "-o", str(offset), image_path, inode]
with open(output_path, "wb") as out_f:
proc = await asyncio.create_subprocess_exec(
*cmd,
stdout=out_f,
stderr=asyncio.subprocess.PIPE,
)
_, stderr = await proc.communicate()
if proc.returncode != 0:
err = stderr.decode("utf-8", errors="replace")
# Clean up empty/partial file on failure
if os.path.exists(output_path):
os.unlink(output_path)
return f"[icat failed (rc={proc.returncode})]: {err}"
size = os.path.getsize(output_path)
return f"Extracted {size} bytes to {output_path}"
async def find_file(image_path: str, inode: str, offset: int = 0) -> str:
"""Find the filename for an inode using ffind."""
cmd = ["ffind", "-o", str(offset), image_path, inode]
return await _run(cmd)
async def _ensure_strings_dump(image_path: str) -> str:
"""Run srch_strings once and cache the output to a temp file.
Returns the path to the cached dump file. Subsequent calls with the
same image_path reuse the existing file. srch_strings scans the entire
raw image — partition offset is irrelevant.
"""
cached = _strings_cache.get(image_path)
if cached and os.path.exists(cached):
return cached
logger.info("Building strings dump for %s — this is a one-time cost", image_path)
import shlex
# Write srch_strings output directly to a temp file to avoid holding
# the entire dump in memory.
fd, dump_path = tempfile.mkstemp(prefix="strings_dump_", suffix=".txt")
os.close(fd)
# -a = scan entire file, -t d = print decimal byte offset of each string
cmd_str = (
f"srch_strings -a -t d {shlex.quote(image_path)} "
f"> {shlex.quote(dump_path)}"
)
proc = await asyncio.create_subprocess_shell(
cmd_str,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
_, stderr = await proc.communicate()
if proc.returncode != 0:
err = stderr.decode("utf-8", errors="replace")
logger.error("srch_strings failed (rc=%d): %s", proc.returncode, err)
# Fall back: don't cache, let search_strings do a direct pipe
os.unlink(dump_path)
return ""
size_mb = os.path.getsize(dump_path) / (1024 * 1024)
logger.info("Strings dump ready: %s (%.1f MB)", dump_path, size_mb)
_strings_cache[image_path] = dump_path
return dump_path
async def search_strings(
image_path: str,
pattern: str,
) -> str:
"""Search for string patterns in the image.
On first call, builds a strings dump (one-time full scan).
Subsequent calls grep the cached dump — orders of magnitude faster.
"""
import shlex
dump_path = await _ensure_strings_dump(image_path)
if dump_path:
# Fast path: grep the cached dump file
cmd_str = (
f"grep -i {shlex.quote(pattern)} {shlex.quote(dump_path)} | head -100"
)
else:
# Fallback: direct pipe (cache build failed)
cmd_str = (
f"srch_strings -a {shlex.quote(image_path)} "
f"| grep -i {shlex.quote(pattern)} | head -100"
)
proc = await asyncio.create_subprocess_shell(
cmd_str,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await proc.communicate()
output = stdout.decode("utf-8", errors="replace")
if not output.strip():
return f"No strings matching '{pattern}' found."
return output[:16000]
async def count_deleted_files(image_path: str, offset: int = 0) -> str:
"""List and count deleted files using fls -rd. Returns total count and extension breakdown."""
cmd = ["fls", "-rd", "-o", str(offset), image_path]
output = await _run(cmd, max_output=64000)
lines = output.strip().splitlines()
ext_counts: dict[str, int] = {}
exe_files = []
total = 0
for line in lines:
if not line.strip():
continue
total += 1
# Extract filename from fls output like "r/r * 1234: filename.ext"
parts = line.split(":", 1)
if len(parts) > 1:
fname = parts[1].strip()
ext = fname.rsplit(".", 1)[-1].lower() if "." in fname else "(no ext)"
ext_counts[ext] = ext_counts.get(ext, 0) + 1
if ext in ("exe", "dll", "com", "bat", "cmd", "scr", "pif"):
exe_files.append(fname)
result = [f"=== Deleted Files Summary ===", f"Total deleted entries: {total}"]
result.append(f"\nExecutable files ({len(exe_files)}):")
for e in exe_files[:50]:
result.append(f" {e}")
if len(exe_files) > 50:
result.append(f" ... ({len(exe_files) - 50} more)")
result.append(f"\nExtension breakdown:")
for ext, count in sorted(ext_counts.items(), key=lambda x: -x[1])[:30]:
result.append(f" .{ext}: {count}")
return "\n".join(result)
async def build_timeline(image_path: str, offset: int = 0) -> str:
"""Build a MAC timeline using fls -m."""
cmd = ["fls", "-m", "/", "-o", str(offset), "-r", image_path]
return await _run(cmd, max_output=32000)

253
uv.lock generated Normal file
View File

@@ -0,0 +1,253 @@
version = 1
revision = 3
requires-python = ">=3.14"
[[package]]
name = "anyio"
version = "4.13.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "idna" },
]
sdist = { url = "https://files.pythonhosted.org/packages/19/14/2c5dd9f512b66549ae92767a9c7b330ae88e1932ca57876909410251fe13/anyio-4.13.0.tar.gz", hash = "sha256:334b70e641fd2221c1505b3890c69882fe4a2df910cba14d97019b90b24439dc", size = 231622, upload-time = "2026-03-24T12:59:09.671Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/da/42/e921fccf5015463e32a3cf6ee7f980a6ed0f395ceeaa45060b61d86486c2/anyio-4.13.0-py3-none-any.whl", hash = "sha256:08b310f9e24a9594186fd75b4f73f4a4152069e3853f1ed8bfbf58369f4ad708", size = 114353, upload-time = "2026-03-24T12:59:08.246Z" },
]
[[package]]
name = "certifi"
version = "2026.2.25"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/af/2d/7bf41579a8986e348fa033a31cdd0e4121114f6bce2457e8876010b092dd/certifi-2026.2.25.tar.gz", hash = "sha256:e887ab5cee78ea814d3472169153c2d12cd43b14bd03329a39a9c6e2e80bfba7", size = 155029, upload-time = "2026-02-25T02:54:17.342Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/9a/3c/c17fb3ca2d9c3acff52e30b309f538586f9f5b9c9cf454f3845fc9af4881/certifi-2026.2.25-py3-none-any.whl", hash = "sha256:027692e4402ad994f1c42e52a4997a9763c646b73e4096e4d5d6db8af1d6f0fa", size = 153684, upload-time = "2026-02-25T02:54:15.766Z" },
]
[[package]]
name = "colorama"
version = "0.4.6"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/d8/53/6f443c9a4a8358a93a6792e2acffb9d9d5cb0a5cfd8802644b7b1c9a02e4/colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44", size = 27697, upload-time = "2022-10-25T02:36:22.414Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d1/d6/3965ed04c63042e047cb6a3e6ed1a63a35087b6a609aa3a15ed8ac56c221/colorama-0.4.6-py2.py3-none-any.whl", hash = "sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6", size = 25335, upload-time = "2022-10-25T02:36:20.889Z" },
]
[[package]]
name = "construct"
version = "2.10.70"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/02/77/8c84b98eca70d245a2a956452f21d57930d22ab88cbeed9290ca630cf03f/construct-2.10.70.tar.gz", hash = "sha256:4d2472f9684731e58cc9c56c463be63baa1447d674e0d66aeb5627b22f512c29", size = 86337, upload-time = "2023-11-29T08:44:49.545Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/b2/fb/08b3f4bf05da99aba8ffea52a558758def16e8516bc75ca94ff73587e7d3/construct-2.10.70-py3-none-any.whl", hash = "sha256:c80be81ef595a1a821ec69dc16099550ed22197615f4320b57cc9ce2a672cb30", size = 63020, upload-time = "2023-11-29T08:44:46.876Z" },
]
[[package]]
name = "h11"
version = "0.16.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/01/ee/02a2c011bdab74c6fb3c75474d40b3052059d95df7e73351460c8588d963/h11-0.16.0.tar.gz", hash = "sha256:4e35b956cf45792e4caa5885e69fba00bdbc6ffafbfa020300e549b208ee5ff1", size = 101250, upload-time = "2025-04-24T03:35:25.427Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/04/4b/29cac41a4d98d144bf5f6d33995617b185d14b22401f75ca86f384e87ff1/h11-0.16.0-py3-none-any.whl", hash = "sha256:63cf8bbe7522de3bf65932fda1d9c2772064ffb3dae62d55932da54b31cb6c86", size = 37515, upload-time = "2025-04-24T03:35:24.344Z" },
]
[[package]]
name = "httpcore"
version = "1.0.9"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "certifi" },
{ name = "h11" },
]
sdist = { url = "https://files.pythonhosted.org/packages/06/94/82699a10bca87a5556c9c59b5963f2d039dbd239f25bc2a63907a05a14cb/httpcore-1.0.9.tar.gz", hash = "sha256:6e34463af53fd2ab5d807f399a9b45ea31c3dfa2276f15a2c3f00afff6e176e8", size = 85484, upload-time = "2025-04-24T22:06:22.219Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/7e/f5/f66802a942d491edb555dd61e3a9961140fd64c90bce1eafd741609d334d/httpcore-1.0.9-py3-none-any.whl", hash = "sha256:2d400746a40668fc9dec9810239072b40b4484b640a8c38fd654a024c7a1bf55", size = 78784, upload-time = "2025-04-24T22:06:20.566Z" },
]
[[package]]
name = "httpx"
version = "0.28.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "anyio" },
{ name = "certifi" },
{ name = "httpcore" },
{ name = "idna" },
]
sdist = { url = "https://files.pythonhosted.org/packages/b1/df/48c586a5fe32a0f01324ee087459e112ebb7224f646c0b5023f5e79e9956/httpx-0.28.1.tar.gz", hash = "sha256:75e98c5f16b0f35b567856f597f06ff2270a374470a5c2392242528e3e3e42fc", size = 141406, upload-time = "2024-12-06T15:37:23.222Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/2a/39/e50c7c3a983047577ee07d2a9e53faf5a69493943ec3f6a384bdc792deb2/httpx-0.28.1-py3-none-any.whl", hash = "sha256:d909fcccc110f8c7faf814ca82a9a4d816bc5a6dbfea25d6591d6985b8ba59ad", size = 73517, upload-time = "2024-12-06T15:37:21.509Z" },
]
[package.optional-dependencies]
socks = [
{ name = "socksio" },
]
[[package]]
name = "idna"
version = "3.11"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/6f/6d/0703ccc57f3a7233505399edb88de3cbd678da106337b9fcde432b65ed60/idna-3.11.tar.gz", hash = "sha256:795dafcc9c04ed0c1fb032c2aa73654d8e8c5023a7df64a53f39190ada629902", size = 194582, upload-time = "2025-10-12T14:55:20.501Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/0e/61/66938bbb5fc52dbdf84594873d5b51fb1f7c7794e9c0f5bd885f30bc507b/idna-3.11-py3-none-any.whl", hash = "sha256:771a87f49d9defaf64091e6e6fe9c18d4833f140bd19464795bc32d966ca37ea", size = 71008, upload-time = "2025-10-12T14:55:18.883Z" },
]
[[package]]
name = "inflection"
version = "0.5.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/e1/7e/691d061b7329bc8d54edbf0ec22fbfb2afe61facb681f9aaa9bff7a27d04/inflection-0.5.1.tar.gz", hash = "sha256:1a29730d366e996aaacffb2f1f1cb9593dc38e2ddd30c91250c6dde09ea9b417", size = 15091, upload-time = "2020-08-22T08:16:29.139Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/59/91/aa6bde563e0085a02a435aa99b49ef75b0a4b062635e606dab23ce18d720/inflection-0.5.1-py2.py3-none-any.whl", hash = "sha256:f38b2b640938a4f35ade69ac3d053042959b62a0f1076a5bbaa1b9526605a8a2", size = 9454, upload-time = "2020-08-22T08:16:27.816Z" },
]
[[package]]
name = "iniconfig"
version = "2.3.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/72/34/14ca021ce8e5dfedc35312d08ba8bf51fdd999c576889fc2c24cb97f4f10/iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730", size = 20503, upload-time = "2025-10-18T21:55:43.219Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" },
]
[[package]]
name = "masforensics"
version = "0.1.0"
source = { virtual = "." }
dependencies = [
{ name = "httpx", extra = ["socks"] },
{ name = "pyyaml" },
{ name = "regipy" },
]
[package.dev-dependencies]
dev = [
{ name = "pytest" },
{ name = "pytest-asyncio" },
]
[package.metadata]
requires-dist = [
{ name = "httpx", extras = ["socks"], specifier = ">=0.28.1" },
{ name = "pyyaml" },
{ name = "regipy", specifier = ">=6.2.1" },
]
[package.metadata.requires-dev]
dev = [
{ name = "pytest", specifier = ">=9.0.2" },
{ name = "pytest-asyncio", specifier = ">=1.3.0" },
]
[[package]]
name = "packaging"
version = "26.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/65/ee/299d360cdc32edc7d2cf530f3accf79c4fca01e96ffc950d8a52213bd8e4/packaging-26.0.tar.gz", hash = "sha256:00243ae351a257117b6a241061796684b084ed1c516a08c48a3f7e147a9d80b4", size = 143416, upload-time = "2026-01-21T20:50:39.064Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/b7/b9/c538f279a4e237a006a2c98387d081e9eb060d203d8ed34467cc0f0b9b53/packaging-26.0-py3-none-any.whl", hash = "sha256:b36f1fef9334a5588b4166f8bcd26a14e521f2b55e6b9de3aaa80d3ff7a37529", size = 74366, upload-time = "2026-01-21T20:50:37.788Z" },
]
[[package]]
name = "pluggy"
version = "1.6.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
]
[[package]]
name = "pygments"
version = "2.20.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/c3/b2/bc9c9196916376152d655522fdcebac55e66de6603a76a02bca1b6414f6c/pygments-2.20.0.tar.gz", hash = "sha256:6757cd03768053ff99f3039c1a36d6c0aa0b263438fcab17520b30a303a82b5f", size = 4955991, upload-time = "2026-03-29T13:29:33.898Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/f4/7e/a72dd26f3b0f4f2bf1dd8923c85f7ceb43172af56d63c7383eb62b332364/pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176", size = 1231151, upload-time = "2026-03-29T13:29:30.038Z" },
]
[[package]]
name = "pytest"
version = "9.0.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "colorama", marker = "sys_platform == 'win32'" },
{ name = "iniconfig" },
{ name = "packaging" },
{ name = "pluggy" },
{ name = "pygments" },
]
sdist = { url = "https://files.pythonhosted.org/packages/d1/db/7ef3487e0fb0049ddb5ce41d3a49c235bf9ad299b6a25d5780a89f19230f/pytest-9.0.2.tar.gz", hash = "sha256:75186651a92bd89611d1d9fc20f0b4345fd827c41ccd5c299a868a05d70edf11", size = 1568901, upload-time = "2025-12-06T21:30:51.014Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/3b/ab/b3226f0bd7cdcf710fbede2b3548584366da3b19b5021e74f5bde2a8fa3f/pytest-9.0.2-py3-none-any.whl", hash = "sha256:711ffd45bf766d5264d487b917733b453d917afd2b0ad65223959f59089f875b", size = 374801, upload-time = "2025-12-06T21:30:49.154Z" },
]
[[package]]
name = "pytest-asyncio"
version = "1.3.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "pytest" },
]
sdist = { url = "https://files.pythonhosted.org/packages/90/2c/8af215c0f776415f3590cac4f9086ccefd6fd463befeae41cd4d3f193e5a/pytest_asyncio-1.3.0.tar.gz", hash = "sha256:d7f52f36d231b80ee124cd216ffb19369aa168fc10095013c6b014a34d3ee9e5", size = 50087, upload-time = "2025-11-10T16:07:47.256Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/e5/35/f8b19922b6a25bc0880171a2f1a003eaeb93657475193ab516fd87cac9da/pytest_asyncio-1.3.0-py3-none-any.whl", hash = "sha256:611e26147c7f77640e6d0a92a38ed17c3e9848063698d5c93d5aa7aa11cebff5", size = 15075, upload-time = "2025-11-10T16:07:45.537Z" },
]
[[package]]
name = "pytz"
version = "2026.1.post1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/56/db/b8721d71d945e6a8ac63c0fc900b2067181dbb50805958d4d4661cf7d277/pytz-2026.1.post1.tar.gz", hash = "sha256:3378dde6a0c3d26719182142c56e60c7f9af7e968076f31aae569d72a0358ee1", size = 321088, upload-time = "2026-03-03T07:47:50.683Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/10/99/781fe0c827be2742bcc775efefccb3b048a3a9c6ce9aec0cbf4a101677e5/pytz-2026.1.post1-py2.py3-none-any.whl", hash = "sha256:f2fd16142fda348286a75e1a524be810bb05d444e5a081f37f7affc635035f7a", size = 510489, upload-time = "2026-03-03T07:47:49.167Z" },
]
[[package]]
name = "pyyaml"
version = "6.0.3"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/05/8e/961c0007c59b8dd7729d542c61a4d537767a59645b82a0b521206e1e25c2/pyyaml-6.0.3.tar.gz", hash = "sha256:d76623373421df22fb4cf8817020cbb7ef15c725b9d5e45f17e189bfc384190f", size = 130960, upload-time = "2025-09-25T21:33:16.546Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/9d/8c/f4bd7f6465179953d3ac9bc44ac1a8a3e6122cf8ada906b4f96c60172d43/pyyaml-6.0.3-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:8d1fab6bb153a416f9aeb4b8763bc0f22a5586065f86f7664fc23339fc1c1fac", size = 181814, upload-time = "2025-09-25T21:32:35.712Z" },
{ url = "https://files.pythonhosted.org/packages/bd/9c/4d95bb87eb2063d20db7b60faa3840c1b18025517ae857371c4dd55a6b3a/pyyaml-6.0.3-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:34d5fcd24b8445fadc33f9cf348c1047101756fd760b4dacb5c3e99755703310", size = 173809, upload-time = "2025-09-25T21:32:36.789Z" },
{ url = "https://files.pythonhosted.org/packages/92/b5/47e807c2623074914e29dabd16cbbdd4bf5e9b2db9f8090fa64411fc5382/pyyaml-6.0.3-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:501a031947e3a9025ed4405a168e6ef5ae3126c59f90ce0cd6f2bfc477be31b7", size = 766454, upload-time = "2025-09-25T21:32:37.966Z" },
{ url = "https://files.pythonhosted.org/packages/02/9e/e5e9b168be58564121efb3de6859c452fccde0ab093d8438905899a3a483/pyyaml-6.0.3-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:b3bc83488de33889877a0f2543ade9f70c67d66d9ebb4ac959502e12de895788", size = 836355, upload-time = "2025-09-25T21:32:39.178Z" },
{ url = "https://files.pythonhosted.org/packages/88/f9/16491d7ed2a919954993e48aa941b200f38040928474c9e85ea9e64222c3/pyyaml-6.0.3-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c458b6d084f9b935061bc36216e8a69a7e293a2f1e68bf956dcd9e6cbcd143f5", size = 794175, upload-time = "2025-09-25T21:32:40.865Z" },
{ url = "https://files.pythonhosted.org/packages/dd/3f/5989debef34dc6397317802b527dbbafb2b4760878a53d4166579111411e/pyyaml-6.0.3-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:7c6610def4f163542a622a73fb39f534f8c101d690126992300bf3207eab9764", size = 755228, upload-time = "2025-09-25T21:32:42.084Z" },
{ url = "https://files.pythonhosted.org/packages/d7/ce/af88a49043cd2e265be63d083fc75b27b6ed062f5f9fd6cdc223ad62f03e/pyyaml-6.0.3-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:5190d403f121660ce8d1d2c1bb2ef1bd05b5f68533fc5c2ea899bd15f4399b35", size = 789194, upload-time = "2025-09-25T21:32:43.362Z" },
{ url = "https://files.pythonhosted.org/packages/23/20/bb6982b26a40bb43951265ba29d4c246ef0ff59c9fdcdf0ed04e0687de4d/pyyaml-6.0.3-cp314-cp314-win_amd64.whl", hash = "sha256:4a2e8cebe2ff6ab7d1050ecd59c25d4c8bd7e6f400f5f82b96557ac0abafd0ac", size = 156429, upload-time = "2025-09-25T21:32:57.844Z" },
{ url = "https://files.pythonhosted.org/packages/f4/f4/a4541072bb9422c8a883ab55255f918fa378ecf083f5b85e87fc2b4eda1b/pyyaml-6.0.3-cp314-cp314-win_arm64.whl", hash = "sha256:93dda82c9c22deb0a405ea4dc5f2d0cda384168e466364dec6255b293923b2f3", size = 143912, upload-time = "2025-09-25T21:32:59.247Z" },
{ url = "https://files.pythonhosted.org/packages/7c/f9/07dd09ae774e4616edf6cda684ee78f97777bdd15847253637a6f052a62f/pyyaml-6.0.3-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:02893d100e99e03eda1c8fd5c441d8c60103fd175728e23e431db1b589cf5ab3", size = 189108, upload-time = "2025-09-25T21:32:44.377Z" },
{ url = "https://files.pythonhosted.org/packages/4e/78/8d08c9fb7ce09ad8c38ad533c1191cf27f7ae1effe5bb9400a46d9437fcf/pyyaml-6.0.3-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:c1ff362665ae507275af2853520967820d9124984e0f7466736aea23d8611fba", size = 183641, upload-time = "2025-09-25T21:32:45.407Z" },
{ url = "https://files.pythonhosted.org/packages/7b/5b/3babb19104a46945cf816d047db2788bcaf8c94527a805610b0289a01c6b/pyyaml-6.0.3-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6adc77889b628398debc7b65c073bcb99c4a0237b248cacaf3fe8a557563ef6c", size = 831901, upload-time = "2025-09-25T21:32:48.83Z" },
{ url = "https://files.pythonhosted.org/packages/8b/cc/dff0684d8dc44da4d22a13f35f073d558c268780ce3c6ba1b87055bb0b87/pyyaml-6.0.3-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:a80cb027f6b349846a3bf6d73b5e95e782175e52f22108cfa17876aaeff93702", size = 861132, upload-time = "2025-09-25T21:32:50.149Z" },
{ url = "https://files.pythonhosted.org/packages/b1/5e/f77dc6b9036943e285ba76b49e118d9ea929885becb0a29ba8a7c75e29fe/pyyaml-6.0.3-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:00c4bdeba853cc34e7dd471f16b4114f4162dc03e6b7afcc2128711f0eca823c", size = 839261, upload-time = "2025-09-25T21:32:51.808Z" },
{ url = "https://files.pythonhosted.org/packages/ce/88/a9db1376aa2a228197c58b37302f284b5617f56a5d959fd1763fb1675ce6/pyyaml-6.0.3-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:66e1674c3ef6f541c35191caae2d429b967b99e02040f5ba928632d9a7f0f065", size = 805272, upload-time = "2025-09-25T21:32:52.941Z" },
{ url = "https://files.pythonhosted.org/packages/da/92/1446574745d74df0c92e6aa4a7b0b3130706a4142b2d1a5869f2eaa423c6/pyyaml-6.0.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:16249ee61e95f858e83976573de0f5b2893b3677ba71c9dd36b9cf8be9ac6d65", size = 829923, upload-time = "2025-09-25T21:32:54.537Z" },
{ url = "https://files.pythonhosted.org/packages/f0/7a/1c7270340330e575b92f397352af856a8c06f230aa3e76f86b39d01b416a/pyyaml-6.0.3-cp314-cp314t-win_amd64.whl", hash = "sha256:4ad1906908f2f5ae4e5a8ddfce73c320c2a1429ec52eafd27138b7f1cbe341c9", size = 174062, upload-time = "2025-09-25T21:32:55.767Z" },
{ url = "https://files.pythonhosted.org/packages/f1/12/de94a39c2ef588c7e6455cfbe7343d3b2dc9d6b6b2f40c4c6565744c873d/pyyaml-6.0.3-cp314-cp314t-win_arm64.whl", hash = "sha256:ebc55a14a21cb14062aa4162f906cd962b28e2e9ea38f9b4391244cd8de4ae0b", size = 149341, upload-time = "2025-09-25T21:32:56.828Z" },
]
[[package]]
name = "regipy"
version = "6.2.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "construct" },
{ name = "inflection" },
{ name = "pytz" },
]
sdist = { url = "https://files.pythonhosted.org/packages/ba/80/dd0a588b1762c9e1016f96ae59e3c984269cbcc94ca7b63a3d097bb96416/regipy-6.2.1.tar.gz", hash = "sha256:4e09623cdeb23ba4ad9bd73a0f107c9c60aab2fe9a5dea0ba48c71af1e070dfd", size = 101711, upload-time = "2026-01-22T15:26:06.905Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/65/eb/db13ab9b8d54e04f42b6619acca417ee37b07eb141a54884d13d20d7459e/regipy-6.2.1-py3-none-any.whl", hash = "sha256:b03110e5c4e12385e1ba53c032ccd120c6dcde1b71afb8c3b7aa4717a5a24e43", size = 134861, upload-time = "2026-01-22T15:26:05.653Z" },
]
[[package]]
name = "socksio"
version = "1.0.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/f8/5c/48a7d9495be3d1c651198fd99dbb6ce190e2274d0f28b9051307bdec6b85/socksio-1.0.0.tar.gz", hash = "sha256:f88beb3da5b5c38b9890469de67d0cb0f9d494b78b106ca1845f96c10b91c4ac", size = 19055, upload-time = "2020-04-17T15:50:34.664Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/37/c3/6eeb6034408dac0fa653d126c9204ade96b819c936e136c5e8a6897eee9c/socksio-1.0.0-py3-none-any.whl", hash = "sha256:95dc1f15f9b34e8d7b16f06d74b8ccf48f609af32ab33c608d08761c5dcbb1f3", size = 12763, upload-time = "2020-04-17T15:50:31.878Z" },
]