Initial commit: ER-TP-DGP research prototype

Event-Reified Temporal Provenance Dual-Granularity Prompting for LLM-based APT detection on DARPA provenance datasets. Includes phase 0-14 method spec, IR/graph/metapath/trimming/prompt modules, scripts for THEIA candidate universe, landmark CSG construction, hybrid prompting, and LLM inference. Excludes data/, reports/, and local LLM config from version control.
2026-05-15 16:53:57 +08:00
commit b86ae87b75
88 changed files with 18570 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,95 @@
+# ER-TP-DGP
+
+Event-Reified Temporal Provenance Dual-Granularity Prompting for LLM-based APT
+detection.
+
+This repository is a research prototype for evaluating graph-enhanced LLM
+detection on DARPA provenance datasets. The main method is not raw log prompting,
+not a GNN classifier, and not a rules detector. The main pipeline is:
+
+```text
+DARPA provenance records
+  -> schema-aware provenance IR
+  -> event-reified temporal heterogeneous graph
+  -> time-respecting APT semantic evidence paths
+  -> dual-granularity graph prompt
+  -> LLM classification with evidence path IDs
+```
+
+The current implementation is data-independent scaffolding. It intentionally
+does not assume that every DARPA dataset contains command lines, registry
+objects, hashes, domains, services, tasks, modules, or complete ground truth.
+
+## Core Formula
+
+```text
+Prompt(q) = Fine(q) + Local(q)
+          + sum_P [Summary_P(q) + Stats_P(q) + Evidence_P(q)]
+```
+
+`q` is a process or event target. `P` is an APT semantic metapath such as
+execution chain, file staging, network/C2, exfiltration-like, persistence, or
+lateral movement.
+
+## Current Status
+
+Implemented without real data:
+
+- Phase 0 method specification.
+- Phase 1 dataset schema audit model and report generation.
+- Unified provenance IR dataclasses.
+- IR validation and JSONL serialization.
+- Dataset adapter interface and schema mismatch reporting.
+- Event-view and causal-view graph construction.
+- Time-window, host-filtered, target-context, and ID-based graph views.
+- Time-respecting APT metapath path extraction for core path families.
+- Temporal, structural, semantic, and security-aware trimming scaffold.
+- Dual-granularity prompt construction with evidence IDs.
+- Label-only ground-truth mapping interfaces.
+- LLM strategy, baseline, and ablation method registry.
+- Imbalanced APT detection metrics including AUPRC, AUROC, Macro-F1,
+  Precision@K, Recall@K, FPR at fixed recall, detection delay, token/cost
+  accounting, and evidence-path hit rate.
+- Time, campaign, and host split helpers with leakage checks for raw event IDs,
+  process IDs, IOC-like file paths, duplicated prompts, summaries, campaigns,
+  and same-host time windows.
+- OpenAI-compatible LLM inference client for remote API and local deployments,
+  with first-token `MALICIOUS`/`BENIGN` parsing and raw response retention.
+- THEIA CDM18 action semantics with auditable canonical actions, causal
+  directions, metapath hints, and MEMORY entity support.
+- Common-behavior context annotations such as browser-like process ratio and
+  local IPC flow ratio. These are neutral prompt features, not hard filters or
+  rule-based benign decisions.
+- Synthetic unit tests for interface and invariant checks.
+
+## LLM Inference
+
+Remote OpenAI-compatible API:
+
+```bash
+export OPENAI_COMPAT_API_KEY='...'
+cp configs/llm.example.yaml configs/llm.yaml
+# edit configs/llm.yaml: provider=api, base_url, model, api_key_env
+
+.venv/bin/python scripts/run_llm_inference.py \
+  --config configs/llm.yaml \
+  --prompt-file reports/theia_e3_idea/prompt.txt \
+  --output-jsonl reports/llm_predictions.jsonl
+```
+
+Local OpenAI-compatible deployment:
+
+```bash
+cp configs/llm.example.yaml configs/llm.yaml
+# edit configs/llm.yaml: provider=local, base_url, model
+
+.venv/bin/python scripts/run_llm_inference.py \
+  --config configs/llm.yaml \
+  --prompt-file reports/theia_e3_idea/prompt.txt \
+  --output-jsonl reports/local_llm_predictions.jsonl
+```
+
+The LLM prompt must not include ground-truth reports, IOC narratives, or labels.
+Ground truth is only for label mapping and evaluation.
+
+Synthetic examples are debugging-only fixtures and are not experimental results.