Initial commit: ER-TP-DGP research prototype

Event-Reified Temporal Provenance Dual-Granularity Prompting for LLM-based APT detection on DARPA provenance datasets. Includes phase 0-14 method spec, IR/graph/metapath/trimming/prompt modules, scripts for THEIA candidate universe, landmark CSG construction, hybrid prompting, and LLM inference. Excludes data/, reports/, and local LLM config from version control.
2026-05-15 16:53:57 +08:00
commit b86ae87b75
88 changed files with 18570 additions and 0 deletions
--- a/docs/phase4_labels.md
+++ b/docs/phase4_labels.md
@@ -0,0 +1,36 @@
+# Phase 4 Ground Truth Mapping and Labels
+
+Ground truth is used only for label mapping and evaluation. It must not enter
+LLM prompts.
+
+## Label Levels
+
+- Event-level: direct matched attack events.
+- Process-level: processes involved in malicious event chains.
+- Subgraph-level: local evidence subgraphs containing key attack-chain events.
+
+## Ambiguous Cases
+
+Ambiguous targets should be assigned `unknown` or `ignore`, not forced to
+malicious or benign:
+
+- attack window overlap without explicit evidence;
+- normal child behavior from a compromised process;
+- normal process later abused by an attacker;
+- missing fields that prevent reliable mapping.
+
+## Negative Sampling
+
+Negative sampling must avoid:
+
+- arbitrary benign labels inside attack windows;
+- train/test leakage through the same attack entity;
+- adjacent attack-chain events split across train and test;
+- using attack-report text as prompt content.
+
+## Checks
+
+- Label records are not prompt-allowed.
+- Each label has source and confidence.
+- Trainable labels require high confidence.
+