Initial commit: ER-TP-DGP research prototype

Event-Reified Temporal Provenance Dual-Granularity Prompting for
LLM-based APT detection on DARPA provenance datasets.

Includes phase 0-14 method spec, IR/graph/metapath/trimming/prompt
modules, scripts for THEIA candidate universe, landmark CSG construction,
hybrid prompting, and LLM inference. Excludes data/, reports/, and
local LLM config from version control.
This commit is contained in:
BattleTag
2026-05-15 16:53:57 +08:00
commit b86ae87b75
88 changed files with 18570 additions and 0 deletions

36
docs/phase4_labels.md Normal file
View File

@@ -0,0 +1,36 @@
# Phase 4 Ground Truth Mapping and Labels
Ground truth is used only for label mapping and evaluation. It must not enter
LLM prompts.
## Label Levels
- Event-level: direct matched attack events.
- Process-level: processes involved in malicious event chains.
- Subgraph-level: local evidence subgraphs containing key attack-chain events.
## Ambiguous Cases
Ambiguous targets should be assigned `unknown` or `ignore`, not forced to
malicious or benign:
- attack window overlap without explicit evidence;
- normal child behavior from a compromised process;
- normal process later abused by an attacker;
- missing fields that prevent reliable mapping.
## Negative Sampling
Negative sampling must avoid:
- arbitrary benign labels inside attack windows;
- train/test leakage through the same attack entity;
- adjacent attack-chain events split across train and test;
- using attack-report text as prompt content.
## Checks
- Label records are not prompt-allowed.
- Each label has source and confidence.
- Trainable labels require high confidence.