Initial commit: code, paper, small artifacts

This commit is contained in:
2026-05-07 20:47:30 +08:00
commit fae2db8cff
322 changed files with 33159 additions and 0 deletions

133
Unified_CFM/README.md Normal file
View File

@@ -0,0 +1,133 @@
# Unified_CFM
A single multi-scale OT-CFM over one token sequence per flow:
```text
[FLOW_TOKEN, PACKET_1, ..., PACKET_T]
```
This is **not** a Flow-CFM + Packet-CFM ensemble. Flow-level and packet-level
signals interact inside one Transformer velocity field, and a Phase 2
masked-prediction consistency loss explicitly trains the cross-modal
dependency.
This is the **current SOTA model** in the repo (within-dataset SOTA on
ISCXTor2016 / CICIDS2017 / CICDDoS2019; near-SOTA cross-dataset).
## Model
`UnifiedTokenCFM` uses fixed tokenization to avoid latent-collapse shortcuts:
```text
flow token: [type=-1, normalized 20-d canonical flow features, zero pad]
packet token: [type=+1, normalized 9-d packet features, zero pad]
```
Velocity field: 4-layer AdaLN-Zero Transformer (`d_model=128, n_heads=4`),
sinusoidal time embedding (`time_dim=64`). Total ≈ 1.23M parameters.
Loss with Phase 2 consistency:
```
L = L_main + λ_flow · L_mask_flow + λ_packet · L_mask_packet
L_main: standard OT-CFM velocity regression with σ-band noise +
Sinkhorn OT coupling.
L_mask_flow: zero out the flow token's input at x_t; predict v[flow]
from packet context only.
L_mask_packet: zero out a random 50% of real packet tokens at x_t;
predict their velocities from flow + remaining packets.
```
Best hyperparameters from the σ × λ sweeps:
```
lambda_flow = lambda_packet = 0.3
packet_mask_ratio = 0.5
sigma = 0.6 # cross-dataset best; σ=0.1 marginally better for some within
use_ot = True
```
## Scores
The model exposes three classes of scores at inference:
```text
# primary
terminal_norm
# decomposed (analysis only)
terminal_flow terminal_packet
arc_length kinetic_energy kinetic_flow kinetic_packet
velocity_total velocity_flow velocity_packet
# Phase 1 diagnostics
curvature_total curvature_flow curvature_packet # ∫ ||dv/dt||² dt
kappa2_speed2norm_packet_{mean,median,trimmed10_mean} # packet curvature / speed²
jacobian_total jacobian_flow jacobian_packet # Hutchinson VJP estimate of ||∂v/∂x||_F²
velocity_*_t{01..10} # 18 time-profile scores
# Phase 2 cross-modal consistency
flow_consistency packet_consistency consistency_total
```
`terminal_norm` is the paper's primary score. The decomposed and diagnostic
scores serve **per-attack-family analysis** — they are NOT competing
SOTA claims. Multi-seed std on `terminal_norm` is ≤ 0.005 across all our
runs.
The Phase 2 consistency scores have a notable property: they are
**discriminative only when the model is trained with the consistency loss**.
On a baseline model `flow_consistency` is roughly random (0.57 on
CICIDS2017); after Phase 2 training it lifts to 0.88. On SSH-Patator,
where standard density scores struggle (`terminal_norm` 0.64), Phase 2
`flow_consistency` reaches 0.94.
## Train
```bash
# baseline (no consistency loss)
uv run python Unified_CFM/train.py --config Unified_CFM/configs/cicids2017_baseline.yaml
# Phase 2 with consistency loss (λ=0.1, σ=0.1)
uv run python Unified_CFM/train.py --config Unified_CFM/configs/cicids2017_consistency.yaml
# σ × λ sweeps and multi-seed orchestrators live in
# artifacts/verify_2026_04_24/run_*.sh
```
The intended setup is to use the workspace-canonical 20-d packet-derived
flow feature file:
```yaml
flow_features_path: datasets/cicids2017/processed/flow_features.parquet
flow_features_align: auto
```
`flow_features.parquet` is row-aligned with the Packet_CFM artifacts via
`flow_id`. With `flow_features_align: auto`, the loader uses direct
row/`flow_id` alignment when possible; scan alignment remains only for
legacy full CSV-derived caches.
For large datasets where a monolithic `packets.npz` would exceed memory,
the loader supports the sharded backend:
```yaml
source_store: datasets/cicddos2019/processed/full_store
val_cap: 20000
attack_cap: 20000
```
If `flow_features_path` is empty, the loader derives compact 16-d flow-level
statistics from the packet sequence. That fallback is for debugging only;
new runs should use the canonical 20-d file generated by
`scripts/generate_flow_features.py`.
## Evaluation
`artifacts/verify_2026_04_24/eval_phase1_unified.py` runs Phase 1 + Phase 2
score battery on a trained checkpoint, with per-attack-class AUROC.
`artifacts/verify_2026_04_24/eval_phase2_cross_cicddos2019.py` runs
cross-dataset CICIDS2017→CICDDoS2019 evaluation under the standard
10k benign + 10k stratified attack protocol.