Initial commit: code, paper, small artifacts

2026-05-07 20:47:30 +08:00
commit fae2db8cff
322 changed files with 33159 additions and 0 deletions
--- a/Unified_CFM/README.md
+++ b/Unified_CFM/README.md
@@ -0,0 +1,133 @@
+# Unified_CFM
+
+A single multi-scale OT-CFM over one token sequence per flow:
+
+```text
+[FLOW_TOKEN, PACKET_1, ..., PACKET_T]
+```
+
+This is **not** a Flow-CFM + Packet-CFM ensemble. Flow-level and packet-level
+signals interact inside one Transformer velocity field, and a Phase 2
+masked-prediction consistency loss explicitly trains the cross-modal
+dependency.
+
+This is the **current SOTA model** in the repo (within-dataset SOTA on
+ISCXTor2016 / CICIDS2017 / CICDDoS2019; near-SOTA cross-dataset).
+
+## Model
+
+`UnifiedTokenCFM` uses fixed tokenization to avoid latent-collapse shortcuts:
+
+```text
+flow token:   [type=-1, normalized 20-d canonical flow features, zero pad]
+packet token: [type=+1, normalized 9-d packet features,           zero pad]
+```
+
+Velocity field: 4-layer AdaLN-Zero Transformer (`d_model=128, n_heads=4`),
+sinusoidal time embedding (`time_dim=64`). Total ≈ 1.23M parameters.
+
+Loss with Phase 2 consistency:
+
+```
+L = L_main + λ_flow · L_mask_flow + λ_packet · L_mask_packet
+
+L_main:        standard OT-CFM velocity regression with σ-band noise +
+               Sinkhorn OT coupling.
+L_mask_flow:   zero out the flow token's input at x_t; predict v[flow]
+               from packet context only.
+L_mask_packet: zero out a random 50% of real packet tokens at x_t;
+               predict their velocities from flow + remaining packets.
+```
+
+Best hyperparameters from the σ × λ sweeps:
+
+```
+lambda_flow = lambda_packet = 0.3
+packet_mask_ratio = 0.5
+sigma = 0.6   # cross-dataset best; σ=0.1 marginally better for some within
+use_ot = True
+```
+
+## Scores
+
+The model exposes three classes of scores at inference:
+
+```text
+# primary
+terminal_norm
+
+# decomposed (analysis only)
+terminal_flow         terminal_packet
+arc_length            kinetic_energy   kinetic_flow   kinetic_packet
+velocity_total        velocity_flow    velocity_packet
+
+# Phase 1 diagnostics
+curvature_total       curvature_flow   curvature_packet      # ∫ ||dv/dt||² dt
+kappa2_speed2norm_packet_{mean,median,trimmed10_mean}        # packet curvature / speed²
+jacobian_total        jacobian_flow    jacobian_packet       # Hutchinson VJP estimate of ||∂v/∂x||_F²
+velocity_*_t{01..10}                                          # 18 time-profile scores
+
+# Phase 2 cross-modal consistency
+flow_consistency      packet_consistency      consistency_total
+```
+
+`terminal_norm` is the paper's primary score. The decomposed and diagnostic
+scores serve **per-attack-family analysis** — they are NOT competing
+SOTA claims. Multi-seed std on `terminal_norm` is ≤ 0.005 across all our
+runs.
+
+The Phase 2 consistency scores have a notable property: they are
+**discriminative only when the model is trained with the consistency loss**.
+On a baseline model `flow_consistency` is roughly random (0.57 on
+CICIDS2017); after Phase 2 training it lifts to 0.88. On SSH-Patator,
+where standard density scores struggle (`terminal_norm` 0.64), Phase 2
+`flow_consistency` reaches 0.94.
+
+## Train
+
+```bash
+# baseline (no consistency loss)
+uv run python Unified_CFM/train.py --config Unified_CFM/configs/cicids2017_baseline.yaml
+
+# Phase 2 with consistency loss (λ=0.1, σ=0.1)
+uv run python Unified_CFM/train.py --config Unified_CFM/configs/cicids2017_consistency.yaml
+
+# σ × λ sweeps and multi-seed orchestrators live in
+# artifacts/verify_2026_04_24/run_*.sh
+```
+
+The intended setup is to use the workspace-canonical 20-d packet-derived
+flow feature file:
+
+```yaml
+flow_features_path: datasets/cicids2017/processed/flow_features.parquet
+flow_features_align: auto
+```
+
+`flow_features.parquet` is row-aligned with the Packet_CFM artifacts via
+`flow_id`. With `flow_features_align: auto`, the loader uses direct
+row/`flow_id` alignment when possible; scan alignment remains only for
+legacy full CSV-derived caches.
+
+For large datasets where a monolithic `packets.npz` would exceed memory,
+the loader supports the sharded backend:
+
+```yaml
+source_store: datasets/cicddos2019/processed/full_store
+val_cap: 20000
+attack_cap: 20000
+```
+
+If `flow_features_path` is empty, the loader derives compact 16-d flow-level
+statistics from the packet sequence. That fallback is for debugging only;
+new runs should use the canonical 20-d file generated by
+`scripts/generate_flow_features.py`.
+
+## Evaluation
+
+`artifacts/verify_2026_04_24/eval_phase1_unified.py` runs Phase 1 + Phase 2
+score battery on a trained checkpoint, with per-attack-class AUROC.
+
+`artifacts/verify_2026_04_24/eval_phase2_cross_cicddos2019.py` runs
+cross-dataset CICIDS2017→CICDDoS2019 evaluation under the standard
+10k benign + 10k stratified attack protocol.