Initial commit: code, paper, small artifacts
This commit is contained in:
133
Unified_CFM/README.md
Normal file
133
Unified_CFM/README.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# Unified_CFM
|
||||
|
||||
A single multi-scale OT-CFM over one token sequence per flow:
|
||||
|
||||
```text
|
||||
[FLOW_TOKEN, PACKET_1, ..., PACKET_T]
|
||||
```
|
||||
|
||||
This is **not** a Flow-CFM + Packet-CFM ensemble. Flow-level and packet-level
|
||||
signals interact inside one Transformer velocity field, and a Phase 2
|
||||
masked-prediction consistency loss explicitly trains the cross-modal
|
||||
dependency.
|
||||
|
||||
This is the **current SOTA model** in the repo (within-dataset SOTA on
|
||||
ISCXTor2016 / CICIDS2017 / CICDDoS2019; near-SOTA cross-dataset).
|
||||
|
||||
## Model
|
||||
|
||||
`UnifiedTokenCFM` uses fixed tokenization to avoid latent-collapse shortcuts:
|
||||
|
||||
```text
|
||||
flow token: [type=-1, normalized 20-d canonical flow features, zero pad]
|
||||
packet token: [type=+1, normalized 9-d packet features, zero pad]
|
||||
```
|
||||
|
||||
Velocity field: 4-layer AdaLN-Zero Transformer (`d_model=128, n_heads=4`),
|
||||
sinusoidal time embedding (`time_dim=64`). Total ≈ 1.23M parameters.
|
||||
|
||||
Loss with Phase 2 consistency:
|
||||
|
||||
```
|
||||
L = L_main + λ_flow · L_mask_flow + λ_packet · L_mask_packet
|
||||
|
||||
L_main: standard OT-CFM velocity regression with σ-band noise +
|
||||
Sinkhorn OT coupling.
|
||||
L_mask_flow: zero out the flow token's input at x_t; predict v[flow]
|
||||
from packet context only.
|
||||
L_mask_packet: zero out a random 50% of real packet tokens at x_t;
|
||||
predict their velocities from flow + remaining packets.
|
||||
```
|
||||
|
||||
Best hyperparameters from the σ × λ sweeps:
|
||||
|
||||
```
|
||||
lambda_flow = lambda_packet = 0.3
|
||||
packet_mask_ratio = 0.5
|
||||
sigma = 0.6 # cross-dataset best; σ=0.1 marginally better for some within
|
||||
use_ot = True
|
||||
```
|
||||
|
||||
## Scores
|
||||
|
||||
The model exposes three classes of scores at inference:
|
||||
|
||||
```text
|
||||
# primary
|
||||
terminal_norm
|
||||
|
||||
# decomposed (analysis only)
|
||||
terminal_flow terminal_packet
|
||||
arc_length kinetic_energy kinetic_flow kinetic_packet
|
||||
velocity_total velocity_flow velocity_packet
|
||||
|
||||
# Phase 1 diagnostics
|
||||
curvature_total curvature_flow curvature_packet # ∫ ||dv/dt||² dt
|
||||
kappa2_speed2norm_packet_{mean,median,trimmed10_mean} # packet curvature / speed²
|
||||
jacobian_total jacobian_flow jacobian_packet # Hutchinson VJP estimate of ||∂v/∂x||_F²
|
||||
velocity_*_t{01..10} # 18 time-profile scores
|
||||
|
||||
# Phase 2 cross-modal consistency
|
||||
flow_consistency packet_consistency consistency_total
|
||||
```
|
||||
|
||||
`terminal_norm` is the paper's primary score. The decomposed and diagnostic
|
||||
scores serve **per-attack-family analysis** — they are NOT competing
|
||||
SOTA claims. Multi-seed std on `terminal_norm` is ≤ 0.005 across all our
|
||||
runs.
|
||||
|
||||
The Phase 2 consistency scores have a notable property: they are
|
||||
**discriminative only when the model is trained with the consistency loss**.
|
||||
On a baseline model `flow_consistency` is roughly random (0.57 on
|
||||
CICIDS2017); after Phase 2 training it lifts to 0.88. On SSH-Patator,
|
||||
where standard density scores struggle (`terminal_norm` 0.64), Phase 2
|
||||
`flow_consistency` reaches 0.94.
|
||||
|
||||
## Train
|
||||
|
||||
```bash
|
||||
# baseline (no consistency loss)
|
||||
uv run python Unified_CFM/train.py --config Unified_CFM/configs/cicids2017_baseline.yaml
|
||||
|
||||
# Phase 2 with consistency loss (λ=0.1, σ=0.1)
|
||||
uv run python Unified_CFM/train.py --config Unified_CFM/configs/cicids2017_consistency.yaml
|
||||
|
||||
# σ × λ sweeps and multi-seed orchestrators live in
|
||||
# artifacts/verify_2026_04_24/run_*.sh
|
||||
```
|
||||
|
||||
The intended setup is to use the workspace-canonical 20-d packet-derived
|
||||
flow feature file:
|
||||
|
||||
```yaml
|
||||
flow_features_path: datasets/cicids2017/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
```
|
||||
|
||||
`flow_features.parquet` is row-aligned with the Packet_CFM artifacts via
|
||||
`flow_id`. With `flow_features_align: auto`, the loader uses direct
|
||||
row/`flow_id` alignment when possible; scan alignment remains only for
|
||||
legacy full CSV-derived caches.
|
||||
|
||||
For large datasets where a monolithic `packets.npz` would exceed memory,
|
||||
the loader supports the sharded backend:
|
||||
|
||||
```yaml
|
||||
source_store: datasets/cicddos2019/processed/full_store
|
||||
val_cap: 20000
|
||||
attack_cap: 20000
|
||||
```
|
||||
|
||||
If `flow_features_path` is empty, the loader derives compact 16-d flow-level
|
||||
statistics from the packet sequence. That fallback is for debugging only;
|
||||
new runs should use the canonical 20-d file generated by
|
||||
`scripts/generate_flow_features.py`.
|
||||
|
||||
## Evaluation
|
||||
|
||||
`artifacts/verify_2026_04_24/eval_phase1_unified.py` runs Phase 1 + Phase 2
|
||||
score battery on a trained checkpoint, with per-attack-class AUROC.
|
||||
|
||||
`artifacts/verify_2026_04_24/eval_phase2_cross_cicddos2019.py` runs
|
||||
cross-dataset CICIDS2017→CICDDoS2019 evaluation under the standard
|
||||
10k benign + 10k stratified attack protocol.
|
||||
Reference in New Issue
Block a user