Files

BattleTag 6e5f753c01 baselines: add 3x3 cross-dataset runners for IF/OCSVM (path A + B) and Shafir NF

New scripts under scripts/baselines/:
- run_if_ocsvm_cross.py            - 20-d canonical flow features (path A)
- run_if_ocsvm_cross_packets.py    - raw 576-d packet sequence (path B)
- run_shafir_nf_cross.py           - single-NF on 5-d SHAFIR5 subset or 20-d
- *_all.sh                         - 3 sources x 3 targets x 3 seeds sweepers

New aggregator scripts/aggregate/baselines_cross_3x3_table.py builds a
Markdown 3x3 matrix per method from per-cell NPZ outputs.

RESULTS.md gains a "Shallow-baseline 3x3 cross matrices" subsection
pointing at the new artifact directories.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 17:41:20 +08:00

21 KiB

Raw Permalink Blame History

Final Results

Main-line model: JANUS

JANUS (Joint Anomaly via Normalizing-flows of Unified States) is the current main-line model. Codebase identifier is Mixed_CFM/; JANUS is the external/published name.

JANUS = a packet-causal Transformer backbone with two output heads:

Continuous Flow Matching head over (size, IAT, win) packet channels
Discrete Flow Matching (DFM) head over the 6 binary protocol-flag / direction channels

trained jointly (σ=0.1, lambda_disc=1.0, use_ot=true, no Phase-2 consistency loss). Downstream uses a single deployable scalar score: the Mahalanobis-OAS distance over the 10-d score vector emitted by JANUS, fit on benign val only (no attack labels).

JANUS is the first NIDS method to use Flow Matching as the training paradigm in mixed continuous-discrete state spaces over packet sequences.

All numbers reported are 3-seed mean ± std. Two model families are tracked:

Unified_CFM (legacy / our previous internal recipe): single Transformer over [FLOW + packets] with Phase-2 consistency loss; λ=0.3. Strongest single-fixed-score (terminal_norm) within-dataset baseline.
JANUS (current main line, 2026-05-01): see above. New SOTA on cross-dataset transfer under Mahalanobis auto-routing; matches legacy within-dataset under the same protocol. See artifacts/route_comparison/SCORE_ROUTER.md.

Caveats that travel with all external claims

CICIoT2023 vs Shafir is a metric mismatch, not a +SOTA result. Shafir reports F1=0.9951 with threshold tuned by Youden's J (TPR−FPR) on a 1K+1K balanced val set (uses attack labels for threshold selection only) and tested on 10K+10K balanced. We report AUROC=0.9594 (Mahalanobis-OAS). Different metric. CICIoT2023 should be presented as "additional benchmark, no Shafir AUROC published" rather than "+SOTA". To make it directly comparable, either reproduce Shafir's threshold protocol on JANUS's d² to compute F1, or run Shafir's GitHub lshafir/NF-anomaly-detection to extract NF AUROC.
Reverse cross (CICDDoS2019→CICIDS2017) matches Shafir, does not beat. JANUS gets 0.9301 ± 0.0122. Shafir Table IX row 3 reports 0.93. The "+0.31" gain is vs our own legacy terminal_norm (0.62), not vs Shafir.
Cross-dataset is calibrated cross-domain transfer, not zero-shot. The Mahalanobis-OAS aggregator is fit on the target dataset's benign val (unsupervised — no attack labels). Comparison vs Shafir is fair (his NF threshold also calibrated on target benign), but the language must be "calibrated cross-domain transfer" not "zero-shot transfer".
Aggregator selection (OAS over LedoitWolf / plain Mahal / max-z) was post-hoc. OAS picked because consistently top across all cells in SCORE_ROUTER.md; differences vs LedoitWolf ≤ 0.005. Strict pre- registration would say "we evaluated 5 benign-only aggregators and OAS performed best".

Headline performance

External SOTA baselines (Shafir 2026 NF + Shapley) verified directly from the paper (artifacts/locked_baselines.md). Unified_CFM "legacy" rows are our previous internal recipe (Phase-2 consistency loss + per-task σ); they are reported as internal ablation, NOT as the SOTA-comparison baseline.

A. vs External SOTA — within-dataset, JANUS + Mahalanobis-OAS (no selection bias)

Task	Shafir 2026 SOTA	JANUS + Mahalanobis-OAS	Δ vs Shafir
ISCXTor2016 (NonTor → Tor)	0.8731 (AUROC)	0.9908 ± 0.0012	+0.118 ⭐⭐
CICIDS2017 within	0.9303 (AUROC)	0.9845 ± 0.0030	+0.054 ⭐
CICDDoS2019 within	0.93 (AUROC)	0.9913 ± 0.0009	+0.061 ⭐
CICIoT2023 within	F1=0.9951 (no AUROC)	0.9594 ± 0.0028 (AUROC)	N/A — metric mismatch, see Caveat 1

3/3 directly comparable within-dataset benchmarks: JANUS sets new SOTA vs external Shafir baselines, with margins +0.054 to +0.118 — all far outside seed std. This holds under fully selection-bias-free eval (single Mahalanobis-OAS aggregator on the 10-d score vector, fit on benign val only, no attack labels). CICIoT2023 is reported as additional benchmark only (Shafir reports F1, we report AUROC; not a +SOTA claim).

A'. Reference only — best per-channel fixed score (per-dataset selection-biased; do NOT use as headline SOTA)

⚠️ Selection-biased: the channel chosen per row (terminal_norm vs terminal_packet) requires looking at attack-label AUROC to pick. Use this table as ablation upper bound only, not as the SOTA claim. The honest external SOTA claim is in table A above.

Task	Shafir 2026	JANUS (best fixed channel)	Δ vs Shafir
ISCXTor2016	0.8731	0.9954 ± 0.0007 (`terminal_norm`)	+0.122
CICIDS2017	0.9303	0.9932 ± 0.0013 (`terminal_packet`)	+0.063
CICDDoS2019	0.93	0.9970 ± 0.0005 (`terminal_norm`)	+0.067
CICIoT2023	F1=0.9951 (different metric)	0.9671 ± 0.0002 (`terminal_packet`)	N/A

B. Internal ablation — JANUS vs our previous Unified_CFM legacy

This is for tracking how JANUS does relative to our own previous internal best (not for the SOTA claim — Unified_CFM legacy is also our work). Within-dataset AUROC has saturated above 0.99; differences ≤ 0.005 are seed noise and the regime has no resolving power. The discriminating axis is cross-dataset (next section).

Task	Legacy Unified_CFM	JANUS + Mahalanobis-OAS	JANUS (best fixed)
ISCXTor2016	0.9945 ± 0.0011	0.9908 ± 0.0012	0.9954 ± 0.0007
CICIDS2017	0.9858 ± 0.0021	0.9845 ± 0.0030	0.9932 ± 0.0013
CICDDoS2019	0.9960 ± 0.0010	0.9913 ± 0.0009	0.9970 ± 0.0005
CICIoT2023	0.9612 ± 0.0017	0.9594 ± 0.0028	0.9671 ± 0.0002

JANUS + Mahalanobis-OAS ties the legacy recipe within seed std on every within-dataset task (all gaps ≤ 0.005, all overlapping). Best-fixed (per- dataset selection-biased) strictly beats legacy on 4/4 but cannot be cited as a clean SOTA claim. The decisive value-add is on cross-dataset transfer.

C. Cross-dataset transfer — JANUS + Mahalanobis-OAS

⚠️ Δ columns are vs our own legacy (not vs Shafir). vs Shafir: forward beats (+0.07 over 0.89), reverse matches (0.93 = 0.93). See Caveats above.

Task	Legacy `terminal_norm`	JANUS + Mahalanobis-OAS	Δ vs legacy	Shafir	vs Shafir
CICIoT2023 → CICIDS2017	0.7700 ± 0.0133	0.8983 ± 0.0098	+0.128	(n/a)	(n/a)
CICIoT2023 → CICDDoS2019	0.7473 ± 0.0223	0.8944 ± 0.0068	+0.147	(n/a)	(n/a)
CICIDS2017 → CICDDoS2019 (forward)	0.911 (legacy SOTA)	0.9594 ± 0.0046	+0.048	0.89	+0.07
CICDDoS2019 → CICIDS2017 (reverse)	0.62 (legacy)	0.9301 ± 0.0122	+0.31	0.93	0 (matches)

Full 4×4 cross matrix at artifacts/route_comparison/CROSS_MATRIX.md. All 12 off-diagonal directions tested (3 seeds each = 36 cross evaluations). Average off-diagonal improvement: +0.175 over terminal_norm (0.660 → 0.835). The four "source-likeness collapse" cells where terminal_norm ≤ 0.57 (essentially random) are all recovered to ≥ 0.75.

See artifacts/route_comparison/SCORE_ROUTER.md for full ablation across max-of-z, plain Mahalanobis, Ledoit-Wolf, OAS, and score-subset variants.

Shallow-baseline 3×3 cross matrices (Isolation Forest, OCSVM) — 2026-05-12 add

Two input modalities tested as cross-dataset reference points:

Path A (artifacts/baselines/if_ocsvm_cross_2026_05_11/): IF and OCSVM on the 20-d canonical flow features (StandardScaler). Strong shallow baseline — best off-diagonal AUROC is OCSVM 0.966 on CICIDS17→CICDDoS19. JANUS still wins all 9 cells; largest margin is CICDDoS19→CICIDS17 (JANUS 0.941 vs OCSVM 0.571, +0.370 AUROC).
Path B (artifacts/baselines/if_ocsvm_cross_packets_2026_05_11/): IF and OCSVM on the raw 576-d packet-token sequence (T=64×9, flattened), matching the input modality JANUS itself consumes. Numbers are weaker across the board (avg −0.16 AUROC vs path A); 3 IF cells and 1 OCSVM cell drop below random. This is the input-controlled comparison and is the recommended baseline column for the paper's cross-dataset table.

Full 3×3 matrices for both paths and a JANUS-vs-baselines off-diagonal margin table are appended to artifacts/baselines/COMPARISON_TABLE.md.

Reverse cross (CICDDoS2019 → CICIDS2017) — 2026-05-01 update

The reverse direction was the project's "stuck" failure mode (memory note reverse_cross_score_redirection_2026_04_25). Three model variants compared:

Model	`terminal_norm`	best single score (post-hoc)	Mahalanobis-OAS
Legacy Unified + consistency	0.626	`pna_packet_median` 0.882	0.824
Legacy Unified no consistency	0.554	`pna_packet_median` 0.852	0.893
JANUS (new)	0.519	`disc_nll_total` 0.903 ± 0.012	0.930 ± 0.015

terminal_norm collapses (≈ random) across all model variants — this is the source-likeness-classifier failure mode confirmed at the architecture level, not just a single-recipe artifact. The recovery path is:

DFM head gives a disc_nll score that captures protocol-flag distribution, which is genuinely transfer-stable.
Mahalanobis-OAS on the 10-d score vector aggregates disc_nll with the (broken-but-not-useless) terminal scores into a 0.93 ± 0.015 AUROC.
Compared to Shafir's reverse 0.93 on this direction, JANUS + Mahalanobis-OAS matches that benchmark (0.93 = 0.93). Does NOT beat.

This is +0.31 over our own legacy memory baseline of 0.62. The "main attack direction" recorded in reverse_cross_score_redirection_2026_04_25 is now substantially solved.

Thresholded F1 / Precision / Recall / TPR@FPR under the unsupervised threshold protocol are reported in section D below. Headline thresholded numbers: CICDDoS2019 within terminal_norm F1=0.993 ± 0.001 at τ=P95; cross terminal_norm F1=0.632 ± 0.051 at τ=P95 (precision ≈ 0.95, recall ≈ 0.47).

Note on cross-dataset baseline: Shafir's Table IX is asymmetric. The IDS2017→DDoS2019 direction (which we evaluate) reads 0.89, not 0.93. The 0.93 number is the reverse direction (DDoS2019→IDS2017), which we have not evaluated. See artifacts/locked_baselines.md.

Note on σ choice: headline numbers use per-task best σ (σ=0.1 for ISCXTor2016 and CICDDoS2019; σ=0.6 for CICIDS2017 within and cross). Within-dataset tasks are σ-insensitive within seed noise; cross-dataset requires σ=0.6. Single-policy σ=0.6 also beats Shafir on 4/4. Full 4×2 sensitivity table in artifacts/sigma_validation.md.

D. Thresholded operating-point metrics

⚠️ Numbers in this section are from the Unified_CFM legacy recipe (σ=0.1 within, σ=0.6 cross, λ=0.3, single fixed score). Equivalent thresholded numbers for current JANUS + Mahalanobis-OAS have not been recomputed yet; the AUROC tables (A/B/C above) are the authoritative JANUS comparison.

Protocol: τ is set from a benign-val half (A); F1 / Precision / Recall / FPR are measured on benign-val half B + attack. AUROC / AUPRC use full benign val + attack. TPR@FPR is measured on the test half. Both percentiles τ ∈ {P95, P99} are reported because they correspond to different operating points and F1 is sensitive to that choice.

CICDDoS2019 within (σ=0.1, λ=0.3):

Score	AUROC	AUPRC	F1 (P95)	Prec (P95)	Recall (P95)	FPR (P95)	F1 (P99)	TPR@1%FPR	TPR@5%FPR
`terminal_norm`	0.9960 ± 0.0011	0.9975 ± 0.0008	0.9932 ± 0.0012	0.9881 ± 0.0015	0.9983 ± 0.0008	0.0481 ± 0.0061	0.9112 ± 0.0402	0.9013 ± 0.0540	0.9980 ± 0.0014
`terminal_flow`	0.9885 ± 0.0028	0.9918 ± 0.0017	0.9788 ± 0.0086	0.9868 ± 0.0009	0.9710 ± 0.0163	0.0517 ± 0.0030	0.7752 ± 0.0128	0.6052 ± 0.0347	0.9697 ± 0.0169

CICIDS2017 → CICDDoS2019 cross (σ=0.6, λ=0.3):

Score	AUROC	AUPRC	F1 (P95)	Prec (P95)	Recall (P95)	FPR (P95)	F1 (P99)	TPR@1%FPR	TPR@5%FPR
`terminal_norm`	0.9109 ± 0.0032	0.8974 ± 0.0047	0.6321 ± 0.0513	0.9545 ± 0.0045	0.4745 ± 0.0550	0.0441 ± 0.0011	0.4202 ± 0.0171	0.2685 ± 0.0139	0.4940 ± 0.0399
`terminal_flow`	0.9197 ± 0.0036	0.8957 ± 0.0086	0.6324 ± 0.0585	0.9517 ± 0.0055	0.4762 ± 0.0639	0.0469 ± 0.0019	0.4028 ± 0.0049	0.2534 ± 0.0039	0.4776 ± 0.0636

Reading:

Within-dataset CICDDoS2019 saturates: at τ=P95 F1 ≈ 0.99 with balanced precision and recall ≈ 0.99; at τ=P99 (≈1% FPR) F1 ≈ 0.91 with TPR@1%FPR ≈ 0.90. The model is a working detector at fixed thresholds, not just an AUROC artifact.
Cross-dataset CICIDS2017→CICDDoS2019 keeps AUROC ≈ 0.91 but at fixed τ shows precision ≈ 0.95 / recall ≈ 0.50 at P95 and ≈0.27 at 1% FPR — the cross-dataset domain shift compresses the score gap, so source-calibrated thresholds are conservative on target. AUROC alone overstates deployability cross-dataset; thresholded numbers are the honest figure.

TIPSO-GAN comparability: TIPSO-GAN's CICDDoS2019 F1 ≈ 0.99 is reported under a supervised protocol (model has seen attack examples). Our F1 ≈ 0.99 on CICDDoS2019 within is achieved under the unsupervised protocol (benign-only training, threshold from benign val), which is the strictly harder setting. Direct F1 numerical equivalence; protocol asymmetry is in our favor.

Methodological contribution: `flow_consistency` diagnostic score

Phase 2 masked-prediction consistency loss unlocks a new score that is discriminative only when the model is trained with the consistency loss:

Dataset	baseline (no aux)	Phase 2 (λ=0.3, σ=0.1)
ISCXTor2016	0.6543	0.9011 ± 0.0125 (+0.247)
CICIDS2017	0.5745	0.8770 ± 0.0039 (+0.302)
CICDDoS2019	0.9084	0.9459 ± 0.0188 (+0.038)

On SSH-Patator — the worst class in CICIDS2017 for terminal_norm (0.6407 ± 0.0675) — flow_consistency reaches 0.94, providing a reliable detector where standard density scores fail.

Per-attack-family pattern

terminal_norm dominates on volumetric attacks (DDoS, DoS, Portscan, all DrDoS_*) — saturated 0.97-0.99. Decomposed scores compete only on brute-force / app-layer attacks where flow-level signal is strong but packet-level signal is weak:

Class	n	terminal_norm	best decomposed score	best AUROC
SSH-Patator	168	0.6407 ± 0.0675	`kinetic_flow`	0.9458 ± 0.0080
FTP-Patator	256	0.8963 ± 0.0015	`terminal_flow`	0.9773 ± 0.0049
DoS GoldenEye	448	0.9760 ± 0.0008	`terminal_flow`	0.9868 ± 0.0015

Outside these classes, terminal_norm is the right primary; decomposed scores are diagnostic only.

What the experiments proved

JANUS sets new SOTA vs external Shafir 2026 NF on 3/3 directly comparable within-dataset benchmarks under unbiased Mahalanobis-OAS eval (+0.054 to +0.118, all margins outside 3-seed std). CICIoT2023 is metric-mismatched (F1 vs AUROC) and reported as additional benchmark.
Within-dataset is saturated: JANUS + Mahalanobis-OAS ties our own internal Unified_CFM legacy within ±0.005 (all in seed std). At AUROC

0.99 the regime has no resolving power; benchmarks here cannot distinguish models. The right axis is cross-dataset.
JANUS recovers the previously catastrophic reverse cross direction: CICDDoS2019→CICIDS2017 from legacy terminal_norm 0.62 → JANUS Mahalanobis-OAS 0.93. Matches Shafir's 0.93 on the same direction (does not exceed). The "source-likeness collapse" failure mode of terminal_norm is confirmed at the architecture level (≤ 0.63 across 3 distinct backbones) and is broken by the DFM head + Mahalanobis route.
Discrete Flow Matching on flag/direction channels unlocks a new score family (disc_nll_total) that is independent of terminal_norm. It is the single best cross→CICIDS2017 fixed score across all 5 routes (0.9191). Without it the Mahalanobis aggregator has nothing to recover reverse cross with.
Causal-packet attention reduces multi-seed std by ~2-8× on every dataset, indicating the protocol-causal prior is a stabilizer for CFM training.
Phase-2 consistency loss is no longer the lead mechanism: useful for the flow_consistency diagnostic family, but JANUS's terminal_packet and disc_nll_total heads cover its function without the masked- prediction aux loss.
σ-band noise is a transfer-friendly regularizer — σ=0.6 cross-dataset AUROC is +0.02 over σ=0.1, matching the σ=0.6 sweet spot from Packet_CFM.
Per-attack-family analysis is the right reporting frame — averaged AUROC hides the SSH-Patator-style cases where decomposed scores save the day.

What the experiments disproved

Curvature as primary score: 0.32-0.91 across datasets, much weaker than terminal_norm. Has diagnostic value on SSH-Patator (+0.30) but should not lead reporting.
Jacobian-Hutchinson as primary score: 0.32-0.59 on ISCXTor2016 — below random for some sub-scores. Failed.
Time-profile velocity scores: at best +0.005 over terminal_norm on average. Some per-class wins on brute-force but not enough to lead.

Configuration

# CURRENT SOTA: JANUS (Mixed_CFM + causal-packet attention).
# Configs at: Mixed_CFM/configs/<dataset>_seed{42,43,44}.yaml
model:
  T: 64
  d_model: 128
  n_layers: 4
  n_heads: 4
  mlp_ratio: 4.0
  time_dim: 64
  use_ot: true
  reference_mode: causal_packets    # ← Route A: packet-causal attention

training:
  n_train: 10000
  epochs: 50
  batch_size: 256
  lr: 3.0e-4
  # Mixed CFM packet preprocessing: cont channels z-scored,
  # disc channels (direction + 5 TCP flags) kept as int {0,1}
  sigma: 0.1
  lambda_disc: 1.0                  # ← Route C: DFM cross-entropy weight

scoring (per dataset best):
  ISCXTor2016 / CICDDoS2019:  terminal_norm
  CICIDS2017 / CICIoT2023:    terminal_packet
  cross→CICIDS2017:           disc_nll_total

Legacy config (Unified_CFM with Phase-2 consistency)

Kept for reference; superseded by JANUS on cross-dataset (within-dataset is saturated and JANUS ties legacy in noise):

lambda_flow: 0.3
lambda_packet: 0.3
packet_mask_ratio: 0.5
sigma: 0.1 within / 0.6 cross

Stability

JANUS std vs legacy Unified_CFM std (3 seeds):

Dataset	Legacy std	JANUS std	std reduction
ISCXTor2016	0.0011	0.0007	1.6×
CICIDS2017	0.0021	0.0013	1.6×
CICDDoS2019	0.0010	0.0005	2×
CICIoT2023	0.0017	0.0002	8×

Causal-packet attention is the dominant contributor to std reduction — isolated Route A also halved std on terminal_norm in CICIoT2023 (Route A alone: 0.0006 vs baseline 0.0017).

Legacy reference (kept for completeness):

terminal_norm ISCXTor2016: ±0.0011 (σ=0.1) / ±0.0019 (σ=0.6)
terminal_norm CICIDS2017: ±0.0021 (σ=0.6)
terminal_norm CICDDoS2019: ±0.0010 (σ=0.1)
cross terminal_norm σ=0.6: ±0.0032
cross terminal_flow σ=0.6: ±0.0036

The +0.121 on ISCXTor2016 and +0.055 on CICIDS2017 are not single-seed artifacts.

Source artifacts

artifacts/locked_baselines.md — verified Shafir baselines (PDF inspection trail).
artifacts/sigma_validation.md — full 4×2 σ-sensitivity table (σ ∈ {0.1, 0.6} × 4 tasks, 3 seeds each) and per-task σ-selection protocol.
artifacts/reverse_cross.md — reverse direction CICDDoS2019 → CICIDS2017 evaluation (3 seeds × 2 σ × 16 scores). Asymmetry finding.
artifacts/phase25_multiseed_2026_04_25/PER_ATTACK_TABLE.md — per-attack multi-seed table (granular terminal_norm vs decomposed scores per class).
artifacts/phase{0,1,25}*/<config_name>_seed*/phase1_summary.json — raw per-seed eval results across all experiments.
artifacts/phase25_sigma06_cross_2026_04_25/cicids2017_to_cicddos2019_seed*.json — 3-seed cross-dataset eval JSONs.
artifacts/baselines/if_ocsvm_cross_2026_05_11/CROSS_MATRIX_3x3.md — IF/OCSVM 3×3 cross matrix on 20-d canonical flow features (path A).
artifacts/baselines/if_ocsvm_cross_packets_2026_05_11/CROSS_MATRIX_3x3.md — IF/OCSVM 3×3 cross matrix on raw 576-d packet sequence (path B, input-modality controlled with JANUS).
Aggregator scripts: artifacts/verify_2026_04_24/aggregate_phase{0,1,2,25,sigma06,per_attack_multiseed}.py.
Orchestrator scripts: artifacts/verify_2026_04_24/run_phase*.sh.

Phase summary markdown reports were superseded by this RESULTS.md and removed during the 2026-04-25 baseline-lock cleanup. The aggregator scripts can regenerate any historical view from the raw JSON results.

21 KiB Raw Permalink Blame History Unescape Escape