17 KiB
Final Results
Main-line model: JANUS
JANUS (Joint Anomaly via Normalizing-flows of Unified States) is the
current main-line model. Codebase identifier is Mixed_CFM/; JANUS is the
external/published name.
JANUS = a packet-causal Transformer backbone with two output heads:
- Continuous Flow Matching head over (size, IAT, win) packet channels
- Discrete Flow Matching (DFM) head over the 6 binary protocol-flag / direction channels
trained jointly (σ=0.1, lambda_disc=1.0, use_ot=true, no Phase-2 consistency loss). Downstream uses a single deployable scalar score: the Mahalanobis-OAS distance over the 10-d score vector emitted by JANUS, fit on benign val only (no attack labels).
JANUS is the first NIDS method to use Flow Matching as the training paradigm in mixed continuous-discrete state spaces over packet sequences.
All numbers reported are 3-seed mean ± std. Two model families are tracked:
- Unified_CFM (legacy / our previous internal recipe): single Transformer
over [FLOW + packets] with Phase-2 consistency loss; λ=0.3. Strongest
single-fixed-score (
terminal_norm) within-dataset baseline. - JANUS = A+C combo (current main line, 2026-05-01): see above.
New SOTA on cross-dataset transfer under Mahalanobis auto-routing;
matches legacy within-dataset under the same protocol. See
artifacts/route_comparison/SCORE_ROUTER.md.
Caveats that travel with all external claims
- CICIoT2023 vs Shafir is a metric mismatch, not a +SOTA result. Shafir
reports F1=0.9951 with threshold tuned by Youden's J (TPR−FPR) on a
1K+1K balanced val set (uses attack labels for threshold selection only)
and tested on 10K+10K balanced. We report AUROC=0.9594 (Mahalanobis-OAS).
Different metric. CICIoT2023 should be presented as "additional
benchmark, no Shafir AUROC published" rather than "+SOTA". To make it
directly comparable, either reproduce Shafir's threshold protocol on
JANUS's d² to compute F1, or run Shafir's GitHub
lshafir/NF-anomaly-detectionto extract NF AUROC. - Reverse cross (CICDDoS2019→CICIDS2017) matches Shafir, does not beat.
JANUS gets 0.9301 ± 0.0122. Shafir Table IX row 3 reports 0.93. The
"+0.31" gain is vs our own legacy
terminal_norm(0.62), not vs Shafir. - Cross-dataset is calibrated cross-domain transfer, not zero-shot. The Mahalanobis-OAS aggregator is fit on the target dataset's benign val (unsupervised — no attack labels). Comparison vs Shafir is fair (his NF threshold also calibrated on target benign), but the language must be "calibrated cross-domain transfer" not "zero-shot transfer".
- Aggregator selection (OAS over LedoitWolf / plain Mahal / max-z) was
post-hoc. OAS picked because consistently top across all cells in
SCORE_ROUTER.md; differences vs LedoitWolf ≤ 0.005. Strict pre- registration would say "we evaluated 5 benign-only aggregators and OAS performed best".
Headline performance
External SOTA baselines (Shafir 2026 NF + Shapley) verified directly from
the paper (artifacts/locked_baselines.md). Unified_CFM "legacy" rows are
our previous internal recipe (Phase-2 consistency loss + per-task σ); they
are reported as internal ablation, NOT as the SOTA-comparison baseline.
A. vs External SOTA — within-dataset, JANUS + Mahalanobis-OAS (no selection bias)
| Task | Shafir 2026 SOTA | JANUS + Mahalanobis-OAS | Δ vs Shafir |
|---|---|---|---|
| ISCXTor2016 (NonTor → Tor) | 0.8731 (AUROC) | 0.9908 ± 0.0012 | +0.118 ⭐⭐ |
| CICIDS2017 within | 0.9303 (AUROC) | 0.9845 ± 0.0030 | +0.054 ⭐ |
| CICDDoS2019 within | 0.93 (AUROC) | 0.9913 ± 0.0009 | +0.061 ⭐ |
| CICIoT2023 within | F1=0.9951 (no AUROC) | 0.9594 ± 0.0028 (AUROC) | N/A — metric mismatch, see Caveat 1 |
3/3 directly comparable within-dataset benchmarks: JANUS sets new SOTA vs external Shafir baselines, with margins +0.054 to +0.118 — all far outside seed std. This holds under fully selection-bias-free eval (single Mahalanobis-OAS aggregator on the 10-d score vector, fit on benign val only, no attack labels). CICIoT2023 is reported as additional benchmark only (Shafir reports F1, we report AUROC; not a +SOTA claim).
A'. Reference only — best per-channel fixed score (per-dataset selection-biased; do NOT use as headline SOTA)
⚠️ Selection-biased: the channel chosen per row (terminal_norm vs
terminal_packet) requires looking at attack-label AUROC to pick. Use this
table as ablation upper bound only, not as the SOTA claim. The honest
external SOTA claim is in table A above.
| Task | Shafir 2026 | JANUS (best fixed channel) | Δ vs Shafir |
|---|---|---|---|
| ISCXTor2016 | 0.8731 | 0.9954 ± 0.0007 (terminal_norm) |
+0.122 |
| CICIDS2017 | 0.9303 | 0.9932 ± 0.0013 (terminal_packet) |
+0.063 |
| CICDDoS2019 | 0.93 | 0.9970 ± 0.0005 (terminal_norm) |
+0.067 |
| CICIoT2023 | F1=0.9951 (different metric) | 0.9671 ± 0.0002 (terminal_packet) |
N/A |
B. Internal ablation — JANUS vs our previous Unified_CFM legacy
This is for tracking how JANUS does relative to our own previous internal best (not for the SOTA claim — Unified_CFM legacy is also our work). Within-dataset AUROC has saturated above 0.99; differences ≤ 0.005 are seed noise and the regime has no resolving power. The discriminating axis is cross-dataset (next section).
| Task | Legacy Unified_CFM | JANUS + Mahalanobis-OAS | JANUS (best fixed) |
|---|---|---|---|
| ISCXTor2016 | 0.9945 ± 0.0011 | 0.9908 ± 0.0012 | 0.9954 ± 0.0007 |
| CICIDS2017 | 0.9858 ± 0.0021 | 0.9845 ± 0.0030 | 0.9932 ± 0.0013 |
| CICDDoS2019 | 0.9960 ± 0.0010 | 0.9913 ± 0.0009 | 0.9970 ± 0.0005 |
| CICIoT2023 | 0.9612 ± 0.0017 | 0.9594 ± 0.0028 | 0.9671 ± 0.0002 |
JANUS + Mahalanobis-OAS ties the legacy recipe within seed std on every within-dataset task (all gaps ≤ 0.005, all overlapping). Best-fixed (per- dataset selection-biased) strictly beats legacy on 4/4 but cannot be cited as a clean SOTA claim. The decisive value-add is on cross-dataset transfer.
C. Cross-dataset transfer — JANUS + Mahalanobis-OAS
⚠️ Δ columns are vs our own legacy (not vs Shafir). vs Shafir: forward beats (+0.07 over 0.89), reverse matches (0.93 = 0.93). See Caveats above.
| Task | Legacy terminal_norm |
JANUS + Mahalanobis-OAS | Δ vs legacy | Shafir | vs Shafir |
|---|---|---|---|---|---|
| CICIoT2023 → CICIDS2017 | 0.7700 ± 0.0133 | 0.8983 ± 0.0098 | +0.128 | (n/a) | (n/a) |
| CICIoT2023 → CICDDoS2019 | 0.7473 ± 0.0223 | 0.8944 ± 0.0068 | +0.147 | (n/a) | (n/a) |
| CICIDS2017 → CICDDoS2019 (forward) | 0.911 (legacy SOTA) | 0.9594 ± 0.0046 | +0.048 | 0.89 | +0.07 |
| CICDDoS2019 → CICIDS2017 (reverse) | 0.62 (legacy) | 0.9301 ± 0.0122 | +0.31 | 0.93 | 0 (matches) |
Full 4×4 cross matrix at artifacts/route_comparison/CROSS_MATRIX.md. All
12 off-diagonal directions tested (3 seeds each = 36 cross evaluations).
Average off-diagonal improvement: +0.175 over terminal_norm
(0.660 → 0.835). The four "source-likeness collapse" cells where
terminal_norm ≤ 0.57 (essentially random) are all recovered to ≥ 0.75.
See artifacts/route_comparison/SCORE_ROUTER.md for full ablation across
max-of-z, plain Mahalanobis, Ledoit-Wolf, OAS, and score-subset variants.
Reverse cross (CICDDoS2019 → CICIDS2017) — 2026-05-01 update
The reverse direction was the project's "stuck" failure mode (memory note
reverse_cross_score_redirection_2026_04_25). Three model variants compared:
| Model | terminal_norm |
best single score (post-hoc) | Mahalanobis-OAS |
|---|---|---|---|
| Legacy Unified + consistency | 0.626 | pna_packet_median 0.882 |
0.824 |
| Legacy Unified no consistency | 0.554 | pna_packet_median 0.852 |
0.893 |
| JANUS (new) | 0.519 | disc_nll_total 0.903 ± 0.012 |
0.930 ± 0.015 |
terminal_norm collapses (≈ random) across all model variants — this is
the source-likeness-classifier failure mode confirmed at the architecture
level, not just a single-recipe artifact. The recovery path is:
- DFM head gives a
disc_nllscore that captures protocol-flag distribution, which is genuinely transfer-stable. - Mahalanobis-OAS on the 10-d score vector aggregates
disc_nllwith the (broken-but-not-useless) terminal scores into a 0.93 ± 0.015 AUROC. - Compared to Shafir's reverse 0.93 on this direction, JANUS + Mahalanobis-OAS matches that benchmark (0.93 = 0.93). Does NOT beat.
This is +0.31 over our own legacy memory baseline of 0.62. The "main
attack direction" recorded in reverse_cross_score_redirection_2026_04_25
is now substantially solved.
Thresholded F1 / Precision / Recall / TPR@FPR (unsupervised protocol, τ from
benign-val percentile) are reported separately in RESULTS_THRESHOLDED.md.
Headline thresholded numbers: CICDDoS2019 within terminal_norm F1=0.993 ± 0.001
at τ=P95; cross terminal_norm F1=0.632 ± 0.051 at τ=P95 (precision ≈ 0.95, recall ≈ 0.47).
Note on cross-dataset baseline: Shafir's Table IX is asymmetric. The IDS2017→DDoS2019 direction (which we evaluate) reads 0.89, not 0.93. The 0.93 number is the reverse direction (DDoS2019→IDS2017), which we have not evaluated. See
artifacts/locked_baselines.md.
Note on σ choice: headline numbers use per-task best σ (σ=0.1 for ISCXTor2016 and CICDDoS2019; σ=0.6 for CICIDS2017 within and cross). Within-dataset tasks are σ-insensitive within seed noise; cross-dataset requires σ=0.6. Single-policy σ=0.6 also beats Shafir on 4/4. Full 4×2 sensitivity table in
artifacts/sigma_validation.md.
Methodological contribution: flow_consistency diagnostic score
Phase 2 masked-prediction consistency loss unlocks a new score that is discriminative only when the model is trained with the consistency loss:
| Dataset | baseline (no aux) | Phase 2 (λ=0.3, σ=0.1) |
|---|---|---|
| ISCXTor2016 | 0.6543 | 0.9011 ± 0.0125 (+0.247) |
| CICIDS2017 | 0.5745 | 0.8770 ± 0.0039 (+0.302) |
| CICDDoS2019 | 0.9084 | 0.9459 ± 0.0188 (+0.038) |
On SSH-Patator — the worst class in CICIDS2017 for terminal_norm
(0.6407 ± 0.0675) — flow_consistency reaches 0.94, providing a reliable
detector where standard density scores fail.
Per-attack-family pattern
terminal_norm dominates on volumetric attacks (DDoS, DoS, Portscan, all
DrDoS_*) — saturated 0.97-0.99. Decomposed scores compete only on
brute-force / app-layer attacks where flow-level signal is strong but
packet-level signal is weak:
| Class | n | terminal_norm | best decomposed score | best AUROC |
|---|---|---|---|---|
| SSH-Patator | 168 | 0.6407 ± 0.0675 | kinetic_flow |
0.9458 ± 0.0080 |
| FTP-Patator | 256 | 0.8963 ± 0.0015 | terminal_flow |
0.9773 ± 0.0049 |
| DoS GoldenEye | 448 | 0.9760 ± 0.0008 | terminal_flow |
0.9868 ± 0.0015 |
Outside these classes, terminal_norm is the right primary; decomposed
scores are diagnostic only.
What the experiments proved
- JANUS sets new SOTA vs external Shafir 2026 NF on 3/3 directly comparable within-dataset benchmarks under unbiased Mahalanobis-OAS eval (+0.054 to +0.118, all margins outside 3-seed std). CICIoT2023 is metric-mismatched (F1 vs AUROC) and reported as additional benchmark.
- Within-dataset is saturated: JANUS + Mahalanobis-OAS ties our own
internal Unified_CFM legacy within ±0.005 (all in seed std). At AUROC
0.99 the regime has no resolving power; benchmarks here cannot distinguish models. The right axis is cross-dataset.
- JANUS recovers the previously catastrophic reverse cross direction:
CICDDoS2019→CICIDS2017 from legacy
terminal_norm0.62 → JANUS Mahalanobis-OAS 0.93. Matches Shafir's 0.93 on the same direction (does not exceed). The "source-likeness collapse" failure mode ofterminal_normis confirmed at the architecture level (≤ 0.63 across 3 distinct backbones) and is broken by the DFM head + Mahalanobis route. - Discrete Flow Matching on flag/direction channels unlocks a new score
family (
disc_nll_total) that is independent ofterminal_norm. It is the single best cross→CICIDS2017 fixed score across all 5 routes (0.9191). Without it the Mahalanobis aggregator has nothing to recover reverse cross with. - Causal-packet attention reduces multi-seed std by ~2-8× on every dataset, indicating the protocol-causal prior is a stabilizer for CFM training.
- Phase-2 consistency loss is no longer the lead mechanism: useful for
the
flow_consistencydiagnostic family, but JANUS'sterminal_packetanddisc_nll_totalheads cover its function without the masked- prediction aux loss. - σ-band noise is a transfer-friendly regularizer — σ=0.6 cross-dataset AUROC is +0.02 over σ=0.1, matching the σ=0.6 sweet spot from Packet_CFM.
- Per-attack-family analysis is the right reporting frame — averaged AUROC hides the SSH-Patator-style cases where decomposed scores save the day.
What the experiments disproved
- Curvature as primary score: 0.32-0.91 across datasets, much weaker
than
terminal_norm. Has diagnostic value on SSH-Patator (+0.30) but should not lead reporting. - Jacobian-Hutchinson as primary score: 0.32-0.59 on ISCXTor2016 — below random for some sub-scores. Failed.
- Time-profile velocity scores: at best +0.005 over
terminal_normon average. Some per-class wins on brute-force but not enough to lead.
Configuration
# CURRENT SOTA: JANUS (Mixed_CFM + causal-packet attention).
# Configs at: Mixed_CFM/configs/<dataset>_ac_combo_seed{42,43,44}.yaml
model:
T: 64
d_model: 128
n_layers: 4
n_heads: 4
mlp_ratio: 4.0
time_dim: 64
use_ot: true
reference_mode: causal_packets # ← Route A: packet-causal attention
training:
n_train: 10000
epochs: 50
batch_size: 256
lr: 3.0e-4
# Mixed CFM packet preprocessing: cont channels z-scored,
# disc channels (direction + 5 TCP flags) kept as int {0,1}
sigma: 0.1
lambda_disc: 1.0 # ← Route C: DFM cross-entropy weight
scoring (per dataset best):
ISCXTor2016 / CICDDoS2019: terminal_norm
CICIDS2017 / CICIoT2023: terminal_packet
cross→CICIDS2017: disc_nll_total
Legacy config (Unified_CFM with Phase-2 consistency)
Kept for reference; superseded by JANUS on cross-dataset (within-dataset is saturated and JANUS ties legacy in noise):
lambda_flow: 0.3
lambda_packet: 0.3
packet_mask_ratio: 0.5
sigma: 0.1 within / 0.6 cross
Stability
JANUS std vs legacy Unified_CFM std (3 seeds):
| Dataset | Legacy std | JANUS std | std reduction |
|---|---|---|---|
| ISCXTor2016 | 0.0011 | 0.0007 | 1.6× |
| CICIDS2017 | 0.0021 | 0.0013 | 1.6× |
| CICDDoS2019 | 0.0010 | 0.0005 | 2× |
| CICIoT2023 | 0.0017 | 0.0002 | 8× |
Causal-packet attention is the dominant contributor to std reduction — isolated Route A also halved std on terminal_norm in CICIoT2023 (Route A alone: 0.0006 vs baseline 0.0017).
Legacy reference (kept for completeness):
terminal_normISCXTor2016: ±0.0011 (σ=0.1) / ±0.0019 (σ=0.6)terminal_normCICIDS2017: ±0.0021 (σ=0.6)terminal_normCICDDoS2019: ±0.0010 (σ=0.1)- cross
terminal_normσ=0.6: ±0.0032 - cross
terminal_flowσ=0.6: ±0.0036
The +0.121 on ISCXTor2016 and +0.055 on CICIDS2017 are not single-seed artifacts.
Source artifacts
RESULTS_THRESHOLDED.md— F1 / Precision / Recall / TPR@FPR under unsupervised threshold protocol (τ = benign-val P95/P99) for CICDDoS2019 within and CICIDS2017→CICDDoS2019 cross.artifacts/locked_baselines.md— verified Shafir baselines (PDF inspection trail).artifacts/sigma_validation.md— full 4×2 σ-sensitivity table (σ ∈ {0.1, 0.6} × 4 tasks, 3 seeds each) and per-task σ-selection protocol.artifacts/reverse_cross.md— reverse direction CICDDoS2019 → CICIDS2017 evaluation (3 seeds × 2 σ × 16 scores). Asymmetry finding.artifacts/phase25_multiseed_2026_04_25/PER_ATTACK_TABLE.md— per-attack multi-seed table (granularterminal_normvs decomposed scores per class).artifacts/phase{0,1,25}*/<config_name>_seed*/phase1_summary.json— raw per-seed eval results across all experiments.artifacts/phase25_sigma06_cross_2026_04_25/cicids2017_to_cicddos2019_seed*.json— 3-seed cross-dataset eval JSONs.- Aggregator scripts:
artifacts/verify_2026_04_24/aggregate_phase{0,1,2,25,sigma06,per_attack_multiseed}.py. - Orchestrator scripts:
artifacts/verify_2026_04_24/run_phase*.sh.
Phase summary markdown reports were superseded by this RESULTS.md and
removed during the 2026-04-25 baseline-lock cleanup. The aggregator
scripts can regenerate any historical view from the raw JSON results.