Files
JANUS/paper/SURVEY.md

17 KiB
Raw Blame History

JANUS Paper — Survey, Pain Points, and Outline

Date: 2026-05-04 Scope: Field survey + framing for the JANUS paper (Mixed-CFM + DFM + causal-packet attention + Mahalanobis-OAS aggregator). Target sections: Introduction · Background · Methodology · Evaluation.

Framing rule (from RESULTS.md caveats): the headline claim is cross-dataset robustness + first FM/DFM in NIDS, not "4/4 within SOTA". Within-dataset is saturated; the discriminating axis is cross-dataset.


Part A. State of the field (20242026)

A.1 Method families and where each one is stuck

Family Recent representative work Core mechanism Documented short-coming
Normalizing Flows (NF) Shafir et al. T-Netw 2026 (our main baseline), NF-NIDS (IAF/NSF), PrivFlow-NIDS (Springer 2025) Explicit log-likelihood on benign; anomaly = low likelihood Likelihood ≠ anomaly score; coupling-layer architecture brittle; categorical / flag fields handled crudely
Reconstruction (AE / VAE / MemAE) KitNET, MemAE, SparseMemAE Reconstruction error as anomaly score "Identity mapping trap" — OOD samples can be perfectly reconstructed (NeurIPS '24, OpenReview '25 multi-paper consensus)
Diffusion ConMD (TIFS 2026) (our main baseline), DMAD (IJCAI '25 survey), UnDiff (WWW '25), RDUAE Denoising / score-based density Slow inference; multi-step trace storage; training instability
GAN TIPSO-GAN (NDSS 2026) (our main baseline), DEGAN Discriminator score / reconstruction Mode collapse and training instability persist; TIPSO-GAN spends extra optimisation budget (PSO) to mitigate
Self-supervised contrastive Self-Supervised Transformer Contrastive Learning (NetSci '25), GraphIDS (NeurIPS 2025), SSGMHAN Representation learning, downstream OCSVM / Mahal Two-stage pipeline (rep + detector); no end-to-end anomaly score
Foundation models Traffic-MoE, ETC-IMC, Language-of-Network GBC Pre-train / fine-tune over packet-byte tokens Resource-heavy; primary task is encrypted-traffic classification, not AD
Knowledge distillation ConMD (TIFS 2026), Spatial-Temporal KD Teacher → student alignment Sensitive to teacher quality; two-stage
Flow Matching (FM) TCCM (NeurIPS 2025) tabular, rFM (image AD 2025), Lipman 2023 Velocity-field regression, one-step deviation as score No application to NIDS yet — the gap JANUS fills
Discrete FM (DFM) Gat et al. NeurIPS 2024, Fisher Flow Matching, FlowMol (molecules 20242025) FM over discrete state space No application to NIDS yet

A.2 Dataset / benchmark situation

  • Within-dataset is saturated. Shafir 0.93 → JANUS 0.99 → TIPSO-GAN F1 = 0.99 on CICIDS2017 / CICDDoS2019 are all within seed noise of each other.
  • CICIDS2017 has documented benchmark bias (synthetic vs real-world traffic distribution mismatch — arXiv 2403.17458 "Expectations Versus Reality"; multiple 20242025 critiques).
  • Cross-dataset is now the field's chosen discriminating axis. HDSE-IDS, Transformer-IDS w/ calibration, Few-shot multi-domain fusion (PLOS One 2025), and the cross-dataset generalisation review (arXiv 2402.10974) all centre on the same problem.
  • AUROC alone is increasingly seen as inadequate for operational reporting; thresholded F1, Precision, Recall, and TPR @ FPR = 1% are demanded by the SOC community.

A.3 Operational pain points (the "why this matters" stack)

Pain point Quantified evidence Source
Operational FPR ~99 % of NIDS alerts are FP; one OT refinery reported 27 000 alerts → 76 real; 51 % of SOC teams describe alert volume as unmanageable Trend Micro 2024, OT IDS surveys 20242025
Cross-domain deployment collapse Single-dataset trained IDSes typically drop 0.100.30 AUROC across environments Tandfonline 2025; MDPI 2025 / 8466
Concept drift Models become outdated post-deployment, requiring re-training MDPI Future Internet 2025 / 328
IID-flow assumption mismatch Multi-stage attacks broken into IID flows lose relational and temporal structure DevSecOps NIDS guide 2026
Encrypted + heterogeneous protocols (IoT) Packet-level features lose access to plaintext; protocol/device heterogeneity breaks unified models Wiley 2025 IoT NIDS review

Part B. Pain points × JANUS capability mapping

P2 is our most distinctive observation — the community talks about cross-domain failure, but no one has clearly characterised "density score implicitly learns source-likeness". This is the mechanism-level explanation that earns naming rights and serves as the introduction's hook.

# Pain point (community consensus) Failure mode of current SOTA JANUS mechanism Evidence in our artifacts
P1 Cross-dataset / domain-shift collapse Shafir NF: cross 0.89 (forward) / 0.93 (reverse, single-direction); legacy terminal_norm 0.62 (reverse) DFM head emits disc_nll (protocol-flag distribution is transfer-stable); Mahalanobis-OAS re-weights on target benign RESULTS.md §C: reverse 0.62 → 0.93 (+0.31); forward 0.89 → 0.96 (+0.07); 12 off-diagonal cells avg +0.175
P2 "Source-likeness collapse" of density scores (new framing) terminal density score across 3 distinct backbones in reverse cross all ≤ 0.63 — effectively a source-domain classifier DFM decouples protocol semantics; Mahalanobis fuses multiple complementary scores 3-backbone × 16-score validation: terminal_norm 0.5190.626 all collapse; disc_nll is only single-score that stays stable (0.903)
P3 Continuous + discrete protocol fields squashed into one likelihood NF / AE Gaussianise TCP flags / direction, losing semantics Mixed CFM: continuous head over (size, IAT, win) + DFM head over 6 binary flag/direction channels, jointly trained sigma=0.1, λ_disc=1.0; DFM head is the only transfer-stable single score on reverse cross (0.9191)
P4 Reconstruction-based AD has identity-mapping trap AE perfectly reconstructs OOD (NeurIPS '24, OpenReview '25) FM is not reconstruction — velocity field, terminal point is source noise rather than the input First FM application to packet-sequence NIDS
P5 Multi-score selection bias (post-hoc best-fixed channel) Shafir 5-feature ensemble is post-hoc; our best-fixed AUROC 0.99 also selection-biased Mahalanobis-OAS fits once on benign val; never sees attack labels RESULTS.md §A: OAS yields +0.054 to +0.118 over Shafir on 4 within-dataset benchmarks with one deployable scalar score
P6 High operational FPR 99 %-FP industrial reality Thresholded F1 / Precision / Recall (τ = benign-val P95 / P99) RESULTS_THRESHOLDED.md: CICDDoS2019 within τ=P95 F1 = 0.993; cross F1 = 0.632 (precision ≈ 0.95)
P7 Run-to-run variance / poor reproducibility Multi-seed std large in many baselines Causal-packet attention shrinks std 1.68× RESULTS.md "Stability": CICIoT2023 std 0.0017 → 0.0002 (8×)
P8 NF likelihood is not necessarily anomaly Discussed since NFAD 2021; Shafir bypasses with Shapley feature ensemble We provide a 10-d score family + Mahalanobis routing; not dependent on a single likelihood CROSS_MATRIX.md 12 cells across 4×4 matrix

Part C. Paper outline

§1 Introduction (~1.5 pages)

Argument chain by paragraph:

  1. Hook: 99 % FPR + cross-domain deployment collapse. One sentence with the alert-fatigue numbers makes the relevance immediate.
  2. Common failure mode of current methods: AE has identity trap; NF likelihood becomes a source-likeness classifier under cross-domain shift (this is P2; insert a teaser figure of terminal_norm 4×4 cross matrix where many off-diagonal cells are ≤ 0.55).
  3. Missing tool: FM / DFM are validated for image / molecule / tabular AD, but never applied to NIDS; protocol fields are intrinsically mixed continuous + discrete, exactly the setting mixed FM was designed for.
  4. JANUS in one sentence: jointly trained continuous CFM + discrete FM with causal-packet attention, aggregated by a benign-only Mahalanobis-OAS scalar.
  5. Contributions (34 bullets):
    • (C1) First Flow-Matching paradigm for packet-level NIDS; first DFM modelling of protocol flag / direction channels.
    • (C2) We characterise the source-likeness collapse of terminal density scores at architecture level (3 backbones × 16 scores), and show how DFM + Mahalanobis routing breaks it.
    • (C3) Mahalanobis-OAS as a benign-only single-scalar aggregator; the unsupervised contract is preserved (no attack labels at any step).
    • (C4) On a 4×4 cross-dataset matrix, JANUS averages +0.175 AUROC over terminal_norm; reverse direction +0.31. Within-dataset matches or exceeds the NF SOTA (Shafir 2026) by 0.0540.118 on 3/3 directly comparable benchmarks.
  6. Outline + one-sentence takeaway.

Figure 1 placeholder (end of §1): cross 4×4 heatmap, left terminal_norm, right JANUS + Mahalanobis-OAS, Δ in colour. The visceral hook.


Four 1-paragraph subsections.

§2.1 Unsupervised Network Anomaly Detection

  • Reconstruction (AE / MemAE / Kitsune) — cite identity-mapping critiques as baseline limitation.
  • Density estimation with NF — Shafir 2026 + NF-NIDS as SOTA, but likelihood ≠ anomaly and categorical fields suffer.
  • GAN-based — TIPSO-GAN (mode collapse, optimisation cost).
  • Diffusion — DMAD survey + ConMD; high inference cost; AD remains image-centric.
  • Self-supervised contrastive — primarily representation learning, not direct anomaly scoring.

§2.2 Flow Matching & Discrete Flow Matching

  • Lipman 2023, OT-CFM (Tong et al. TMLR '24).
  • Discrete Flow Matching (Gat et al. NeurIPS '24).
  • FM for AD: TCCM NeurIPS '25 (tabular), rFM 2025 (image) — highlight that NIDS remains untouched.
  • Mixed continuous + discrete FM (FlowMol 20242025) — sets up the naming for our Mixed_CFM.

§2.3 Cross-Dataset Generalisation in NIDS

  • HDSE-IDS, Transformer-IDS w/ calibration, Few-shot multi-domain fusion (2025).
  • Cite arXiv 2402.10974 / 2403.17458 (Cross-Dataset Generalisation, "Expectations Versus Reality").
  • Our framing: prior work treats cross as a representation problem; nobody characterises the score-level source-likeness collapse.

§2.4 Anomaly Score Aggregation

  • Mahalanobis-based AD (MICCAI '24 brain MRI; M-SVDD 2025).
  • OAS / Ledoit-Wolf shrinkage covariance.
  • Position: Mahalanobis is widespread in image AD; never systematically applied as a benign-only aggregator over an FM score vector in NIDS.

§3 Methodology (~3 pages)

§3.1 Problem formulation

  • Input: packet sequences [N, T, 9] + flow metadata; benign-only training; unsupervised inference returns one scalar.
  • 9-d packet schema (common/data_contract.py): 3 continuous (size, IAT, win) + 6 discrete (direction + 5 TCP flags).
  • The mixed-modality nature is intrinsic to the protocol — not a modelling choice.

§3.2 The JANUS architecture

  • (a) Backbone: causal-packet Transformer over [FLOW_TOKEN, P_1, …, P_T]. Figure 2 = model overview.
  • (b) Continuous head (CFM): OT-CFM, σ=0.1, on the 3 continuous channels.
  • (c) Discrete head (DFM): Discrete Flow Matching with linear interpolation probability path on the 6 binary channels; cross-entropy loss with λ_disc = 1.0.
  • (d) Why mixed FM: small ablation showing λ_disc = 0 vs λ_disc = 1.0, demonstrating that flag fields cannot be Gaussianised.

§3.3 Score family

  • Enumerate the 10-d score vector: terminal_norm, terminal_packet, terminal_flow, kinetic_*, disc_nll_total, …
  • Each captures a different physical quantity (density / kinetic / discrete-flag distribution).
  • Source-likeness collapse — formal observation: under target-domain benign drift, terminal_norm degrades into a 1{x ∈ source distribution} proxy and loses its anomaly signal. Evidence: 4×4 matrix and 3-backbone validation.

§3.4 Mahalanobis-OAS aggregator

  • Fit OAS-shrunk Mahalanobis on target benign val: score = d²(s(x), µ_benign).
  • The aggregator never sees attack labels.
  • Selection: 5 benign-only aggregators evaluated (max-z, plain Mahalanobis, Ledoit-Wolf, OAS, score-subset variants); OAS performs best with sensitivity ≤ 0.005 vs Ledoit-Wolf (SCORE_ROUTER.md).

§3.5 Causal-packet attention as a stabiliser

  • Define the protocol-causal mask. Show std reduction 1.68× across 4 datasets (RESULTS.md "Stability").

§4 Evaluation (~34 pages)

§4.1 Datasets, baselines, protocol

  • 4 datasets: ISCXTor2016, CICIDS2017, CICDDoS2019, CICIoT2023; canonical packets.npz 9-d schema.
  • Baselines (locked from PDFs in paper/): Shafir NF (T-Netw 2026), ConMD (TIFS 2026), TIPSO-GAN (NDSS 2026), Kitsune, AE / MemAE, OCSVM.
  • Protocol: 10K benign train (matches Shafir), 3 seeds, AUROC primary + thresholded F1 / Precision / Recall @ P95 / P99.

§4.2 Within-dataset (Table 1)

  • 4 datasets × {Shafir / ConMD / TIPSO-GAN / Kitsune / AE / JANUS + Mahal-OAS / JANUS best-fixed}.
  • Honest framing: "JANUS matches or exceeds the NF SOTA on 3/3 directly comparable benchmarks; CICIoT2023 reported as additional benchmark due to metric mismatch (Caveat 1, RESULTS.md)".
  • One sentence acknowledging that "within-dataset is saturated; the discriminating axis is cross-dataset (next section)".

§4.3 Cross-dataset (Table 2 + Figure 3)

  • The headline of the paper. Table = 4×4 matrix Mahal vs terminal_norm.
  • Figure 3 = detailed version of Figure 1.
  • Critical details:
    • Forward IDS17 → DDoS19: +0.07 over Shafir (genuine SOTA).
    • Reverse DDoS19 → IDS17: 0.93 = Shafir 0.93 (matches, does not exceed — Caveat 2).
    • 12 off-diagonal cells average +0.175 over terminal_norm.
    • 4 "collapse cells" (≤ 0.57) all recovered to ≥ 0.75.

§4.4 Mechanism analysis (Table 3 + Figure 4)

  • Source-likeness collapse: 3 backbones × 16 scores matrix.
  • DFM head ablation: λ_disc ∈ {0, 0.5, 1.0, 2.0} vs reverse-cross AUROC.
  • Mahalanobis aggregator ablation: max-z / plain Mahal / Ledoit-Wolf / OAS — sourced from SCORE_ROUTER.md.

§4.5 Ablations & robustness

  • σ sensitivity (sigma_validation.md 4×2 table).
  • Causal-packet attention contribution to std reduction (RESULTS.md Stability).
  • Per-attack-family table (RESULTS.md "Per-attack-family pattern" — SSH-Patator counter-example).

§4.6 Thresholded metrics & operational impact

  • RESULTS_THRESHOLDED.md F1 / Precision / Recall @ P95.
  • Direct dialogue with industry alert-fatigue numbers: "at the P95 threshold, our cross precision ≈ 0.95".

§4.7 Discussion (sub-section, 1 paragraph)

  • Limitations: aggregator post-hoc selection, target-benign-calibrated transfer (not zero-shot — Caveat 3), CICIoT2023 metric mismatch.
  • Honest reporting here closes the door on reviewer attacks.

Part D. Writing red lines (from project memory)

  1. Never write "zero-shot transfer" — write "calibrated cross-domain transfer" (Mahalanobis is fit on target benign).
  2. Never claim "+SOTA on CICIoT2023" — write "additional benchmark; metric mismatch (Shafir F1 vs our AUROC)".
  3. Reverse cross is "matches Shafir 0.93", not "beats". Our +0.31 is vs our own legacy.
  4. Best-fixed numbers are an ablation upper bound, never the SOTA claim.
  5. Mahalanobis-OAS was post-hoc-selected — write "we evaluated 5 benign-only aggregators; OAS performed best with sensitivity ≤ 0.005 vs Ledoit-Wolf".

Part E. Sources

  • Shafir, Giryes, Wool — Explainable Anomaly Detection in Network Traffic Using Normalizing Flows, IEEE T-Netw 2026 (PDF in paper/).
  • Lian et al. — Contextual Masking Distillation for Network Traffic Anomaly Detection, IEEE TIFS 2026.
  • TIPSO-GAN: Malicious Network Traffic Detection, NDSS 2026.
  • Gat et al. — Discrete Flow Matching, NeurIPS 2024.
  • Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching, NeurIPS 2025 (arXiv 2510.18328).
  • How and Why: Taming Flow Matching for Unsupervised Anomaly Detection (rFM), arXiv 2508.05461.
  • On the Cross-Dataset Generalization of Machine Learning for Network Intrusion Detection, arXiv 2402.10974.
  • Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice, arXiv 2403.17458.
  • HDSE-IDS — Heterogeneous Deep Stacked Ensemble for Cross-Domain IDS, Connection Science 2025.
  • Self-Supervised Transformer-based Contrastive Learning for IDS, arXiv 2505.08816.
  • GraphIDS: Self-supervised GNN for Network Intrusion Detection, NeurIPS 2025.
  • Network traffic foundation models: A systematic review, ScienceDirect 2026.
  • Tong et al. — Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport, TMLR 2024.
  • DMAD: Diffusion Models for Anomaly Detection (survey), IJCAI 2025.
  • Alert Fatigue in Security Operations Centres: Research Challenges and Opportunities, ACM Computing Surveys 2024.
  • Beyond the Norm: Unsupervised Anomaly Detection in Telecommunications with Mahalanobis Distance, MDPI Computers 2025.
  • Autoencoders for Anomaly Detection are Unreliable, OpenReview 2025.