# JANUS Paper — Survey, Pain Points, and Outline **Date**: 2026-05-04 **Scope**: Field survey + framing for the JANUS paper (Mixed-CFM + DFM + causal-packet attention + Mahalanobis-OAS aggregator). **Target sections**: Introduction · Background · Methodology · Evaluation. > Framing rule (from `RESULTS.md` caveats): the headline claim is **cross-dataset robustness + first FM/DFM in NIDS**, not "4/4 within SOTA". Within-dataset is saturated; the discriminating axis is cross-dataset. --- ## Part A. State of the field (2024–2026) ### A.1 Method families and where each one is stuck | Family | Recent representative work | Core mechanism | Documented short-coming | |---|---|---|---| | **Normalizing Flows (NF)** | **Shafir et al. T-Netw 2026** (our main baseline), NF-NIDS (IAF/NSF), PrivFlow-NIDS (Springer 2025) | Explicit log-likelihood on benign; anomaly = low likelihood | Likelihood ≠ anomaly score; coupling-layer architecture brittle; categorical / flag fields handled crudely | | **Reconstruction (AE / VAE / MemAE)** | KitNET, MemAE, SparseMemAE | Reconstruction error as anomaly score | "Identity mapping trap" — OOD samples can be perfectly reconstructed (NeurIPS '24, OpenReview '25 multi-paper consensus) | | **Diffusion** | **ConMD (TIFS 2026)** (our main baseline), DMAD (IJCAI '25 survey), UnDiff (WWW '25), RDUAE | Denoising / score-based density | Slow inference; multi-step trace storage; training instability | | **GAN** | **TIPSO-GAN (NDSS 2026)** (our main baseline), DEGAN | Discriminator score / reconstruction | Mode collapse and training instability persist; TIPSO-GAN spends extra optimisation budget (PSO) to mitigate | | **Self-supervised contrastive** | Self-Supervised Transformer Contrastive Learning (NetSci '25), GraphIDS (NeurIPS 2025), SSGMHAN | Representation learning, downstream OCSVM / Mahal | Two-stage pipeline (rep + detector); no end-to-end anomaly score | | **Foundation models** | Traffic-MoE, ETC-IMC, Language-of-Network GBC | Pre-train / fine-tune over packet-byte tokens | Resource-heavy; primary task is encrypted-traffic classification, not AD | | **Knowledge distillation** | ConMD (TIFS 2026), Spatial-Temporal KD | Teacher → student alignment | Sensitive to teacher quality; two-stage | | **Flow Matching (FM)** ✨ | **TCCM (NeurIPS 2025) tabular**, rFM (image AD 2025), Lipman 2023 | Velocity-field regression, one-step deviation as score | **No application to NIDS yet — the gap JANUS fills** | | **Discrete FM (DFM)** ✨ | Gat et al. NeurIPS 2024, Fisher Flow Matching, FlowMol (molecules 2024–2025) | FM over discrete state space | **No application to NIDS yet** | ### A.2 Dataset / benchmark situation - Within-dataset is saturated. Shafir 0.93 → JANUS 0.99 → TIPSO-GAN F1 = 0.99 on CICIDS2017 / CICDDoS2019 are all within seed noise of each other. - CICIDS2017 has documented benchmark bias (synthetic vs real-world traffic distribution mismatch — arXiv 2403.17458 "Expectations Versus Reality"; multiple 2024–2025 critiques). - Cross-dataset is now the field's chosen discriminating axis. HDSE-IDS, Transformer-IDS w/ calibration, Few-shot multi-domain fusion (PLOS One 2025), and the cross-dataset generalisation review (arXiv 2402.10974) all centre on the same problem. - AUROC alone is increasingly seen as inadequate for operational reporting; thresholded F1, Precision, Recall, and TPR @ FPR = 1% are demanded by the SOC community. ### A.3 Operational pain points (the "why this matters" stack) | Pain point | Quantified evidence | Source | |---|---|---| | Operational FPR | ~99 % of NIDS alerts are FP; one OT refinery reported 27 000 alerts → 76 real; 51 % of SOC teams describe alert volume as unmanageable | Trend Micro 2024, OT IDS surveys 2024–2025 | | Cross-domain deployment collapse | Single-dataset trained IDSes typically drop 0.10–0.30 AUROC across environments | Tandfonline 2025; MDPI 2025 / 8466 | | Concept drift | Models become outdated post-deployment, requiring re-training | MDPI Future Internet 2025 / 328 | | IID-flow assumption mismatch | Multi-stage attacks broken into IID flows lose relational and temporal structure | DevSecOps NIDS guide 2026 | | Encrypted + heterogeneous protocols (IoT) | Packet-level features lose access to plaintext; protocol/device heterogeneity breaks unified models | Wiley 2025 IoT NIDS review | --- ## Part B. Pain points × JANUS capability mapping > **P2 is our most distinctive observation** — the community talks about cross-domain failure, but no one has clearly characterised "density score implicitly learns source-likeness". This is the mechanism-level explanation that earns naming rights and serves as the introduction's hook. | # | Pain point (community consensus) | Failure mode of current SOTA | JANUS mechanism | Evidence in our artifacts | |---|---|---|---|---| | **P1** | Cross-dataset / domain-shift collapse | Shafir NF: cross 0.89 (forward) / 0.93 (reverse, single-direction); legacy `terminal_norm` 0.62 (reverse) | DFM head emits `disc_nll` (protocol-flag distribution is transfer-stable); Mahalanobis-OAS re-weights on target benign | `RESULTS.md` §C: reverse 0.62 → 0.93 (+0.31); forward 0.89 → 0.96 (+0.07); 12 off-diagonal cells avg +0.175 | | **P2** | "Source-likeness collapse" of density scores (new framing) | terminal density score across **3** distinct backbones in reverse cross all ≤ 0.63 — effectively a source-domain classifier | DFM decouples protocol semantics; Mahalanobis fuses multiple complementary scores | 3-backbone × 16-score validation: terminal_norm 0.519–0.626 all collapse; `disc_nll` is only single-score that stays stable (0.903) | | **P3** | Continuous + discrete protocol fields squashed into one likelihood | NF / AE Gaussianise TCP flags / direction, losing semantics | **Mixed CFM**: continuous head over (size, IAT, win) + **DFM head** over 6 binary flag/direction channels, jointly trained | `sigma=0.1`, `λ_disc=1.0`; DFM head is the only transfer-stable single score on reverse cross (0.9191) | | **P4** | Reconstruction-based AD has identity-mapping trap | AE perfectly reconstructs OOD (NeurIPS '24, OpenReview '25) | FM is not reconstruction — velocity field, terminal point is source noise rather than the input | First FM application to packet-sequence NIDS | | **P5** | Multi-score selection bias (post-hoc best-fixed channel) | Shafir 5-feature ensemble is post-hoc; our best-fixed AUROC 0.99 also selection-biased | Mahalanobis-OAS fits once on benign val; **never sees attack labels** | `RESULTS.md` §A: OAS yields +0.054 to +0.118 over Shafir on 4 within-dataset benchmarks with one deployable scalar score | | **P6** | High operational FPR | 99 %-FP industrial reality | Thresholded F1 / Precision / Recall (τ = benign-val P95 / P99) | `RESULTS_THRESHOLDED.md`: CICDDoS2019 within τ=P95 F1 = 0.993; cross F1 = 0.632 (precision ≈ 0.95) | | **P7** | Run-to-run variance / poor reproducibility | Multi-seed std large in many baselines | Causal-packet attention shrinks std 1.6–8× | `RESULTS.md` "Stability": CICIoT2023 std 0.0017 → 0.0002 (8×) | | **P8** | NF likelihood is not necessarily anomaly | Discussed since NFAD 2021; Shafir bypasses with Shapley feature ensemble | We provide a 10-d score family + Mahalanobis routing; not dependent on a single likelihood | `CROSS_MATRIX.md` 12 cells across 4×4 matrix | --- ## Part C. Paper outline ### §1 Introduction (~1.5 pages) Argument chain by paragraph: 1. **Hook**: 99 % FPR + cross-domain deployment collapse. One sentence with the alert-fatigue numbers makes the relevance immediate. 2. **Common failure mode of current methods**: AE has identity trap; NF likelihood becomes a source-likeness classifier under cross-domain shift (this is P2; insert a teaser figure of `terminal_norm` 4×4 cross matrix where many off-diagonal cells are ≤ 0.55). 3. **Missing tool**: FM / DFM are validated for image / molecule / tabular AD, but never applied to NIDS; protocol fields are intrinsically mixed continuous + discrete, exactly the setting mixed FM was designed for. 4. **JANUS in one sentence**: jointly trained continuous CFM + discrete FM with causal-packet attention, aggregated by a benign-only Mahalanobis-OAS scalar. 5. **Contributions** (3–4 bullets): - **(C1)** First Flow-Matching paradigm for packet-level NIDS; first DFM modelling of protocol flag / direction channels. - **(C2)** We characterise the *source-likeness collapse* of terminal density scores at architecture level (3 backbones × 16 scores), and show how DFM + Mahalanobis routing breaks it. - **(C3)** Mahalanobis-OAS as a benign-only single-scalar aggregator; the unsupervised contract is preserved (no attack labels at any step). - **(C4)** On a 4×4 cross-dataset matrix, JANUS averages **+0.175** AUROC over `terminal_norm`; reverse direction +0.31. Within-dataset matches or exceeds the NF SOTA (Shafir 2026) by 0.054–0.118 on 3/3 directly comparable benchmarks. 6. **Outline + one-sentence takeaway**. **Figure 1 placeholder** (end of §1): cross 4×4 heatmap, left `terminal_norm`, right `JANUS + Mahalanobis-OAS`, Δ in colour. The visceral hook. --- ### §2 Background & Related Work (~1.5 pages) Four 1-paragraph subsections. **§2.1 Unsupervised Network Anomaly Detection** - Reconstruction (AE / MemAE / Kitsune) — cite identity-mapping critiques as baseline limitation. - Density estimation with NF — Shafir 2026 + NF-NIDS as SOTA, but likelihood ≠ anomaly and categorical fields suffer. - GAN-based — TIPSO-GAN (mode collapse, optimisation cost). - Diffusion — DMAD survey + ConMD; high inference cost; AD remains image-centric. - Self-supervised contrastive — primarily representation learning, not direct anomaly scoring. **§2.2 Flow Matching & Discrete Flow Matching** - Lipman 2023, OT-CFM (Tong et al. TMLR '24). - Discrete Flow Matching (Gat et al. NeurIPS '24). - FM for AD: TCCM NeurIPS '25 (tabular), rFM 2025 (image) — **highlight that NIDS remains untouched**. - Mixed continuous + discrete FM (FlowMol 2024–2025) — sets up the naming for our Mixed_CFM. **§2.3 Cross-Dataset Generalisation in NIDS** - HDSE-IDS, Transformer-IDS w/ calibration, Few-shot multi-domain fusion (2025). - Cite arXiv 2402.10974 / 2403.17458 (Cross-Dataset Generalisation, "Expectations Versus Reality"). - **Our framing**: prior work treats cross as a representation problem; nobody characterises the score-level source-likeness collapse. **§2.4 Anomaly Score Aggregation** - Mahalanobis-based AD (MICCAI '24 brain MRI; M-SVDD 2025). - OAS / Ledoit-Wolf shrinkage covariance. - **Position**: Mahalanobis is widespread in image AD; never systematically applied as a benign-only aggregator over an FM score vector in NIDS. --- ### §3 Methodology (~3 pages) **§3.1 Problem formulation** - Input: packet sequences `[N, T, 9]` + flow metadata; benign-only training; unsupervised inference returns one scalar. - 9-d packet schema (`common/data_contract.py`): 3 continuous (size, IAT, win) + 6 discrete (direction + 5 TCP flags). - The mixed-modality nature is *intrinsic* to the protocol — not a modelling choice. **§3.2 The JANUS architecture** - (a) **Backbone**: causal-packet Transformer over `[FLOW_TOKEN, P_1, …, P_T]`. Figure 2 = model overview. - (b) **Continuous head (CFM)**: OT-CFM, σ=0.1, on the 3 continuous channels. - (c) **Discrete head (DFM)**: Discrete Flow Matching with linear interpolation probability path on the 6 binary channels; cross-entropy loss with `λ_disc = 1.0`. - (d) **Why mixed FM**: small ablation showing `λ_disc = 0` vs `λ_disc = 1.0`, demonstrating that flag fields cannot be Gaussianised. **§3.3 Score family** - Enumerate the 10-d score vector: `terminal_norm`, `terminal_packet`, `terminal_flow`, `kinetic_*`, `disc_nll_total`, … - Each captures a different physical quantity (density / kinetic / discrete-flag distribution). - **Source-likeness collapse — formal observation**: under target-domain benign drift, `terminal_norm` degrades into a `1{x ∈ source distribution}` proxy and loses its anomaly signal. Evidence: 4×4 matrix and 3-backbone validation. **§3.4 Mahalanobis-OAS aggregator** - Fit OAS-shrunk Mahalanobis on **target** benign val: `score = d²(s(x), µ_benign)`. - The aggregator never sees attack labels. - Selection: 5 benign-only aggregators evaluated (max-z, plain Mahalanobis, Ledoit-Wolf, OAS, score-subset variants); OAS performs best with sensitivity ≤ 0.005 vs Ledoit-Wolf (`SCORE_ROUTER.md`). **§3.5 Causal-packet attention as a stabiliser** - Define the protocol-causal mask. Show std reduction 1.6–8× across 4 datasets (`RESULTS.md` "Stability"). --- ### §4 Evaluation (~3–4 pages) **§4.1 Datasets, baselines, protocol** - 4 datasets: ISCXTor2016, CICIDS2017, CICDDoS2019, CICIoT2023; canonical `packets.npz` 9-d schema. - Baselines (locked from PDFs in `paper/`): Shafir NF (T-Netw 2026), ConMD (TIFS 2026), TIPSO-GAN (NDSS 2026), Kitsune, AE / MemAE, OCSVM. - Protocol: 10K benign train (matches Shafir), 3 seeds, AUROC primary + thresholded F1 / Precision / Recall @ P95 / P99. **§4.2 Within-dataset (Table 1)** - 4 datasets × {Shafir / ConMD / TIPSO-GAN / Kitsune / AE / **JANUS + Mahal-OAS** / JANUS best-fixed}. - Honest framing: "JANUS matches or exceeds the NF SOTA on 3/3 directly comparable benchmarks; CICIoT2023 reported as additional benchmark due to metric mismatch (Caveat 1, `RESULTS.md`)". - One sentence acknowledging that "within-dataset is saturated; the discriminating axis is cross-dataset (next section)". **§4.3 Cross-dataset (Table 2 + Figure 3)** - The headline of the paper. Table = 4×4 matrix Mahal vs `terminal_norm`. - Figure 3 = detailed version of Figure 1. - Critical details: - Forward IDS17 → DDoS19: +0.07 over Shafir (genuine SOTA). - Reverse DDoS19 → IDS17: 0.93 = Shafir 0.93 (matches, does not exceed — Caveat 2). - 12 off-diagonal cells average +0.175 over `terminal_norm`. - 4 "collapse cells" (≤ 0.57) all recovered to ≥ 0.75. **§4.4 Mechanism analysis (Table 3 + Figure 4)** - Source-likeness collapse: 3 backbones × 16 scores matrix. - DFM head ablation: `λ_disc ∈ {0, 0.5, 1.0, 2.0}` vs reverse-cross AUROC. - Mahalanobis aggregator ablation: max-z / plain Mahal / Ledoit-Wolf / OAS — sourced from `SCORE_ROUTER.md`. **§4.5 Ablations & robustness** - σ sensitivity (`sigma_validation.md` 4×2 table). - Causal-packet attention contribution to std reduction (`RESULTS.md` Stability). - Per-attack-family table (`RESULTS.md` "Per-attack-family pattern" — SSH-Patator counter-example). **§4.6 Thresholded metrics & operational impact** - `RESULTS_THRESHOLDED.md` F1 / Precision / Recall @ P95. - Direct dialogue with industry alert-fatigue numbers: "at the P95 threshold, our cross precision ≈ 0.95". **§4.7 Discussion (sub-section, 1 paragraph)** - Limitations: aggregator post-hoc selection, target-benign-calibrated transfer (not zero-shot — Caveat 3), CICIoT2023 metric mismatch. - Honest reporting here closes the door on reviewer attacks. --- ## Part D. Writing red lines (from project memory) 1. **Never** write "zero-shot transfer" — write "calibrated cross-domain transfer" (Mahalanobis is fit on target benign). 2. **Never** claim "+SOTA on CICIoT2023" — write "additional benchmark; metric mismatch (Shafir F1 vs our AUROC)". 3. Reverse cross is "matches Shafir 0.93", not "beats". Our +0.31 is vs our own legacy. 4. Best-fixed numbers are an ablation upper bound, never the SOTA claim. 5. Mahalanobis-OAS was post-hoc-selected — write "we evaluated 5 benign-only aggregators; OAS performed best with sensitivity ≤ 0.005 vs Ledoit-Wolf". --- ## Part E. Sources - Shafir, Giryes, Wool — *Explainable Anomaly Detection in Network Traffic Using Normalizing Flows*, IEEE T-Netw 2026 (PDF in `paper/`). - Lian et al. — *Contextual Masking Distillation for Network Traffic Anomaly Detection*, IEEE TIFS 2026. - *TIPSO-GAN: Malicious Network Traffic Detection*, NDSS 2026. - Gat et al. — *Discrete Flow Matching*, NeurIPS 2024. - *Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching*, NeurIPS 2025 (arXiv 2510.18328). - *How and Why: Taming Flow Matching for Unsupervised Anomaly Detection* (rFM), arXiv 2508.05461. - *On the Cross-Dataset Generalization of Machine Learning for Network Intrusion Detection*, arXiv 2402.10974. - *Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice*, arXiv 2403.17458. - HDSE-IDS — *Heterogeneous Deep Stacked Ensemble for Cross-Domain IDS*, Connection Science 2025. - *Self-Supervised Transformer-based Contrastive Learning for IDS*, arXiv 2505.08816. - *GraphIDS: Self-supervised GNN for Network Intrusion Detection*, NeurIPS 2025. - *Network traffic foundation models: A systematic review*, ScienceDirect 2026. - Tong et al. — *Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport*, TMLR 2024. - *DMAD: Diffusion Models for Anomaly Detection (survey)*, IJCAI 2025. - *Alert Fatigue in Security Operations Centres: Research Challenges and Opportunities*, ACM Computing Surveys 2024. - *Beyond the Norm: Unsupervised Anomaly Detection in Telecommunications with Mahalanobis Distance*, MDPI Computers 2025. - *Autoencoders for Anomaly Detection are Unreliable*, OpenReview 2025.