6.6 KiB
Network intrusion detection systems (NIDS) in production are dogged by two persistent failures. Alert volume overwhelms downstream triage: industry surveys and recent reviews report false-positive rates that frequently exceed 90%, and at the upper end approach 99% [Trend2024; ACM-CSur-2024]. Detectors that score well in one environment also lose a substantial fraction of that performance once evaluated on traffic from a different deployment [Cross2402.10974; Tand2025]. Modern NIDS research has largely converged on unsupervised anomaly detection, but neither failure has a settled answer within that paradigm. With within-dataset scores on the standard public benchmarks now narrowed to within reporting noise, the substantive evaluation axis has shifted to cross-dataset robustness, on which the field is far from converged.
The unsupervised NIDS toolkit has converged on three families of methods, all of which reduce a packet stream to a single anomaly score. Reconstruction-based detectors such as autoencoders, KitNET, and MemAE [Kitsune; MemAE] score by reconstruction error and exhibit a documented identity-mapping failure in which anomalies far from the benign manifold can still be reconstructed near-perfectly, undermining the core assumption [AE-Unreliable-2025; NeurIPS24-Reconstruction]. Density-based detectors built on normalising flows (NF) are the current public SOTA; the strongest recent pipeline reports 0.93 AUROC within-dataset on CIC-DDoS2019, with cross-domain transfer ranging from 0.89 to 0.93 depending on direction [Shafir2026]. The log-likelihood these methods rely on, however, is known to dissociate from anomaly status once the benign distribution drifts [NFAD2021]. Diffusion-based detectors [ConMD2026; DMAD2025] and optimised GAN variants [TIPSO-GAN-NDSS2026] have arrived in 2025–2026 with strong within-dataset numbers but share the same underlying object: a single scalar derived from a homogeneous probabilistic model fit to benign traffic.
Why these density-based scores transfer poorly has gone uncharacterised. We identify a structural failure mode we call source-likeness collapse. Under target-domain drift, the log-likelihood emitted by a benign-fit generative model no longer discriminates "x is benign vs malicious" but rather "x lies in the source benign distribution vs not"; the two coincide only when there is no shift, and diverge as drift grows. Empirically, across three independent Continuous Flow-Matching backbones (CFM; framework introduced below) and 16 candidate score channels, the canonical density-based score for these models (the terminal-norm of the velocity field) drops to AUROC ≤ 0.63 when trained on CIC-DDoS2019 and evaluated on CIC-IDS2017, with four of the twelve off-diagonal cells of our 4×4 cross-dataset matrix falling below 0.57 (near-random). The collapse persists across recipes, ruling out hyperparameter artefacts and indicating a structural property of likelihood as an anomaly proxy.
Two recent generative frameworks point to a way out. Continuous Flow Matching [Lipman2023; OT-CFM-Tong2024] learns a velocity field rather than a reconstruction, side-stepping the identity-mapping trap of reconstruction-based detectors. Discrete Flow Matching [Gat-NeurIPS2024] extends the same machinery to categorical state spaces. Network packets sit naturally in both regimes: each packet contributes three continuous channels (size, inter-arrival time, TCP window) and six binary channels (direction and five TCP flags). To our knowledge, neither paradigm has been applied to packet-sequence NIDS, although Flow Matching has been validated for image [rFM2025] and tabular [TCCM-NeurIPS2025] anomaly detection. Mixed continuous–discrete modelling emits a family of complementary scores rather than the single homogeneous likelihood under which source-likeness collapse occurs, and provides a structural path to the discrete protocol semantics that prior NF / autoencoder / GAN approaches must either Gaussianise away or ignore.
We present JANUS, an unsupervised packet-sequence anomaly detector with three components.
- A causal-packet Transformer backbone that produces a temporally-ordered representation of each flow.
- Two jointly-trained Flow-Matching heads on benign traffic, one over the continuous packet channels and one over the discrete protocol channels. Together they emit a family of complementary scores rather than a single likelihood.
- A benign-only aggregator that compresses the score family into a single deployable scalar, fit on target-domain benign validation data and never on attack labels. Together, the discrete head supplies a transfer-stable signal that survives source-likeness collapse, and the aggregator combines it with the residual information in the continuous-head scores rather than discarding them. The unsupervised contract holds end-to-end. We make four contributions:
- (C1) First Flow-Matching detector for NIDS. To the best of our knowledge, JANUS is the first network anomaly detector to use Flow Matching as its training objective. It also combines continuous and discrete FM heads, a configuration not present in prior FM anomaly-detection work on image [rFM2025] or tabular [TCCM-NeurIPS2025] data.
- (C2) Characterisation of source-likeness collapse. We name and analyse a structural failure mode in which density-based anomaly scores degrade into source-domain membership classifiers under cross-dataset shift. The phenomenon persists across three independent CFM backbones and all 16 candidate score channels we evaluate, identifying it as a property of density-based scoring rather than of any specific backbone. This explains a cross-domain failure mode that prior work observed but did not name.
- (C3) A benign-only Mahalanobis aggregator with Oracle-Approximating-Shrinkage (OAS) covariance that compresses the score family into a single deployable scalar without consuming attack labels. We compare five aggregators (max-z, plain Mahalanobis, Ledoit–Wolf, OAS, and score-subset variants) and observe sensitivity ≤ 0.005 in AUROC across them, supporting the design as robust rather than hyperparameter-tuned.
- (C4) Cross-dataset robustness with within-dataset competitiveness. On a 4×4 cross-dataset matrix (12 off-diagonal directions, three seeds per cell), JANUS averages +0.175 AUROC over the terminal-norm baseline and recovers all four collapse cells (terminal-norm < 0.57) to ≥ 0.75. It exceeds the Shafir NF baseline [Shafir2026] by +0.07 AUROC (0.96 vs 0.89) when trained on CIC-IDS2017 and evaluated on CIC-DDoS2019, and matches it (0.93) when the direction is reversed. Within-dataset, it exceeds the NF SOTA on three benchmarks by margins of +0.054 to +0.118, all exceeding the standard deviation across three seeds.