Files
JANUS/paper/background_related.md

6.5 KiB
Raw Blame History

2 Background

2.1 Unsupervised network anomaly detection

We consider the standard unsupervised setting: a detector is trained only on benign traffic and, at inference time, must assign an anomaly score to each flow without access to attack labels at any stage of training. Public benchmarks (e.g., CIC-IDS2017, CIC-DDoS2019, ISCXTor2016) provide labelled attack traffic for evaluation only. Two granularities dominate the literature: flow-level detectors operate on per-flow aggregate features (byte counts, inter-arrival statistics, flag tallies), while packet-level detectors operate on the ordered sequence of per-packet features inside a flow and retain temporal structure that flow aggregates discard.

Within-dataset AUROC on the standard benchmarks has narrowed to within seed noise across recent recipes; the substantive evaluation axis is now cross-dataset transfer, in which a detector is trained on one environment and evaluated on traffic from another. Performance on this axis has not converged.

2.2 Continuous Flow Matching

Continuous Flow Matching (CFM) trains a time-dependent vector field v_\theta(x, t) to transport a tractable source distribution (typically \mathcal{N}(0, I)) to the data distribution along an ODE \mathrm{d}x_t = v_\theta(x_t, t)\,\mathrm{d}t. The training objective regresses v_\theta onto a target velocity defined along a chosen conditional probability path; for the linear (Gaussian) path this reduces to a simple least-squares loss, side-stepping the score-matching objective and stochastic sampler of diffusion models. OT-CFM straightens trajectories by pairing source and data samples through minibatch optimal transport, which lowers integration error and enables stable few-step inference.

A trained CFM model gives access not only to the learned density but to a family of geometric quantities along the trajectory: terminal velocity norm, divergence, curvature, and Jacobian-trace estimators. These can be read off the velocity field without retraining.

2.3 Discrete Flow Matching

Continuous FM does not apply to categorical state spaces, where adding Gaussian noise is undefined. Discrete Flow Matching (DFM) generalises the framework to finite alphabets through continuous-time Markov chains: the model parameterises token-level transition rates that interpolate between a source distribution (typically uniform) and the data distribution. The training objective remains a simple regression onto target rates derived from a chosen interpolation schedule. DFM has been validated on language and molecular generation; mixed continuousdiscrete data, where each observation has both numerical and categorical channels, is the natural composition of CFM and DFM.


3.1 Reconstruction-based detectors

Autoencoder-style detectors learn to reconstruct benign inputs and score anomalies by reconstruction error. Kitsune popularised the design for online NIDS using an ensemble of small autoencoders, and MemAE introduced a learned memory bank to constrain the latent representation to the benign manifold. The family suffers from a documented identity-mapping failure: sufficiently expressive autoencoders reconstruct out-of- distribution inputs near-perfectly, eroding the gap between benign and anomalous reconstruction error. Recent critiques argue that this behaviour is structural rather than a hyperparameter artefact, and that reconstruction error is therefore an unreliable anomaly score in general.

3.2 Density-based detectors

Three deep generative families currently hold the public SOTA on NIDS benchmarks. Normalising flows fit an explicit invertible density on benign traffic and score by negative log-likelihood; the strongest recent pipeline reports 0.93 within-dataset AUROC on CIC-DDoS2019 with cross-domain transfer in the 0.890.93 range. Diffusion-based detectors include contextual masking distillation schemes that compare a student denoiser against a benign-trained teacher, alongside a broader 2025 survey of diffusion AD variants. GAN-based detectors, exemplified by recent NDSS work that augments the optimisation with particle-swarm search, score by discriminator output or cycle-reconstruction error. All three families reduce a packet stream to a single scalar derived from one homogeneous probabilistic model fit to benign data, and the reported log-likelihood is known to dissociate from anomaly status once the benign distribution drifts.

A separate line of work uses self-supervised contrastive representations, graph neural networks, or pre-trained traffic foundation models, with anomaly scoring delegated to a downstream detector such as OCSVM or Mahalanobis distance. These pipelines are typically two-stage, are primarily evaluated on encrypted-traffic classification rather than open-set anomaly detection, and are not the focus of the cross-dataset robustness comparison we pursue.

3.3 Flow Matching for anomaly detection

Outside NIDS, two recent works adopt Flow Matching as the AD objective. A time-reversed FM detector for image anomaly detection couples worst-transport coupling with a high-dimensional latent, scoring by deviation from the learned velocity field. A tabular detector built on one-step FM offers explainability and provable robustness guarantees on heterogeneous structured data. Both validate FM-based scoring as competitive with reconstruction- and density-based baselines in their respective regimes. Discrete Flow Matching has been validated on language and molecular generation but not, to our knowledge, evaluated as an anomaly-detection objective. No prior work applies either continuous or discrete FM to packet-sequence NIDS.

3.4 Cross-dataset robustness in NIDS

As within-dataset metrics have saturated, cross-dataset evaluation has emerged as the field's discriminating axis. A 2024 systematic study measures the generalisation gap across the standard NIDS benchmarks under matched feature schemas and reports AUROC drops of 0.100.30 when detectors trained on one environment are evaluated on another. Subsequent work on heterogeneous deep stacked ensembles, calibrated transformers, and few-shot multi-domain fusion targets the same gap through architectural or training-time interventions. The phenomenon is broadly observed and quantified; what is missing from the literature is a mechanism-level account of why density-based scores in particular degrade under domain shift, as opposed to an accumulation of empirical remedies. The pilot study in §X revisits this gap directly and frames the structural failure mode that the rest of the paper addresses.