Initial commit: code, paper, small artifacts

This commit is contained in:
2026-05-07 20:47:30 +08:00
commit fae2db8cff
322 changed files with 33159 additions and 0 deletions

125
paper/background_related.md Normal file
View File

@@ -0,0 +1,125 @@
## 2 Background
### 2.1 Unsupervised network anomaly detection
We consider the standard unsupervised setting: a detector is trained only on
benign traffic and, at inference time, must assign an anomaly score to each
flow without access to attack labels at any stage of training. Public
benchmarks (e.g., CIC-IDS2017, CIC-DDoS2019, ISCXTor2016) provide labelled
attack traffic for evaluation only. Two granularities dominate the
literature: flow-level detectors operate on per-flow aggregate features
(byte counts, inter-arrival statistics, flag tallies), while packet-level
detectors operate on the ordered sequence of per-packet features inside a
flow and retain temporal structure that flow aggregates discard.
Within-dataset AUROC on the standard benchmarks has narrowed to within
seed noise across recent recipes; the substantive evaluation axis is now
cross-dataset transfer, in which a detector is trained on one environment
and evaluated on traffic from another. Performance on this axis has not
converged.
### 2.2 Continuous Flow Matching
Continuous Flow Matching (CFM) trains a time-dependent vector field
$v_\theta(x, t)$ to transport a tractable source distribution (typically
$\mathcal{N}(0, I)$) to the data distribution along an ODE
$\mathrm{d}x_t = v_\theta(x_t, t)\,\mathrm{d}t$. The training objective
regresses $v_\theta$ onto a target velocity defined along a chosen
conditional probability path; for the linear (Gaussian) path this reduces
to a simple least-squares loss, side-stepping the score-matching objective
and stochastic sampler of diffusion models. OT-CFM straightens
trajectories by pairing source and data samples through minibatch optimal
transport, which lowers integration error and enables stable few-step
inference.
A trained CFM model gives access not only to the learned density but to
a family of geometric quantities along the trajectory: terminal velocity
norm, divergence, curvature, and Jacobian-trace estimators. These can be
read off the velocity field without retraining.
### 2.3 Discrete Flow Matching
Continuous FM does not apply to categorical state spaces, where adding
Gaussian noise is undefined. Discrete Flow Matching (DFM) generalises
the framework to finite alphabets through continuous-time Markov chains:
the model parameterises token-level transition rates that interpolate
between a source distribution (typically uniform) and the data
distribution. The training objective remains a simple regression onto
target rates derived from a chosen interpolation schedule. DFM has been
validated on language and molecular generation; mixed
continuousdiscrete data, where each observation has both numerical and
categorical channels, is the natural composition of CFM and DFM.
---
## 3 Related Work
### 3.1 Reconstruction-based detectors
Autoencoder-style detectors learn to reconstruct benign inputs and score
anomalies by reconstruction error. Kitsune popularised the design for
online NIDS using an ensemble of small autoencoders, and MemAE introduced
a learned memory bank to constrain the latent representation to the
benign manifold. The family suffers from a documented identity-mapping
failure: sufficiently expressive autoencoders reconstruct out-of-
distribution inputs near-perfectly, eroding the gap between benign and
anomalous reconstruction error. Recent critiques argue that this
behaviour is structural rather than a hyperparameter artefact, and that
reconstruction error is therefore an unreliable anomaly score in
general.
### 3.2 Density-based detectors
Three deep generative families currently hold the public SOTA on
NIDS benchmarks. **Normalising flows** fit an explicit invertible
density on benign traffic and score by negative log-likelihood; the
strongest recent pipeline reports 0.93 within-dataset AUROC on
CIC-DDoS2019 with cross-domain transfer in the 0.890.93 range.
**Diffusion-based detectors** include contextual masking distillation
schemes that compare a student denoiser against a benign-trained
teacher, alongside a broader 2025 survey of diffusion AD variants.
**GAN-based detectors**, exemplified by recent NDSS work that augments
the optimisation with particle-swarm search, score by discriminator
output or cycle-reconstruction error. All three families reduce a
packet stream to a single scalar derived from one homogeneous
probabilistic model fit to benign data, and the reported log-likelihood
is known to dissociate from anomaly status once the benign distribution
drifts.
A separate line of work uses self-supervised contrastive
representations, graph neural networks, or pre-trained traffic
foundation models, with anomaly scoring delegated to a downstream
detector such as OCSVM or Mahalanobis distance. These pipelines are
typically two-stage, are primarily evaluated on encrypted-traffic
classification rather than open-set anomaly detection, and are not the
focus of the cross-dataset robustness comparison we pursue.
### 3.3 Flow Matching for anomaly detection
Outside NIDS, two recent works adopt Flow Matching as the AD objective.
A time-reversed FM detector for image anomaly detection couples
worst-transport coupling with a high-dimensional latent, scoring by
deviation from the learned velocity field. A tabular detector built on
one-step FM offers explainability and provable robustness guarantees on
heterogeneous structured data. Both validate FM-based scoring as
competitive with reconstruction- and density-based baselines in their
respective regimes. Discrete Flow Matching has been validated on
language and molecular generation but not, to our knowledge, evaluated
as an anomaly-detection objective. No prior work applies either
continuous or discrete FM to packet-sequence NIDS.
### 3.4 Cross-dataset robustness in NIDS
As within-dataset metrics have saturated, cross-dataset evaluation has
emerged as the field's discriminating axis. A 2024 systematic study
measures the generalisation gap across the standard NIDS benchmarks
under matched feature schemas and reports AUROC drops of 0.100.30
when detectors trained on one environment are evaluated on another.
Subsequent work on heterogeneous deep stacked ensembles, calibrated
transformers, and few-shot multi-domain fusion targets the same gap
through architectural or training-time interventions. The phenomenon is
broadly observed and quantified; what is missing from the literature is
a mechanism-level account of why density-based scores in particular
degrade under domain shift, as opposed to an accumulation of empirical
remedies. The pilot study in §X revisits this gap directly and frames
the structural failure mode that the rest of the paper addresses.