Initial commit: code, paper, small artifacts

2026-05-07 20:47:30 +08:00
commit fae2db8cff
322 changed files with 33159 additions and 0 deletions
--- a/paper/background_related.md
+++ b/paper/background_related.md
@@ -0,0 +1,125 @@
+## 2 Background
+
+### 2.1 Unsupervised network anomaly detection
+
+We consider the standard unsupervised setting: a detector is trained only on
+benign traffic and, at inference time, must assign an anomaly score to each
+flow without access to attack labels at any stage of training. Public
+benchmarks (e.g., CIC-IDS2017, CIC-DDoS2019, ISCXTor2016) provide labelled
+attack traffic for evaluation only. Two granularities dominate the
+literature: flow-level detectors operate on per-flow aggregate features
+(byte counts, inter-arrival statistics, flag tallies), while packet-level
+detectors operate on the ordered sequence of per-packet features inside a
+flow and retain temporal structure that flow aggregates discard.
+
+Within-dataset AUROC on the standard benchmarks has narrowed to within
+seed noise across recent recipes; the substantive evaluation axis is now
+cross-dataset transfer, in which a detector is trained on one environment
+and evaluated on traffic from another. Performance on this axis has not
+converged.
+
+### 2.2 Continuous Flow Matching
+
+Continuous Flow Matching (CFM) trains a time-dependent vector field
+$v_\theta(x, t)$ to transport a tractable source distribution (typically
+$\mathcal{N}(0, I)$) to the data distribution along an ODE
+$\mathrm{d}x_t = v_\theta(x_t, t)\,\mathrm{d}t$. The training objective
+regresses $v_\theta$ onto a target velocity defined along a chosen
+conditional probability path; for the linear (Gaussian) path this reduces
+to a simple least-squares loss, side-stepping the score-matching objective
+and stochastic sampler of diffusion models. OT-CFM straightens
+trajectories by pairing source and data samples through minibatch optimal
+transport, which lowers integration error and enables stable few-step
+inference.
+
+A trained CFM model gives access not only to the learned density but to
+a family of geometric quantities along the trajectory: terminal velocity
+norm, divergence, curvature, and Jacobian-trace estimators. These can be
+read off the velocity field without retraining.
+
+### 2.3 Discrete Flow Matching
+
+Continuous FM does not apply to categorical state spaces, where adding
+Gaussian noise is undefined. Discrete Flow Matching (DFM) generalises
+the framework to finite alphabets through continuous-time Markov chains:
+the model parameterises token-level transition rates that interpolate
+between a source distribution (typically uniform) and the data
+distribution. The training objective remains a simple regression onto
+target rates derived from a chosen interpolation schedule. DFM has been
+validated on language and molecular generation; mixed
+continuous–discrete data, where each observation has both numerical and
+categorical channels, is the natural composition of CFM and DFM.
+
+---
+
+## 3 Related Work
+
+### 3.1 Reconstruction-based detectors
+
+Autoencoder-style detectors learn to reconstruct benign inputs and score
+anomalies by reconstruction error. Kitsune popularised the design for
+online NIDS using an ensemble of small autoencoders, and MemAE introduced
+a learned memory bank to constrain the latent representation to the
+benign manifold. The family suffers from a documented identity-mapping
+failure: sufficiently expressive autoencoders reconstruct out-of-
+distribution inputs near-perfectly, eroding the gap between benign and
+anomalous reconstruction error. Recent critiques argue that this
+behaviour is structural rather than a hyperparameter artefact, and that
+reconstruction error is therefore an unreliable anomaly score in
+general.
+
+### 3.2 Density-based detectors
+
+Three deep generative families currently hold the public SOTA on
+NIDS benchmarks. **Normalising flows** fit an explicit invertible
+density on benign traffic and score by negative log-likelihood; the
+strongest recent pipeline reports 0.93 within-dataset AUROC on
+CIC-DDoS2019 with cross-domain transfer in the 0.89–0.93 range.
+**Diffusion-based detectors** include contextual masking distillation
+schemes that compare a student denoiser against a benign-trained
+teacher, alongside a broader 2025 survey of diffusion AD variants.
+**GAN-based detectors**, exemplified by recent NDSS work that augments
+the optimisation with particle-swarm search, score by discriminator
+output or cycle-reconstruction error. All three families reduce a
+packet stream to a single scalar derived from one homogeneous
+probabilistic model fit to benign data, and the reported log-likelihood
+is known to dissociate from anomaly status once the benign distribution
+drifts.
+
+A separate line of work uses self-supervised contrastive
+representations, graph neural networks, or pre-trained traffic
+foundation models, with anomaly scoring delegated to a downstream
+detector such as OCSVM or Mahalanobis distance. These pipelines are
+typically two-stage, are primarily evaluated on encrypted-traffic
+classification rather than open-set anomaly detection, and are not the
+focus of the cross-dataset robustness comparison we pursue.
+
+### 3.3 Flow Matching for anomaly detection
+
+Outside NIDS, two recent works adopt Flow Matching as the AD objective.
+A time-reversed FM detector for image anomaly detection couples
+worst-transport coupling with a high-dimensional latent, scoring by
+deviation from the learned velocity field. A tabular detector built on
+one-step FM offers explainability and provable robustness guarantees on
+heterogeneous structured data. Both validate FM-based scoring as
+competitive with reconstruction- and density-based baselines in their
+respective regimes. Discrete Flow Matching has been validated on
+language and molecular generation but not, to our knowledge, evaluated
+as an anomaly-detection objective. No prior work applies either
+continuous or discrete FM to packet-sequence NIDS.
+
+### 3.4 Cross-dataset robustness in NIDS
+
+As within-dataset metrics have saturated, cross-dataset evaluation has
+emerged as the field's discriminating axis. A 2024 systematic study
+measures the generalisation gap across the standard NIDS benchmarks
+under matched feature schemas and reports AUROC drops of 0.10–0.30
+when detectors trained on one environment are evaluated on another.
+Subsequent work on heterogeneous deep stacked ensembles, calibrated
+transformers, and few-shot multi-domain fusion targets the same gap
+through architectural or training-time interventions. The phenomenon is
+broadly observed and quantified; what is missing from the literature is
+a mechanism-level account of why density-based scores in particular
+degrade under domain shift, as opposed to an accumulation of empirical
+remedies. The pilot study in §X revisits this gap directly and frames
+the structural failure mode that the rest of the paper addresses.