New scripts under scripts/baselines/: - run_if_ocsvm_cross.py - 20-d canonical flow features (path A) - run_if_ocsvm_cross_packets.py - raw 576-d packet sequence (path B) - run_shafir_nf_cross.py - single-NF on 5-d SHAFIR5 subset or 20-d - *_all.sh - 3 sources x 3 targets x 3 seeds sweepers New aggregator scripts/aggregate/baselines_cross_3x3_table.py builds a Markdown 3x3 matrix per method from per-cell NPZ outputs. RESULTS.md gains a "Shallow-baseline 3x3 cross matrices" subsection pointing at the new artifact directories. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
409 lines
21 KiB
Markdown
409 lines
21 KiB
Markdown
# Final Results
|
||
|
||
## Main-line model: JANUS
|
||
|
||
**JANUS** (Joint Anomaly via Normalizing-flows of Unified States) is the
|
||
current main-line model. Codebase identifier is `Mixed_CFM/`; JANUS is the
|
||
external/published name.
|
||
|
||
JANUS = a packet-causal Transformer backbone with two output heads:
|
||
|
||
- **Continuous Flow Matching head** over (size, IAT, win) packet channels
|
||
- **Discrete Flow Matching (DFM) head** over the 6 binary protocol-flag /
|
||
direction channels
|
||
|
||
trained jointly (σ=0.1, lambda_disc=1.0, use_ot=true, no Phase-2
|
||
consistency loss). Downstream uses a **single deployable scalar score**:
|
||
the Mahalanobis-OAS distance over the 10-d score vector emitted by JANUS,
|
||
fit on benign val only (no attack labels).
|
||
|
||
JANUS is the first NIDS method to use Flow Matching as the training paradigm
|
||
in mixed continuous-discrete state spaces over packet sequences.
|
||
|
||
All numbers reported are 3-seed mean ± std. Two model families are tracked:
|
||
|
||
- **Unified_CFM** (legacy / our previous internal recipe): single Transformer
|
||
over [FLOW + packets] with Phase-2 consistency loss; λ=0.3. Strongest
|
||
single-fixed-score (`terminal_norm`) within-dataset baseline.
|
||
- **JANUS** (current main line, 2026-05-01): see above.
|
||
**New SOTA on cross-dataset transfer** under Mahalanobis auto-routing;
|
||
matches legacy within-dataset under the same protocol. See
|
||
`artifacts/route_comparison/SCORE_ROUTER.md`.
|
||
|
||
## Caveats that travel with all external claims
|
||
|
||
1. **CICIoT2023 vs Shafir is a metric mismatch, not a +SOTA result.** Shafir
|
||
reports F1=0.9951 with threshold tuned by Youden's J (TPR−FPR) on a
|
||
1K+1K balanced val set (uses attack labels for threshold selection only)
|
||
and tested on 10K+10K balanced. We report AUROC=0.9594 (Mahalanobis-OAS).
|
||
Different metric. CICIoT2023 should be presented as "additional
|
||
benchmark, no Shafir AUROC published" rather than "+SOTA". To make it
|
||
directly comparable, either reproduce Shafir's threshold protocol on
|
||
JANUS's d² to compute F1, or run Shafir's GitHub
|
||
`lshafir/NF-anomaly-detection` to extract NF AUROC.
|
||
2. **Reverse cross (CICDDoS2019→CICIDS2017) matches Shafir, does not beat.**
|
||
JANUS gets 0.9301 ± 0.0122. Shafir Table IX row 3 reports 0.93. The
|
||
"+0.31" gain is vs our own legacy `terminal_norm` (0.62), not vs Shafir.
|
||
3. **Cross-dataset is calibrated cross-domain transfer, not zero-shot.** The
|
||
Mahalanobis-OAS aggregator is fit on the **target** dataset's benign val
|
||
(unsupervised — no attack labels). Comparison vs Shafir is fair (his NF
|
||
threshold also calibrated on target benign), but the language must be
|
||
"calibrated cross-domain transfer" not "zero-shot transfer".
|
||
4. **Aggregator selection (OAS over LedoitWolf / plain Mahal / max-z) was
|
||
post-hoc.** OAS picked because consistently top across all cells in
|
||
`SCORE_ROUTER.md`; differences vs LedoitWolf ≤ 0.005. Strict pre-
|
||
registration would say "we evaluated 5 benign-only aggregators and OAS
|
||
performed best".
|
||
|
||
## Headline performance
|
||
|
||
External SOTA baselines (Shafir 2026 NF + Shapley) verified directly from
|
||
the paper (`artifacts/locked_baselines.md`). Unified_CFM "legacy" rows are
|
||
*our* previous internal recipe (Phase-2 consistency loss + per-task σ); they
|
||
are reported as internal ablation, NOT as the SOTA-comparison baseline.
|
||
|
||
### A. vs External SOTA — within-dataset, JANUS + Mahalanobis-OAS (no selection bias)
|
||
|
||
| Task | **Shafir 2026 SOTA** | **JANUS + Mahalanobis-OAS** | **Δ vs Shafir** |
|
||
|---|---|---|---|
|
||
| ISCXTor2016 (NonTor → Tor) | 0.8731 (AUROC) | **0.9908 ± 0.0012** | **+0.118** ⭐⭐ |
|
||
| CICIDS2017 within | 0.9303 (AUROC) | **0.9845 ± 0.0030** | **+0.054** ⭐ |
|
||
| CICDDoS2019 within | 0.93 (AUROC) | **0.9913 ± 0.0009** | **+0.061** ⭐ |
|
||
| CICIoT2023 within | F1=0.9951 (no AUROC) | 0.9594 ± 0.0028 (AUROC) | **N/A — metric mismatch, see Caveat 1** |
|
||
|
||
**3/3 directly comparable within-dataset benchmarks: JANUS sets new SOTA vs
|
||
external Shafir baselines, with margins +0.054 to +0.118 — all far outside
|
||
seed std.** This holds under fully selection-bias-free eval (single
|
||
Mahalanobis-OAS aggregator on the 10-d score vector, fit on benign val
|
||
only, no attack labels). CICIoT2023 is reported as additional benchmark
|
||
only (Shafir reports F1, we report AUROC; not a +SOTA claim).
|
||
|
||
### A'. Reference only — best per-channel fixed score (per-dataset selection-biased; do NOT use as headline SOTA)
|
||
|
||
⚠️ **Selection-biased**: the channel chosen per row (`terminal_norm` vs
|
||
`terminal_packet`) requires looking at attack-label AUROC to pick. Use this
|
||
table as ablation upper bound only, not as the SOTA claim. The honest
|
||
external SOTA claim is in table A above.
|
||
|
||
| Task | Shafir 2026 | JANUS (best fixed channel) | Δ vs Shafir |
|
||
|---|---|---|---|
|
||
| ISCXTor2016 | 0.8731 | 0.9954 ± 0.0007 (`terminal_norm`) | +0.122 |
|
||
| CICIDS2017 | 0.9303 | 0.9932 ± 0.0013 (`terminal_packet`) | +0.063 |
|
||
| CICDDoS2019 | 0.93 | 0.9970 ± 0.0005 (`terminal_norm`) | +0.067 |
|
||
| CICIoT2023 | F1=0.9951 (different metric) | 0.9671 ± 0.0002 (`terminal_packet`) | N/A |
|
||
|
||
### B. Internal ablation — JANUS vs our previous Unified_CFM legacy
|
||
|
||
This is for tracking how JANUS does relative to our own previous internal
|
||
best (not for the SOTA claim — Unified_CFM legacy is also our work).
|
||
Within-dataset AUROC has saturated above 0.99; differences ≤ 0.005 are seed
|
||
noise and the regime has no resolving power. The discriminating axis is
|
||
cross-dataset (next section).
|
||
|
||
| Task | Legacy Unified_CFM | JANUS + Mahalanobis-OAS | JANUS (best fixed) |
|
||
|---|---|---|---|
|
||
| ISCXTor2016 | 0.9945 ± 0.0011 | 0.9908 ± 0.0012 | 0.9954 ± 0.0007 |
|
||
| CICIDS2017 | 0.9858 ± 0.0021 | 0.9845 ± 0.0030 | 0.9932 ± 0.0013 |
|
||
| CICDDoS2019 | 0.9960 ± 0.0010 | 0.9913 ± 0.0009 | 0.9970 ± 0.0005 |
|
||
| CICIoT2023 | 0.9612 ± 0.0017 | 0.9594 ± 0.0028 | 0.9671 ± 0.0002 |
|
||
|
||
JANUS + Mahalanobis-OAS ties the legacy recipe within seed std on every
|
||
within-dataset task (all gaps ≤ 0.005, all overlapping). Best-fixed (per-
|
||
dataset selection-biased) strictly beats legacy on 4/4 but cannot be cited
|
||
as a clean SOTA claim. The decisive value-add is on cross-dataset transfer.
|
||
|
||
### C. Cross-dataset transfer — JANUS + Mahalanobis-OAS
|
||
|
||
⚠️ **Δ columns are vs our own legacy** (not vs Shafir). vs Shafir: forward
|
||
beats (+0.07 over 0.89), reverse matches (0.93 = 0.93). See Caveats above.
|
||
|
||
| Task | Legacy `terminal_norm` | **JANUS + Mahalanobis-OAS** | Δ vs legacy | Shafir | vs Shafir |
|
||
|---|---|---|---|---|---|
|
||
| **CICIoT2023 → CICIDS2017** | 0.7700 ± 0.0133 | **0.8983 ± 0.0098** | **+0.128** | (n/a) | (n/a) |
|
||
| **CICIoT2023 → CICDDoS2019** | 0.7473 ± 0.0223 | **0.8944 ± 0.0068** | **+0.147** | (n/a) | (n/a) |
|
||
| **CICIDS2017 → CICDDoS2019** (forward) | 0.911 (legacy SOTA) | **0.9594 ± 0.0046** | +0.048 | 0.89 | **+0.07** |
|
||
| **CICDDoS2019 → CICIDS2017** (reverse) | 0.62 (legacy) | **0.9301 ± 0.0122** | **+0.31** | 0.93 | **0 (matches)** |
|
||
|
||
Full 4×4 cross matrix at `artifacts/route_comparison/CROSS_MATRIX.md`. All
|
||
12 off-diagonal directions tested (3 seeds each = 36 cross evaluations).
|
||
**Average off-diagonal improvement: +0.175 over `terminal_norm`**
|
||
(0.660 → 0.835). The four "source-likeness collapse" cells where
|
||
`terminal_norm` ≤ 0.57 (essentially random) are all recovered to ≥ 0.75.
|
||
|
||
See `artifacts/route_comparison/SCORE_ROUTER.md` for full ablation across
|
||
max-of-z, plain Mahalanobis, Ledoit-Wolf, OAS, and score-subset variants.
|
||
|
||
#### Shallow-baseline 3×3 cross matrices (Isolation Forest, OCSVM) — 2026-05-12 add
|
||
|
||
Two input modalities tested as cross-dataset reference points:
|
||
|
||
- **Path A** (`artifacts/baselines/if_ocsvm_cross_2026_05_11/`): IF and OCSVM
|
||
on the 20-d canonical flow features (`StandardScaler`). Strong shallow
|
||
baseline — best off-diagonal AUROC is OCSVM 0.966 on CICIDS17→CICDDoS19.
|
||
JANUS still wins all 9 cells; largest margin is CICDDoS19→CICIDS17
|
||
(JANUS 0.941 vs OCSVM 0.571, **+0.370 AUROC**).
|
||
- **Path B** (`artifacts/baselines/if_ocsvm_cross_packets_2026_05_11/`): IF
|
||
and OCSVM on the raw 576-d packet-token sequence (T=64×9, flattened),
|
||
matching the input modality JANUS itself consumes. Numbers are weaker
|
||
across the board (avg −0.16 AUROC vs path A); 3 IF cells and 1 OCSVM cell
|
||
drop **below random**. This is the input-controlled comparison and is the
|
||
recommended baseline column for the paper's cross-dataset table.
|
||
|
||
Full 3×3 matrices for both paths and a JANUS-vs-baselines off-diagonal
|
||
margin table are appended to `artifacts/baselines/COMPARISON_TABLE.md`.
|
||
|
||
### Reverse cross (CICDDoS2019 → CICIDS2017) — 2026-05-01 update
|
||
|
||
The reverse direction was the project's "stuck" failure mode (memory note
|
||
`reverse_cross_score_redirection_2026_04_25`). Three model variants compared:
|
||
|
||
| Model | `terminal_norm` | best single score (post-hoc) | **Mahalanobis-OAS** |
|
||
|---|---|---|---|
|
||
| Legacy Unified + consistency | 0.626 | `pna_packet_median` 0.882 | 0.824 |
|
||
| Legacy Unified no consistency | 0.554 | `pna_packet_median` 0.852 | 0.893 |
|
||
| **JANUS (new)** | 0.519 | `disc_nll_total` **0.903 ± 0.012** | **0.930 ± 0.015** |
|
||
|
||
`terminal_norm` collapses (≈ random) across **all** model variants — this is
|
||
the source-likeness-classifier failure mode confirmed at the architecture
|
||
level, not just a single-recipe artifact. The recovery path is:
|
||
|
||
1. **DFM head** gives a `disc_nll` score that captures protocol-flag
|
||
distribution, which is genuinely transfer-stable.
|
||
2. **Mahalanobis-OAS** on the 10-d score vector aggregates `disc_nll` with
|
||
the (broken-but-not-useless) terminal scores into a 0.93 ± 0.015 AUROC.
|
||
3. Compared to Shafir's reverse 0.93 on this direction, JANUS +
|
||
Mahalanobis-OAS **matches** that benchmark (0.93 = 0.93). Does NOT beat.
|
||
|
||
This is **+0.31 over our own legacy memory baseline of 0.62**. The "main
|
||
attack direction" recorded in `reverse_cross_score_redirection_2026_04_25`
|
||
is now substantially solved.
|
||
|
||
Thresholded F1 / Precision / Recall / TPR@FPR under the unsupervised threshold
|
||
protocol are reported in **section D** below. Headline thresholded numbers:
|
||
CICDDoS2019 within `terminal_norm` F1=0.993 ± 0.001 at τ=P95; cross `terminal_norm`
|
||
F1=0.632 ± 0.051 at τ=P95 (precision ≈ 0.95, recall ≈ 0.47).
|
||
|
||
> **Note on cross-dataset baseline**: Shafir's Table IX is asymmetric.
|
||
> The IDS2017→DDoS2019 direction (which we evaluate) reads **0.89**, not
|
||
> 0.93. The 0.93 number is the reverse direction (DDoS2019→IDS2017),
|
||
> which we have not evaluated. See `artifacts/locked_baselines.md`.
|
||
|
||
> **Note on σ choice**: headline numbers use per-task best σ (σ=0.1 for ISCXTor2016
|
||
> and CICDDoS2019; σ=0.6 for CICIDS2017 within and cross). Within-dataset
|
||
> tasks are σ-insensitive within seed noise; cross-dataset requires σ=0.6.
|
||
> Single-policy σ=0.6 also beats Shafir on 4/4. Full 4×2 sensitivity table
|
||
> in `artifacts/sigma_validation.md`.
|
||
|
||
### D. Thresholded operating-point metrics
|
||
|
||
⚠️ Numbers in this section are from the **Unified_CFM legacy** recipe (σ=0.1
|
||
within, σ=0.6 cross, λ=0.3, single fixed score). Equivalent thresholded
|
||
numbers for current JANUS + Mahalanobis-OAS have not been recomputed yet;
|
||
the AUROC tables (A/B/C above) are the authoritative JANUS comparison.
|
||
|
||
**Protocol**: τ is set from a benign-val half (A); F1 / Precision / Recall /
|
||
FPR are measured on benign-val half B + attack. AUROC / AUPRC use full
|
||
benign val + attack. TPR@FPR is measured on the test half. Both percentiles
|
||
τ ∈ {P95, P99} are reported because they correspond to different operating
|
||
points and F1 is sensitive to that choice.
|
||
|
||
**CICDDoS2019 within** (σ=0.1, λ=0.3):
|
||
|
||
| Score | AUROC | AUPRC | F1 (P95) | Prec (P95) | Recall (P95) | FPR (P95) | F1 (P99) | TPR@1%FPR | TPR@5%FPR |
|
||
|---|---|---|---|---|---|---|---|---|---|
|
||
| `terminal_norm` | 0.9960 ± 0.0011 | 0.9975 ± 0.0008 | 0.9932 ± 0.0012 | 0.9881 ± 0.0015 | 0.9983 ± 0.0008 | 0.0481 ± 0.0061 | 0.9112 ± 0.0402 | 0.9013 ± 0.0540 | 0.9980 ± 0.0014 |
|
||
| `terminal_flow` | 0.9885 ± 0.0028 | 0.9918 ± 0.0017 | 0.9788 ± 0.0086 | 0.9868 ± 0.0009 | 0.9710 ± 0.0163 | 0.0517 ± 0.0030 | 0.7752 ± 0.0128 | 0.6052 ± 0.0347 | 0.9697 ± 0.0169 |
|
||
|
||
**CICIDS2017 → CICDDoS2019 cross** (σ=0.6, λ=0.3):
|
||
|
||
| Score | AUROC | AUPRC | F1 (P95) | Prec (P95) | Recall (P95) | FPR (P95) | F1 (P99) | TPR@1%FPR | TPR@5%FPR |
|
||
|---|---|---|---|---|---|---|---|---|---|
|
||
| `terminal_norm` | 0.9109 ± 0.0032 | 0.8974 ± 0.0047 | 0.6321 ± 0.0513 | 0.9545 ± 0.0045 | 0.4745 ± 0.0550 | 0.0441 ± 0.0011 | 0.4202 ± 0.0171 | 0.2685 ± 0.0139 | 0.4940 ± 0.0399 |
|
||
| `terminal_flow` | 0.9197 ± 0.0036 | 0.8957 ± 0.0086 | 0.6324 ± 0.0585 | 0.9517 ± 0.0055 | 0.4762 ± 0.0639 | 0.0469 ± 0.0019 | 0.4028 ± 0.0049 | 0.2534 ± 0.0039 | 0.4776 ± 0.0636 |
|
||
|
||
**Reading**:
|
||
|
||
- *Within-dataset CICDDoS2019* saturates: at τ=P95 F1 ≈ 0.99 with balanced
|
||
precision and recall ≈ 0.99; at τ=P99 (≈1% FPR) F1 ≈ 0.91 with TPR@1%FPR
|
||
≈ 0.90. The model is a working detector at fixed thresholds, not just an
|
||
AUROC artifact.
|
||
- *Cross-dataset CICIDS2017→CICDDoS2019* keeps AUROC ≈ 0.91 but at fixed τ
|
||
shows precision ≈ 0.95 / recall ≈ 0.50 at P95 and ≈0.27 at 1% FPR — the
|
||
cross-dataset domain shift compresses the score gap, so source-calibrated
|
||
thresholds are conservative on target. **AUROC alone overstates
|
||
deployability cross-dataset; thresholded numbers are the honest figure.**
|
||
|
||
**TIPSO-GAN comparability**: TIPSO-GAN's CICDDoS2019 F1 ≈ 0.99 is reported
|
||
under a **supervised** protocol (model has seen attack examples). Our F1
|
||
≈ 0.99 on CICDDoS2019 within is achieved under the **unsupervised** protocol
|
||
(benign-only training, threshold from benign val), which is the strictly
|
||
harder setting. Direct F1 numerical equivalence; protocol asymmetry is in
|
||
our favor.
|
||
|
||
## Methodological contribution: `flow_consistency` diagnostic score
|
||
|
||
Phase 2 masked-prediction consistency loss unlocks a new score that is
|
||
discriminative **only when the model is trained with the consistency loss**:
|
||
|
||
| Dataset | baseline (no aux) | Phase 2 (λ=0.3, σ=0.1) |
|
||
|---|---|---|
|
||
| ISCXTor2016 | 0.6543 | 0.9011 ± 0.0125 (+0.247) |
|
||
| CICIDS2017 | 0.5745 | 0.8770 ± 0.0039 (+0.302) |
|
||
| CICDDoS2019 | 0.9084 | 0.9459 ± 0.0188 (+0.038) |
|
||
|
||
On **SSH-Patator** — the worst class in CICIDS2017 for `terminal_norm`
|
||
(0.6407 ± 0.0675) — `flow_consistency` reaches 0.94, providing a reliable
|
||
detector where standard density scores fail.
|
||
|
||
## Per-attack-family pattern
|
||
|
||
`terminal_norm` dominates on volumetric attacks (DDoS, DoS, Portscan, all
|
||
DrDoS_*) — saturated 0.97-0.99. Decomposed scores compete only on
|
||
brute-force / app-layer attacks where flow-level signal is strong but
|
||
packet-level signal is weak:
|
||
|
||
| Class | n | terminal_norm | best decomposed score | best AUROC |
|
||
|---|---|---|---|---|
|
||
| SSH-Patator | 168 | 0.6407 ± 0.0675 | `kinetic_flow` | 0.9458 ± 0.0080 |
|
||
| FTP-Patator | 256 | 0.8963 ± 0.0015 | `terminal_flow` | 0.9773 ± 0.0049 |
|
||
| DoS GoldenEye | 448 | 0.9760 ± 0.0008 | `terminal_flow` | 0.9868 ± 0.0015 |
|
||
|
||
Outside these classes, `terminal_norm` is the right primary; decomposed
|
||
scores are diagnostic only.
|
||
|
||
## What the experiments proved
|
||
|
||
1. **JANUS sets new SOTA vs external Shafir 2026 NF on 3/3 directly
|
||
comparable within-dataset benchmarks** under unbiased Mahalanobis-OAS
|
||
eval (+0.054 to +0.118, all margins outside 3-seed std). CICIoT2023 is
|
||
metric-mismatched (F1 vs AUROC) and reported as additional benchmark.
|
||
2. **Within-dataset is saturated**: JANUS + Mahalanobis-OAS ties our own
|
||
internal Unified_CFM legacy within ±0.005 (all in seed std). At AUROC
|
||
> 0.99 the regime has no resolving power; benchmarks here cannot
|
||
distinguish models. The right axis is cross-dataset.
|
||
3. **JANUS recovers the previously catastrophic reverse cross direction**:
|
||
CICDDoS2019→CICIDS2017 from legacy `terminal_norm` 0.62 → JANUS
|
||
Mahalanobis-OAS 0.93. Matches Shafir's 0.93 on the same direction
|
||
(does not exceed). The "source-likeness collapse" failure mode of
|
||
`terminal_norm` is confirmed at the architecture level (≤ 0.63 across
|
||
3 distinct backbones) and is broken by the DFM head + Mahalanobis route.
|
||
4. **Discrete Flow Matching on flag/direction channels unlocks a new score
|
||
family** (`disc_nll_total`) that is independent of `terminal_norm`. It is
|
||
the single best cross→CICIDS2017 fixed score across all 5 routes
|
||
(0.9191). Without it the Mahalanobis aggregator has nothing to recover
|
||
reverse cross with.
|
||
5. **Causal-packet attention reduces multi-seed std** by ~2-8× on every
|
||
dataset, indicating the protocol-causal prior is a stabilizer for CFM
|
||
training.
|
||
6. **Phase-2 consistency loss is no longer the lead mechanism**: useful for
|
||
the `flow_consistency` diagnostic family, but JANUS's `terminal_packet`
|
||
and `disc_nll_total` heads cover its function without the masked-
|
||
prediction aux loss.
|
||
7. **σ-band noise is a transfer-friendly regularizer** — σ=0.6 cross-dataset
|
||
AUROC is +0.02 over σ=0.1, matching the σ=0.6 sweet spot from Packet_CFM.
|
||
8. **Per-attack-family analysis is the right reporting frame** — averaged
|
||
AUROC hides the SSH-Patator-style cases where decomposed scores save
|
||
the day.
|
||
|
||
## What the experiments disproved
|
||
|
||
1. **Curvature as primary score**: 0.32-0.91 across datasets, much weaker
|
||
than `terminal_norm`. Has diagnostic value on SSH-Patator (+0.30) but
|
||
should not lead reporting.
|
||
2. **Jacobian-Hutchinson as primary score**: 0.32-0.59 on ISCXTor2016 —
|
||
below random for some sub-scores. Failed.
|
||
3. **Time-profile velocity scores**: at best +0.005 over `terminal_norm`
|
||
on average. Some per-class wins on brute-force but not enough to lead.
|
||
|
||
## Configuration
|
||
|
||
```yaml
|
||
# CURRENT SOTA: JANUS (Mixed_CFM + causal-packet attention).
|
||
# Configs at: Mixed_CFM/configs/<dataset>_seed{42,43,44}.yaml
|
||
model:
|
||
T: 64
|
||
d_model: 128
|
||
n_layers: 4
|
||
n_heads: 4
|
||
mlp_ratio: 4.0
|
||
time_dim: 64
|
||
use_ot: true
|
||
reference_mode: causal_packets # ← Route A: packet-causal attention
|
||
|
||
training:
|
||
n_train: 10000
|
||
epochs: 50
|
||
batch_size: 256
|
||
lr: 3.0e-4
|
||
# Mixed CFM packet preprocessing: cont channels z-scored,
|
||
# disc channels (direction + 5 TCP flags) kept as int {0,1}
|
||
sigma: 0.1
|
||
lambda_disc: 1.0 # ← Route C: DFM cross-entropy weight
|
||
|
||
scoring (per dataset best):
|
||
ISCXTor2016 / CICDDoS2019: terminal_norm
|
||
CICIDS2017 / CICIoT2023: terminal_packet
|
||
cross→CICIDS2017: disc_nll_total
|
||
```
|
||
|
||
### Legacy config (Unified_CFM with Phase-2 consistency)
|
||
|
||
Kept for reference; superseded by JANUS on cross-dataset (within-dataset is
|
||
saturated and JANUS ties legacy in noise):
|
||
```yaml
|
||
lambda_flow: 0.3
|
||
lambda_packet: 0.3
|
||
packet_mask_ratio: 0.5
|
||
sigma: 0.1 within / 0.6 cross
|
||
```
|
||
|
||
## Stability
|
||
|
||
JANUS std vs legacy Unified_CFM std (3 seeds):
|
||
|
||
| Dataset | Legacy std | **JANUS std** | std reduction |
|
||
|---|---|---|---|
|
||
| ISCXTor2016 | 0.0011 | **0.0007** | 1.6× |
|
||
| CICIDS2017 | 0.0021 | **0.0013** | 1.6× |
|
||
| CICDDoS2019 | 0.0010 | **0.0005** | 2× |
|
||
| CICIoT2023 | 0.0017 | **0.0002** | **8×** |
|
||
|
||
Causal-packet attention is the dominant contributor to std reduction —
|
||
isolated Route A also halved std on terminal_norm in CICIoT2023 (Route A
|
||
alone: 0.0006 vs baseline 0.0017).
|
||
|
||
Legacy reference (kept for completeness):
|
||
- `terminal_norm` ISCXTor2016: ±0.0011 (σ=0.1) / ±0.0019 (σ=0.6)
|
||
- `terminal_norm` CICIDS2017: ±0.0021 (σ=0.6)
|
||
- `terminal_norm` CICDDoS2019: ±0.0010 (σ=0.1)
|
||
- cross `terminal_norm` σ=0.6: ±0.0032
|
||
- cross `terminal_flow` σ=0.6: ±0.0036
|
||
|
||
The +0.121 on ISCXTor2016 and +0.055 on CICIDS2017 are not single-seed
|
||
artifacts.
|
||
|
||
## Source artifacts
|
||
|
||
- `artifacts/locked_baselines.md` — verified Shafir baselines (PDF inspection trail).
|
||
- `artifacts/sigma_validation.md` — full 4×2 σ-sensitivity table (σ ∈ {0.1, 0.6} ×
|
||
4 tasks, 3 seeds each) and per-task σ-selection protocol.
|
||
- `artifacts/reverse_cross.md` — reverse direction CICDDoS2019 → CICIDS2017
|
||
evaluation (3 seeds × 2 σ × 16 scores). Asymmetry finding.
|
||
- `artifacts/phase25_multiseed_2026_04_25/PER_ATTACK_TABLE.md` — per-attack
|
||
multi-seed table (granular `terminal_norm` vs decomposed scores per class).
|
||
- `artifacts/phase{0,1,25}*/<config_name>_seed*/phase1_summary.json` — raw
|
||
per-seed eval results across all experiments.
|
||
- `artifacts/phase25_sigma06_cross_2026_04_25/cicids2017_to_cicddos2019_seed*.json` —
|
||
3-seed cross-dataset eval JSONs.
|
||
- `artifacts/baselines/if_ocsvm_cross_2026_05_11/CROSS_MATRIX_3x3.md` —
|
||
IF/OCSVM 3×3 cross matrix on 20-d canonical flow features (path A).
|
||
- `artifacts/baselines/if_ocsvm_cross_packets_2026_05_11/CROSS_MATRIX_3x3.md` —
|
||
IF/OCSVM 3×3 cross matrix on raw 576-d packet sequence (path B,
|
||
input-modality controlled with JANUS).
|
||
- Aggregator scripts: `artifacts/verify_2026_04_24/aggregate_phase{0,1,2,25,sigma06,per_attack_multiseed}.py`.
|
||
- Orchestrator scripts: `artifacts/verify_2026_04_24/run_phase*.sh`.
|
||
|
||
Phase summary markdown reports were superseded by this `RESULTS.md` and
|
||
removed during the 2026-04-25 baseline-lock cleanup. The aggregator
|
||
scripts can regenerate any historical view from the raw JSON results.
|