Files
JANUS/RESULTS.md
BattleTag 6e5f753c01 baselines: add 3x3 cross-dataset runners for IF/OCSVM (path A + B) and Shafir NF
New scripts under scripts/baselines/:
- run_if_ocsvm_cross.py            - 20-d canonical flow features (path A)
- run_if_ocsvm_cross_packets.py    - raw 576-d packet sequence (path B)
- run_shafir_nf_cross.py           - single-NF on 5-d SHAFIR5 subset or 20-d
- *_all.sh                         - 3 sources x 3 targets x 3 seeds sweepers

New aggregator scripts/aggregate/baselines_cross_3x3_table.py builds a
Markdown 3x3 matrix per method from per-cell NPZ outputs.

RESULTS.md gains a "Shallow-baseline 3x3 cross matrices" subsection
pointing at the new artifact directories.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:41:20 +08:00

409 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Final Results
## Main-line model: JANUS
**JANUS** (Joint Anomaly via Normalizing-flows of Unified States) is the
current main-line model. Codebase identifier is `Mixed_CFM/`; JANUS is the
external/published name.
JANUS = a packet-causal Transformer backbone with two output heads:
- **Continuous Flow Matching head** over (size, IAT, win) packet channels
- **Discrete Flow Matching (DFM) head** over the 6 binary protocol-flag /
direction channels
trained jointly (σ=0.1, lambda_disc=1.0, use_ot=true, no Phase-2
consistency loss). Downstream uses a **single deployable scalar score**:
the Mahalanobis-OAS distance over the 10-d score vector emitted by JANUS,
fit on benign val only (no attack labels).
JANUS is the first NIDS method to use Flow Matching as the training paradigm
in mixed continuous-discrete state spaces over packet sequences.
All numbers reported are 3-seed mean ± std. Two model families are tracked:
- **Unified_CFM** (legacy / our previous internal recipe): single Transformer
over [FLOW + packets] with Phase-2 consistency loss; λ=0.3. Strongest
single-fixed-score (`terminal_norm`) within-dataset baseline.
- **JANUS** (current main line, 2026-05-01): see above.
**New SOTA on cross-dataset transfer** under Mahalanobis auto-routing;
matches legacy within-dataset under the same protocol. See
`artifacts/route_comparison/SCORE_ROUTER.md`.
## Caveats that travel with all external claims
1. **CICIoT2023 vs Shafir is a metric mismatch, not a +SOTA result.** Shafir
reports F1=0.9951 with threshold tuned by Youden's J (TPRFPR) on a
1K+1K balanced val set (uses attack labels for threshold selection only)
and tested on 10K+10K balanced. We report AUROC=0.9594 (Mahalanobis-OAS).
Different metric. CICIoT2023 should be presented as "additional
benchmark, no Shafir AUROC published" rather than "+SOTA". To make it
directly comparable, either reproduce Shafir's threshold protocol on
JANUS's d² to compute F1, or run Shafir's GitHub
`lshafir/NF-anomaly-detection` to extract NF AUROC.
2. **Reverse cross (CICDDoS2019→CICIDS2017) matches Shafir, does not beat.**
JANUS gets 0.9301 ± 0.0122. Shafir Table IX row 3 reports 0.93. The
"+0.31" gain is vs our own legacy `terminal_norm` (0.62), not vs Shafir.
3. **Cross-dataset is calibrated cross-domain transfer, not zero-shot.** The
Mahalanobis-OAS aggregator is fit on the **target** dataset's benign val
(unsupervised — no attack labels). Comparison vs Shafir is fair (his NF
threshold also calibrated on target benign), but the language must be
"calibrated cross-domain transfer" not "zero-shot transfer".
4. **Aggregator selection (OAS over LedoitWolf / plain Mahal / max-z) was
post-hoc.** OAS picked because consistently top across all cells in
`SCORE_ROUTER.md`; differences vs LedoitWolf ≤ 0.005. Strict pre-
registration would say "we evaluated 5 benign-only aggregators and OAS
performed best".
## Headline performance
External SOTA baselines (Shafir 2026 NF + Shapley) verified directly from
the paper (`artifacts/locked_baselines.md`). Unified_CFM "legacy" rows are
*our* previous internal recipe (Phase-2 consistency loss + per-task σ); they
are reported as internal ablation, NOT as the SOTA-comparison baseline.
### A. vs External SOTA — within-dataset, JANUS + Mahalanobis-OAS (no selection bias)
| Task | **Shafir 2026 SOTA** | **JANUS + Mahalanobis-OAS** | **Δ vs Shafir** |
|---|---|---|---|
| ISCXTor2016 (NonTor → Tor) | 0.8731 (AUROC) | **0.9908 ± 0.0012** | **+0.118** ⭐⭐ |
| CICIDS2017 within | 0.9303 (AUROC) | **0.9845 ± 0.0030** | **+0.054** ⭐ |
| CICDDoS2019 within | 0.93 (AUROC) | **0.9913 ± 0.0009** | **+0.061** ⭐ |
| CICIoT2023 within | F1=0.9951 (no AUROC) | 0.9594 ± 0.0028 (AUROC) | **N/A — metric mismatch, see Caveat 1** |
**3/3 directly comparable within-dataset benchmarks: JANUS sets new SOTA vs
external Shafir baselines, with margins +0.054 to +0.118 — all far outside
seed std.** This holds under fully selection-bias-free eval (single
Mahalanobis-OAS aggregator on the 10-d score vector, fit on benign val
only, no attack labels). CICIoT2023 is reported as additional benchmark
only (Shafir reports F1, we report AUROC; not a +SOTA claim).
### A'. Reference only — best per-channel fixed score (per-dataset selection-biased; do NOT use as headline SOTA)
⚠️ **Selection-biased**: the channel chosen per row (`terminal_norm` vs
`terminal_packet`) requires looking at attack-label AUROC to pick. Use this
table as ablation upper bound only, not as the SOTA claim. The honest
external SOTA claim is in table A above.
| Task | Shafir 2026 | JANUS (best fixed channel) | Δ vs Shafir |
|---|---|---|---|
| ISCXTor2016 | 0.8731 | 0.9954 ± 0.0007 (`terminal_norm`) | +0.122 |
| CICIDS2017 | 0.9303 | 0.9932 ± 0.0013 (`terminal_packet`) | +0.063 |
| CICDDoS2019 | 0.93 | 0.9970 ± 0.0005 (`terminal_norm`) | +0.067 |
| CICIoT2023 | F1=0.9951 (different metric) | 0.9671 ± 0.0002 (`terminal_packet`) | N/A |
### B. Internal ablation — JANUS vs our previous Unified_CFM legacy
This is for tracking how JANUS does relative to our own previous internal
best (not for the SOTA claim — Unified_CFM legacy is also our work).
Within-dataset AUROC has saturated above 0.99; differences ≤ 0.005 are seed
noise and the regime has no resolving power. The discriminating axis is
cross-dataset (next section).
| Task | Legacy Unified_CFM | JANUS + Mahalanobis-OAS | JANUS (best fixed) |
|---|---|---|---|
| ISCXTor2016 | 0.9945 ± 0.0011 | 0.9908 ± 0.0012 | 0.9954 ± 0.0007 |
| CICIDS2017 | 0.9858 ± 0.0021 | 0.9845 ± 0.0030 | 0.9932 ± 0.0013 |
| CICDDoS2019 | 0.9960 ± 0.0010 | 0.9913 ± 0.0009 | 0.9970 ± 0.0005 |
| CICIoT2023 | 0.9612 ± 0.0017 | 0.9594 ± 0.0028 | 0.9671 ± 0.0002 |
JANUS + Mahalanobis-OAS ties the legacy recipe within seed std on every
within-dataset task (all gaps ≤ 0.005, all overlapping). Best-fixed (per-
dataset selection-biased) strictly beats legacy on 4/4 but cannot be cited
as a clean SOTA claim. The decisive value-add is on cross-dataset transfer.
### C. Cross-dataset transfer — JANUS + Mahalanobis-OAS
⚠️ **Δ columns are vs our own legacy** (not vs Shafir). vs Shafir: forward
beats (+0.07 over 0.89), reverse matches (0.93 = 0.93). See Caveats above.
| Task | Legacy `terminal_norm` | **JANUS + Mahalanobis-OAS** | Δ vs legacy | Shafir | vs Shafir |
|---|---|---|---|---|---|
| **CICIoT2023 → CICIDS2017** | 0.7700 ± 0.0133 | **0.8983 ± 0.0098** | **+0.128** | (n/a) | (n/a) |
| **CICIoT2023 → CICDDoS2019** | 0.7473 ± 0.0223 | **0.8944 ± 0.0068** | **+0.147** | (n/a) | (n/a) |
| **CICIDS2017 → CICDDoS2019** (forward) | 0.911 (legacy SOTA) | **0.9594 ± 0.0046** | +0.048 | 0.89 | **+0.07** |
| **CICDDoS2019 → CICIDS2017** (reverse) | 0.62 (legacy) | **0.9301 ± 0.0122** | **+0.31** | 0.93 | **0 (matches)** |
Full 4×4 cross matrix at `artifacts/route_comparison/CROSS_MATRIX.md`. All
12 off-diagonal directions tested (3 seeds each = 36 cross evaluations).
**Average off-diagonal improvement: +0.175 over `terminal_norm`**
(0.660 → 0.835). The four "source-likeness collapse" cells where
`terminal_norm` ≤ 0.57 (essentially random) are all recovered to ≥ 0.75.
See `artifacts/route_comparison/SCORE_ROUTER.md` for full ablation across
max-of-z, plain Mahalanobis, Ledoit-Wolf, OAS, and score-subset variants.
#### Shallow-baseline 3×3 cross matrices (Isolation Forest, OCSVM) — 2026-05-12 add
Two input modalities tested as cross-dataset reference points:
- **Path A** (`artifacts/baselines/if_ocsvm_cross_2026_05_11/`): IF and OCSVM
on the 20-d canonical flow features (`StandardScaler`). Strong shallow
baseline — best off-diagonal AUROC is OCSVM 0.966 on CICIDS17→CICDDoS19.
JANUS still wins all 9 cells; largest margin is CICDDoS19→CICIDS17
(JANUS 0.941 vs OCSVM 0.571, **+0.370 AUROC**).
- **Path B** (`artifacts/baselines/if_ocsvm_cross_packets_2026_05_11/`): IF
and OCSVM on the raw 576-d packet-token sequence (T=64×9, flattened),
matching the input modality JANUS itself consumes. Numbers are weaker
across the board (avg 0.16 AUROC vs path A); 3 IF cells and 1 OCSVM cell
drop **below random**. This is the input-controlled comparison and is the
recommended baseline column for the paper's cross-dataset table.
Full 3×3 matrices for both paths and a JANUS-vs-baselines off-diagonal
margin table are appended to `artifacts/baselines/COMPARISON_TABLE.md`.
### Reverse cross (CICDDoS2019 → CICIDS2017) — 2026-05-01 update
The reverse direction was the project's "stuck" failure mode (memory note
`reverse_cross_score_redirection_2026_04_25`). Three model variants compared:
| Model | `terminal_norm` | best single score (post-hoc) | **Mahalanobis-OAS** |
|---|---|---|---|
| Legacy Unified + consistency | 0.626 | `pna_packet_median` 0.882 | 0.824 |
| Legacy Unified no consistency | 0.554 | `pna_packet_median` 0.852 | 0.893 |
| **JANUS (new)** | 0.519 | `disc_nll_total` **0.903 ± 0.012** | **0.930 ± 0.015** |
`terminal_norm` collapses (≈ random) across **all** model variants — this is
the source-likeness-classifier failure mode confirmed at the architecture
level, not just a single-recipe artifact. The recovery path is:
1. **DFM head** gives a `disc_nll` score that captures protocol-flag
distribution, which is genuinely transfer-stable.
2. **Mahalanobis-OAS** on the 10-d score vector aggregates `disc_nll` with
the (broken-but-not-useless) terminal scores into a 0.93 ± 0.015 AUROC.
3. Compared to Shafir's reverse 0.93 on this direction, JANUS +
Mahalanobis-OAS **matches** that benchmark (0.93 = 0.93). Does NOT beat.
This is **+0.31 over our own legacy memory baseline of 0.62**. The "main
attack direction" recorded in `reverse_cross_score_redirection_2026_04_25`
is now substantially solved.
Thresholded F1 / Precision / Recall / TPR@FPR under the unsupervised threshold
protocol are reported in **section D** below. Headline thresholded numbers:
CICDDoS2019 within `terminal_norm` F1=0.993 ± 0.001 at τ=P95; cross `terminal_norm`
F1=0.632 ± 0.051 at τ=P95 (precision ≈ 0.95, recall ≈ 0.47).
> **Note on cross-dataset baseline**: Shafir's Table IX is asymmetric.
> The IDS2017→DDoS2019 direction (which we evaluate) reads **0.89**, not
> 0.93. The 0.93 number is the reverse direction (DDoS2019→IDS2017),
> which we have not evaluated. See `artifacts/locked_baselines.md`.
> **Note on σ choice**: headline numbers use per-task best σ (σ=0.1 for ISCXTor2016
> and CICDDoS2019; σ=0.6 for CICIDS2017 within and cross). Within-dataset
> tasks are σ-insensitive within seed noise; cross-dataset requires σ=0.6.
> Single-policy σ=0.6 also beats Shafir on 4/4. Full 4×2 sensitivity table
> in `artifacts/sigma_validation.md`.
### D. Thresholded operating-point metrics
⚠️ Numbers in this section are from the **Unified_CFM legacy** recipe (σ=0.1
within, σ=0.6 cross, λ=0.3, single fixed score). Equivalent thresholded
numbers for current JANUS + Mahalanobis-OAS have not been recomputed yet;
the AUROC tables (A/B/C above) are the authoritative JANUS comparison.
**Protocol**: τ is set from a benign-val half (A); F1 / Precision / Recall /
FPR are measured on benign-val half B + attack. AUROC / AUPRC use full
benign val + attack. TPR@FPR is measured on the test half. Both percentiles
τ ∈ {P95, P99} are reported because they correspond to different operating
points and F1 is sensitive to that choice.
**CICDDoS2019 within** (σ=0.1, λ=0.3):
| Score | AUROC | AUPRC | F1 (P95) | Prec (P95) | Recall (P95) | FPR (P95) | F1 (P99) | TPR@1%FPR | TPR@5%FPR |
|---|---|---|---|---|---|---|---|---|---|
| `terminal_norm` | 0.9960 ± 0.0011 | 0.9975 ± 0.0008 | 0.9932 ± 0.0012 | 0.9881 ± 0.0015 | 0.9983 ± 0.0008 | 0.0481 ± 0.0061 | 0.9112 ± 0.0402 | 0.9013 ± 0.0540 | 0.9980 ± 0.0014 |
| `terminal_flow` | 0.9885 ± 0.0028 | 0.9918 ± 0.0017 | 0.9788 ± 0.0086 | 0.9868 ± 0.0009 | 0.9710 ± 0.0163 | 0.0517 ± 0.0030 | 0.7752 ± 0.0128 | 0.6052 ± 0.0347 | 0.9697 ± 0.0169 |
**CICIDS2017 → CICDDoS2019 cross** (σ=0.6, λ=0.3):
| Score | AUROC | AUPRC | F1 (P95) | Prec (P95) | Recall (P95) | FPR (P95) | F1 (P99) | TPR@1%FPR | TPR@5%FPR |
|---|---|---|---|---|---|---|---|---|---|
| `terminal_norm` | 0.9109 ± 0.0032 | 0.8974 ± 0.0047 | 0.6321 ± 0.0513 | 0.9545 ± 0.0045 | 0.4745 ± 0.0550 | 0.0441 ± 0.0011 | 0.4202 ± 0.0171 | 0.2685 ± 0.0139 | 0.4940 ± 0.0399 |
| `terminal_flow` | 0.9197 ± 0.0036 | 0.8957 ± 0.0086 | 0.6324 ± 0.0585 | 0.9517 ± 0.0055 | 0.4762 ± 0.0639 | 0.0469 ± 0.0019 | 0.4028 ± 0.0049 | 0.2534 ± 0.0039 | 0.4776 ± 0.0636 |
**Reading**:
- *Within-dataset CICDDoS2019* saturates: at τ=P95 F1 ≈ 0.99 with balanced
precision and recall ≈ 0.99; at τ=P99 (≈1% FPR) F1 ≈ 0.91 with TPR@1%FPR
≈ 0.90. The model is a working detector at fixed thresholds, not just an
AUROC artifact.
- *Cross-dataset CICIDS2017→CICDDoS2019* keeps AUROC ≈ 0.91 but at fixed τ
shows precision ≈ 0.95 / recall ≈ 0.50 at P95 and ≈0.27 at 1% FPR — the
cross-dataset domain shift compresses the score gap, so source-calibrated
thresholds are conservative on target. **AUROC alone overstates
deployability cross-dataset; thresholded numbers are the honest figure.**
**TIPSO-GAN comparability**: TIPSO-GAN's CICDDoS2019 F1 ≈ 0.99 is reported
under a **supervised** protocol (model has seen attack examples). Our F1
≈ 0.99 on CICDDoS2019 within is achieved under the **unsupervised** protocol
(benign-only training, threshold from benign val), which is the strictly
harder setting. Direct F1 numerical equivalence; protocol asymmetry is in
our favor.
## Methodological contribution: `flow_consistency` diagnostic score
Phase 2 masked-prediction consistency loss unlocks a new score that is
discriminative **only when the model is trained with the consistency loss**:
| Dataset | baseline (no aux) | Phase 2 (λ=0.3, σ=0.1) |
|---|---|---|
| ISCXTor2016 | 0.6543 | 0.9011 ± 0.0125 (+0.247) |
| CICIDS2017 | 0.5745 | 0.8770 ± 0.0039 (+0.302) |
| CICDDoS2019 | 0.9084 | 0.9459 ± 0.0188 (+0.038) |
On **SSH-Patator** — the worst class in CICIDS2017 for `terminal_norm`
(0.6407 ± 0.0675) — `flow_consistency` reaches 0.94, providing a reliable
detector where standard density scores fail.
## Per-attack-family pattern
`terminal_norm` dominates on volumetric attacks (DDoS, DoS, Portscan, all
DrDoS_*) — saturated 0.97-0.99. Decomposed scores compete only on
brute-force / app-layer attacks where flow-level signal is strong but
packet-level signal is weak:
| Class | n | terminal_norm | best decomposed score | best AUROC |
|---|---|---|---|---|
| SSH-Patator | 168 | 0.6407 ± 0.0675 | `kinetic_flow` | 0.9458 ± 0.0080 |
| FTP-Patator | 256 | 0.8963 ± 0.0015 | `terminal_flow` | 0.9773 ± 0.0049 |
| DoS GoldenEye | 448 | 0.9760 ± 0.0008 | `terminal_flow` | 0.9868 ± 0.0015 |
Outside these classes, `terminal_norm` is the right primary; decomposed
scores are diagnostic only.
## What the experiments proved
1. **JANUS sets new SOTA vs external Shafir 2026 NF on 3/3 directly
comparable within-dataset benchmarks** under unbiased Mahalanobis-OAS
eval (+0.054 to +0.118, all margins outside 3-seed std). CICIoT2023 is
metric-mismatched (F1 vs AUROC) and reported as additional benchmark.
2. **Within-dataset is saturated**: JANUS + Mahalanobis-OAS ties our own
internal Unified_CFM legacy within ±0.005 (all in seed std). At AUROC
> 0.99 the regime has no resolving power; benchmarks here cannot
distinguish models. The right axis is cross-dataset.
3. **JANUS recovers the previously catastrophic reverse cross direction**:
CICDDoS2019→CICIDS2017 from legacy `terminal_norm` 0.62 → JANUS
Mahalanobis-OAS 0.93. Matches Shafir's 0.93 on the same direction
(does not exceed). The "source-likeness collapse" failure mode of
`terminal_norm` is confirmed at the architecture level (≤ 0.63 across
3 distinct backbones) and is broken by the DFM head + Mahalanobis route.
4. **Discrete Flow Matching on flag/direction channels unlocks a new score
family** (`disc_nll_total`) that is independent of `terminal_norm`. It is
the single best cross→CICIDS2017 fixed score across all 5 routes
(0.9191). Without it the Mahalanobis aggregator has nothing to recover
reverse cross with.
5. **Causal-packet attention reduces multi-seed std** by ~2-8× on every
dataset, indicating the protocol-causal prior is a stabilizer for CFM
training.
6. **Phase-2 consistency loss is no longer the lead mechanism**: useful for
the `flow_consistency` diagnostic family, but JANUS's `terminal_packet`
and `disc_nll_total` heads cover its function without the masked-
prediction aux loss.
7. **σ-band noise is a transfer-friendly regularizer**σ=0.6 cross-dataset
AUROC is +0.02 over σ=0.1, matching the σ=0.6 sweet spot from Packet_CFM.
8. **Per-attack-family analysis is the right reporting frame** — averaged
AUROC hides the SSH-Patator-style cases where decomposed scores save
the day.
## What the experiments disproved
1. **Curvature as primary score**: 0.32-0.91 across datasets, much weaker
than `terminal_norm`. Has diagnostic value on SSH-Patator (+0.30) but
should not lead reporting.
2. **Jacobian-Hutchinson as primary score**: 0.32-0.59 on ISCXTor2016 —
below random for some sub-scores. Failed.
3. **Time-profile velocity scores**: at best +0.005 over `terminal_norm`
on average. Some per-class wins on brute-force but not enough to lead.
## Configuration
```yaml
# CURRENT SOTA: JANUS (Mixed_CFM + causal-packet attention).
# Configs at: Mixed_CFM/configs/<dataset>_seed{42,43,44}.yaml
model:
T: 64
d_model: 128
n_layers: 4
n_heads: 4
mlp_ratio: 4.0
time_dim: 64
use_ot: true
reference_mode: causal_packets # ← Route A: packet-causal attention
training:
n_train: 10000
epochs: 50
batch_size: 256
lr: 3.0e-4
# Mixed CFM packet preprocessing: cont channels z-scored,
# disc channels (direction + 5 TCP flags) kept as int {0,1}
sigma: 0.1
lambda_disc: 1.0 # ← Route C: DFM cross-entropy weight
scoring (per dataset best):
ISCXTor2016 / CICDDoS2019: terminal_norm
CICIDS2017 / CICIoT2023: terminal_packet
cross→CICIDS2017: disc_nll_total
```
### Legacy config (Unified_CFM with Phase-2 consistency)
Kept for reference; superseded by JANUS on cross-dataset (within-dataset is
saturated and JANUS ties legacy in noise):
```yaml
lambda_flow: 0.3
lambda_packet: 0.3
packet_mask_ratio: 0.5
sigma: 0.1 within / 0.6 cross
```
## Stability
JANUS std vs legacy Unified_CFM std (3 seeds):
| Dataset | Legacy std | **JANUS std** | std reduction |
|---|---|---|---|
| ISCXTor2016 | 0.0011 | **0.0007** | 1.6× |
| CICIDS2017 | 0.0021 | **0.0013** | 1.6× |
| CICDDoS2019 | 0.0010 | **0.0005** | 2× |
| CICIoT2023 | 0.0017 | **0.0002** | **8×** |
Causal-packet attention is the dominant contributor to std reduction —
isolated Route A also halved std on terminal_norm in CICIoT2023 (Route A
alone: 0.0006 vs baseline 0.0017).
Legacy reference (kept for completeness):
- `terminal_norm` ISCXTor2016: ±0.0011 (σ=0.1) / ±0.0019 (σ=0.6)
- `terminal_norm` CICIDS2017: ±0.0021 (σ=0.6)
- `terminal_norm` CICDDoS2019: ±0.0010 (σ=0.1)
- cross `terminal_norm` σ=0.6: ±0.0032
- cross `terminal_flow` σ=0.6: ±0.0036
The +0.121 on ISCXTor2016 and +0.055 on CICIDS2017 are not single-seed
artifacts.
## Source artifacts
- `artifacts/locked_baselines.md` — verified Shafir baselines (PDF inspection trail).
- `artifacts/sigma_validation.md` — full 4×2 σ-sensitivity table (σ ∈ {0.1, 0.6} ×
4 tasks, 3 seeds each) and per-task σ-selection protocol.
- `artifacts/reverse_cross.md` — reverse direction CICDDoS2019 → CICIDS2017
evaluation (3 seeds × 2 σ × 16 scores). Asymmetry finding.
- `artifacts/phase25_multiseed_2026_04_25/PER_ATTACK_TABLE.md` — per-attack
multi-seed table (granular `terminal_norm` vs decomposed scores per class).
- `artifacts/phase{0,1,25}*/<config_name>_seed*/phase1_summary.json` — raw
per-seed eval results across all experiments.
- `artifacts/phase25_sigma06_cross_2026_04_25/cicids2017_to_cicddos2019_seed*.json`
3-seed cross-dataset eval JSONs.
- `artifacts/baselines/if_ocsvm_cross_2026_05_11/CROSS_MATRIX_3x3.md`
IF/OCSVM 3×3 cross matrix on 20-d canonical flow features (path A).
- `artifacts/baselines/if_ocsvm_cross_packets_2026_05_11/CROSS_MATRIX_3x3.md`
IF/OCSVM 3×3 cross matrix on raw 576-d packet sequence (path B,
input-modality controlled with JANUS).
- Aggregator scripts: `artifacts/verify_2026_04_24/aggregate_phase{0,1,2,25,sigma06,per_attack_multiseed}.py`.
- Orchestrator scripts: `artifacts/verify_2026_04_24/run_phase*.sh`.
Phase summary markdown reports were superseded by this `RESULTS.md` and
removed during the 2026-04-25 baseline-lock cleanup. The aggregator
scripts can regenerate any historical view from the raw JSON results.