ablation: add Group A (aggregator) + Group B (architecture) infrastructure

Extends MixedCFMConfig with 5 backwards-compatible flags (use_flow_token, n_packet_tokens, disc_as_cont, cont_as_disc + cont_n_bins) so existing JANUS-full checkpoints load with 0 missing/unexpected keys. Adds: - 60 ablation training configs (5 variants × 4 datasets × 3 seeds) - scripts/ablation/{generate_configs.py, run_groupB.sh, run_cross_groupB.sh, smoke_test.sh} — config generation + GPU drivers - scripts/aggregate/aggregate_ablation{,_cross,_cross_B}.py — produces within-dataset and cross-dataset (3×3) ablation tables with 3-seed mean ± 95% t-CI plus optional paired DeLong p-values README updated with ablation section pointing at artifacts/ablation/ABLATION_SUMMARY.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 23:59:27 +08:00
parent 1d8862fbeb
commit a6bcbbd299
72 changed files with 3642 additions and 96 deletions
--- a/README.md
+++ b/README.md
@@ -51,6 +51,28 @@ Source (rows) trained on 10K benign of source dataset; target (columns) tested o

 Forward CICIDS17→CICDDoS19 (0.969) beats Shafir 0.89 by **+0.08**; reverse CICDDoS19→CICIDS17 (0.941) approximately matches Shafir 0.93. CICIoT23 is hardest both as source and target — its IoT-protocol diversity makes the "benign of source ≈ benign of target" assumption brittle. Full table at `artifacts/route_comparison/CROSS_MATRIX_3x3.md`.

+### Ablations (architecture & aggregator)
+
+Two orthogonal ablation axes, each evaluated **within-dataset** (4 datasets × 3 seeds) **and** **cross-dataset** (3×3 transfer × 3 seeds):
+
+- **Group A** — 7 alternative aggregators on the same JANUS-full sub-score vector (post-processing only; no retraining).
+- **Group B** — 5 architecture variants, each retrained 4 datasets × 3 seeds = 60 runs + 90 cross-evals.
+
+Every load-bearing JANUS design choice has the **same shape of ablation curve**: small in-distribution cost, large cross-dataset gain.
+
+| Component (removed in ablation) | Variant | Within Δ | Cross-mean Δ | Cross-worst Δ |
+|---|---|---:|---:|---:|
+| FLOW token (global context) | B1 | **−0.94** | −6.70 | −19.97 |
+| Packet sequence | B2 | +0.15 | **−23.82** | **−36.27** |
+| Cont/disc head split (drop disc head) | B3 | +0.44 | **−13.14** | **−25.03** |
+| CFM head (drop continuous side) | B4 | **−2.37** | −2.03 | −2.86 |
+| Joint training of two heads | B5 | +0.20 | **−18.93** | **−27.54** |
+| OAS Mahalanobis aggregator | A1 vs A5 | +0.37 | **−15.88** | **−27.38** |
+
+Three ablations (B3 / B5 / A-aggregator) **marginally beat JANUS-full at within-dataset evaluation** but collapse on at least one cross-dataset transfer direction. The disc head, joint training, and OAS aggregator are deliberate trades: their value is exclusively in cross-dataset robustness.
+
+Full headline summary: `artifacts/ablation/ABLATION_SUMMARY.md`. Per-variant 3×3 cross matrices: `artifacts/ablation/ABLATION_CROSS_B_full.md` and `artifacts/ablation/ABLATION_TABLE_CROSS_full.md`.
+
 ## Layout

 ```
@@ -74,6 +96,12 @@ scripts/                   Workspace-level pcap → artifact pipeline,
                           orchestration. aggregate_score_router.py is the
                           deployable score path; run_cross_3x3.sh +
                           cross_3x3_table.py produce the cross matrix.
+                           aggregate_ablation.py / aggregate_ablation_cross.py /
+                           aggregate_ablation_cross_B.py produce the ablation
+                           tables in artifacts/ablation/.
+  ablation/                B-group ablation training/eval drivers
+                           (generate_configs.py, run_groupB.sh,
+                           run_cross_groupB.sh).
 tests/                     Data-contract unit tests.
 ```

@@ -177,7 +205,8 @@ Common gotcha: if CSV timestamps and pcap epochs are in different time zones, `e

 ## Authoritative documents

- `RESULTS.md` — full headline tables, ablations, per-attack analysis, JANUS configuration, thresholded operating-point metrics, what the experiments proved / disproved.
+- `RESULTS.md` — full headline tables, per-attack analysis, JANUS configuration, thresholded operating-point metrics, what the experiments proved / disproved.
+- `artifacts/ablation/ABLATION_SUMMARY.md` — paper-facing ablation summary (Group A aggregator + Group B architecture, both within and cross views).
 - `Mixed_CFM/model.py` and `common/data_contract.py` — model + data-contract source of truth.

 ## Python environment