3.3 KiB
3.3 KiB
Thresholded metrics — unsupervised AD protocol
3-seed mean ± std. Threshold τ is set on benign-val half A; F1 / Precision / Recall / FPR are measured on benign-val half B + attack. AUROC/AUPRC use full benign val + attack. TPR@FPR is measured on the test half.
Both percentiles are reported because P95 and P99 give different operating points; F1 numbers are sensitive to that choice.
Primary score: terminal_norm. terminal_flow is reported on cross because RESULTS.md headlines both.
CICDDoS2019 within (σ=0.1, λ=0.3)
| Score | AUROC | AUPRC | F1 (P95) | Prec (P95) | Recall (P95) | FPR (P95) | F1 (P99) | TPR@1%FPR | TPR@5%FPR |
|---|---|---|---|---|---|---|---|---|---|
terminal_norm |
0.9960 ± 0.0011 | 0.9975 ± 0.0008 | 0.9932 ± 0.0012 | 0.9881 ± 0.0015 | 0.9983 ± 0.0008 | 0.0481 ± 0.0061 | 0.9112 ± 0.0402 | 0.9013 ± 0.0540 | 0.9980 ± 0.0014 |
terminal_flow |
0.9885 ± 0.0028 | 0.9918 ± 0.0017 | 0.9788 ± 0.0086 | 0.9868 ± 0.0009 | 0.9710 ± 0.0163 | 0.0517 ± 0.0030 | 0.7752 ± 0.0128 | 0.6052 ± 0.0347 | 0.9697 ± 0.0169 |
CICIDS2017 → CICDDoS2019 cross (σ=0.6, λ=0.3)
| Score | AUROC | AUPRC | F1 (P95) | Prec (P95) | Recall (P95) | FPR (P95) | F1 (P99) | TPR@1%FPR | TPR@5%FPR |
|---|---|---|---|---|---|---|---|---|---|
terminal_norm |
0.9109 ± 0.0032 | 0.8974 ± 0.0047 | 0.6321 ± 0.0513 | 0.9545 ± 0.0045 | 0.4745 ± 0.0550 | 0.0441 ± 0.0011 | 0.4202 ± 0.0171 | 0.2685 ± 0.0139 | 0.4940 ± 0.0399 |
terminal_flow |
0.9197 ± 0.0036 | 0.8957 ± 0.0086 | 0.6324 ± 0.0585 | 0.9517 ± 0.0055 | 0.4762 ± 0.0639 | 0.0469 ± 0.0019 | 0.4028 ± 0.0049 | 0.2534 ± 0.0039 | 0.4776 ± 0.0636 |
Reading
- Within-dataset (CICDDoS2019): at τ=P95,
terminal_normreaches F1 ≈ 0.99 with precision ≈ 0.99 and recall ≈ 0.99 — saturation. At τ=P99 (≈1% FPR), F1 ≈ 0.91 / TPR@1%FPR ≈ 0.90. The model is a working detector at fixed thresholds, not just an AUROC artifact. - Cross-dataset (CICIDS2017 → CICDDoS2019): AUROC stays high (≈ 0.91) but at fixed thresholds Precision is high (≈0.95) and Recall drops to ≈0.50 at P95 / ≈0.27 at 1% FPR. The cross-dataset domain shift compresses the score gap, so a source-calibrated threshold is conservative on target — false positives stay low, but a substantial fraction of target-domain attacks score below the source benign P95. AUROC alone overstates deployability cross-dataset; thresholded numbers are the honest figure.
- TIPSO-GAN comparability: TIPSO-GAN's CIC-DDoS2019 F1 ≈ 0.99 is reported under a supervised protocol (model has seen attack examples). Our F1 ≈ 0.99 on CICDDoS2019 within is achieved under the unsupervised protocol (benign-only training, threshold from benign-val), which is the strictly harder setting. Direct F1 numerical equivalence; protocol asymmetry is in our favor.
Source artifacts
artifacts/verify_2026_04_24/thresholded_metrics.py— per-file metric tool.artifacts/verify_2026_04_24/aggregate_thresholded.py— this aggregator.- Within:
artifacts/phase1_2026_04_25/cicddos2019_lambda0p3_seed*/thresholded_metrics.json(computed from existingphase1_scores.npz). - Cross:
artifacts/phase25_sigma06_cross_2026_04_25/with_scores/thresholded_seed*.json(raw scores re-saved by patchedeval_phase2_cross_cicddos2019.py).