Thresholded metrics — unsupervised AD protocol

3-seed mean ± std. Threshold τ is set on benign-val half A; F1 / Precision / Recall / FPR are measured on benign-val half B + attack. AUROC/AUPRC use full benign val + attack. TPR@FPR is measured on the test half.

Both percentiles are reported because P95 and P99 give different operating points; F1 numbers are sensitive to that choice.

Primary score: terminal_norm. terminal_flow is reported on cross because RESULTS.md headlines both.

CICDDoS2019 within (σ=0.1, λ=0.3)

Score	AUROC	AUPRC	F1 (P95)	Prec (P95)	Recall (P95)	FPR (P95)	F1 (P99)	TPR@1%FPR	TPR@5%FPR
`terminal_norm`	0.9960 ± 0.0011	0.9975 ± 0.0008	0.9932 ± 0.0012	0.9881 ± 0.0015	0.9983 ± 0.0008	0.0481 ± 0.0061	0.9112 ± 0.0402	0.9013 ± 0.0540	0.9980 ± 0.0014
`terminal_flow`	0.9885 ± 0.0028	0.9918 ± 0.0017	0.9788 ± 0.0086	0.9868 ± 0.0009	0.9710 ± 0.0163	0.0517 ± 0.0030	0.7752 ± 0.0128	0.6052 ± 0.0347	0.9697 ± 0.0169

CICIDS2017 → CICDDoS2019 cross (σ=0.6, λ=0.3)

Score	AUROC	AUPRC	F1 (P95)	Prec (P95)	Recall (P95)	FPR (P95)	F1 (P99)	TPR@1%FPR	TPR@5%FPR
`terminal_norm`	0.9109 ± 0.0032	0.8974 ± 0.0047	0.6321 ± 0.0513	0.9545 ± 0.0045	0.4745 ± 0.0550	0.0441 ± 0.0011	0.4202 ± 0.0171	0.2685 ± 0.0139	0.4940 ± 0.0399
`terminal_flow`	0.9197 ± 0.0036	0.8957 ± 0.0086	0.6324 ± 0.0585	0.9517 ± 0.0055	0.4762 ± 0.0639	0.0469 ± 0.0019	0.4028 ± 0.0049	0.2534 ± 0.0039	0.4776 ± 0.0636

Reading

Within-dataset (CICDDoS2019): at τ=P95, terminal_norm reaches F1 ≈ 0.99 with precision ≈ 0.99 and recall ≈ 0.99 — saturation. At τ=P99 (≈1% FPR), F1 ≈ 0.91 / TPR@1%FPR ≈ 0.90. The model is a working detector at fixed thresholds, not just an AUROC artifact.
Cross-dataset (CICIDS2017 → CICDDoS2019): AUROC stays high (≈ 0.91) but at fixed thresholds Precision is high (≈0.95) and Recall drops to ≈0.50 at P95 / ≈0.27 at 1% FPR. The cross-dataset domain shift compresses the score gap, so a source-calibrated threshold is conservative on target — false positives stay low, but a substantial fraction of target-domain attacks score below the source benign P95. AUROC alone overstates deployability cross-dataset; thresholded numbers are the honest figure.
TIPSO-GAN comparability: TIPSO-GAN's CIC-DDoS2019 F1 ≈ 0.99 is reported under a supervised protocol (model has seen attack examples). Our F1 ≈ 0.99 on CICDDoS2019 within is achieved under the unsupervised protocol (benign-only training, threshold from benign-val), which is the strictly harder setting. Direct F1 numerical equivalence; protocol asymmetry is in our favor.

Source artifacts

artifacts/verify_2026_04_24/thresholded_metrics.py — per-file metric tool.
artifacts/verify_2026_04_24/aggregate_thresholded.py — this aggregator.
Within: artifacts/phase1_2026_04_25/cicddos2019_lambda0p3_seed*/thresholded_metrics.json (computed from existing phase1_scores.npz).
Cross: artifacts/phase25_sigma06_cross_2026_04_25/with_scores/thresholded_seed*.json (raw scores re-saved by patched eval_phase2_cross_cicddos2019.py).

3.3 KiB Raw Blame History Unescape Escape

Thresholded metrics — unsupervised AD protocol

CICDDoS2019 within (σ=0.1, λ=0.3)

CICIDS2017 → CICDDoS2019 cross (σ=0.6, λ=0.3)

Reading

Source artifacts

3.3 KiB

Raw Blame History