Files
JANUS/RESULTS_THRESHOLDED.md

3.3 KiB
Raw Blame History

Thresholded metrics — unsupervised AD protocol

3-seed mean ± std. Threshold τ is set on benign-val half A; F1 / Precision / Recall / FPR are measured on benign-val half B + attack. AUROC/AUPRC use full benign val + attack. TPR@FPR is measured on the test half.

Both percentiles are reported because P95 and P99 give different operating points; F1 numbers are sensitive to that choice.

Primary score: terminal_norm. terminal_flow is reported on cross because RESULTS.md headlines both.

CICDDoS2019 within (σ=0.1, λ=0.3)

Score AUROC AUPRC F1 (P95) Prec (P95) Recall (P95) FPR (P95) F1 (P99) TPR@1%FPR TPR@5%FPR
terminal_norm 0.9960 ± 0.0011 0.9975 ± 0.0008 0.9932 ± 0.0012 0.9881 ± 0.0015 0.9983 ± 0.0008 0.0481 ± 0.0061 0.9112 ± 0.0402 0.9013 ± 0.0540 0.9980 ± 0.0014
terminal_flow 0.9885 ± 0.0028 0.9918 ± 0.0017 0.9788 ± 0.0086 0.9868 ± 0.0009 0.9710 ± 0.0163 0.0517 ± 0.0030 0.7752 ± 0.0128 0.6052 ± 0.0347 0.9697 ± 0.0169

CICIDS2017 → CICDDoS2019 cross (σ=0.6, λ=0.3)

Score AUROC AUPRC F1 (P95) Prec (P95) Recall (P95) FPR (P95) F1 (P99) TPR@1%FPR TPR@5%FPR
terminal_norm 0.9109 ± 0.0032 0.8974 ± 0.0047 0.6321 ± 0.0513 0.9545 ± 0.0045 0.4745 ± 0.0550 0.0441 ± 0.0011 0.4202 ± 0.0171 0.2685 ± 0.0139 0.4940 ± 0.0399
terminal_flow 0.9197 ± 0.0036 0.8957 ± 0.0086 0.6324 ± 0.0585 0.9517 ± 0.0055 0.4762 ± 0.0639 0.0469 ± 0.0019 0.4028 ± 0.0049 0.2534 ± 0.0039 0.4776 ± 0.0636

Reading

  • Within-dataset (CICDDoS2019): at τ=P95, terminal_norm reaches F1 ≈ 0.99 with precision ≈ 0.99 and recall ≈ 0.99 — saturation. At τ=P99 (≈1% FPR), F1 ≈ 0.91 / TPR@1%FPR ≈ 0.90. The model is a working detector at fixed thresholds, not just an AUROC artifact.
  • Cross-dataset (CICIDS2017 → CICDDoS2019): AUROC stays high (≈ 0.91) but at fixed thresholds Precision is high (≈0.95) and Recall drops to ≈0.50 at P95 / ≈0.27 at 1% FPR. The cross-dataset domain shift compresses the score gap, so a source-calibrated threshold is conservative on target — false positives stay low, but a substantial fraction of target-domain attacks score below the source benign P95. AUROC alone overstates deployability cross-dataset; thresholded numbers are the honest figure.
  • TIPSO-GAN comparability: TIPSO-GAN's CIC-DDoS2019 F1 ≈ 0.99 is reported under a supervised protocol (model has seen attack examples). Our F1 ≈ 0.99 on CICDDoS2019 within is achieved under the unsupervised protocol (benign-only training, threshold from benign-val), which is the strictly harder setting. Direct F1 numerical equivalence; protocol asymmetry is in our favor.

Source artifacts

  • artifacts/verify_2026_04_24/thresholded_metrics.py — per-file metric tool.
  • artifacts/verify_2026_04_24/aggregate_thresholded.py — this aggregator.
  • Within: artifacts/phase1_2026_04_25/cicddos2019_lambda0p3_seed*/thresholded_metrics.json (computed from existing phase1_scores.npz).
  • Cross: artifacts/phase25_sigma06_cross_2026_04_25/with_scores/thresholded_seed*.json (raw scores re-saved by patched eval_phase2_cross_cicddos2019.py).