diff --git a/README.md b/README.md
index d98c2f2..61e1828 100644
--- a/README.md
+++ b/README.md
@@ -29,15 +29,15 @@ JANUS is the first NIDS method to use Flow Matching as the training paradigm in
 | MFAD | — | 86.02 ± 0.8 † | — | — | — |
 | STFPM | BMVC'21 | 86.29 ± 1.7 † | — | — | — |
 | MMR | — | 89.26 ± 1.2 † | — | — | — |
-| Shafir NF + Shapley | arXiv'26 | 93.03 ‡ | 93.00 ‡ | 99.51 §‡ | 87.31 ‡ |
+| Shafir NF + Shapley | arXiv'26 | 93.03 ‡ | 93.00 ‡ | 72.24 ± 6.08 ★ | 87.31 ‡ |
 | ConMD | TIFS'26 | 94.43 ± 0.1 † | — | — | — |
 | **JANUS (ours)** | — | **98.26 ± 0.35** | **99.18 ± 0.05** | **95.90 ± 0.22** | **99.09 ± 0.13** |
 
 † Numbers from ConMD (TIFS'26) Table I; protocol = train 10 K benign / test 5 K + 5 K balanced; 5-seed mean ± std.
-‡ Numbers from Shafir et al. (arXiv'26); protocol = train 10 K benign / SHAP-selected feature subsets per dataset.
-§ Metric mismatch on CIC-IoT2023: Shafir reports F1 = 99.51 (Youden's-J threshold tuned with attack labels), we report AUROC = 95.90 (threshold-free); not directly comparable. Thresholded F1 for JANUS is reported in `RESULTS.md` Section D and `artifacts/route_comparison/THRESHOLDED.md`.
+‡ Numbers from Shafir et al. (arXiv'26) headline tables; protocol = train 10 K benign / SHAP-selected feature subsets per dataset (single NF).
+★ Reproduced by us (3-seed mean ± std, 2-NF ensemble, CSV pipeline, paper-specified 5-feat SHAP subset). Shafir's paper does not publish an AUROC for CIC-IoT2023 — only F1 = 99.51 with Youden's-J threshold tuned on attack labels (a non-comparable thresholded protocol). For threshold-free head-to-head AUROC on this dataset we cite our reproduction.
 
-JANUS sets new SOTA on **3/3 directly comparable benchmarks** (CIC-IDS2017 +3.83, CIC-DDoS2019 +6.18, ISCXTor2016 +11.78) — all margins outside seed std. JANUS is fully unsupervised (benign-only training, no attack labels at any stage), and uses the Mahalanobis-OAS aggregator over its 10-d raw score vector with parameters fit on benign val only.
+JANUS sets new SOTA on **4/4 within-dataset benchmarks** under matched AUROC protocol — CIC-IDS2017 **+3.83**, CIC-DDoS2019 **+6.18**, CIC-IoT2023 **+23.66** (vs reproduced Shafir), ISCXTor2016 **+11.78** — all margins outside seed std. JANUS is fully unsupervised (benign-only training, no attack labels at any stage) and uses the Mahalanobis-OAS aggregator over its 10-d raw score vector with parameters fit on benign val only. Thresholded F1 metrics for JANUS across all four datasets are in `RESULTS.md` Section D and `artifacts/route_comparison/THRESHOLDED.md`.
 
 ### 3×3 cross-dataset transfer matrix