Compare commits

...

2 Commits

Author SHA1 Message Date
c5afd8c90f untrack CLAUDE.md (now gitignored)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 08:43:58 +08:00
4263fa8807 README: slim public-facing sections; gitignore CLAUDE.md
Trim README down to results/quickstart by removing Layout, Data contract,
Python environment, and Authoritative documents sections (these now live
in CLAUDE.md). Add CLAUDE.md to .gitignore so it stays as private dev
notes rather than committed docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 08:42:51 +08:00
3 changed files with 6 additions and 248 deletions

2
.gitignore vendored
View File

@@ -31,3 +31,5 @@ Thumbs.db
/janus_figures_*/
*.tmp
CLAUDE.md

172
CLAUDE.md
View File

@@ -1,172 +0,0 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Repo shape
This is a **workspace-style repo with three sibling model packages** plus a
shared data contract. The root intentionally keeps only workspace-level
files; all model/training/eval code lives under one of the three packages.
- `common/data_contract.py`**single source of truth** for the canonical
9-d packet schema (`PACKET_FEATURE_NAMES`) and 20-d packet-derived flow
schema (`CANONICAL_FLOW_FEATURE_NAMES`), label normalization, canonical
5-tuple, packet preprocessing helpers, and `compute_flow_features_from_packets`.
All three packages import from here.
- `Packet_CFM/` — packet-sequence OT-CFM with explicit σ-band benign
distribution learning. Has its own `CLAUDE.md` for internal details.
- `Flow_CFM/` — flow-level CFM on the workspace-canonical 20-d packet-derived
`flow_features.parquet`. Legacy 61-d CICFlowMeter CSV caches are still
available only for reproduction via the `--legacy-csv-features` flag.
- `Unified_CFM/`**current SOTA model**. Unified token CFM over
`[FLOW_TOKEN, PACKET_1, ..., PACKET_T]` with masked-prediction consistency
loss (Phase 2). All within-dataset SOTAs (ISCXTor2016 / CICIDS2017 /
CICDDoS2019) come from here.
- `scripts/`**workspace-level** scripts shared across all packages:
- `download/` — UNB/CIC dataset downloaders (Token-cookie + `cic_download.py`
recursive crawler). See `scripts/download/README.md` before touching.
- `extract_<dataset>.py` + `extract_lib.py` — pcap→artifact drivers that write
`datasets/<name>/processed/{packets.npz, flows.parquet, flow_features.parquet}`,
all row-aligned by `flow_id = arange(N)`.
- `generate_flow_features.py` — one-shot tool to upgrade an existing
`packets.npz` + `flows.parquet` pair to a canonical `flow_features.parquet`
without re-extracting pcap. Supports `--source-store` for sharded stores.
- `csv_adapter.py`, `convert_npz_splits_to_store.py`, `eval_cross_dataset_protocol.py`,
`merge_*.py`, `auto_transfer_*.sh` — cross-package tooling.
- `datasets/<name>/raw/` and `datasets/<name>/processed/` — shared dataset store.
- `artifacts/{runs,phase0_*,phase1_*,phase25_*,verify_*}/` — **all outputs go
here**, not `runs/` at root. Phase summary reports live in `artifacts/phase*/`.
- `paper/` — paper PDFs we compare against (Shafir 2026 NF, ConMD 2026,
TIPSO-GAN 2026, Lipman 2210.02747).
There is no `archive_v1/` at root; old flow-stat v1 code has been removed.
`Flow_CFM/checkpoints_archive/` retains historical checkpoints for reproduction.
## Data contract (read this before touching data code)
Every processed dataset under `datasets/<name>/processed/` ships an aligned
triple, all with the same row order (`flow_id = arange(N)`):
```
packets.npz # packet_tokens [N, T_full, 9], packet_lengths [N], flow_id [N]
# OR full_store/ (PacketShardStore directory) for large datasets
flows.parquet # flow_id + label + 5-tuple metadata (src_ip, dst_ip, ports, protocol)
flow_features.parquet # flow_id + label + 20 canonical packet-derived features
```
Optional / legacy:
- `flow_features_csv.parquet` — Flow_CFM's 61-d CICFlowMeter cache (paper
reproduction only; not row-aligned with packets in general)
The 20 canonical flow features are computed by
`common.data_contract.compute_flow_features_from_packets(packet_tokens, lens)`
and cover Shafir 2026's top-SHAP categories (size/IAT/active-idle/rate/flags)
in a packet-derivable way.
## Python env
- `requires-python = ">=3.14"`; PyTorch pinned to the `pytorch-cu128` index
(`torch>=2.9.1`), plus `mamba-ssm`, `causal-conv1d`, `scapy`, `dpkt`, `pyarrow`.
- Two `pyproject.toml` files: root (`/pyproject.toml`) and `Packet_CFM/pyproject.toml`.
They are **not declared as a uv workspace** — each resolves independently.
Run `uv run ...` from whichever directory owns the entry point you are invoking.
- `Flow_CFM/` and `Unified_CFM/` have no `pyproject.toml`; they use the root
venv (`uv run --no-sync python <script.py>`).
- Scripts under `scripts/download/` are pure stdlib — invoke with `python3`.
## Running things
**Unified_CFM** (SOTA model, run from `Unified_CFM/`):
```bash
cd Unified_CFM
uv run --no-sync python train.py --config configs/cicids2017_baseline.yaml
# Phase 2 with consistency loss:
uv run --no-sync python train.py --config configs/cicids2017_consistency.yaml
```
Best hyperparameters from the σ × λ sweeps:
- `lambda_flow = lambda_packet = 0.3`
- `sigma = 0.6` for cross-dataset transfer
- `sigma = 0.1` is fine for within-dataset (and marginally better on ISCXTor2016)
**Phase 1 / 2 evaluation**:
```bash
# Per-attack-class AUROC over 34 scores (terminal_norm primary, plus curvature,
# Jacobian-Hutchinson, time-profile velocity, flow_consistency diagnostics).
uv run --no-sync python artifacts/verify_2026_04_24/eval_phase1_unified.py \
--model-dir <model_dir> --out-dir <eval_dir> \
--batch-size 256 --jacobian-n-eps 4 \
--n-val-cap 10000 --n-atk-cap 30000
# Cross-dataset CICIDS2017 → CICDDoS2019:
uv run --no-sync python artifacts/verify_2026_04_24/eval_phase2_cross_cicddos2019.py \
--model-dir <model_dir> --out <result.json> \
--n-benign 10000 --n-attack 10000 --seed 42
```
**Packet_CFM entry points** (run from `Packet_CFM/`):
```bash
cd Packet_CFM
uv run python -m train --config configs/n10k.yaml
uv run python -m detect --save-dir ../artifacts/runs/<run>
uv run python -m eval.per_class --save-dir ../artifacts/runs/<run>
uv run python -m run_phase1 --sigmas 0.0 0.1 0.2 0.3
```
**Flow_CFM entry points** (run from `Flow_CFM/`): see `Flow_CFM/README_migration.md`.
**Tests**:
```bash
uv run --no-sync python -m pytest Packet_CFM/tests/ tests/common/ Unified_CFM/tests/
```
(43 passing — common data contract + Unified_CFM Phase 1/2 score functions
+ Packet_CFM existing tests.)
## Adding a new dataset
Write one driver at `scripts/extract_<name>.py` that calls
`extract_lib.extract_dataset(...)` (see `scripts/extract_cicids2017.py` as
the reference template). The driver hardcodes CSV column names, timestamp
formats, benign aliases, and drop patterns as module constants, then feeds
`extract_lib` a per-day `(canonical_key → [(row_idx, ts_epoch)])` mapping
and a per-day pcap file map. No YAML is needed.
The extract pipeline writes all three artifacts (packets.npz, flows.parquet,
flow_features.parquet) row-aligned. To upgrade an existing artifact pair
that lacks `flow_features.parquet`, run
`scripts/generate_flow_features.py --packets-npz ... --flows-parquet ... --out ...`
(or `--source-store` for sharded stores).
Common gotcha: if CSV timestamps and pcap epochs are in different time zones,
`extract_lib` prints a diagnostic with the recommended `--time-offset`; rerun
with that value.
## Conventions worth preserving
- Do not create a new `runs/` at repo root — outputs belong under `artifacts/`.
- `scripts/download/` stays at the root (shared by all packages).
- When adding new cross-package tooling, put it in root `scripts/`. Only move
it into `Packet_CFM/scripts/` if it depends on that package's imports.
- Phase reports live in `artifacts/phase*/` — keep the timestamp suffix
(`_2026_04_25`) so future runs don't overwrite history.
- The 9-d packet schema and 20-d canonical flow schema are FIXED in
`common/data_contract.py`. Do not extend them ad-hoc; if you need new
features, propose them with evidence (Shafir-style SHAP analysis or
Phase 1-style per-attack ablation).
## Current state of the work (2026-04-25)
- Phase 0 baselines + Shafir-protocol verification: ✓
- Phase 1 (34-score expansion + per-attack-class table): ✓
- Phase 2 (masked-prediction consistency loss): ✓ — multi-seed at λ=0.3
- Phase 2.5 (σ × λ sweep + multi-seed at σ=0.6): ✓
- Cross-dataset multi-seed: ✓ — also SOTA after baseline lock
- Shafir baselines locked from PDF: ✓ — `artifacts/locked_baselines.md`
- 4 of 4 reported tasks beat Shafir SOTA (final table: `RESULTS.md`)
- Architecture is finalized; remaining work is paper writing
(P1 skeleton, P2 thresholded F1/Precision/Recall metrics).

View File

@@ -1,6 +1,6 @@
# JANUS
**JANUS** (Joint Anomaly via Normalizing-flows of Unified States) — flow-matching unsupervised network anomaly detection over packet sequences.
**JANUS** — flow-matching unsupervised network anomaly detection over packet sequences.
JANUS is a packet-causal Transformer with **two output heads on a shared backbone**:
@@ -37,7 +37,9 @@ JANUS is the first NIDS method to use Flow Matching as the training paradigm in
‡ Numbers from Shafir et al. (arXiv'26) headline tables; protocol = train 10 K benign / SHAP-selected feature subsets per dataset (single NF).
★ Reproduced by us (3-seed mean ± std, 2-NF ensemble, CSV pipeline, paper-specified 5-feat SHAP subset). Shafir's paper does not publish an AUROC for CIC-IoT2023 — only F1 = 99.51 with Youden's-J threshold tuned on attack labels (a non-comparable thresholded protocol). For threshold-free head-to-head AUROC on this dataset we cite our reproduction.
JANUS sets new SOTA on **4/4 within-dataset benchmarks** under matched AUROC protocol — CIC-IDS2017 **+3.83**, CIC-DDoS2019 **+6.18**, CIC-IoT2023 **+23.66** (vs reproduced Shafir), ISCXTor2016 **+11.78** — all margins outside seed std. JANUS is fully unsupervised (benign-only training, no attack labels at any stage) and uses the Mahalanobis-OAS aggregator over its 10-d raw score vector with parameters fit on benign val only. Thresholded F1 metrics for JANUS across all four datasets are in `RESULTS.md` Section D and `artifacts/route_comparison/THRESHOLDED.md`.
JANUS is fully unsupervised (benign-only training, no attack labels at any stage) and uses the Mahalanobis-OAS aggregator over its 10-d raw score vector with parameters fit on benign val only.
Thresholded F1 metrics for JANUS across all four datasets are in `RESULTS.md` Section D.
### 3×3 cross-dataset transfer matrix
@@ -49,8 +51,6 @@ Source (rows) trained on 10K benign of source dataset; target (columns) tested o
| **CICDDoS19** | 0.9413 ± 0.0212 | _0.9918 ± 0.0005_ | 0.8767 ± 0.0068 |
| **CICIoT23** | 0.9394 ± 0.0063 | 0.9030 ± 0.0075 | _0.9590 ± 0.0022_ |
Forward CICIDS17→CICDDoS19 (0.969) beats Shafir 0.89 by **+0.08**; reverse CICDDoS19→CICIDS17 (0.941) approximately matches Shafir 0.93. CICIoT23 is hardest both as source and target — its IoT-protocol diversity makes the "benign of source ≈ benign of target" assumption brittle. Full table at `artifacts/route_comparison/CROSS_MATRIX_3x3.md`.
### Ablations (architecture & aggregator)
Two orthogonal ablation axes, each evaluated **within-dataset** (4 datasets × 3 seeds) **and** **cross-dataset** (3×3 transfer × 3 seeds):
@@ -73,65 +73,6 @@ Three ablations (B3 / B5 / A-aggregator) **marginally beat JANUS-full at within-
Full headline summary: `artifacts/ablation/ABLATION_SUMMARY.md`. Per-variant 3×3 cross matrices: `artifacts/ablation/ABLATION_CROSS_B_full.md` and `artifacts/ablation/ABLATION_TABLE_CROSS_full.md`.
## Layout
```
common/ Data contract — single source of truth for the
9-d packet schema, 20-d packet-derived flow schema,
label normalization, and packet preprocessing.
Mixed_CFM/ The JANUS model. Mixed continuousdiscrete CFM
with two output heads on a shared causal Transformer.
configs/ Per-(dataset × seed) training configs.
model.py MixedTokenCFM + MixedVelocity.
train.py / eval_phase1.py / eval_cross.py
Unified_CFM/ Legacy unified token CFM. Mixed_CFM imports its
AdaLNBlock + sinusoidal time embedding for backbone
reuse. Kept as internal ablation reference.
scripts/ Workspace-level pcap → artifact pipeline,
CSV adapters, cross-package eval tooling.
download/ UNB/CIC dataset downloaders.
baselines/ Third-party baseline runners (Kitsune, Shafir-NF,
Anomaly-Transformer).
aggregate/ Mahalanobis-OAS score-router + cross-matrix
orchestration. aggregate_score_router.py is the
deployable score path; run_cross_3x3.sh +
cross_3x3_table.py produce the cross matrix.
aggregate_ablation.py / aggregate_ablation_cross.py /
aggregate_ablation_cross_B.py produce the ablation
tables in artifacts/ablation/.
ablation/ B-group ablation training/eval drivers
(generate_configs.py, run_groupB.sh,
run_cross_groupB.sh).
tests/ Data-contract unit tests.
```
The following directories are **gitignored** (live on the dev box, not in the repo):
```
artifacts/ All run outputs (checkpoints, eval JSONs, score
npzs, figures). Per-(dataset × seed) model dirs at
artifacts/route_comparison/janus_<ds>_seed<N>/.
datasets/ Raw + processed datasets (~1 TB).
baselines/ Third-party baseline forks (Kitsune-py,
Anomaly-Transformer, ConMD, ganomaly, TIPSO-GAN, ...).
paper/ Paper sources & external PDFs (Shafir 2026, Lipman
2210.02747, etc.).
.venv/ uv-managed Python 3.14 virtual env.
```
## Data contract
Every processed dataset under `datasets/<name>/processed/` ships an aligned triple, all with the same row order (`flow_id = arange(N)`):
```
packets.npz packet_tokens [N, T_full, 9], packet_lengths [N], flow_id [N]
(or full_store/ — sharded PacketShardStore — for large datasets)
flows.parquet flow_id + label + 5-tuple metadata (src_ip, dst_ip, ports, protocol)
flow_features.parquet flow_id + label + 20 canonical packet-derived features
```
The 9-d packet schema and 20-d flow schema are FIXED in `common/data_contract.py`. Flow features are computed by `compute_flow_features_from_packets(packet_tokens, lens)` so row alignment is guaranteed.
## Quick start
```bash
@@ -202,16 +143,3 @@ Write one driver at `scripts/extract_<name>.py` that calls `extract_lib.extract_
To upgrade an existing artifact pair that lacks `flow_features.parquet`, run `scripts/generate_flow_features.py --packets-npz ... --flows-parquet ... --out ...` (or `--source-store` for sharded stores).
Common gotcha: if CSV timestamps and pcap epochs are in different time zones, `extract_lib` prints a diagnostic with the recommended `--time-offset`; rerun with that value.
## Authoritative documents
- `RESULTS.md` — full headline tables, per-attack analysis, JANUS configuration, thresholded operating-point metrics, what the experiments proved / disproved.
- `artifacts/ablation/ABLATION_SUMMARY.md` — paper-facing ablation summary (Group A aggregator + Group B architecture, both within and cross views).
- `Mixed_CFM/model.py` and `common/data_contract.py` — model + data-contract source of truth.
## Python environment
- `requires-python = ">=3.14"`; PyTorch pinned to the `pytorch-cu128` index, plus `mamba-ssm`, `causal-conv1d`, `scapy`, `dpkt`, `pyarrow`, `sklearn` (for the OAS aggregator).
- Two `pyproject.toml` files exist: root and `Mixed_CFM/`; they are not declared as a uv workspace and resolve independently. Run `uv run ...` from whichever directory owns the entry point.
- `Unified_CFM/` has no `pyproject.toml`; it uses the root venv (`uv run --no-sync python <script.py>`).
- Scripts under `scripts/download/` are pure stdlib — invoke with `python3`.