Files
JANUS/CLAUDE.md

173 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Repo shape
This is a **workspace-style repo with three sibling model packages** plus a
shared data contract. The root intentionally keeps only workspace-level
files; all model/training/eval code lives under one of the three packages.
- `common/data_contract.py`**single source of truth** for the canonical
9-d packet schema (`PACKET_FEATURE_NAMES`) and 20-d packet-derived flow
schema (`CANONICAL_FLOW_FEATURE_NAMES`), label normalization, canonical
5-tuple, packet preprocessing helpers, and `compute_flow_features_from_packets`.
All three packages import from here.
- `Packet_CFM/` — packet-sequence OT-CFM with explicit σ-band benign
distribution learning. Has its own `CLAUDE.md` for internal details.
- `Flow_CFM/` — flow-level CFM on the workspace-canonical 20-d packet-derived
`flow_features.parquet`. Legacy 61-d CICFlowMeter CSV caches are still
available only for reproduction via the `--legacy-csv-features` flag.
- `Unified_CFM/`**current SOTA model**. Unified token CFM over
`[FLOW_TOKEN, PACKET_1, ..., PACKET_T]` with masked-prediction consistency
loss (Phase 2). All within-dataset SOTAs (ISCXTor2016 / CICIDS2017 /
CICDDoS2019) come from here.
- `scripts/`**workspace-level** scripts shared across all packages:
- `download/` — UNB/CIC dataset downloaders (Token-cookie + `cic_download.py`
recursive crawler). See `scripts/download/README.md` before touching.
- `extract_<dataset>.py` + `extract_lib.py` — pcap→artifact drivers that write
`datasets/<name>/processed/{packets.npz, flows.parquet, flow_features.parquet}`,
all row-aligned by `flow_id = arange(N)`.
- `generate_flow_features.py` — one-shot tool to upgrade an existing
`packets.npz` + `flows.parquet` pair to a canonical `flow_features.parquet`
without re-extracting pcap. Supports `--source-store` for sharded stores.
- `csv_adapter.py`, `convert_npz_splits_to_store.py`, `eval_cross_dataset_protocol.py`,
`merge_*.py`, `auto_transfer_*.sh` — cross-package tooling.
- `datasets/<name>/raw/` and `datasets/<name>/processed/` — shared dataset store.
- `artifacts/{runs,phase0_*,phase1_*,phase25_*,verify_*}/` — **all outputs go
here**, not `runs/` at root. Phase summary reports live in `artifacts/phase*/`.
- `paper/` — paper PDFs we compare against (Shafir 2026 NF, ConMD 2026,
TIPSO-GAN 2026, Lipman 2210.02747).
There is no `archive_v1/` at root; old flow-stat v1 code has been removed.
`Flow_CFM/checkpoints_archive/` retains historical checkpoints for reproduction.
## Data contract (read this before touching data code)
Every processed dataset under `datasets/<name>/processed/` ships an aligned
triple, all with the same row order (`flow_id = arange(N)`):
```
packets.npz # packet_tokens [N, T_full, 9], packet_lengths [N], flow_id [N]
# OR full_store/ (PacketShardStore directory) for large datasets
flows.parquet # flow_id + label + 5-tuple metadata (src_ip, dst_ip, ports, protocol)
flow_features.parquet # flow_id + label + 20 canonical packet-derived features
```
Optional / legacy:
- `flow_features_csv.parquet` — Flow_CFM's 61-d CICFlowMeter cache (paper
reproduction only; not row-aligned with packets in general)
The 20 canonical flow features are computed by
`common.data_contract.compute_flow_features_from_packets(packet_tokens, lens)`
and cover Shafir 2026's top-SHAP categories (size/IAT/active-idle/rate/flags)
in a packet-derivable way.
## Python env
- `requires-python = ">=3.14"`; PyTorch pinned to the `pytorch-cu128` index
(`torch>=2.9.1`), plus `mamba-ssm`, `causal-conv1d`, `scapy`, `dpkt`, `pyarrow`.
- Two `pyproject.toml` files: root (`/pyproject.toml`) and `Packet_CFM/pyproject.toml`.
They are **not declared as a uv workspace** — each resolves independently.
Run `uv run ...` from whichever directory owns the entry point you are invoking.
- `Flow_CFM/` and `Unified_CFM/` have no `pyproject.toml`; they use the root
venv (`uv run --no-sync python <script.py>`).
- Scripts under `scripts/download/` are pure stdlib — invoke with `python3`.
## Running things
**Unified_CFM** (SOTA model, run from `Unified_CFM/`):
```bash
cd Unified_CFM
uv run --no-sync python train.py --config configs/cicids2017_baseline.yaml
# Phase 2 with consistency loss:
uv run --no-sync python train.py --config configs/cicids2017_consistency.yaml
```
Best hyperparameters from the σ × λ sweeps:
- `lambda_flow = lambda_packet = 0.3`
- `sigma = 0.6` for cross-dataset transfer
- `sigma = 0.1` is fine for within-dataset (and marginally better on ISCXTor2016)
**Phase 1 / 2 evaluation**:
```bash
# Per-attack-class AUROC over 34 scores (terminal_norm primary, plus curvature,
# Jacobian-Hutchinson, time-profile velocity, flow_consistency diagnostics).
uv run --no-sync python artifacts/verify_2026_04_24/eval_phase1_unified.py \
--model-dir <model_dir> --out-dir <eval_dir> \
--batch-size 256 --jacobian-n-eps 4 \
--n-val-cap 10000 --n-atk-cap 30000
# Cross-dataset CICIDS2017 → CICDDoS2019:
uv run --no-sync python artifacts/verify_2026_04_24/eval_phase2_cross_cicddos2019.py \
--model-dir <model_dir> --out <result.json> \
--n-benign 10000 --n-attack 10000 --seed 42
```
**Packet_CFM entry points** (run from `Packet_CFM/`):
```bash
cd Packet_CFM
uv run python -m train --config configs/n10k.yaml
uv run python -m detect --save-dir ../artifacts/runs/<run>
uv run python -m eval.per_class --save-dir ../artifacts/runs/<run>
uv run python -m run_phase1 --sigmas 0.0 0.1 0.2 0.3
```
**Flow_CFM entry points** (run from `Flow_CFM/`): see `Flow_CFM/README_migration.md`.
**Tests**:
```bash
uv run --no-sync python -m pytest Packet_CFM/tests/ tests/common/ Unified_CFM/tests/
```
(43 passing — common data contract + Unified_CFM Phase 1/2 score functions
+ Packet_CFM existing tests.)
## Adding a new dataset
Write one driver at `scripts/extract_<name>.py` that calls
`extract_lib.extract_dataset(...)` (see `scripts/extract_cicids2017.py` as
the reference template). The driver hardcodes CSV column names, timestamp
formats, benign aliases, and drop patterns as module constants, then feeds
`extract_lib` a per-day `(canonical_key → [(row_idx, ts_epoch)])` mapping
and a per-day pcap file map. No YAML is needed.
The extract pipeline writes all three artifacts (packets.npz, flows.parquet,
flow_features.parquet) row-aligned. To upgrade an existing artifact pair
that lacks `flow_features.parquet`, run
`scripts/generate_flow_features.py --packets-npz ... --flows-parquet ... --out ...`
(or `--source-store` for sharded stores).
Common gotcha: if CSV timestamps and pcap epochs are in different time zones,
`extract_lib` prints a diagnostic with the recommended `--time-offset`; rerun
with that value.
## Conventions worth preserving
- Do not create a new `runs/` at repo root — outputs belong under `artifacts/`.
- `scripts/download/` stays at the root (shared by all packages).
- When adding new cross-package tooling, put it in root `scripts/`. Only move
it into `Packet_CFM/scripts/` if it depends on that package's imports.
- Phase reports live in `artifacts/phase*/` — keep the timestamp suffix
(`_2026_04_25`) so future runs don't overwrite history.
- The 9-d packet schema and 20-d canonical flow schema are FIXED in
`common/data_contract.py`. Do not extend them ad-hoc; if you need new
features, propose them with evidence (Shafir-style SHAP analysis or
Phase 1-style per-attack ablation).
## Current state of the work (2026-04-25)
- Phase 0 baselines + Shafir-protocol verification: ✓
- Phase 1 (34-score expansion + per-attack-class table): ✓
- Phase 2 (masked-prediction consistency loss): ✓ — multi-seed at λ=0.3
- Phase 2.5 (σ × λ sweep + multi-seed at σ=0.6): ✓
- Cross-dataset multi-seed: ✓ — also SOTA after baseline lock
- Shafir baselines locked from PDF: ✓ — `artifacts/locked_baselines.md`
- 4 of 4 reported tasks beat Shafir SOTA (final table: `RESULTS.md`)
- Architecture is finalized; remaining work is paper writing
(P1 skeleton, P2 thresholded F1/Precision/Recall metrics).