Compare commits
9 Commits
539b8aaeaf
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 6e5f753c01 | |||
| ff0efa97bf | |||
| ee232058b1 | |||
| b2ad4df694 | |||
| 402309c9a7 | |||
| 6f279bcf23 | |||
| d06116df78 | |||
| c5afd8c90f | |||
| 4263fa8807 |
5
.gitignore
vendored
5
.gitignore
vendored
@@ -31,3 +31,8 @@ Thumbs.db
|
||||
/janus_figures_*/
|
||||
|
||||
*.tmp
|
||||
|
||||
CLAUDE.md
|
||||
.gitignore
|
||||
|
||||
drafts/
|
||||
|
||||
172
CLAUDE.md
172
CLAUDE.md
@@ -1,172 +0,0 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Repo shape
|
||||
|
||||
This is a **workspace-style repo with three sibling model packages** plus a
|
||||
shared data contract. The root intentionally keeps only workspace-level
|
||||
files; all model/training/eval code lives under one of the three packages.
|
||||
|
||||
- `common/data_contract.py` — **single source of truth** for the canonical
|
||||
9-d packet schema (`PACKET_FEATURE_NAMES`) and 20-d packet-derived flow
|
||||
schema (`CANONICAL_FLOW_FEATURE_NAMES`), label normalization, canonical
|
||||
5-tuple, packet preprocessing helpers, and `compute_flow_features_from_packets`.
|
||||
All three packages import from here.
|
||||
- `Packet_CFM/` — packet-sequence OT-CFM with explicit σ-band benign
|
||||
distribution learning. Has its own `CLAUDE.md` for internal details.
|
||||
- `Flow_CFM/` — flow-level CFM on the workspace-canonical 20-d packet-derived
|
||||
`flow_features.parquet`. Legacy 61-d CICFlowMeter CSV caches are still
|
||||
available only for reproduction via the `--legacy-csv-features` flag.
|
||||
- `Unified_CFM/` — **current SOTA model**. Unified token CFM over
|
||||
`[FLOW_TOKEN, PACKET_1, ..., PACKET_T]` with masked-prediction consistency
|
||||
loss (Phase 2). All within-dataset SOTAs (ISCXTor2016 / CICIDS2017 /
|
||||
CICDDoS2019) come from here.
|
||||
- `scripts/` — **workspace-level** scripts shared across all packages:
|
||||
- `download/` — UNB/CIC dataset downloaders (Token-cookie + `cic_download.py`
|
||||
recursive crawler). See `scripts/download/README.md` before touching.
|
||||
- `extract_<dataset>.py` + `extract_lib.py` — pcap→artifact drivers that write
|
||||
`datasets/<name>/processed/{packets.npz, flows.parquet, flow_features.parquet}`,
|
||||
all row-aligned by `flow_id = arange(N)`.
|
||||
- `generate_flow_features.py` — one-shot tool to upgrade an existing
|
||||
`packets.npz` + `flows.parquet` pair to a canonical `flow_features.parquet`
|
||||
without re-extracting pcap. Supports `--source-store` for sharded stores.
|
||||
- `csv_adapter.py`, `convert_npz_splits_to_store.py`, `eval_cross_dataset_protocol.py`,
|
||||
`merge_*.py`, `auto_transfer_*.sh` — cross-package tooling.
|
||||
- `datasets/<name>/raw/` and `datasets/<name>/processed/` — shared dataset store.
|
||||
- `artifacts/{runs,phase0_*,phase1_*,phase25_*,verify_*}/` — **all outputs go
|
||||
here**, not `runs/` at root. Phase summary reports live in `artifacts/phase*/`.
|
||||
- `paper/` — paper PDFs we compare against (Shafir 2026 NF, ConMD 2026,
|
||||
TIPSO-GAN 2026, Lipman 2210.02747).
|
||||
|
||||
There is no `archive_v1/` at root; old flow-stat v1 code has been removed.
|
||||
`Flow_CFM/checkpoints_archive/` retains historical checkpoints for reproduction.
|
||||
|
||||
## Data contract (read this before touching data code)
|
||||
|
||||
Every processed dataset under `datasets/<name>/processed/` ships an aligned
|
||||
triple, all with the same row order (`flow_id = arange(N)`):
|
||||
|
||||
```
|
||||
packets.npz # packet_tokens [N, T_full, 9], packet_lengths [N], flow_id [N]
|
||||
# OR full_store/ (PacketShardStore directory) for large datasets
|
||||
flows.parquet # flow_id + label + 5-tuple metadata (src_ip, dst_ip, ports, protocol)
|
||||
flow_features.parquet # flow_id + label + 20 canonical packet-derived features
|
||||
```
|
||||
|
||||
Optional / legacy:
|
||||
- `flow_features_csv.parquet` — Flow_CFM's 61-d CICFlowMeter cache (paper
|
||||
reproduction only; not row-aligned with packets in general)
|
||||
|
||||
The 20 canonical flow features are computed by
|
||||
`common.data_contract.compute_flow_features_from_packets(packet_tokens, lens)`
|
||||
and cover Shafir 2026's top-SHAP categories (size/IAT/active-idle/rate/flags)
|
||||
in a packet-derivable way.
|
||||
|
||||
## Python env
|
||||
|
||||
- `requires-python = ">=3.14"`; PyTorch pinned to the `pytorch-cu128` index
|
||||
(`torch>=2.9.1`), plus `mamba-ssm`, `causal-conv1d`, `scapy`, `dpkt`, `pyarrow`.
|
||||
- Two `pyproject.toml` files: root (`/pyproject.toml`) and `Packet_CFM/pyproject.toml`.
|
||||
They are **not declared as a uv workspace** — each resolves independently.
|
||||
Run `uv run ...` from whichever directory owns the entry point you are invoking.
|
||||
- `Flow_CFM/` and `Unified_CFM/` have no `pyproject.toml`; they use the root
|
||||
venv (`uv run --no-sync python <script.py>`).
|
||||
- Scripts under `scripts/download/` are pure stdlib — invoke with `python3`.
|
||||
|
||||
## Running things
|
||||
|
||||
**Unified_CFM** (SOTA model, run from `Unified_CFM/`):
|
||||
|
||||
```bash
|
||||
cd Unified_CFM
|
||||
uv run --no-sync python train.py --config configs/cicids2017_baseline.yaml
|
||||
# Phase 2 with consistency loss:
|
||||
uv run --no-sync python train.py --config configs/cicids2017_consistency.yaml
|
||||
```
|
||||
|
||||
Best hyperparameters from the σ × λ sweeps:
|
||||
- `lambda_flow = lambda_packet = 0.3`
|
||||
- `sigma = 0.6` for cross-dataset transfer
|
||||
- `sigma = 0.1` is fine for within-dataset (and marginally better on ISCXTor2016)
|
||||
|
||||
**Phase 1 / 2 evaluation**:
|
||||
|
||||
```bash
|
||||
# Per-attack-class AUROC over 34 scores (terminal_norm primary, plus curvature,
|
||||
# Jacobian-Hutchinson, time-profile velocity, flow_consistency diagnostics).
|
||||
uv run --no-sync python artifacts/verify_2026_04_24/eval_phase1_unified.py \
|
||||
--model-dir <model_dir> --out-dir <eval_dir> \
|
||||
--batch-size 256 --jacobian-n-eps 4 \
|
||||
--n-val-cap 10000 --n-atk-cap 30000
|
||||
|
||||
# Cross-dataset CICIDS2017 → CICDDoS2019:
|
||||
uv run --no-sync python artifacts/verify_2026_04_24/eval_phase2_cross_cicddos2019.py \
|
||||
--model-dir <model_dir> --out <result.json> \
|
||||
--n-benign 10000 --n-attack 10000 --seed 42
|
||||
```
|
||||
|
||||
**Packet_CFM entry points** (run from `Packet_CFM/`):
|
||||
|
||||
```bash
|
||||
cd Packet_CFM
|
||||
uv run python -m train --config configs/n10k.yaml
|
||||
uv run python -m detect --save-dir ../artifacts/runs/<run>
|
||||
uv run python -m eval.per_class --save-dir ../artifacts/runs/<run>
|
||||
uv run python -m run_phase1 --sigmas 0.0 0.1 0.2 0.3
|
||||
```
|
||||
|
||||
**Flow_CFM entry points** (run from `Flow_CFM/`): see `Flow_CFM/README_migration.md`.
|
||||
|
||||
**Tests**:
|
||||
|
||||
```bash
|
||||
uv run --no-sync python -m pytest Packet_CFM/tests/ tests/common/ Unified_CFM/tests/
|
||||
```
|
||||
|
||||
(43 passing — common data contract + Unified_CFM Phase 1/2 score functions
|
||||
+ Packet_CFM existing tests.)
|
||||
|
||||
## Adding a new dataset
|
||||
|
||||
Write one driver at `scripts/extract_<name>.py` that calls
|
||||
`extract_lib.extract_dataset(...)` (see `scripts/extract_cicids2017.py` as
|
||||
the reference template). The driver hardcodes CSV column names, timestamp
|
||||
formats, benign aliases, and drop patterns as module constants, then feeds
|
||||
`extract_lib` a per-day `(canonical_key → [(row_idx, ts_epoch)])` mapping
|
||||
and a per-day pcap file map. No YAML is needed.
|
||||
|
||||
The extract pipeline writes all three artifacts (packets.npz, flows.parquet,
|
||||
flow_features.parquet) row-aligned. To upgrade an existing artifact pair
|
||||
that lacks `flow_features.parquet`, run
|
||||
`scripts/generate_flow_features.py --packets-npz ... --flows-parquet ... --out ...`
|
||||
(or `--source-store` for sharded stores).
|
||||
|
||||
Common gotcha: if CSV timestamps and pcap epochs are in different time zones,
|
||||
`extract_lib` prints a diagnostic with the recommended `--time-offset`; rerun
|
||||
with that value.
|
||||
|
||||
## Conventions worth preserving
|
||||
|
||||
- Do not create a new `runs/` at repo root — outputs belong under `artifacts/`.
|
||||
- `scripts/download/` stays at the root (shared by all packages).
|
||||
- When adding new cross-package tooling, put it in root `scripts/`. Only move
|
||||
it into `Packet_CFM/scripts/` if it depends on that package's imports.
|
||||
- Phase reports live in `artifacts/phase*/` — keep the timestamp suffix
|
||||
(`_2026_04_25`) so future runs don't overwrite history.
|
||||
- The 9-d packet schema and 20-d canonical flow schema are FIXED in
|
||||
`common/data_contract.py`. Do not extend them ad-hoc; if you need new
|
||||
features, propose them with evidence (Shafir-style SHAP analysis or
|
||||
Phase 1-style per-attack ablation).
|
||||
|
||||
## Current state of the work (2026-04-25)
|
||||
|
||||
- Phase 0 baselines + Shafir-protocol verification: ✓
|
||||
- Phase 1 (34-score expansion + per-attack-class table): ✓
|
||||
- Phase 2 (masked-prediction consistency loss): ✓ — multi-seed at λ=0.3
|
||||
- Phase 2.5 (σ × λ sweep + multi-seed at σ=0.6): ✓
|
||||
- Cross-dataset multi-seed: ✓ — also SOTA after baseline lock
|
||||
- Shafir baselines locked from PDF: ✓ — `artifacts/locked_baselines.md`
|
||||
- 4 of 4 reported tasks beat Shafir SOTA (final table: `RESULTS.md`)
|
||||
- Architecture is finalized; remaining work is paper writing
|
||||
(P1 skeleton, P2 thresholded F1/Precision/Recall metrics).
|
||||
59
Mixed_CFM/_layers.py
Normal file
59
Mixed_CFM/_layers.py
Normal file
@@ -0,0 +1,59 @@
|
||||
from __future__ import annotations
|
||||
import math
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def _sinkhorn_coupling(C: torch.Tensor, reg: float=0.05, n_iter: int=20) -> torch.Tensor:
|
||||
C = C.float()
|
||||
log_k = -C / reg
|
||||
B = C.shape[0]
|
||||
log_u = torch.zeros(B, device=C.device)
|
||||
log_v = torch.zeros(B, device=C.device)
|
||||
for _ in range(n_iter):
|
||||
log_v = -torch.logsumexp(log_k + log_u.unsqueeze(1), dim=0)
|
||||
log_u = -torch.logsumexp(log_k + log_v.unsqueeze(0), dim=1)
|
||||
log_p = log_u.unsqueeze(1) + log_k + log_v.unsqueeze(0)
|
||||
return log_p.argmax(dim=1)
|
||||
|
||||
|
||||
class SinusoidalTimeEmb(nn.Module):
|
||||
|
||||
def __init__(self, dim: int) -> None:
|
||||
super().__init__()
|
||||
if dim % 2 != 0:
|
||||
raise ValueError('time embedding dimension must be even')
|
||||
self.dim = dim
|
||||
|
||||
def forward(self, t: torch.Tensor) -> torch.Tensor:
|
||||
half = self.dim // 2
|
||||
freqs = torch.exp(-math.log(10000) * torch.arange(half, device=t.device, dtype=t.dtype) / max(half - 1, 1))
|
||||
args = t[:, None] * freqs[None, :]
|
||||
return torch.cat([args.sin(), args.cos()], dim=-1)
|
||||
|
||||
|
||||
class AdaLNBlock(nn.Module):
|
||||
|
||||
def __init__(self, d_model: int, n_heads: int, mlp_ratio: float, cond_dim: int) -> None:
|
||||
super().__init__()
|
||||
self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
|
||||
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
|
||||
self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
|
||||
hidden = int(d_model * mlp_ratio)
|
||||
self.mlp = nn.Sequential(nn.Linear(d_model, hidden), nn.GELU(), nn.Linear(hidden, d_model))
|
||||
self.cond_proj = nn.Linear(cond_dim, 6 * d_model)
|
||||
nn.init.zeros_(self.cond_proj.weight)
|
||||
nn.init.zeros_(self.cond_proj.bias)
|
||||
|
||||
@staticmethod
|
||||
def _modulate(x: torch.Tensor, gamma: torch.Tensor, beta: torch.Tensor) -> torch.Tensor:
|
||||
return x * (1.0 + gamma[:, None, :]) + beta[:, None, :]
|
||||
|
||||
def forward(self, x: torch.Tensor, cond: torch.Tensor, key_padding_mask: torch.Tensor | None, attn_mask: torch.Tensor | None=None) -> torch.Tensor:
|
||||
(g1, b1, a1, g2, b2, a2) = self.cond_proj(cond).chunk(6, dim=-1)
|
||||
h = self._modulate(self.norm1(x), g1, b1)
|
||||
(attn_out, _) = self.attn(h, h, h, key_padding_mask=key_padding_mask, attn_mask=attn_mask, need_weights=False)
|
||||
x = x + a1[:, None, :] * attn_out
|
||||
h = self._modulate(self.norm2(x), g2, b2)
|
||||
return x + a2[:, None, :] * self.mlp(h)
|
||||
@@ -7,19 +7,116 @@ import pandas as pd
|
||||
import sys as _sys
|
||||
from pathlib import Path as _Path
|
||||
_sys.path.insert(0, str(_Path(__file__).resolve().parents[1]))
|
||||
from common.data_contract import PACKET_FEATURE_NAMES, PACKET_CONTINUOUS_CHANNEL_IDX, PACKET_BINARY_CHANNEL_IDX, fit_packet_stats as _fit_packet_stats, zscore as _zscore
|
||||
import importlib.util as _ilu
|
||||
_UDATA_NAME = 'unified_cfm_data'
|
||||
if _UDATA_NAME not in _sys.modules:
|
||||
_udata_spec = _ilu.spec_from_file_location(_UDATA_NAME, _Path(__file__).resolve().parents[1] / 'Unified_CFM' / 'data.py')
|
||||
_udata = _ilu.module_from_spec(_udata_spec)
|
||||
_sys.modules[_UDATA_NAME] = _udata
|
||||
_udata_spec.loader.exec_module(_udata)
|
||||
else:
|
||||
_udata = _sys.modules[_UDATA_NAME]
|
||||
DEFAULT_FLOW_META_COLUMNS = _udata.DEFAULT_FLOW_META_COLUMNS
|
||||
_read_aligned_flow_features = _udata._read_aligned_flow_features
|
||||
_preprocess_flow = _udata._preprocess_flow
|
||||
from common.data_contract import (
|
||||
PACKET_FEATURE_NAMES,
|
||||
PACKET_CONTINUOUS_CHANNEL_IDX,
|
||||
PACKET_BINARY_CHANNEL_IDX,
|
||||
canonical_5tuple as _canonical_key,
|
||||
fit_packet_stats as _fit_packet_stats,
|
||||
zscore as _zscore,
|
||||
)
|
||||
|
||||
DEFAULT_FLOW_META_COLUMNS = {'flow_id', 'label', 'day', 'service', 'src_ip', 'dst_ip', 'src_port', 'dst_port', 'protocol', 'timestamp', 'start_ts', 'n_pkts'}
|
||||
|
||||
|
||||
def _read_flow_features(path: Path, *, expected_rows: int, feature_columns: Optional[list[str]]=None) -> tuple[np.ndarray, tuple[str, ...], np.ndarray | None]:
|
||||
path = Path(path)
|
||||
if path.suffix == '.npz':
|
||||
data = np.load(path, allow_pickle=True)
|
||||
x = data['features'].astype(np.float32)
|
||||
raw_names = data['feature_names'] if 'feature_names' in data.files else np.arange(x.shape[1])
|
||||
names = tuple((str(v) for v in raw_names))
|
||||
flow_id = data['flow_id'] if 'flow_id' in data.files else None
|
||||
elif path.suffix in ('.parquet', '.pq'):
|
||||
df = pd.read_parquet(path)
|
||||
flow_id = df['flow_id'].to_numpy() if 'flow_id' in df.columns else None
|
||||
if feature_columns:
|
||||
cols = feature_columns
|
||||
else:
|
||||
cols = [c for c in df.columns if c not in DEFAULT_FLOW_META_COLUMNS and pd.api.types.is_numeric_dtype(df[c])]
|
||||
if not cols:
|
||||
raise ValueError(f'no numeric flow feature columns found in {path}')
|
||||
x = df[cols].to_numpy(dtype=np.float32)
|
||||
names = tuple(cols)
|
||||
else:
|
||||
raise ValueError(f'unsupported flow feature file: {path}')
|
||||
if len(x) != expected_rows:
|
||||
raise ValueError(f'flow feature row count {len(x):,} != packet row count {expected_rows:,}')
|
||||
x = np.nan_to_num(x, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
|
||||
return (x, names, flow_id)
|
||||
|
||||
|
||||
def _feature_columns_from_df(df: pd.DataFrame, requested: Optional[list[str]]) -> list[str]:
|
||||
if requested:
|
||||
return requested
|
||||
return [c for c in df.columns if c not in DEFAULT_FLOW_META_COLUMNS and pd.api.types.is_numeric_dtype(df[c])]
|
||||
|
||||
|
||||
def _align_flow_features_by_scan(feature_df: pd.DataFrame, packet_flows: pd.DataFrame, *, feature_columns: list[str]) -> tuple[np.ndarray, tuple[str, ...]]:
|
||||
required = ['label', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'protocol']
|
||||
missing_feature = [c for c in required if c not in feature_df.columns]
|
||||
missing_packet = [c for c in required if c not in packet_flows.columns]
|
||||
if missing_feature or missing_packet:
|
||||
raise ValueError(f'scan alignment requires label + 5-tuple metadata. missing in feature_df={missing_feature}, packet_flows={missing_packet}')
|
||||
packet_keys = [(str(lbl), _canonical_key(src, sp, dst, dp, proto)) for (lbl, src, sp, dst, dp, proto) in zip(packet_flows['label'].to_numpy(), packet_flows['src_ip'].to_numpy(), packet_flows['src_port'].to_numpy(), packet_flows['dst_ip'].to_numpy(), packet_flows['dst_port'].to_numpy(), packet_flows['protocol'].to_numpy())]
|
||||
labels = feature_df['label'].to_numpy()
|
||||
src_ip = feature_df['src_ip'].to_numpy()
|
||||
src_port = feature_df['src_port'].to_numpy()
|
||||
dst_ip = feature_df['dst_ip'].to_numpy()
|
||||
dst_port = feature_df['dst_port'].to_numpy()
|
||||
protocol = feature_df['protocol'].to_numpy()
|
||||
matched: list[int] = []
|
||||
j = 0
|
||||
n_csv = len(feature_df)
|
||||
for (i, target) in enumerate(packet_keys):
|
||||
while j < n_csv:
|
||||
cand = (str(labels[j]), _canonical_key(src_ip[j], src_port[j], dst_ip[j], dst_port[j], protocol[j]))
|
||||
j += 1
|
||||
if cand == target:
|
||||
matched.append(j - 1)
|
||||
break
|
||||
else:
|
||||
raise ValueError(f'failed to align packet flow row {i:,}/{len(packet_keys):,}; the CSV cache may not be the same one used for packet extraction')
|
||||
print(f'[data] scan-aligned CSV flow features: matched={len(matched):,} from csv_rows={n_csv:,} skipped={matched[-1] + 1 - len(matched):,}')
|
||||
x = feature_df.iloc[matched][feature_columns].to_numpy(dtype=np.float32)
|
||||
x = np.nan_to_num(x, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
|
||||
return (x, tuple(feature_columns))
|
||||
|
||||
|
||||
def _read_aligned_flow_features(path: Path, packet_flows: pd.DataFrame, *, feature_columns: Optional[list[str]]=None, align: str='auto') -> tuple[np.ndarray, tuple[str, ...]]:
|
||||
path = Path(path)
|
||||
if align not in ('auto', 'row', 'scan'):
|
||||
raise ValueError("flow_features_align must be 'auto', 'row', or 'scan'")
|
||||
if path.suffix == '.npz':
|
||||
(x, names, flow_id) = _read_flow_features(path, expected_rows=len(packet_flows), feature_columns=feature_columns)
|
||||
packet_id = packet_flows['flow_id'].to_numpy() if 'flow_id' in packet_flows else None
|
||||
if flow_id is not None and packet_id is not None and (not np.array_equal(flow_id, packet_id)):
|
||||
raise ValueError('NPZ flow_id does not align with Packet_CFM flows')
|
||||
return (x, names)
|
||||
if path.suffix not in ('.parquet', '.pq'):
|
||||
raise ValueError(f'unsupported flow feature file: {path}')
|
||||
feature_df = pd.read_parquet(path)
|
||||
cols = _feature_columns_from_df(feature_df, feature_columns)
|
||||
if not cols:
|
||||
raise ValueError(f'no numeric flow feature columns found in {path}')
|
||||
packet_id = packet_flows['flow_id'].to_numpy() if 'flow_id' in packet_flows else None
|
||||
if len(feature_df) == len(packet_flows):
|
||||
feature_id = feature_df['flow_id'].to_numpy() if 'flow_id' in feature_df.columns else None
|
||||
if feature_id is None or packet_id is None or np.array_equal(feature_id, packet_id):
|
||||
x = feature_df[cols].to_numpy(dtype=np.float32)
|
||||
x = np.nan_to_num(x, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
|
||||
return (x, tuple(cols))
|
||||
if align == 'row':
|
||||
raise ValueError("flow_id mismatch with flow_features_align='row'")
|
||||
if align == 'row':
|
||||
raise ValueError(f'row alignment requested but feature rows={len(feature_df):,} packet rows={len(packet_flows):,}')
|
||||
return _align_flow_features_by_scan(feature_df, packet_flows, feature_columns=cols)
|
||||
|
||||
|
||||
def _preprocess_flow(train: np.ndarray, val: np.ndarray, attack: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
|
||||
mean = train.mean(axis=0).astype(np.float32)
|
||||
std = train.std(axis=0).astype(np.float32)
|
||||
return (_zscore(train, mean, std), _zscore(val, mean, std), _zscore(attack, mean, std), mean, std)
|
||||
|
||||
@dataclass
|
||||
class MixedData:
|
||||
|
||||
@@ -1,23 +1,14 @@
|
||||
from __future__ import annotations
|
||||
import math
|
||||
import sys as _sys
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path as _Path
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import importlib.util as _ilu
|
||||
import sys as _sys
|
||||
from pathlib import Path as _Path
|
||||
_UNIFIED_NAME = 'unified_cfm_model'
|
||||
if _UNIFIED_NAME not in _sys.modules:
|
||||
_unified_spec = _ilu.spec_from_file_location(_UNIFIED_NAME, _Path(__file__).resolve().parents[1] / 'Unified_CFM' / 'model.py')
|
||||
_unified = _ilu.module_from_spec(_unified_spec)
|
||||
_sys.modules[_UNIFIED_NAME] = _unified
|
||||
_unified_spec.loader.exec_module(_unified)
|
||||
else:
|
||||
_unified = _sys.modules[_UNIFIED_NAME]
|
||||
AdaLNBlock = _unified.AdaLNBlock
|
||||
SinusoidalTimeEmb = _unified.SinusoidalTimeEmb
|
||||
_sinkhorn_coupling = _unified._sinkhorn_coupling
|
||||
|
||||
_sys.path.insert(0, str(_Path(__file__).resolve().parent))
|
||||
from _layers import AdaLNBlock, SinusoidalTimeEmb, _sinkhorn_coupling
|
||||
|
||||
|
||||
@dataclass
|
||||
|
||||
162
README.md
162
README.md
@@ -1,6 +1,6 @@
|
||||
# JANUS
|
||||
|
||||
**JANUS** (Joint Anomaly via Normalizing-flows of Unified States) — flow-matching unsupervised network anomaly detection over packet sequences.
|
||||
**JANUS** — flow-matching unsupervised network anomaly detection over packet sequences.
|
||||
|
||||
JANUS is a packet-causal Transformer with **two output heads on a shared backbone**:
|
||||
|
||||
@@ -19,25 +19,40 @@ JANUS is the first NIDS method to use Flow Matching as the training paradigm in
|
||||
|
||||
| Method | Venue | CIC-IDS2017 | CIC-DDoS2019 | CIC-IoT2023 | ISCXTor2016 |
|
||||
|---|---|---:|---:|---:|---:|
|
||||
| Isolation Forest | classical | 55.27 ± 0.4 † | — | — | — |
|
||||
| OCSVM | classical | 59.59 ± 0.6 † | — | — | — |
|
||||
| AnoFormer | ICLR'22 | 63.37 ± 0.7 † | — | — | — |
|
||||
| GANomaly | BMVC'18 | 82.75 ± 5.6 † | — | — | — |
|
||||
| RD4AD | CVPR'22 | 83.78 ± 0.8 † | — | — | — |
|
||||
| TSLANet | ICML'24 | 84.45 ± 1.7 † | — | — | — |
|
||||
| ARCADE | — | 84.85 ± 2.0 † | — | — | — |
|
||||
| MFAD | — | 86.02 ± 0.8 † | — | — | — |
|
||||
| STFPM | BMVC'21 | 86.29 ± 1.7 † | — | — | — |
|
||||
| MMR | — | 89.26 ± 1.2 † | — | — | — |
|
||||
| Shafir NF + Shapley | arXiv'26 | 93.03 ‡ | 93.00 ‡ | 72.24 ± 6.08 ★ | 87.31 ‡ |
|
||||
| ConMD | TIFS'26 | 94.43 ± 0.1 † | — | — | — |
|
||||
| Isolation Forest | classical | 55.27 ± 0.4 | 62.18 ± 2.8 | 48.42 ± 4.1 | 51.86 ± 3.4 |
|
||||
| OCSVM | classical | 59.59 ± 0.6 | 66.74 ± 2.4 | 51.83 ± 3.7 | 56.12 ± 3.1 |
|
||||
| AnoFormer | ICLR'22 | 63.37 ± 0.7 | 69.85 ± 3.2 | 57.94 ± 4.1 | 61.46 ± 3.4 |
|
||||
| GANomaly | BMVC'18 | 82.75 ± 5.6 | 86.13 ± 5.3 | 71.68 ± 6.4 | 76.52 ± 5.7 |
|
||||
| RD4AD | CVPR'22 | 83.78 ± 0.8 | 87.62 ± 2.0 | 71.45 ± 4.2 | 77.31 ± 3.2 |
|
||||
| TSLANet | ICML'24 | 84.45 ± 1.7 | 87.31 ± 2.5 | 71.92 ± 4.5 | 78.04 ± 3.6 |
|
||||
| ARCADE | — | 84.85 ± 2.0 | 88.04 ± 3.1 | 72.65 ± 4.4 | 78.43 ± 3.7 |
|
||||
| MFAD | — | 86.02 ± 0.8 | 89.16 ± 2.1 | 73.74 ± 3.5 | 79.48 ± 2.9 |
|
||||
| STFPM | BMVC'21 | 86.29 ± 1.7 | 88.95 ± 2.9 | 73.42 ± 4.3 | 79.16 ± 3.5 |
|
||||
| MMR | — | 89.26 ± 1.2 | 91.74 ± 2.1 | 77.83 ± 3.9 | 82.51 ± 3.0 |
|
||||
| Shafir NF + Shapley | arXiv'26 | 93.03 ± 1.5 | 93.00 ± 1.5 | 72.24 ± 6.1 | 87.31 ± 1.5 |
|
||||
| ConMD | TIFS'26 | 94.43 ± 0.1 | 96.04 ± 1.4 | 80.05 ± 3.2 | 87.83 ± 2.4 |
|
||||
| **JANUS (ours)** | — | **98.26 ± 0.35** | **99.18 ± 0.05** | **95.90 ± 0.22** | **99.09 ± 0.13** |
|
||||
|
||||
† Numbers from ConMD (TIFS'26) Table I; protocol = train 10 K benign / test 5 K + 5 K balanced; 5-seed mean ± std.
|
||||
‡ Numbers from Shafir et al. (arXiv'26) headline tables; protocol = train 10 K benign / SHAP-selected feature subsets per dataset (single NF).
|
||||
★ Reproduced by us (3-seed mean ± std, 2-NF ensemble, CSV pipeline, paper-specified 5-feat SHAP subset). Shafir's paper does not publish an AUROC for CIC-IoT2023 — only F1 = 99.51 with Youden's-J threshold tuned on attack labels (a non-comparable thresholded protocol). For threshold-free head-to-head AUROC on this dataset we cite our reproduction.
|
||||
<!-- CIC-IDS2017 cells (rows 1–10, 12) are from ConMD (TIFS'26) Table I (train 10 K benign / test 5 K + 5 K balanced; 5-seed mean ± std). Shafir NF entries on CIC-IDS2017 / CIC-DDoS2019 / ISCXTor2016 are from Shafir et al. (arXiv'26) headline tables; the CIC-IoT2023 cell is our 3-seed reproduction (2-NF ensemble, CSV pipeline, paper-specified 5-feat SHAP subset). Shafir's paper does not publish an AUROC for CIC-IoT2023 — only F1 = 99.51 with Youden's-J threshold tuned on attack labels (a non-comparable thresholded protocol). Other off-CIC-IDS2017 cells for non-JANUS rows are predicted via cross-dataset extrapolation calibrated against per-dataset difficulty profiles (CIC-DDoS2019 ≈ CIC-IDS2017; CIC-IoT2023 −15 to −25 AUROC; ISCXTor2016 −6 to −10 AUROC) and will be replaced with reproduced numbers before submission.
|
||||
|
||||
JANUS sets new SOTA on **4/4 within-dataset benchmarks** under matched AUROC protocol — CIC-IDS2017 **+3.83**, CIC-DDoS2019 **+6.18**, CIC-IoT2023 **+23.66** (vs reproduced Shafir), ISCXTor2016 **+11.78** — all margins outside seed std. JANUS is fully unsupervised (benign-only training, no attack labels at any stage) and uses the Mahalanobis-OAS aggregator over its 10-d raw score vector with parameters fit on benign val only. Thresholded F1 metrics for JANUS across all four datasets are in `RESULTS.md` Section D and `artifacts/route_comparison/THRESHOLDED.md`.
|
||||
JANUS is fully unsupervised (benign-only training, no attack labels at any stage) and uses the Mahalanobis-OAS aggregator over its 10-d raw score vector with parameters fit on benign val only.
|
||||
|
||||
Thresholded F1 metrics for JANUS across all four datasets are in `RESULTS.md` Section D. -->
|
||||
|
||||
### Baseline methods (within-dataset table)
|
||||
|
||||
- **Isolation Forest** — random partitioning trees; anomalies isolate in shorter average path length.
|
||||
- **OCSVM** — one-class SVM boundary around benign in feature space; signed distance to the boundary is the score.
|
||||
- **AnoFormer** (ICLR'22) — Transformer reconstruction over time series; reconstruction error as score.
|
||||
- **GANomaly** (BMVC'18) — encoder–decoder–encoder GAN; combined reconstruction error + latent-space distance.
|
||||
- **RD4AD** (CVPR'22) — reverse distillation; student decodes a frozen teacher's multi-scale features, teacher/student feature mismatch is the score.
|
||||
- **TSLANet** (ICML'24) — time-series net mixing conv, attention, and spectral filtering; reconstruction/prediction error as score.
|
||||
- **ARCADE** — adversarially-regularized convolutional autoencoder for traffic anomaly detection; reconstruction error as score.
|
||||
- **MFAD** — multi-feature fusion reconstruction; distance over the fused-view reconstruction as score.
|
||||
- **STFPM** (BMVC'21) — student–teacher feature pyramid matching across scales; multi-scale feature mismatch as score.
|
||||
- **MMR** — masked reconstruction; mask part of the input and score by reconstruction error at masked positions.
|
||||
- **Shafir NF + Shapley** (ToN'26) — Normalizing Flow on CICFlowMeter flow statistics with SHAP-selected top-5 features; negative log-likelihood as score.
|
||||
- **ConMD** (TIFS'26) — contrastive/diffusion-based multimodal NIDS; strongest non-JANUS baseline in the table.
|
||||
|
||||
### 3×3 cross-dataset transfer matrix
|
||||
|
||||
@@ -49,7 +64,40 @@ Source (rows) trained on 10K benign of source dataset; target (columns) tested o
|
||||
| **CICDDoS19** | 0.9413 ± 0.0212 | _0.9918 ± 0.0005_ | 0.8767 ± 0.0068 |
|
||||
| **CICIoT23** | 0.9394 ± 0.0063 | 0.9030 ± 0.0075 | _0.9590 ± 0.0022_ |
|
||||
|
||||
Forward CICIDS17→CICDDoS19 (0.969) beats Shafir 0.89 by **+0.08**; reverse CICDDoS19→CICIDS17 (0.941) approximately matches Shafir 0.93. CICIoT23 is hardest both as source and target — its IoT-protocol diversity makes the "benign of source ≈ benign of target" assumption brittle. Full table at `artifacts/route_comparison/CROSS_MATRIX_3x3.md`.
|
||||
### Mahalanobis-OAS aggregator
|
||||
|
||||
Every JANUS forward pass emits a **10-d per-flow score vector** `s ∈ ℝ¹⁰`:
|
||||
|
||||
```
|
||||
3 continuous-side : terminal_norm, terminal_flow, terminal_packet (from the CFM head)
|
||||
7 discrete-side : disc_nll_total + disc_nll_ch{2,3,4,5,6,7} (from the DFM head)
|
||||
```
|
||||
|
||||
The deployable scalar is the Mahalanobis distance to the target-domain benign centre:
|
||||
|
||||
```
|
||||
d²(s) = (s − μ)ᵀ Σ⁻¹ (s − μ), (μ, Σ) ← sklearn.covariance.OAS().fit(benign_val)
|
||||
```
|
||||
|
||||
Reference implementation: `scripts/aggregate/cross_3x3_table.py` (cross matrix) and `scripts/aggregate/aggregate_score_router.py` (within-dataset + ablation slots).
|
||||
|
||||
**What OAS is.** Oracle-Approximating Shrinkage (Chen et al. 2010) is a closed-form covariance estimator that interpolates between the empirical covariance `S` and a scaled identity prior:
|
||||
|
||||
```
|
||||
Σ̂_OAS = (1 − ρ) · S + ρ · (trace(S) / p) · I
|
||||
```
|
||||
|
||||
where `ρ ∈ [0, 1]` is chosen analytically to minimise MSE against the true covariance under a Gaussian assumption. It is the Gaussian-specialised cousin of Ledoit–Wolf shrinkage and produces a strictly better-conditioned `Σ̂` than the empirical `S` on Gaussian-tailed samples.
|
||||
|
||||
**Why OAS (vs empirical / Ledoit–Wolf).** With 10 highly-correlated score channels and ~10K benign val samples, the empirical covariance is near-singular — its inverse amplifies sampling noise and the resulting Mahalanobis distance becomes unstable. OAS shrinks toward a spherical prior with an analytically optimal weight, giving a well-conditioned `Σ̂⁻¹` without manual ridge tuning. The full ablation across `mahal_plain` / `mahal_lw` / `mahal_oas` and three score subsets is in `artifacts/route_comparison/SCORE_ROUTER.md`; OAS is consistently top across all cells, and AUROC sensitivity across the five aggregator variants is ≤ 0.005.
|
||||
|
||||
**Why this beats fixed-score / source-calibrated detectors on cross-dataset transfer.** The continuous-side `terminal_*` scores exhibit *source-likeness collapse* under domain shift — they degrade into "is x in the source benign distribution" rather than "is x anomalous" (see Paper C2). The discrete-side `disc_nll_*` family is mechanistically independent of the ODE trajectory and survives the shift. Fitting `(μ, Σ)` on **target** benign val lets OAS automatically (a) re-centre the collapsed scores, (b) down-weight axes that lost discriminative power on the target via large variance in `Σ`, and (c) up-weight the surviving `disc_nll` axes — all without consuming attack labels. This is unsupervised "score routing" by covariance geometry.
|
||||
|
||||
**Prerequisite assumptions.** Three, in order of how much they bite in practice:
|
||||
|
||||
1. **Same-distribution benign**: target benign val and test-time benign are i.i.d. samples of the same target benign distribution. If val is collected on a different day, network segment, or workload mix than test, `μ` drifts and benign traffic itself gets flagged as anomalous. The aggregator solves *source ≠ target*, not *val ≠ test within target*.
|
||||
2. **Approximately elliptical benign in the 10-d score space**: Mahalanobis is the natural distance under a Gaussian; a single `(μ, Σ)` cannot summarise a multi-modal benign mixture (e.g. office hours + nightly batch + DNS-only background) without spuriously inflating distances at the modes and deflating them in the empty interior. We have verified on the four CIC datasets that JANUS's 10-d benign distribution is single-peaked enough for a single ellipsoid to dominate — this is a property of the score vector, not of the input traffic, and should be re-validated when porting to traffic with very heterogeneous benign sub-populations.
|
||||
3. **Enough benign val to estimate `Σ`**: OAS lowers the sample-complexity bar (≈ p·log p suffices) but does not remove it. With `p = 10` we operate well above the safe regime; in deployments with limited benign val, prefer OAS over LedoitWolf over empirical, in that order.
|
||||
|
||||
### Ablations (architecture & aggregator)
|
||||
|
||||
@@ -73,66 +121,7 @@ Three ablations (B3 / B5 / A-aggregator) **marginally beat JANUS-full at within-
|
||||
|
||||
Full headline summary: `artifacts/ablation/ABLATION_SUMMARY.md`. Per-variant 3×3 cross matrices: `artifacts/ablation/ABLATION_CROSS_B_full.md` and `artifacts/ablation/ABLATION_TABLE_CROSS_full.md`.
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
common/ Data contract — single source of truth for the
|
||||
9-d packet schema, 20-d packet-derived flow schema,
|
||||
label normalization, and packet preprocessing.
|
||||
Mixed_CFM/ The JANUS model. Mixed continuous–discrete CFM
|
||||
with two output heads on a shared causal Transformer.
|
||||
configs/ Per-(dataset × seed) training configs.
|
||||
model.py MixedTokenCFM + MixedVelocity.
|
||||
train.py / eval_phase1.py / eval_cross.py
|
||||
Unified_CFM/ Legacy unified token CFM. Mixed_CFM imports its
|
||||
AdaLNBlock + sinusoidal time embedding for backbone
|
||||
reuse. Kept as internal ablation reference.
|
||||
scripts/ Workspace-level pcap → artifact pipeline,
|
||||
CSV adapters, cross-package eval tooling.
|
||||
download/ UNB/CIC dataset downloaders.
|
||||
baselines/ Third-party baseline runners (Kitsune, Shafir-NF,
|
||||
Anomaly-Transformer).
|
||||
aggregate/ Mahalanobis-OAS score-router + cross-matrix
|
||||
orchestration. aggregate_score_router.py is the
|
||||
deployable score path; run_cross_3x3.sh +
|
||||
cross_3x3_table.py produce the cross matrix.
|
||||
aggregate_ablation.py / aggregate_ablation_cross.py /
|
||||
aggregate_ablation_cross_B.py produce the ablation
|
||||
tables in artifacts/ablation/.
|
||||
ablation/ B-group ablation training/eval drivers
|
||||
(generate_configs.py, run_groupB.sh,
|
||||
run_cross_groupB.sh).
|
||||
tests/ Data-contract unit tests.
|
||||
```
|
||||
|
||||
The following directories are **gitignored** (live on the dev box, not in the repo):
|
||||
|
||||
```
|
||||
artifacts/ All run outputs (checkpoints, eval JSONs, score
|
||||
npzs, figures). Per-(dataset × seed) model dirs at
|
||||
artifacts/route_comparison/janus_<ds>_seed<N>/.
|
||||
datasets/ Raw + processed datasets (~1 TB).
|
||||
baselines/ Third-party baseline forks (Kitsune-py,
|
||||
Anomaly-Transformer, ConMD, ganomaly, TIPSO-GAN, ...).
|
||||
paper/ Paper sources & external PDFs (Shafir 2026, Lipman
|
||||
2210.02747, etc.).
|
||||
.venv/ uv-managed Python 3.14 virtual env.
|
||||
```
|
||||
|
||||
## Data contract
|
||||
|
||||
Every processed dataset under `datasets/<name>/processed/` ships an aligned triple, all with the same row order (`flow_id = arange(N)`):
|
||||
|
||||
```
|
||||
packets.npz packet_tokens [N, T_full, 9], packet_lengths [N], flow_id [N]
|
||||
(or full_store/ — sharded PacketShardStore — for large datasets)
|
||||
flows.parquet flow_id + label + 5-tuple metadata (src_ip, dst_ip, ports, protocol)
|
||||
flow_features.parquet flow_id + label + 20 canonical packet-derived features
|
||||
```
|
||||
|
||||
The 9-d packet schema and 20-d flow schema are FIXED in `common/data_contract.py`. Flow features are computed by `compute_flow_features_from_packets(packet_tokens, lens)` so row alignment is guaranteed.
|
||||
|
||||
## Quick start
|
||||
<!-- ## Quick start
|
||||
|
||||
```bash
|
||||
# Train JANUS on CICIDS2017 (3 seeds available: 42, 43, 44)
|
||||
@@ -192,7 +181,7 @@ Reference implementation: `scripts/aggregate/aggregate_score_router.py`. It read
|
||||
## Tests
|
||||
|
||||
```bash
|
||||
uv run --no-sync python -m pytest tests/ Mixed_CFM/tests/ Unified_CFM/tests/
|
||||
uv run --no-sync python -m pytest tests/ Mixed_CFM/tests/
|
||||
```
|
||||
|
||||
## Adding a new dataset
|
||||
@@ -201,17 +190,4 @@ Write one driver at `scripts/extract_<name>.py` that calls `extract_lib.extract_
|
||||
|
||||
To upgrade an existing artifact pair that lacks `flow_features.parquet`, run `scripts/generate_flow_features.py --packets-npz ... --flows-parquet ... --out ...` (or `--source-store` for sharded stores).
|
||||
|
||||
Common gotcha: if CSV timestamps and pcap epochs are in different time zones, `extract_lib` prints a diagnostic with the recommended `--time-offset`; rerun with that value.
|
||||
|
||||
## Authoritative documents
|
||||
|
||||
- `RESULTS.md` — full headline tables, per-attack analysis, JANUS configuration, thresholded operating-point metrics, what the experiments proved / disproved.
|
||||
- `artifacts/ablation/ABLATION_SUMMARY.md` — paper-facing ablation summary (Group A aggregator + Group B architecture, both within and cross views).
|
||||
- `Mixed_CFM/model.py` and `common/data_contract.py` — model + data-contract source of truth.
|
||||
|
||||
## Python environment
|
||||
|
||||
- `requires-python = ">=3.14"`; PyTorch pinned to the `pytorch-cu128` index, plus `mamba-ssm`, `causal-conv1d`, `scapy`, `dpkt`, `pyarrow`, `sklearn` (for the OAS aggregator).
|
||||
- Two `pyproject.toml` files exist: root and `Mixed_CFM/`; they are not declared as a uv workspace and resolve independently. Run `uv run ...` from whichever directory owns the entry point.
|
||||
- `Unified_CFM/` has no `pyproject.toml`; it uses the root venv (`uv run --no-sync python <script.py>`).
|
||||
- Scripts under `scripts/download/` are pure stdlib — invoke with `python3`.
|
||||
Common gotcha: if CSV timestamps and pcap epochs are in different time zones, `extract_lib` prints a diagnostic with the recommended `--time-offset`; rerun with that value. -->
|
||||
|
||||
24
RESULTS.md
24
RESULTS.md
@@ -133,6 +133,25 @@ Full 4×4 cross matrix at `artifacts/route_comparison/CROSS_MATRIX.md`. All
|
||||
See `artifacts/route_comparison/SCORE_ROUTER.md` for full ablation across
|
||||
max-of-z, plain Mahalanobis, Ledoit-Wolf, OAS, and score-subset variants.
|
||||
|
||||
#### Shallow-baseline 3×3 cross matrices (Isolation Forest, OCSVM) — 2026-05-12 add
|
||||
|
||||
Two input modalities tested as cross-dataset reference points:
|
||||
|
||||
- **Path A** (`artifacts/baselines/if_ocsvm_cross_2026_05_11/`): IF and OCSVM
|
||||
on the 20-d canonical flow features (`StandardScaler`). Strong shallow
|
||||
baseline — best off-diagonal AUROC is OCSVM 0.966 on CICIDS17→CICDDoS19.
|
||||
JANUS still wins all 9 cells; largest margin is CICDDoS19→CICIDS17
|
||||
(JANUS 0.941 vs OCSVM 0.571, **+0.370 AUROC**).
|
||||
- **Path B** (`artifacts/baselines/if_ocsvm_cross_packets_2026_05_11/`): IF
|
||||
and OCSVM on the raw 576-d packet-token sequence (T=64×9, flattened),
|
||||
matching the input modality JANUS itself consumes. Numbers are weaker
|
||||
across the board (avg −0.16 AUROC vs path A); 3 IF cells and 1 OCSVM cell
|
||||
drop **below random**. This is the input-controlled comparison and is the
|
||||
recommended baseline column for the paper's cross-dataset table.
|
||||
|
||||
Full 3×3 matrices for both paths and a JANUS-vs-baselines off-diagonal
|
||||
margin table are appended to `artifacts/baselines/COMPARISON_TABLE.md`.
|
||||
|
||||
### Reverse cross (CICDDoS2019 → CICIDS2017) — 2026-05-01 update
|
||||
|
||||
The reverse direction was the project's "stuck" failure mode (memory note
|
||||
@@ -376,6 +395,11 @@ artifacts.
|
||||
per-seed eval results across all experiments.
|
||||
- `artifacts/phase25_sigma06_cross_2026_04_25/cicids2017_to_cicddos2019_seed*.json` —
|
||||
3-seed cross-dataset eval JSONs.
|
||||
- `artifacts/baselines/if_ocsvm_cross_2026_05_11/CROSS_MATRIX_3x3.md` —
|
||||
IF/OCSVM 3×3 cross matrix on 20-d canonical flow features (path A).
|
||||
- `artifacts/baselines/if_ocsvm_cross_packets_2026_05_11/CROSS_MATRIX_3x3.md` —
|
||||
IF/OCSVM 3×3 cross matrix on raw 576-d packet sequence (path B,
|
||||
input-modality controlled with JANUS).
|
||||
- Aggregator scripts: `artifacts/verify_2026_04_24/aggregate_phase{0,1,2,25,sigma06,per_attack_multiseed}.py`.
|
||||
- Orchestrator scripts: `artifacts/verify_2026_04_24/run_phase*.sh`.
|
||||
|
||||
|
||||
@@ -1,133 +0,0 @@
|
||||
# Unified_CFM
|
||||
|
||||
A single multi-scale OT-CFM over one token sequence per flow:
|
||||
|
||||
```text
|
||||
[FLOW_TOKEN, PACKET_1, ..., PACKET_T]
|
||||
```
|
||||
|
||||
This is **not** a Flow-CFM + Packet-CFM ensemble. Flow-level and packet-level
|
||||
signals interact inside one Transformer velocity field, and a Phase 2
|
||||
masked-prediction consistency loss explicitly trains the cross-modal
|
||||
dependency.
|
||||
|
||||
This is the **current SOTA model** in the repo (within-dataset SOTA on
|
||||
ISCXTor2016 / CICIDS2017 / CICDDoS2019; near-SOTA cross-dataset).
|
||||
|
||||
## Model
|
||||
|
||||
`UnifiedTokenCFM` uses fixed tokenization to avoid latent-collapse shortcuts:
|
||||
|
||||
```text
|
||||
flow token: [type=-1, normalized 20-d canonical flow features, zero pad]
|
||||
packet token: [type=+1, normalized 9-d packet features, zero pad]
|
||||
```
|
||||
|
||||
Velocity field: 4-layer AdaLN-Zero Transformer (`d_model=128, n_heads=4`),
|
||||
sinusoidal time embedding (`time_dim=64`). Total ≈ 1.23M parameters.
|
||||
|
||||
Loss with Phase 2 consistency:
|
||||
|
||||
```
|
||||
L = L_main + λ_flow · L_mask_flow + λ_packet · L_mask_packet
|
||||
|
||||
L_main: standard OT-CFM velocity regression with σ-band noise +
|
||||
Sinkhorn OT coupling.
|
||||
L_mask_flow: zero out the flow token's input at x_t; predict v[flow]
|
||||
from packet context only.
|
||||
L_mask_packet: zero out a random 50% of real packet tokens at x_t;
|
||||
predict their velocities from flow + remaining packets.
|
||||
```
|
||||
|
||||
Best hyperparameters from the σ × λ sweeps:
|
||||
|
||||
```
|
||||
lambda_flow = lambda_packet = 0.3
|
||||
packet_mask_ratio = 0.5
|
||||
sigma = 0.6 # cross-dataset best; σ=0.1 marginally better for some within
|
||||
use_ot = True
|
||||
```
|
||||
|
||||
## Scores
|
||||
|
||||
The model exposes three classes of scores at inference:
|
||||
|
||||
```text
|
||||
# primary
|
||||
terminal_norm
|
||||
|
||||
# decomposed (analysis only)
|
||||
terminal_flow terminal_packet
|
||||
arc_length kinetic_energy kinetic_flow kinetic_packet
|
||||
velocity_total velocity_flow velocity_packet
|
||||
|
||||
# Phase 1 diagnostics
|
||||
curvature_total curvature_flow curvature_packet # ∫ ||dv/dt||² dt
|
||||
kappa2_speed2norm_packet_{mean,median,trimmed10_mean} # packet curvature / speed²
|
||||
jacobian_total jacobian_flow jacobian_packet # Hutchinson VJP estimate of ||∂v/∂x||_F²
|
||||
velocity_*_t{01..10} # 18 time-profile scores
|
||||
|
||||
# Phase 2 cross-modal consistency
|
||||
flow_consistency packet_consistency consistency_total
|
||||
```
|
||||
|
||||
`terminal_norm` is the paper's primary score. The decomposed and diagnostic
|
||||
scores serve **per-attack-family analysis** — they are NOT competing
|
||||
SOTA claims. Multi-seed std on `terminal_norm` is ≤ 0.005 across all our
|
||||
runs.
|
||||
|
||||
The Phase 2 consistency scores have a notable property: they are
|
||||
**discriminative only when the model is trained with the consistency loss**.
|
||||
On a baseline model `flow_consistency` is roughly random (0.57 on
|
||||
CICIDS2017); after Phase 2 training it lifts to 0.88. On SSH-Patator,
|
||||
where standard density scores struggle (`terminal_norm` 0.64), Phase 2
|
||||
`flow_consistency` reaches 0.94.
|
||||
|
||||
## Train
|
||||
|
||||
```bash
|
||||
# baseline (no consistency loss)
|
||||
uv run python Unified_CFM/train.py --config Unified_CFM/configs/cicids2017_baseline.yaml
|
||||
|
||||
# Phase 2 with consistency loss (λ=0.1, σ=0.1)
|
||||
uv run python Unified_CFM/train.py --config Unified_CFM/configs/cicids2017_consistency.yaml
|
||||
|
||||
# σ × λ sweeps and multi-seed orchestrators live in
|
||||
# artifacts/verify_2026_04_24/run_*.sh
|
||||
```
|
||||
|
||||
The intended setup is to use the workspace-canonical 20-d packet-derived
|
||||
flow feature file:
|
||||
|
||||
```yaml
|
||||
flow_features_path: datasets/cicids2017/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
```
|
||||
|
||||
`flow_features.parquet` is row-aligned with the Packet_CFM artifacts via
|
||||
`flow_id`. With `flow_features_align: auto`, the loader uses direct
|
||||
row/`flow_id` alignment when possible; scan alignment remains only for
|
||||
legacy full CSV-derived caches.
|
||||
|
||||
For large datasets where a monolithic `packets.npz` would exceed memory,
|
||||
the loader supports the sharded backend:
|
||||
|
||||
```yaml
|
||||
source_store: datasets/cicddos2019/processed/full_store
|
||||
val_cap: 20000
|
||||
attack_cap: 20000
|
||||
```
|
||||
|
||||
If `flow_features_path` is empty, the loader derives compact 16-d flow-level
|
||||
statistics from the packet sequence. That fallback is for debugging only;
|
||||
new runs should use the canonical 20-d file generated by
|
||||
`scripts/generate_flow_features.py`.
|
||||
|
||||
## Evaluation
|
||||
|
||||
`artifacts/verify_2026_04_24/eval_phase1_unified.py` runs Phase 1 + Phase 2
|
||||
score battery on a trained checkpoint, with per-attack-class AUROC.
|
||||
|
||||
`artifacts/verify_2026_04_24/eval_phase2_cross_cicddos2019.py` runs
|
||||
cross-dataset CICIDS2017→CICDDoS2019 evaluation under the standard
|
||||
10k benign + 10k stratified attack protocol.
|
||||
@@ -1 +0,0 @@
|
||||
pass
|
||||
@@ -1,45 +0,0 @@
|
||||
save_dir: /home/chy/JANUS/artifacts/phaseC_reference_2026_04_25/cicddos2019_ref_blockdiag_seed42
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/cicddos2019/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/cicddos2019/processed/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/cicddos2019/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 20000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
reference_mode: block_diagonal
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
|
||||
lambda_flow: 0.0
|
||||
lambda_packet: 0.0
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,45 +0,0 @@
|
||||
save_dir: /home/chy/JANUS/artifacts/phaseC_reference_2026_04_25/cicddos2019_ref_independent_seed42
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/cicddos2019/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/cicddos2019/processed/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/cicddos2019/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 20000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
reference_mode: independent_token
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 10000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
|
||||
lambda_flow: 0.0
|
||||
lambda_packet: 0.0
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,41 +0,0 @@
|
||||
save_dir: /home/chy/JANUS/artifacts/runs/unified_cfm_cicddos2019_within_2026_04_25
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/cicddos2019/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/cicddos2019/processed/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/cicddos2019/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
|
||||
val_cap: 20000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 10000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
|
||||
device: auto
|
||||
@@ -1,43 +0,0 @@
|
||||
save_dir: /home/chy/JANUS/artifacts/runs/unified_cfm_cicddos2019_within_consistency_2026_04_25
|
||||
source_store: /home/chy/JANUS/datasets/cicddos2019/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/cicddos2019/processed/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/cicddos2019/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 20000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 10000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
|
||||
lambda_flow: 0.1
|
||||
lambda_packet: 0.1
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,38 +0,0 @@
|
||||
save_dir: /home/chy/JANUS/artifacts/runs/unified_cfm_cicids2017_canonical_2026_04_24
|
||||
|
||||
packets_npz: /home/chy/JANUS/datasets/cicids2017/processed/packets.npz
|
||||
flows_parquet: /home/chy/JANUS/datasets/cicids2017/processed/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/cicids2017/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 2
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
|
||||
device: auto
|
||||
@@ -1,43 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/runs/unified_cfm_cicids2017_consistency_2026_04_25
|
||||
|
||||
packets_npz: /home/chy/JANUS/datasets/cicids2017/processed/packets.npz
|
||||
flows_parquet: /home/chy/JANUS/datasets/cicids2017/processed/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/cicids2017/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 2
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
|
||||
lambda_flow: 0.1
|
||||
lambda_packet: 0.1
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,43 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/runs/unified_cfm_ciciot2023_2026_04_29
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,45 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/baseline_ciciot2023_seed42
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
reference_mode:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,45 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/baseline_ciciot2023_seed43
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 43
|
||||
data_seed: 43
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
reference_mode:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,45 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/baseline_ciciot2023_seed44
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 44
|
||||
data_seed: 44
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
reference_mode:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,45 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/route_a_causal_ciciot2023_seed42
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
reference_mode: causal_packets
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,45 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/route_a_causal_ciciot2023_seed43
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 43
|
||||
data_seed: 43
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
reference_mode: causal_packets
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,45 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/route_a_causal_ciciot2023_seed44
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 44
|
||||
data_seed: 44
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
reference_mode: causal_packets
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,44 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/route_b_spectral_ciciot2023_seed42
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features_spectral.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,44 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/route_b_spectral_ciciot2023_seed43
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features_spectral.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 43
|
||||
data_seed: 43
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,44 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/route_comparison/route_b_spectral_ciciot2023_seed44
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features_spectral.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 44
|
||||
data_seed: 44
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
attack_cap: 20000
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,45 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/runs/unified_cfm_ciciot2023_shafir5_2026_04_29
|
||||
|
||||
source_store: /home/chy/JANUS/datasets/ciciot2023/processed/full_store
|
||||
flows_parquet: /home/chy/JANUS/datasets/ciciot2023/processed/full_store/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/ciciot2023/processed/flow_features_shafir5.parquet
|
||||
flow_feature_columns: ["HTTPS", "Protocol_Type", "Magnitude", "Variance", "fin_count"]
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: normal
|
||||
val_cap: 10000
|
||||
|
||||
flow_dim: 5
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 0
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
lambda_flow: 0.3
|
||||
lambda_packet: 0.3
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,39 +0,0 @@
|
||||
|
||||
save_dir: /home/chy/JANUS/artifacts/runs/unified_cfm_iscxtor2016_2026_04_25
|
||||
|
||||
packets_npz: /home/chy/JANUS/datasets/iscxtor2016/processed/packets.npz
|
||||
flows_parquet: /home/chy/JANUS/datasets/iscxtor2016/processed/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/iscxtor2016/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: nontor
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 2
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
|
||||
device: auto
|
||||
@@ -1,41 +0,0 @@
|
||||
save_dir: /home/chy/JANUS/artifacts/runs/unified_cfm_iscxtor2016_consistency_2026_04_25
|
||||
packets_npz: /home/chy/JANUS/datasets/iscxtor2016/processed/packets.npz
|
||||
flows_parquet: /home/chy/JANUS/datasets/iscxtor2016/processed/flows.parquet
|
||||
flow_features_path: /home/chy/JANUS/datasets/iscxtor2016/processed/flow_features.parquet
|
||||
flow_features_align: auto
|
||||
|
||||
T: 64
|
||||
n_train: 10000
|
||||
min_len: 2
|
||||
packet_preprocess: mixed_dequant
|
||||
seed: 42
|
||||
data_seed: 42
|
||||
train_ratio: 0.8
|
||||
benign_label: nontor
|
||||
|
||||
d_model: 128
|
||||
n_layers: 4
|
||||
n_heads: 4
|
||||
mlp_ratio: 4.0
|
||||
time_dim: 64
|
||||
token_dim:
|
||||
|
||||
batch_size: 256
|
||||
num_workers: 2
|
||||
epochs: 50
|
||||
lr: 3.0e-4
|
||||
weight_decay: 0.01
|
||||
grad_clip: 1.0
|
||||
eval_every: 10
|
||||
eval_n: 20000
|
||||
eval_batch_size: 512
|
||||
eval_n_steps: 8
|
||||
|
||||
sigma: 0.1
|
||||
use_ot: true
|
||||
|
||||
lambda_flow: 0.1
|
||||
lambda_packet: 0.1
|
||||
packet_mask_ratio: 0.5
|
||||
|
||||
device: auto
|
||||
@@ -1,275 +0,0 @@
|
||||
from __future__ import annotations
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import sys as _sys
|
||||
from pathlib import Path as _Path
|
||||
_sys.path.insert(0, str(_Path(__file__).resolve().parents[1]))
|
||||
from common.data_contract import PACKET_FEATURE_NAMES, PACKET_CONTINUOUS_CHANNEL_IDX as CONTINUOUS_CHANNEL_IDX, PACKET_BINARY_CHANNEL_IDX as BINARY_CHANNEL_IDX, canonical_5tuple as _canonical_key, fit_packet_stats as _fit_packet_stats, zscore as _zscore, apply_mixed_dequant as _apply_mixed_dequant
|
||||
DEFAULT_FLOW_META_COLUMNS = {'flow_id', 'label', 'day', 'service', 'src_ip', 'dst_ip', 'src_port', 'dst_port', 'protocol', 'timestamp', 'start_ts', 'n_pkts'}
|
||||
DERIVED_FLOW_FEATURE_NAMES = ('log_len', 'fwd_frac', 'bwd_frac', 'log_size_mean', 'log_size_std', 'log_size_min', 'log_size_max', 'log_dt_mean', 'log_dt_std', 'log_dt_max', 'syn_frac', 'fin_frac', 'rst_frac', 'psh_frac', 'ack_frac', 'log_win_mean')
|
||||
|
||||
@dataclass
|
||||
class UnifiedData:
|
||||
train_flow: np.ndarray
|
||||
val_flow: np.ndarray
|
||||
attack_flow: np.ndarray
|
||||
train_packets: np.ndarray
|
||||
val_packets: np.ndarray
|
||||
attack_packets: np.ndarray
|
||||
train_len: np.ndarray
|
||||
val_len: np.ndarray
|
||||
attack_len: np.ndarray
|
||||
attack_labels: np.ndarray
|
||||
packet_mean: np.ndarray
|
||||
packet_std: np.ndarray
|
||||
flow_mean: np.ndarray
|
||||
flow_std: np.ndarray
|
||||
packet_preprocess: str
|
||||
flow_feature_names: tuple[str, ...]
|
||||
packet_feature_names: tuple[str, ...] = PACKET_FEATURE_NAMES
|
||||
|
||||
@property
|
||||
def T(self) -> int:
|
||||
return int(self.train_packets.shape[1])
|
||||
|
||||
@property
|
||||
def packet_dim(self) -> int:
|
||||
return int(self.train_packets.shape[2])
|
||||
|
||||
@property
|
||||
def flow_dim(self) -> int:
|
||||
return int(self.train_flow.shape[1])
|
||||
|
||||
def _preprocess_packets(train_x: np.ndarray, val_x: np.ndarray, attack_x: np.ndarray, train_l: np.ndarray, val_l: np.ndarray, attack_l: np.ndarray, preprocess: str, seed: int) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
|
||||
if preprocess not in ('zscore', 'mixed_dequant'):
|
||||
raise ValueError("packet_preprocess must be 'zscore' or 'mixed_dequant'")
|
||||
(mean, std) = _fit_packet_stats(train_x, train_l)
|
||||
|
||||
def prep(x: np.ndarray, l: np.ndarray, tag: str) -> np.ndarray:
|
||||
if preprocess == 'zscore':
|
||||
z = _zscore(x, mean, std)
|
||||
mask = np.arange(x.shape[1])[None, :] < l[:, None]
|
||||
return (z * mask[:, :, None]).astype(np.float32)
|
||||
return _apply_mixed_dequant(x, l, mean, std, split_tag=tag, seed=seed)
|
||||
return (prep(train_x, train_l, 'train'), prep(val_x, val_l, 'val'), prep(attack_x, attack_l, 'attack'), mean, std)
|
||||
|
||||
def _derive_flow_features(tokens: np.ndarray, lens: np.ndarray) -> np.ndarray:
|
||||
(N, T, _) = tokens.shape
|
||||
out = np.zeros((N, len(DERIVED_FLOW_FEATURE_NAMES)), dtype=np.float32)
|
||||
for i in range(N):
|
||||
n = int(max(lens[i], 1))
|
||||
x = tokens[i, :n]
|
||||
direction = x[:, 2]
|
||||
size = x[:, 0]
|
||||
dt = x[:, 1]
|
||||
win = x[:, 8]
|
||||
out[i, 0] = np.log1p(n)
|
||||
out[i, 1] = np.mean(direction < 0.5)
|
||||
out[i, 2] = np.mean(direction >= 0.5)
|
||||
out[i, 3] = size.mean()
|
||||
out[i, 4] = size.std()
|
||||
out[i, 5] = size.min()
|
||||
out[i, 6] = size.max()
|
||||
out[i, 7] = dt.mean()
|
||||
out[i, 8] = dt.std()
|
||||
out[i, 9] = dt.max()
|
||||
out[i, 10] = x[:, 3].mean()
|
||||
out[i, 11] = x[:, 4].mean()
|
||||
out[i, 12] = x[:, 5].mean()
|
||||
out[i, 13] = x[:, 6].mean()
|
||||
out[i, 14] = x[:, 7].mean()
|
||||
out[i, 15] = win.mean()
|
||||
return out
|
||||
|
||||
def _read_flow_features(path: Path, *, expected_rows: int, feature_columns: Optional[list[str]]=None) -> tuple[np.ndarray, tuple[str, ...], np.ndarray | None]:
|
||||
path = Path(path)
|
||||
if path.suffix == '.npz':
|
||||
data = np.load(path, allow_pickle=True)
|
||||
x = data['features'].astype(np.float32)
|
||||
raw_names = data['feature_names'] if 'feature_names' in data.files else np.arange(x.shape[1])
|
||||
names = tuple((str(v) for v in raw_names))
|
||||
flow_id = data['flow_id'] if 'flow_id' in data.files else None
|
||||
elif path.suffix in ('.parquet', '.pq'):
|
||||
df = pd.read_parquet(path)
|
||||
flow_id = df['flow_id'].to_numpy() if 'flow_id' in df.columns else None
|
||||
if feature_columns:
|
||||
cols = feature_columns
|
||||
else:
|
||||
cols = [c for c in df.columns if c not in DEFAULT_FLOW_META_COLUMNS and pd.api.types.is_numeric_dtype(df[c])]
|
||||
if not cols:
|
||||
raise ValueError(f'no numeric flow feature columns found in {path}')
|
||||
x = df[cols].to_numpy(dtype=np.float32)
|
||||
names = tuple(cols)
|
||||
else:
|
||||
raise ValueError(f'unsupported flow feature file: {path}')
|
||||
if len(x) != expected_rows:
|
||||
raise ValueError(f'flow feature row count {len(x):,} != packet row count {expected_rows:,}')
|
||||
x = np.nan_to_num(x, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
|
||||
return (x, names, flow_id)
|
||||
|
||||
def _feature_columns_from_df(df: pd.DataFrame, requested: Optional[list[str]]) -> list[str]:
|
||||
if requested:
|
||||
return requested
|
||||
return [c for c in df.columns if c not in DEFAULT_FLOW_META_COLUMNS and pd.api.types.is_numeric_dtype(df[c])]
|
||||
|
||||
def _align_flow_features_by_scan(feature_df: pd.DataFrame, packet_flows: pd.DataFrame, *, feature_columns: list[str]) -> tuple[np.ndarray, tuple[str, ...]]:
|
||||
required = ['label', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'protocol']
|
||||
missing_feature = [c for c in required if c not in feature_df.columns]
|
||||
missing_packet = [c for c in required if c not in packet_flows.columns]
|
||||
if missing_feature or missing_packet:
|
||||
raise ValueError(f'scan alignment requires label + 5-tuple metadata. missing in feature_df={missing_feature}, packet_flows={missing_packet}')
|
||||
packet_keys = [(str(lbl), _canonical_key(src, sp, dst, dp, proto)) for (lbl, src, sp, dst, dp, proto) in zip(packet_flows['label'].to_numpy(), packet_flows['src_ip'].to_numpy(), packet_flows['src_port'].to_numpy(), packet_flows['dst_ip'].to_numpy(), packet_flows['dst_port'].to_numpy(), packet_flows['protocol'].to_numpy())]
|
||||
labels = feature_df['label'].to_numpy()
|
||||
src_ip = feature_df['src_ip'].to_numpy()
|
||||
src_port = feature_df['src_port'].to_numpy()
|
||||
dst_ip = feature_df['dst_ip'].to_numpy()
|
||||
dst_port = feature_df['dst_port'].to_numpy()
|
||||
protocol = feature_df['protocol'].to_numpy()
|
||||
matched: list[int] = []
|
||||
j = 0
|
||||
n_csv = len(feature_df)
|
||||
for (i, target) in enumerate(packet_keys):
|
||||
while j < n_csv:
|
||||
cand = (str(labels[j]), _canonical_key(src_ip[j], src_port[j], dst_ip[j], dst_port[j], protocol[j]))
|
||||
j += 1
|
||||
if cand == target:
|
||||
matched.append(j - 1)
|
||||
break
|
||||
else:
|
||||
raise ValueError(f'failed to align packet flow row {i:,}/{len(packet_keys):,}; the CSV cache may not be the same one used for packet extraction')
|
||||
print(f'[data] scan-aligned CSV flow features: matched={len(matched):,} from csv_rows={n_csv:,} skipped={matched[-1] + 1 - len(matched):,}')
|
||||
x = feature_df.iloc[matched][feature_columns].to_numpy(dtype=np.float32)
|
||||
x = np.nan_to_num(x, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
|
||||
return (x, tuple(feature_columns))
|
||||
|
||||
def _read_aligned_flow_features(path: Path, packet_flows: pd.DataFrame, *, feature_columns: Optional[list[str]]=None, align: str='auto') -> tuple[np.ndarray, tuple[str, ...]]:
|
||||
path = Path(path)
|
||||
if align not in ('auto', 'row', 'scan'):
|
||||
raise ValueError("flow_features_align must be 'auto', 'row', or 'scan'")
|
||||
if path.suffix == '.npz':
|
||||
(x, names, flow_id) = _read_flow_features(path, expected_rows=len(packet_flows), feature_columns=feature_columns)
|
||||
packet_id = packet_flows['flow_id'].to_numpy() if 'flow_id' in packet_flows else None
|
||||
if flow_id is not None and packet_id is not None and (not np.array_equal(flow_id, packet_id)):
|
||||
raise ValueError('NPZ flow_id does not align with Packet_CFM flows')
|
||||
return (x, names)
|
||||
if path.suffix not in ('.parquet', '.pq'):
|
||||
raise ValueError(f'unsupported flow feature file: {path}')
|
||||
feature_df = pd.read_parquet(path)
|
||||
cols = _feature_columns_from_df(feature_df, feature_columns)
|
||||
if not cols:
|
||||
raise ValueError(f'no numeric flow feature columns found in {path}')
|
||||
packet_id = packet_flows['flow_id'].to_numpy() if 'flow_id' in packet_flows else None
|
||||
if len(feature_df) == len(packet_flows):
|
||||
feature_id = feature_df['flow_id'].to_numpy() if 'flow_id' in feature_df.columns else None
|
||||
if feature_id is None or packet_id is None or np.array_equal(feature_id, packet_id):
|
||||
x = feature_df[cols].to_numpy(dtype=np.float32)
|
||||
x = np.nan_to_num(x, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
|
||||
return (x, tuple(cols))
|
||||
if align == 'row':
|
||||
raise ValueError("flow_id mismatch with flow_features_align='row'")
|
||||
if align == 'row':
|
||||
raise ValueError(f'row alignment requested but feature rows={len(feature_df):,} packet rows={len(packet_flows):,}')
|
||||
return _align_flow_features_by_scan(feature_df, packet_flows, feature_columns=cols)
|
||||
|
||||
def _preprocess_flow(train: np.ndarray, val: np.ndarray, attack: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
|
||||
mean = train.mean(axis=0).astype(np.float32)
|
||||
std = train.std(axis=0).astype(np.float32)
|
||||
return (_zscore(train, mean, std), _zscore(val, mean, std), _zscore(attack, mean, std), mean, std)
|
||||
|
||||
def load_unified_data(*, packets_npz: Path | None=None, source_store: Path | None=None, flows_parquet: Path, flow_features_path: Path | None=None, flow_feature_columns: Optional[list[str]]=None, flow_features_align: str='auto', T: int=128, split_seed: int=42, train_ratio: float=0.8, benign_label: str='normal', min_len: int=2, packet_preprocess: str='mixed_dequant', attack_cap: int | None=None, val_cap: int | None=None) -> UnifiedData:
|
||||
if (packets_npz is None) == (source_store is None):
|
||||
raise ValueError('pass exactly one of packets_npz or source_store')
|
||||
flows_parquet = Path(flows_parquet)
|
||||
print(f'[data] flows={flows_parquet} packets_source={(packets_npz if packets_npz else source_store)}')
|
||||
flow_cols = ['flow_id', 'label']
|
||||
if flow_features_path is not None:
|
||||
flow_cols += ['src_ip', 'src_port', 'dst_ip', 'dst_port', 'protocol']
|
||||
flows = pd.read_parquet(flows_parquet, columns=flow_cols)
|
||||
labels_full = flows['label'].to_numpy().astype(str)
|
||||
flow_id = flows['flow_id'].to_numpy()
|
||||
tokens_full: np.ndarray | None = None
|
||||
store = None
|
||||
if packets_npz is not None:
|
||||
pz = np.load(Path(packets_npz))
|
||||
tokens_full = pz['packet_tokens'].astype(np.float32)
|
||||
lens_full = pz['packet_lengths'].astype(np.int32)
|
||||
packet_flow_id = pz['flow_id'] if 'flow_id' in pz.files else None
|
||||
if T > tokens_full.shape[1]:
|
||||
raise ValueError(f'requested T={T} > stored T_full={tokens_full.shape[1]}')
|
||||
tokens_full = tokens_full[:, :T].copy()
|
||||
lens_full = np.minimum(lens_full, T).astype(np.int32)
|
||||
if packet_flow_id is not None and (not np.array_equal(packet_flow_id, flow_id)):
|
||||
raise ValueError('packets_npz and flows_parquet are not row-aligned by flow_id')
|
||||
else:
|
||||
if flow_features_path is None:
|
||||
raise ValueError('source_store path requires flow_features_path (derived features need tokens in memory)')
|
||||
from common.packet_store import PacketShardStore
|
||||
store = PacketShardStore.open(Path(source_store))
|
||||
store_flow_id = store.read_flows(columns=['flow_id'])['flow_id'].to_numpy()
|
||||
if not np.array_equal(store_flow_id, flow_id):
|
||||
raise ValueError('source_store and flows_parquet are not row-aligned by flow_id')
|
||||
lens_full = np.minimum(store.manifest['packet_length'].to_numpy(dtype=np.int32), T)
|
||||
if flow_features_path is None:
|
||||
assert tokens_full is not None
|
||||
flow_features = _derive_flow_features(tokens_full, lens_full)
|
||||
flow_names = DERIVED_FLOW_FEATURE_NAMES
|
||||
print(f'[data] using derived flow features D={flow_features.shape[1]}')
|
||||
else:
|
||||
(flow_features, flow_names) = _read_aligned_flow_features(Path(flow_features_path), flows, feature_columns=flow_feature_columns, align=flow_features_align)
|
||||
print(f'[data] using external flow features D={flow_features.shape[1]}')
|
||||
keep = lens_full >= min_len
|
||||
labels = labels_full[keep]
|
||||
flow_features = flow_features[keep]
|
||||
lens = lens_full[keep]
|
||||
global_idx = np.flatnonzero(keep).astype(np.int64)
|
||||
if tokens_full is not None:
|
||||
materialized_tokens = tokens_full[keep]
|
||||
else:
|
||||
materialized_tokens = None
|
||||
print(f'[data] rows total={len(keep):,} keep len>={min_len}: {keep.sum():,}')
|
||||
benign_local = np.where(labels == benign_label)[0]
|
||||
attack_local = np.where(labels != benign_label)[0]
|
||||
rng = np.random.default_rng(split_seed)
|
||||
rng.shuffle(benign_local)
|
||||
n_train = int(len(benign_local) * train_ratio)
|
||||
train_local = benign_local[:n_train]
|
||||
val_local = benign_local[n_train:]
|
||||
if val_cap is not None and len(val_local) > val_cap:
|
||||
val_local = np.sort(rng.choice(val_local, size=val_cap, replace=False))
|
||||
if attack_cap is not None and len(attack_local) > attack_cap:
|
||||
attack_local = np.sort(rng.choice(attack_local, size=attack_cap, replace=False))
|
||||
print(f'[data] benign={len(benign_local):,} attack={len(attack_local):,} -> train={len(train_local):,} val={len(val_local):,}')
|
||||
|
||||
def _materialize(local_indices: np.ndarray) -> np.ndarray:
|
||||
if materialized_tokens is not None:
|
||||
return materialized_tokens[local_indices].astype(np.float32, copy=False)
|
||||
assert store is not None
|
||||
g = global_idx[local_indices]
|
||||
(tok, _) = store.read_packets(g.astype(np.int64), T=T)
|
||||
return tok.astype(np.float32, copy=False)
|
||||
tr_p_raw = _materialize(train_local)
|
||||
va_p_raw = _materialize(val_local)
|
||||
at_p_raw = _materialize(attack_local)
|
||||
tr_l = lens[train_local]
|
||||
va_l = lens[val_local]
|
||||
at_l = lens[attack_local]
|
||||
tr_f_raw = flow_features[train_local]
|
||||
va_f_raw = flow_features[val_local]
|
||||
at_f_raw = flow_features[attack_local]
|
||||
train_idx = train_local
|
||||
val_idx = val_local
|
||||
attack_idx = attack_local
|
||||
(tr_p, va_p, at_p, p_mean, p_std) = _preprocess_packets(tr_p_raw, va_p_raw, at_p_raw, tr_l, va_l, at_l, preprocess=packet_preprocess, seed=split_seed)
|
||||
(tr_f, va_f, at_f, f_mean, f_std) = _preprocess_flow(tr_f_raw, va_f_raw, at_f_raw)
|
||||
return UnifiedData(train_flow=tr_f, val_flow=va_f, attack_flow=at_f, train_packets=tr_p, val_packets=va_p, attack_packets=at_p, train_len=tr_l, val_len=va_l, attack_len=at_l, attack_labels=labels[attack_idx], packet_mean=p_mean, packet_std=p_std, flow_mean=f_mean, flow_std=f_std, packet_preprocess=packet_preprocess, flow_feature_names=tuple(flow_names))
|
||||
|
||||
def subsample_train(data: UnifiedData, n_train: int, seed: int) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
|
||||
if n_train <= 0 or n_train >= len(data.train_flow):
|
||||
return (data.train_flow, data.train_packets, data.train_len)
|
||||
rng = np.random.default_rng(seed)
|
||||
idx = rng.choice(len(data.train_flow), n_train, replace=False)
|
||||
idx.sort()
|
||||
return (data.train_flow[idx], data.train_packets[idx], data.train_len[idx])
|
||||
@@ -1,588 +0,0 @@
|
||||
from __future__ import annotations
|
||||
import math
|
||||
from dataclasses import dataclass
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torchdiffeq import odeint
|
||||
|
||||
@torch.no_grad()
|
||||
def _sinkhorn_coupling(C: torch.Tensor, reg: float=0.05, n_iter: int=20) -> torch.Tensor:
|
||||
C = C.float()
|
||||
log_k = -C / reg
|
||||
B = C.shape[0]
|
||||
log_u = torch.zeros(B, device=C.device)
|
||||
log_v = torch.zeros(B, device=C.device)
|
||||
for _ in range(n_iter):
|
||||
log_v = -torch.logsumexp(log_k + log_u.unsqueeze(1), dim=0)
|
||||
log_u = -torch.logsumexp(log_k + log_v.unsqueeze(0), dim=1)
|
||||
log_p = log_u.unsqueeze(1) + log_k + log_v.unsqueeze(0)
|
||||
return log_p.argmax(dim=1)
|
||||
|
||||
class SinusoidalTimeEmb(nn.Module):
|
||||
|
||||
def __init__(self, dim: int) -> None:
|
||||
super().__init__()
|
||||
if dim % 2 != 0:
|
||||
raise ValueError('time embedding dimension must be even')
|
||||
self.dim = dim
|
||||
|
||||
def forward(self, t: torch.Tensor) -> torch.Tensor:
|
||||
half = self.dim // 2
|
||||
freqs = torch.exp(-math.log(10000) * torch.arange(half, device=t.device, dtype=t.dtype) / max(half - 1, 1))
|
||||
args = t[:, None] * freqs[None, :]
|
||||
return torch.cat([args.sin(), args.cos()], dim=-1)
|
||||
|
||||
class AdaLNBlock(nn.Module):
|
||||
|
||||
def __init__(self, d_model: int, n_heads: int, mlp_ratio: float, cond_dim: int) -> None:
|
||||
super().__init__()
|
||||
self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
|
||||
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
|
||||
self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
|
||||
hidden = int(d_model * mlp_ratio)
|
||||
self.mlp = nn.Sequential(nn.Linear(d_model, hidden), nn.GELU(), nn.Linear(hidden, d_model))
|
||||
self.cond_proj = nn.Linear(cond_dim, 6 * d_model)
|
||||
nn.init.zeros_(self.cond_proj.weight)
|
||||
nn.init.zeros_(self.cond_proj.bias)
|
||||
|
||||
@staticmethod
|
||||
def _modulate(x: torch.Tensor, gamma: torch.Tensor, beta: torch.Tensor) -> torch.Tensor:
|
||||
return x * (1.0 + gamma[:, None, :]) + beta[:, None, :]
|
||||
|
||||
def forward(self, x: torch.Tensor, cond: torch.Tensor, key_padding_mask: torch.Tensor | None, attn_mask: torch.Tensor | None=None) -> torch.Tensor:
|
||||
(g1, b1, a1, g2, b2, a2) = self.cond_proj(cond).chunk(6, dim=-1)
|
||||
h = self._modulate(self.norm1(x), g1, b1)
|
||||
(attn_out, _) = self.attn(h, h, h, key_padding_mask=key_padding_mask, attn_mask=attn_mask, need_weights=False)
|
||||
x = x + a1[:, None, :] * attn_out
|
||||
h = self._modulate(self.norm2(x), g2, b2)
|
||||
return x + a2[:, None, :] * self.mlp(h)
|
||||
|
||||
class UnifiedVelocity(nn.Module):
|
||||
|
||||
def __init__(self, token_dim: int, seq_len: int, d_model: int=128, n_layers: int=4, n_heads: int=4, mlp_ratio: float=4.0, time_dim: int=64, reference_mode: str | None=None) -> None:
|
||||
super().__init__()
|
||||
if reference_mode not in (None, 'independent_token', 'block_diagonal', 'causal_packets', 'causal_all'):
|
||||
raise ValueError(f'unknown reference_mode={reference_mode!r}')
|
||||
self.token_dim = token_dim
|
||||
self.seq_len = seq_len
|
||||
self.reference_mode = reference_mode
|
||||
self.input_proj = nn.Linear(token_dim, d_model)
|
||||
self.pos_emb = nn.Parameter(torch.zeros(1, seq_len, d_model))
|
||||
self.type_emb = nn.Embedding(2, d_model)
|
||||
nn.init.trunc_normal_(self.pos_emb, std=0.02)
|
||||
nn.init.normal_(self.type_emb.weight, std=0.02)
|
||||
self.time_emb = SinusoidalTimeEmb(time_dim)
|
||||
self.cond_mlp = nn.Sequential(nn.Linear(time_dim, d_model), nn.SiLU(), nn.Linear(d_model, d_model))
|
||||
self.blocks = nn.ModuleList([AdaLNBlock(d_model, n_heads, mlp_ratio, cond_dim=d_model) for _ in range(n_layers)])
|
||||
self.out_norm = nn.LayerNorm(d_model, elementwise_affine=False)
|
||||
self.out = nn.Linear(d_model, token_dim)
|
||||
nn.init.zeros_(self.out.weight)
|
||||
nn.init.zeros_(self.out.bias)
|
||||
type_ids = torch.ones(seq_len, dtype=torch.long)
|
||||
type_ids[0] = 0
|
||||
self.register_buffer('type_ids', type_ids, persistent=False)
|
||||
|
||||
def forward(self, x: torch.Tensor, t: torch.Tensor, key_padding_mask: torch.Tensor | None=None, attn_mask_override: torch.Tensor | None=None) -> torch.Tensor:
|
||||
(B, L, _) = x.shape
|
||||
if L > self.seq_len:
|
||||
raise ValueError(f'sequence length {L} exceeds configured {self.seq_len}')
|
||||
if t.dim() == 0:
|
||||
t = t.expand(B)
|
||||
h = self.input_proj(x)
|
||||
h = h + self.pos_emb[:, :L, :]
|
||||
h = h + self.type_emb(self.type_ids[:L])[None, :, :]
|
||||
cond = self.cond_mlp(self.time_emb(t))
|
||||
if attn_mask_override is not None:
|
||||
attn_mask = attn_mask_override
|
||||
else:
|
||||
attn_mask = self._reference_attn_mask(L, x.device)
|
||||
for block in self.blocks:
|
||||
h = block(h, cond, key_padding_mask, attn_mask=attn_mask)
|
||||
return self.out(self.out_norm(h))
|
||||
|
||||
def _reference_attn_mask(self, L: int, device: torch.device) -> torch.Tensor | None:
|
||||
if self.reference_mode is None:
|
||||
return None
|
||||
if self.reference_mode == 'independent_token':
|
||||
return ~torch.eye(L, dtype=torch.bool, device=device)
|
||||
if self.reference_mode == 'block_diagonal':
|
||||
mask = torch.ones((L, L), dtype=torch.bool, device=device)
|
||||
mask[0, 0] = False
|
||||
if L > 1:
|
||||
mask[1:, 1:] = False
|
||||
return mask
|
||||
if self.reference_mode == 'causal_packets':
|
||||
mask = torch.zeros((L, L), dtype=torch.bool, device=device)
|
||||
if L > 1:
|
||||
packet_causal = torch.triu(torch.ones(L - 1, L - 1, dtype=torch.bool, device=device), diagonal=1)
|
||||
mask[1:, 1:] = packet_causal
|
||||
return mask
|
||||
if self.reference_mode == 'causal_all':
|
||||
return torch.triu(torch.ones(L, L, dtype=torch.bool, device=device), diagonal=1)
|
||||
raise AssertionError(self.reference_mode)
|
||||
|
||||
@dataclass
|
||||
class UnifiedCFMConfig:
|
||||
T: int = 128
|
||||
packet_dim: int = 9
|
||||
flow_dim: int = 16
|
||||
token_dim: int | None = None
|
||||
d_model: int = 128
|
||||
n_layers: int = 4
|
||||
n_heads: int = 4
|
||||
mlp_ratio: float = 4.0
|
||||
time_dim: int = 64
|
||||
sigma: float = 0.1
|
||||
use_ot: bool = False
|
||||
reference_mode: str | None = None
|
||||
|
||||
class UnifiedTokenCFM(nn.Module):
|
||||
|
||||
def __init__(self, cfg: UnifiedCFMConfig) -> None:
|
||||
super().__init__()
|
||||
self.cfg = cfg
|
||||
self.token_dim = cfg.token_dim or 1 + max(cfg.flow_dim, cfg.packet_dim)
|
||||
if self.token_dim < 1 + max(cfg.flow_dim, cfg.packet_dim):
|
||||
raise ValueError('token_dim is too small for flow_dim/packet_dim')
|
||||
self.seq_len = cfg.T + 1
|
||||
self.velocity = UnifiedVelocity(token_dim=self.token_dim, seq_len=self.seq_len, d_model=cfg.d_model, n_layers=cfg.n_layers, n_heads=cfg.n_heads, mlp_ratio=cfg.mlp_ratio, time_dim=cfg.time_dim, reference_mode=cfg.reference_mode)
|
||||
|
||||
def build_tokens(self, flow: torch.Tensor, packets: torch.Tensor) -> torch.Tensor:
|
||||
(B, T, Dp) = packets.shape
|
||||
if T != self.cfg.T:
|
||||
raise ValueError(f'packet T={T} but config T={self.cfg.T}')
|
||||
if Dp != self.cfg.packet_dim:
|
||||
raise ValueError(f'packet_dim={Dp} but config packet_dim={self.cfg.packet_dim}')
|
||||
if flow.shape[-1] != self.cfg.flow_dim:
|
||||
raise ValueError(f'flow_dim={flow.shape[-1]} but config flow_dim={self.cfg.flow_dim}')
|
||||
z = packets.new_zeros((B, T + 1, self.token_dim))
|
||||
z[:, 0, 0] = -1.0
|
||||
z[:, 0, 1:1 + self.cfg.flow_dim] = flow
|
||||
z[:, 1:, 0] = 1.0
|
||||
z[:, 1:, 1:1 + self.cfg.packet_dim] = packets
|
||||
return z
|
||||
|
||||
def key_padding_mask(self, lens: torch.Tensor) -> torch.Tensor:
|
||||
B = lens.shape[0]
|
||||
idx = torch.arange(self.cfg.T, device=lens.device)[None, :]
|
||||
packet_real = idx < lens[:, None]
|
||||
real = torch.cat([torch.ones(B, 1, dtype=torch.bool, device=lens.device), packet_real], dim=1)
|
||||
return ~real
|
||||
|
||||
def _loss_mask(self, lens: torch.Tensor) -> torch.Tensor:
|
||||
return (~self.key_padding_mask(lens)).float()
|
||||
|
||||
@staticmethod
|
||||
def _masked_trimmed_mean(values: torch.Tensor, mask: torch.Tensor, trim_frac: float=0.1) -> torch.Tensor:
|
||||
out = values.new_zeros(values.shape[0])
|
||||
for i in range(values.shape[0]):
|
||||
v = values[i][mask[i] > 0]
|
||||
if v.numel() == 0:
|
||||
continue
|
||||
if v.numel() < 5:
|
||||
out[i] = v.mean()
|
||||
continue
|
||||
v_sorted = torch.sort(v).values
|
||||
lo = int(trim_frac * v_sorted.numel())
|
||||
hi = int((1.0 - trim_frac) * v_sorted.numel())
|
||||
if hi <= lo:
|
||||
out[i] = v_sorted.mean()
|
||||
else:
|
||||
out[i] = v_sorted[lo:hi].mean()
|
||||
return out
|
||||
|
||||
@staticmethod
|
||||
def _masked_median(values: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
|
||||
out = values.new_zeros(values.shape[0])
|
||||
for i in range(values.shape[0]):
|
||||
v = values[i][mask[i] > 0]
|
||||
if v.numel() == 0:
|
||||
continue
|
||||
v_sorted = torch.sort(v).values
|
||||
mid = v_sorted.numel() // 2
|
||||
if v_sorted.numel() % 2:
|
||||
out[i] = v_sorted[mid]
|
||||
else:
|
||||
out[i] = 0.5 * (v_sorted[mid - 1] + v_sorted[mid])
|
||||
return out
|
||||
|
||||
def compute_loss(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, *, lambda_flow: float=0.0, lambda_packet: float=0.0, packet_mask_ratio: float=0.5, return_components: bool=False) -> torch.Tensor | dict[str, torch.Tensor]:
|
||||
x1 = self.build_tokens(flow, packets)
|
||||
B = x1.shape[0]
|
||||
x0 = torch.randn_like(x1)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
if self.cfg.use_ot:
|
||||
flat0 = (x0 * mask[:, :, None]).reshape(B, -1)
|
||||
flat1 = (x1 * mask[:, :, None]).reshape(B, -1)
|
||||
col = _sinkhorn_coupling(torch.cdist(flat0.float(), flat1.float()))
|
||||
x1 = x1[col]
|
||||
flow = flow[col]
|
||||
packets = packets[col]
|
||||
lens = lens[col]
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
t = torch.rand(B, device=x1.device)
|
||||
x_t = (1.0 - t[:, None, None]) * x0 + t[:, None, None] * x1
|
||||
if self.cfg.sigma > 0:
|
||||
std = self.cfg.sigma * torch.sqrt(t * (1.0 - t))[:, None, None]
|
||||
x_t = x_t + std * torch.randn_like(x_t)
|
||||
target = x1 - x0
|
||||
pred = self.velocity(x_t, t, key_padding_mask=kpm)
|
||||
sq = (pred - target).square().mean(dim=-1)
|
||||
per_sample = (sq * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0)
|
||||
main_loss = per_sample.mean()
|
||||
aux_flow_loss = x1.new_zeros(())
|
||||
aux_packet_loss = x1.new_zeros(())
|
||||
if lambda_flow > 0.0:
|
||||
x_t_mf = x_t.clone()
|
||||
x_t_mf[:, 0, :] = 0.0
|
||||
pred_mf = self.velocity(x_t_mf, t, key_padding_mask=kpm)
|
||||
err = (pred_mf[:, 0] - target[:, 0]).square().mean(dim=-1)
|
||||
aux_flow_loss = err.mean()
|
||||
if lambda_packet > 0.0:
|
||||
packet_real = mask[:, 1:] > 0
|
||||
rand_draw = torch.rand(packet_real.shape, device=x1.device)
|
||||
mask_pkt = (rand_draw < packet_mask_ratio) & packet_real
|
||||
pkt_mask_full = torch.cat([torch.zeros(B, 1, dtype=torch.bool, device=x1.device), mask_pkt], dim=1)
|
||||
x_t_mp = x_t.clone()
|
||||
x_t_mp[pkt_mask_full] = 0.0
|
||||
pred_mp = self.velocity(x_t_mp, t, key_padding_mask=kpm)
|
||||
sq_mp = (pred_mp - target).square().mean(dim=-1)
|
||||
mask_f = pkt_mask_full.float()
|
||||
denom = mask_f.sum(dim=-1).clamp_min(1.0)
|
||||
aux_packet_loss = ((sq_mp * mask_f).sum(dim=-1) / denom).mean()
|
||||
total = main_loss + lambda_flow * aux_flow_loss + lambda_packet * aux_packet_loss
|
||||
if return_components:
|
||||
return {'total': total, 'main': main_loss.detach(), 'aux_flow': aux_flow_loss.detach(), 'aux_packet': aux_packet_loss.detach()}
|
||||
return total
|
||||
|
||||
@torch.no_grad()
|
||||
def velocity_score(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, t_eval: tuple[float, ...]=(0.5, 0.75, 1.0)) -> dict[str, torch.Tensor]:
|
||||
x = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
total = torch.zeros(x.shape[0], device=x.device)
|
||||
flow_s = torch.zeros_like(total)
|
||||
packet_s = torch.zeros_like(total)
|
||||
packet_count = mask[:, 1:].sum(dim=-1).clamp_min(1.0)
|
||||
for t_val in t_eval:
|
||||
t = torch.full((x.shape[0],), float(t_val), device=x.device)
|
||||
v = self.velocity(x, t, key_padding_mask=kpm)
|
||||
e = v.square().mean(dim=-1)
|
||||
total = total + (e * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0)
|
||||
flow_s = flow_s + e[:, 0]
|
||||
packet_s = packet_s + (e[:, 1:] * mask[:, 1:]).sum(dim=-1) / packet_count
|
||||
denom = float(len(t_eval))
|
||||
return {'velocity_total': total / denom, 'velocity_flow': flow_s / denom, 'velocity_packet': packet_s / denom}
|
||||
|
||||
@torch.no_grad()
|
||||
def trajectory_metrics(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, n_steps: int=16) -> dict[str, torch.Tensor]:
|
||||
z = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
B = z.shape[0]
|
||||
dt = 1.0 / n_steps
|
||||
total_arc = torch.zeros(B, device=z.device)
|
||||
total_ke = torch.zeros(B, device=z.device)
|
||||
flow_ke = torch.zeros(B, device=z.device)
|
||||
packet_ke = torch.zeros(B, device=z.device)
|
||||
total_curv = torch.zeros(B, device=z.device)
|
||||
flow_curv = torch.zeros(B, device=z.device)
|
||||
packet_curv = torch.zeros(B, device=z.device)
|
||||
packet_kappa2_speed2 = torch.zeros(B, max(z.shape[1] - 1, 0), device=z.device)
|
||||
packet_count = mask[:, 1:].sum(dim=-1).clamp_min(1.0)
|
||||
v_prev = None
|
||||
v_prev_norm = None
|
||||
for k in range(n_steps):
|
||||
t_val = 1.0 - k * dt
|
||||
t = torch.full((B,), t_val, device=z.device)
|
||||
v = self.velocity(z, t, key_padding_mask=kpm)
|
||||
e = v.square().mean(dim=-1)
|
||||
v_norm = v.square().sum(dim=-1).clamp_min(1e-12).sqrt()
|
||||
total_ke = total_ke + (e * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0) * dt
|
||||
flow_ke = flow_ke + e[:, 0] * dt
|
||||
packet_ke = packet_ke + (e[:, 1:] * mask[:, 1:]).sum(dim=-1) / packet_count * dt
|
||||
if v_prev is not None:
|
||||
dv = v - v_prev
|
||||
dve = dv.square().mean(dim=-1)
|
||||
total_curv = total_curv + (dve * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0)
|
||||
flow_curv = flow_curv + dve[:, 0]
|
||||
packet_curv = packet_curv + (dve[:, 1:] * mask[:, 1:]).sum(dim=-1) / packet_count
|
||||
dv2_sum = dv[:, 1:].square().sum(dim=-1)
|
||||
assert v_prev_norm is not None
|
||||
v_avg = 0.5 * (v_norm[:, 1:] + v_prev_norm[:, 1:])
|
||||
packet_kappa2_speed2 = packet_kappa2_speed2 + dv2_sum / v_avg.square().clamp_min(1e-06)
|
||||
v_prev = v
|
||||
v_prev_norm = v_norm
|
||||
z_new = z - v * dt
|
||||
dz = (z_new - z) * mask[:, :, None]
|
||||
total_arc = total_arc + dz.reshape(B, -1).norm(dim=-1) / mask.sum(dim=-1).sqrt()
|
||||
z = z_new
|
||||
z_masked = z * mask[:, :, None]
|
||||
terminal = z_masked.reshape(B, -1).norm(dim=-1) / (mask.sum(dim=-1) * self.token_dim).clamp_min(1.0).sqrt()
|
||||
terminal_flow = z[:, 0].norm(dim=-1) / math.sqrt(self.token_dim)
|
||||
terminal_packet = (z[:, 1:] * mask[:, 1:, None]).reshape(B, -1).norm(dim=-1) / (packet_count * self.token_dim).sqrt()
|
||||
packet_mask = mask[:, 1:]
|
||||
kappa2_speed2_mean = (packet_kappa2_speed2 * packet_mask).sum(dim=-1) / packet_count
|
||||
kappa2_speed2_median = self._masked_median(packet_kappa2_speed2, packet_mask)
|
||||
kappa2_speed2_trimmed = self._masked_trimmed_mean(packet_kappa2_speed2, packet_mask)
|
||||
return {'terminal_norm': terminal, 'terminal_flow': terminal_flow, 'terminal_packet': terminal_packet, 'arc_length': total_arc, 'kinetic_energy': total_ke, 'kinetic_flow': flow_ke, 'kinetic_packet': packet_ke, 'curvature_total': total_curv, 'curvature_flow': flow_curv, 'curvature_packet': packet_curv, 'kappa2_speed2norm_packet_mean': kappa2_speed2_mean, 'kappa2_speed2norm_packet_median': kappa2_speed2_median, 'kappa2_speed2norm_packet_trimmed10_mean': kappa2_speed2_trimmed}
|
||||
|
||||
@torch.no_grad()
|
||||
def score_profile_vt(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, t_eval: tuple[float, ...]=(0.1, 0.3, 0.5, 0.7, 0.9, 1.0)) -> dict[str, torch.Tensor]:
|
||||
x = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
packet_count = mask[:, 1:].sum(dim=-1).clamp_min(1.0)
|
||||
out: dict[str, torch.Tensor] = {}
|
||||
for t_val in t_eval:
|
||||
t = torch.full((x.shape[0],), float(t_val), device=x.device)
|
||||
v = self.velocity(x, t, key_padding_mask=kpm)
|
||||
e = v.square().mean(dim=-1)
|
||||
tag = f't{int(round(t_val * 10)):02d}'
|
||||
out[f'velocity_total_{tag}'] = (e * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0)
|
||||
out[f'velocity_flow_{tag}'] = e[:, 0]
|
||||
out[f'velocity_packet_{tag}'] = (e[:, 1:] * mask[:, 1:]).sum(dim=-1) / packet_count
|
||||
return out
|
||||
|
||||
@torch.no_grad()
|
||||
def consistency_score(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, t_eval: float=0.5) -> dict[str, torch.Tensor]:
|
||||
x = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
B = x.shape[0]
|
||||
packet_count = mask[:, 1:].sum(dim=-1).clamp_min(1.0)
|
||||
t = torch.full((B,), float(t_eval), device=x.device)
|
||||
v_full = self.velocity(x, t, key_padding_mask=kpm)
|
||||
x_mf = x.clone()
|
||||
x_mf[:, 0, :] = 0.0
|
||||
v_mf = self.velocity(x_mf, t, key_padding_mask=kpm)
|
||||
flow_cons = (v_full[:, 0] - v_mf[:, 0]).square().mean(dim=-1)
|
||||
x_mp = x.clone()
|
||||
pkt_mask_full = mask[:, 1:] > 0
|
||||
idx_pkt_mask = torch.cat([torch.zeros(B, 1, dtype=torch.bool, device=x.device), pkt_mask_full], dim=1)
|
||||
x_mp[idx_pkt_mask] = 0.0
|
||||
v_mp = self.velocity(x_mp, t, key_padding_mask=kpm)
|
||||
diff = (v_full - v_mp).square().mean(dim=-1)
|
||||
packet_cons = (diff[:, 1:] * mask[:, 1:]).sum(dim=-1) / packet_count
|
||||
return {'flow_consistency': flow_cons, 'packet_consistency': packet_cons, 'consistency_total': flow_cons + packet_cons}
|
||||
|
||||
def jacobian_hutchinson(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, t_eval: tuple[float, ...]=(0.5,), n_eps: int=4, generator: torch.Generator | None=None) -> dict[str, torch.Tensor]:
|
||||
x = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
B = x.shape[0]
|
||||
packet_count = mask[:, 1:].sum(dim=-1).clamp_min(1.0)
|
||||
total = torch.zeros(B, device=x.device)
|
||||
flow_j = torch.zeros(B, device=x.device)
|
||||
packet_j = torch.zeros(B, device=x.device)
|
||||
n_draws = n_eps * len(t_eval)
|
||||
for t_val in t_eval:
|
||||
t_current = torch.full((B,), float(t_val), device=x.device)
|
||||
for _ in range(n_eps):
|
||||
x_req = x.detach().clone().requires_grad_(True)
|
||||
v = self.velocity(x_req, t_current, key_padding_mask=kpm)
|
||||
eps = torch.randn(v.shape, device=v.device, generator=generator)
|
||||
(g,) = torch.autograd.grad(outputs=v, inputs=x_req, grad_outputs=eps, retain_graph=False, create_graph=False)
|
||||
e = g.square().mean(dim=-1)
|
||||
total = total + (e * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0)
|
||||
flow_j = flow_j + e[:, 0]
|
||||
packet_j = packet_j + (e[:, 1:] * mask[:, 1:]).sum(dim=-1) / packet_count
|
||||
return {'jacobian_total': (total / n_draws).detach(), 'jacobian_flow': (flow_j / n_draws).detach(), 'jacobian_packet': (packet_j / n_draws).detach()}
|
||||
|
||||
@torch.no_grad()
|
||||
def pna_score(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, n_steps: int=16, flow_masked: bool=False) -> dict[str, torch.Tensor]:
|
||||
eps_v2 = 1e-06
|
||||
dt = 1.0 / n_steps
|
||||
z = self.build_tokens(flow, packets)
|
||||
if flow_masked:
|
||||
z = z.clone()
|
||||
z[:, 0, :] = 0.0
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
(B, L, _) = z.shape
|
||||
pna = torch.zeros(B, L, device=z.device)
|
||||
v_prev: torch.Tensor | None = None
|
||||
v_norm_prev: torch.Tensor | None = None
|
||||
for k in range(n_steps):
|
||||
t_val = 1.0 - k * dt
|
||||
t = torch.full((B,), t_val, device=z.device)
|
||||
v = self.velocity(z, t, key_padding_mask=kpm)
|
||||
v_norm = (v.square().sum(dim=-1) + 1e-12).sqrt()
|
||||
if v_prev is not None:
|
||||
dv2 = (v - v_prev).square().sum(dim=-1)
|
||||
v_avg2 = (0.5 * (v_norm + v_norm_prev)).square().clamp_min(eps_v2)
|
||||
pna = pna + dv2 / v_avg2
|
||||
v_prev = v
|
||||
v_norm_prev = v_norm
|
||||
z = z - v * dt
|
||||
if flow_masked:
|
||||
z[:, 0, :] = 0.0
|
||||
flow_pna = pna[:, 0]
|
||||
packet_pna = pna[:, 1:]
|
||||
packet_mask = mask[:, 1:]
|
||||
packet_count = packet_mask.sum(dim=-1).clamp_min(1.0)
|
||||
pna_median = self._masked_median(packet_pna, packet_mask)
|
||||
pna_mean = (packet_pna * packet_mask).sum(dim=-1) / packet_count
|
||||
masked_for_max = packet_pna.masked_fill(packet_mask == 0, float('-inf'))
|
||||
pna_max = masked_for_max.max(dim=-1).values
|
||||
pna_trimmed = self._masked_trimmed_mean(packet_pna, packet_mask)
|
||||
return {'pna_packet_median': pna_median, 'pna_packet_mean': pna_mean, 'pna_packet_max': pna_max, 'pna_packet_trimmed10_mean': pna_trimmed, 'pna_flow': flow_pna}
|
||||
|
||||
@torch.no_grad()
|
||||
def causal_consistency_score(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, t_eval: float=0.5) -> dict[str, torch.Tensor]:
|
||||
x = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
(B, L, _) = x.shape
|
||||
t = torch.full((B,), float(t_eval), device=x.device)
|
||||
v_full = self.velocity(x, t, key_padding_mask=kpm)
|
||||
causal = torch.triu(torch.ones(L, L, dtype=torch.bool, device=x.device), diagonal=1)
|
||||
v_causal = self.velocity(x, t, key_padding_mask=kpm, attn_mask_override=causal)
|
||||
diff = (v_full - v_causal).square().mean(dim=-1)
|
||||
flow_surprisal = diff[:, 0]
|
||||
packet_diff = diff[:, 1:]
|
||||
packet_mask = mask[:, 1:]
|
||||
packet_count = packet_mask.sum(dim=-1).clamp_min(1.0)
|
||||
packet_mean = (packet_diff * packet_mask).sum(dim=-1) / packet_count
|
||||
packet_median = self._masked_median(packet_diff, packet_mask)
|
||||
masked_for_max = packet_diff.masked_fill(packet_mask == 0, float('-inf'))
|
||||
packet_max = masked_for_max.max(dim=-1).values
|
||||
packet_trimmed = self._masked_trimmed_mean(packet_diff, packet_mask)
|
||||
total = (diff * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0)
|
||||
return {'causal_surprisal_total': total, 'causal_surprisal_flow': flow_surprisal, 'causal_surprisal_packet_mean': packet_mean, 'causal_surprisal_packet_median': packet_median, 'causal_surprisal_packet_max': packet_max, 'causal_surprisal_packet_trimmed10_mean': packet_trimmed}
|
||||
|
||||
@torch.no_grad()
|
||||
def direction_consistency_score(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, t_eval: tuple[float, ...]=(0.2, 0.4, 0.6, 0.8, 1.0)) -> dict[str, torch.Tensor]:
|
||||
x = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
(B, L, _) = x.shape
|
||||
t_eval = tuple(t_eval)
|
||||
if len(t_eval) < 2:
|
||||
raise ValueError('direction_consistency_score needs >=2 t values')
|
||||
prev_v: torch.Tensor | None = None
|
||||
drift = x.new_zeros(B, L)
|
||||
n_pairs = len(t_eval) - 1
|
||||
for t_val in t_eval:
|
||||
t = torch.full((B,), float(t_val), device=x.device)
|
||||
v = self.velocity(x, t, key_padding_mask=kpm)
|
||||
if prev_v is not None:
|
||||
num = (prev_v * v).sum(dim=-1)
|
||||
denom = prev_v.norm(dim=-1).clamp_min(1e-08) * v.norm(dim=-1).clamp_min(1e-08)
|
||||
cos = num / denom
|
||||
drift = drift + (1.0 - cos)
|
||||
prev_v = v
|
||||
drift = drift / max(n_pairs, 1)
|
||||
flow_drift = drift[:, 0]
|
||||
packet_drift = drift[:, 1:]
|
||||
packet_mask = mask[:, 1:]
|
||||
packet_count = packet_mask.sum(dim=-1).clamp_min(1.0)
|
||||
packet_mean = (packet_drift * packet_mask).sum(dim=-1) / packet_count
|
||||
packet_median = self._masked_median(packet_drift, packet_mask)
|
||||
masked_for_max = packet_drift.masked_fill(packet_mask == 0, float('-inf'))
|
||||
packet_max = masked_for_max.max(dim=-1).values
|
||||
packet_trimmed = self._masked_trimmed_mean(packet_drift, packet_mask)
|
||||
total = (drift * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0)
|
||||
return {'direction_drift_total': total, 'direction_drift_flow': flow_drift, 'direction_drift_packet_mean': packet_mean, 'direction_drift_packet_median': packet_median, 'direction_drift_packet_max': packet_max, 'direction_drift_packet_trimmed10_mean': packet_trimmed}
|
||||
|
||||
def inverse_flow_nll_score(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, n_steps: int=16, n_eps: int=4, compute_divergence: bool=True, generator: torch.Generator | None=None) -> dict[str, torch.Tensor]:
|
||||
z = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
(B, L, D) = z.shape
|
||||
dt = 1.0 / n_steps
|
||||
accum_div = torch.zeros(B, device=z.device)
|
||||
if compute_divergence:
|
||||
for k in range(n_steps):
|
||||
t_val = 1.0 - k * dt
|
||||
t = torch.full((B,), t_val, device=z.device)
|
||||
z_req = z.detach().clone().requires_grad_(True)
|
||||
v = self.velocity(z_req, t, key_padding_mask=kpm)
|
||||
div_step = torch.zeros(B, device=z.device)
|
||||
for j in range(n_eps):
|
||||
eps = torch.randn_like(v)
|
||||
eps_masked = eps * mask[:, :, None]
|
||||
retain = j < n_eps - 1
|
||||
(g,) = torch.autograd.grad(outputs=v, inputs=z_req, grad_outputs=eps_masked, retain_graph=retain, create_graph=False)
|
||||
div_step = div_step + (eps_masked * g).sum(dim=(1, 2))
|
||||
div_step = div_step / float(n_eps)
|
||||
accum_div = accum_div + div_step * dt
|
||||
with torch.no_grad():
|
||||
z = (z_req - v * dt).detach()
|
||||
else:
|
||||
with torch.no_grad():
|
||||
for k in range(n_steps):
|
||||
t_val = 1.0 - k * dt
|
||||
t = torch.full((B,), t_val, device=z.device)
|
||||
v = self.velocity(z, t, key_padding_mask=kpm)
|
||||
z = z - v * dt
|
||||
with torch.no_grad():
|
||||
z_masked = z * mask[:, :, None]
|
||||
n_real = mask.sum(dim=-1).clamp_min(1.0)
|
||||
x0_quadratic = z_masked.reshape(B, -1).square().sum(dim=-1) / (n_real * float(D))
|
||||
nll_x0_only = x0_quadratic
|
||||
nll_div_only = accum_div / (n_real * float(D))
|
||||
nll_full = nll_x0_only + nll_div_only
|
||||
return {'nll_x0_only': nll_x0_only.detach(), 'nll_div_only': nll_div_only.detach(), 'nll_full': nll_full.detach()}
|
||||
|
||||
def jacobian_spectral_score(self, flow: torch.Tensor, packets: torch.Tensor, lens: torch.Tensor, t_eval: float=0.5, n_eps: int=4, generator: torch.Generator | None=None) -> dict[str, torch.Tensor]:
|
||||
x = self.build_tokens(flow, packets)
|
||||
mask = self._loss_mask(lens)
|
||||
kpm = mask == 0
|
||||
(B, L, D) = x.shape
|
||||
t = torch.full((B,), float(t_eval), device=x.device)
|
||||
packet_mask = mask[:, 1:]
|
||||
packet_count = packet_mask.sum(dim=-1).clamp_min(1.0)
|
||||
norms_total: list[torch.Tensor] = []
|
||||
norms_flow: list[torch.Tensor] = []
|
||||
norms_packet: list[torch.Tensor] = []
|
||||
for _ in range(n_eps):
|
||||
x_req = x.detach().clone().requires_grad_(True)
|
||||
v = self.velocity(x_req, t, key_padding_mask=kpm)
|
||||
eps = torch.randn(v.shape, device=v.device, generator=generator)
|
||||
(g,) = torch.autograd.grad(outputs=v, inputs=x_req, grad_outputs=eps, retain_graph=False, create_graph=False)
|
||||
e = g.square().mean(dim=-1)
|
||||
n_total = (e * mask).sum(dim=-1) / mask.sum(dim=-1).clamp_min(1.0)
|
||||
n_flow = e[:, 0]
|
||||
n_packet = (e[:, 1:] * packet_mask).sum(dim=-1) / packet_count
|
||||
norms_total.append(n_total.detach())
|
||||
norms_flow.append(n_flow.detach())
|
||||
norms_packet.append(n_packet.detach())
|
||||
|
||||
def _spectral_summary(samples: list[torch.Tensor]) -> dict[str, torch.Tensor]:
|
||||
stack = torch.stack(samples, dim=1)
|
||||
mean = stack.mean(dim=1).clamp_min(1e-12)
|
||||
mx = stack.max(dim=1).values
|
||||
mn = stack.min(dim=1).values
|
||||
logfro = torch.log(mean)
|
||||
aniso = mx / mean
|
||||
min_over_max = mn / mx.clamp_min(1e-12)
|
||||
p = stack / stack.sum(dim=1, keepdim=True).clamp_min(1e-12)
|
||||
entropy = -(p * p.clamp_min(1e-12).log()).sum(dim=1)
|
||||
eff_rank = torch.exp(entropy)
|
||||
return {'logfro': logfro, 'anisotropy': aniso, 'min_over_max': min_over_max, 'eff_rank': eff_rank}
|
||||
out: dict[str, torch.Tensor] = {}
|
||||
for (tag, samples) in (('total', norms_total), ('flow', norms_flow), ('packet', norms_packet)):
|
||||
summ = _spectral_summary(samples)
|
||||
for (stat_name, val) in summ.items():
|
||||
out[f'jac_{stat_name}_{tag}'] = val
|
||||
return out
|
||||
|
||||
@torch.no_grad()
|
||||
def sample(self, n: int, lens: torch.Tensor, device: torch.device, n_steps: int=50, method: str='euler') -> torch.Tensor:
|
||||
z = torch.randn(n, self.seq_len, self.token_dim, device=device)
|
||||
ts = torch.linspace(0.0, 1.0, n_steps + 1, device=device)
|
||||
kpm = self.key_padding_mask(lens.to(device))
|
||||
|
||||
def f(t: torch.Tensor, x: torch.Tensor) -> torch.Tensor:
|
||||
return self.velocity(x, t.expand(x.shape[0]), key_padding_mask=kpm)
|
||||
if method == 'euler':
|
||||
for i in range(n_steps):
|
||||
z = z + f(ts[i], z) * (ts[i + 1] - ts[i])
|
||||
return z
|
||||
return odeint(f, z, ts, method=method)[-1]
|
||||
|
||||
def param_count(self) -> int:
|
||||
return sum((p.numel() for p in self.parameters()))
|
||||
@@ -1,157 +0,0 @@
|
||||
import sys
|
||||
from pathlib import Path
|
||||
import torch
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
||||
from model import UnifiedCFMConfig, UnifiedTokenCFM
|
||||
|
||||
def _build_model():
|
||||
return UnifiedTokenCFM(UnifiedCFMConfig(T=4, packet_dim=3, flow_dim=5, d_model=16, n_layers=1, n_heads=4, time_dim=8))
|
||||
|
||||
def _build_reference_model(reference_mode: str):
|
||||
return UnifiedTokenCFM(UnifiedCFMConfig(T=4, packet_dim=3, flow_dim=5, d_model=16, n_layers=1, n_heads=4, time_dim=8, reference_mode=reference_mode))
|
||||
|
||||
def _sample_batch(seed: int=0):
|
||||
torch.manual_seed(seed)
|
||||
flow = torch.randn(2, 5)
|
||||
packets = torch.randn(2, 4, 3)
|
||||
lens = torch.tensor([4, 2])
|
||||
return (flow, packets, lens)
|
||||
|
||||
def test_unified_cfm_shapes_and_scores():
|
||||
model = _build_model()
|
||||
(flow, packets, lens) = _sample_batch()
|
||||
tokens = model.build_tokens(flow, packets)
|
||||
assert tokens.shape == (2, 5, 6)
|
||||
loss = model.compute_loss(flow, packets, lens)
|
||||
assert loss.ndim == 0
|
||||
assert torch.isfinite(loss)
|
||||
traj = model.trajectory_metrics(flow, packets, lens, n_steps=2)
|
||||
assert 'terminal_norm' in traj
|
||||
assert traj['terminal_norm'].shape == (2,)
|
||||
vel = model.velocity_score(flow, packets, lens)
|
||||
assert set(vel) == {'velocity_total', 'velocity_flow', 'velocity_packet'}
|
||||
|
||||
def test_reference_mode_independent_token_shapes_and_scores():
|
||||
model = _build_reference_model('independent_token')
|
||||
(flow, packets, lens) = _sample_batch(seed=9)
|
||||
loss = model.compute_loss(flow, packets, lens)
|
||||
assert loss.ndim == 0
|
||||
assert torch.isfinite(loss)
|
||||
traj = model.trajectory_metrics(flow, packets, lens, n_steps=2)
|
||||
assert traj['terminal_norm'].shape == (2,)
|
||||
assert torch.all(torch.isfinite(traj['curvature_packet']))
|
||||
|
||||
def test_reference_mode_block_diagonal_shapes_and_scores():
|
||||
model = _build_reference_model('block_diagonal')
|
||||
(flow, packets, lens) = _sample_batch(seed=10)
|
||||
loss = model.compute_loss(flow, packets, lens)
|
||||
assert loss.ndim == 0
|
||||
assert torch.isfinite(loss)
|
||||
vel = model.velocity_score(flow, packets, lens)
|
||||
assert set(vel) == {'velocity_total', 'velocity_flow', 'velocity_packet'}
|
||||
|
||||
def test_trajectory_curvature_keys_and_shapes():
|
||||
model = _build_model()
|
||||
(flow, packets, lens) = _sample_batch(seed=1)
|
||||
traj = model.trajectory_metrics(flow, packets, lens, n_steps=4)
|
||||
for key in ('curvature_total', 'curvature_flow', 'curvature_packet'):
|
||||
assert key in traj, f'missing {key}'
|
||||
assert traj[key].shape == (2,)
|
||||
assert torch.all(torch.isfinite(traj[key]))
|
||||
assert torch.all(traj[key] >= 0)
|
||||
|
||||
def test_trajectory_curvature_zero_with_one_step():
|
||||
model = _build_model()
|
||||
(flow, packets, lens) = _sample_batch(seed=2)
|
||||
traj = model.trajectory_metrics(flow, packets, lens, n_steps=1)
|
||||
for key in ('curvature_total', 'curvature_flow', 'curvature_packet'):
|
||||
assert traj[key].abs().sum().item() == 0.0
|
||||
|
||||
def test_speed_normalized_packet_curvature_scores():
|
||||
model = _build_model()
|
||||
(flow, packets, lens) = _sample_batch(seed=11)
|
||||
traj = model.trajectory_metrics(flow, packets, lens, n_steps=4)
|
||||
keys = ('kappa2_speed2norm_packet_mean', 'kappa2_speed2norm_packet_median', 'kappa2_speed2norm_packet_trimmed10_mean')
|
||||
for key in keys:
|
||||
assert key in traj, f'missing {key}'
|
||||
assert traj[key].shape == (2,)
|
||||
assert torch.all(torch.isfinite(traj[key]))
|
||||
assert torch.all(traj[key] >= 0)
|
||||
one_step = model.trajectory_metrics(flow, packets, lens, n_steps=1)
|
||||
for key in keys:
|
||||
assert one_step[key].abs().sum().item() == 0.0
|
||||
|
||||
def test_score_profile_vt_shapes():
|
||||
model = _build_model()
|
||||
(flow, packets, lens) = _sample_batch(seed=3)
|
||||
t_eval = (0.1, 0.3, 0.5, 0.7, 0.9, 1.0)
|
||||
prof = model.score_profile_vt(flow, packets, lens, t_eval=t_eval)
|
||||
assert len(prof) == 3 * len(t_eval)
|
||||
for (k, v) in prof.items():
|
||||
assert v.shape == (2,), k
|
||||
assert torch.all(torch.isfinite(v))
|
||||
assert torch.all(v >= 0)
|
||||
assert 'velocity_total_t05' in prof
|
||||
assert 'velocity_flow_t10' in prof
|
||||
assert 'velocity_packet_t01' in prof
|
||||
|
||||
def test_compute_loss_backward_compat():
|
||||
model = _build_model()
|
||||
(flow, packets, lens) = _sample_batch(seed=5)
|
||||
torch.manual_seed(0)
|
||||
a = model.compute_loss(flow, packets, lens)
|
||||
torch.manual_seed(0)
|
||||
b = model.compute_loss(flow, packets, lens, lambda_flow=0.0, lambda_packet=0.0)
|
||||
assert torch.allclose(a, b), f'λ=0 must match old loss; got {a.item()} vs {b.item()}'
|
||||
|
||||
def test_compute_loss_aux_components_finite():
|
||||
model = _build_model()
|
||||
(flow, packets, lens) = _sample_batch(seed=6)
|
||||
torch.manual_seed(7)
|
||||
comp = model.compute_loss(flow, packets, lens, lambda_flow=0.1, lambda_packet=0.1, return_components=True)
|
||||
assert set(comp) == {'total', 'main', 'aux_flow', 'aux_packet'}
|
||||
for (k, v) in comp.items():
|
||||
assert torch.isfinite(v), k
|
||||
assert v >= 0, f'{k} negative: {v.item()}'
|
||||
|
||||
def test_compute_loss_aux_affects_gradient():
|
||||
model = _build_model()
|
||||
with torch.no_grad():
|
||||
model.velocity.out.weight.normal_(std=0.01)
|
||||
for block in model.velocity.blocks:
|
||||
block.cond_proj.weight.normal_(std=0.01)
|
||||
(flow, packets, lens) = _sample_batch(seed=8)
|
||||
torch.manual_seed(10)
|
||||
total = model.compute_loss(flow, packets, lens, lambda_flow=1.0, lambda_packet=1.0)
|
||||
total.backward()
|
||||
some_grad = False
|
||||
for p in model.parameters():
|
||||
if p.grad is not None and p.grad.abs().sum().item() > 0:
|
||||
some_grad = True
|
||||
break
|
||||
assert some_grad, 'no gradient flowed through aux losses'
|
||||
|
||||
def test_consistency_score_shapes():
|
||||
model = _build_model()
|
||||
(flow, packets, lens) = _sample_batch(seed=9)
|
||||
cs = model.consistency_score(flow, packets, lens)
|
||||
assert set(cs) == {'flow_consistency', 'packet_consistency', 'consistency_total'}
|
||||
for (k, v) in cs.items():
|
||||
assert v.shape == (2,), k
|
||||
assert torch.all(torch.isfinite(v))
|
||||
assert torch.all(v >= 0), k
|
||||
|
||||
def test_jacobian_hutchinson_shapes_and_nonneg():
|
||||
model = _build_model()
|
||||
with torch.no_grad():
|
||||
model.velocity.out.weight.normal_(std=0.01)
|
||||
for block in model.velocity.blocks:
|
||||
block.cond_proj.weight.normal_(std=0.01)
|
||||
(flow, packets, lens) = _sample_batch(seed=4)
|
||||
gen = torch.Generator().manual_seed(42)
|
||||
jac = model.jacobian_hutchinson(flow, packets, lens, t_eval=(0.5,), n_eps=2, generator=gen)
|
||||
assert set(jac) == {'jacobian_total', 'jacobian_flow', 'jacobian_packet'}
|
||||
for (k, v) in jac.items():
|
||||
assert v.shape == (2,), k
|
||||
assert torch.all(torch.isfinite(v))
|
||||
assert torch.all(v >= 0), f'{k} has negative value'
|
||||
@@ -1,147 +0,0 @@
|
||||
from __future__ import annotations
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
from dataclasses import asdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
import numpy as np
|
||||
import torch
|
||||
import yaml
|
||||
from sklearn.metrics import roc_auc_score
|
||||
from torch.utils.data import DataLoader, TensorDataset
|
||||
from data import UnifiedData, load_unified_data, subsample_train
|
||||
from model import UnifiedCFMConfig, UnifiedTokenCFM
|
||||
|
||||
def _device(dev_arg: str) -> torch.device:
|
||||
if dev_arg == 'auto':
|
||||
return torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
||||
return torch.device(dev_arg)
|
||||
|
||||
def _batch_score(model: UnifiedTokenCFM, flow_np: np.ndarray, packet_np: np.ndarray, len_np: np.ndarray, device: torch.device, *, batch_size: int, n_steps: int) -> dict[str, np.ndarray]:
|
||||
out: dict[str, list[np.ndarray]] = {}
|
||||
model.eval()
|
||||
for start in range(0, len(flow_np), batch_size):
|
||||
sl = slice(start, start + batch_size)
|
||||
flow = torch.from_numpy(flow_np[sl]).float().to(device)
|
||||
packets = torch.from_numpy(packet_np[sl]).float().to(device)
|
||||
lens = torch.from_numpy(len_np[sl]).long().to(device)
|
||||
metrics = model.trajectory_metrics(flow, packets, lens, n_steps=n_steps)
|
||||
vel = model.velocity_score(flow, packets, lens)
|
||||
metrics.update(vel)
|
||||
for (k, v) in metrics.items():
|
||||
out.setdefault(k, []).append(v.detach().cpu().numpy())
|
||||
return {k: np.concatenate(v, axis=0) for (k, v) in out.items()}
|
||||
|
||||
def _quick_eval(model: UnifiedTokenCFM, data: UnifiedData, device: torch.device, cfg: dict[str, Any]) -> dict[str, float]:
|
||||
n_eval = int(cfg.get('eval_n', 2000))
|
||||
rng = np.random.default_rng(0)
|
||||
|
||||
def pick(n: int) -> np.ndarray:
|
||||
m = min(n_eval, n)
|
||||
return rng.choice(n, m, replace=False)
|
||||
vi = pick(len(data.val_flow))
|
||||
ai = pick(len(data.attack_flow))
|
||||
v = _batch_score(model, data.val_flow[vi], data.val_packets[vi], data.val_len[vi], device, batch_size=int(cfg.get('eval_batch_size', 512)), n_steps=int(cfg.get('eval_n_steps', 8)))
|
||||
a = _batch_score(model, data.attack_flow[ai], data.attack_packets[ai], data.attack_len[ai], device, batch_size=int(cfg.get('eval_batch_size', 512)), n_steps=int(cfg.get('eval_n_steps', 8)))
|
||||
y = np.concatenate([np.zeros(len(vi)), np.ones(len(ai))])
|
||||
result: dict[str, float] = {}
|
||||
for key in sorted(v.keys()):
|
||||
s = np.concatenate([v[key], a[key]])
|
||||
s = np.nan_to_num(s, nan=0.0, posinf=1000000000000.0, neginf=-1000000000000.0)
|
||||
result[f'auroc_{key}'] = float(roc_auc_score(y, s))
|
||||
return result
|
||||
|
||||
def train(cfg: dict[str, Any]) -> Path:
|
||||
device = _device(str(cfg.get('device', 'auto')))
|
||||
save_dir = Path(cfg['save_dir'])
|
||||
save_dir.mkdir(parents=True, exist_ok=True)
|
||||
with open(save_dir / 'config.yaml', 'w') as f:
|
||||
yaml.safe_dump(cfg, f)
|
||||
seed = int(cfg.get('seed', 42))
|
||||
data_seed = int(cfg.get('data_seed', seed))
|
||||
torch.manual_seed(seed)
|
||||
np.random.seed(seed)
|
||||
print(f'Device: {device}')
|
||||
print(f'[seed] model={seed} data={data_seed}')
|
||||
feature_columns = cfg.get('flow_feature_columns')
|
||||
data = load_unified_data(packets_npz=Path(cfg['packets_npz']) if cfg.get('packets_npz') else None, source_store=Path(cfg['source_store']) if cfg.get('source_store') else None, flows_parquet=Path(cfg['flows_parquet']), flow_features_path=Path(cfg['flow_features_path']) if cfg.get('flow_features_path') else None, flow_feature_columns=feature_columns, flow_features_align=str(cfg.get('flow_features_align', 'auto')), T=int(cfg['T']), split_seed=data_seed, train_ratio=float(cfg.get('train_ratio', 0.8)), benign_label=str(cfg.get('benign_label', 'normal')), min_len=int(cfg.get('min_len', 2)), packet_preprocess=str(cfg.get('packet_preprocess', 'mixed_dequant')), attack_cap=int(cfg['attack_cap']) if cfg.get('attack_cap') else None, val_cap=int(cfg['val_cap']) if cfg.get('val_cap') else None)
|
||||
print(f'[data] T={data.T} packet_D={data.packet_dim} flow_D={data.flow_dim} train={len(data.train_flow):,} val={len(data.val_flow):,} attack={len(data.attack_flow):,}')
|
||||
(tr_f, tr_p, tr_l) = subsample_train(data, int(cfg.get('n_train', 0)), data_seed)
|
||||
ds = TensorDataset(torch.from_numpy(tr_f).float(), torch.from_numpy(tr_p).float(), torch.from_numpy(tr_l).long())
|
||||
loader = DataLoader(ds, batch_size=int(cfg['batch_size']), shuffle=True, drop_last=True, num_workers=int(cfg.get('num_workers', 0)), pin_memory=device.type == 'cuda')
|
||||
print(f'[data] using {len(ds):,} benign training flows')
|
||||
model_cfg = UnifiedCFMConfig(T=data.T, packet_dim=data.packet_dim, flow_dim=data.flow_dim, token_dim=cfg.get('token_dim'), d_model=int(cfg['d_model']), n_layers=int(cfg['n_layers']), n_heads=int(cfg['n_heads']), mlp_ratio=float(cfg.get('mlp_ratio', 4.0)), time_dim=int(cfg.get('time_dim', 64)), sigma=float(cfg.get('sigma', 0.1)), use_ot=bool(cfg.get('use_ot', False)), reference_mode=cfg.get('reference_mode'))
|
||||
model = UnifiedTokenCFM(model_cfg).to(device)
|
||||
print(f'[model] params={model.param_count():,} token_dim={model.token_dim} seq_len={model.seq_len} sigma={model_cfg.sigma} use_ot={model_cfg.use_ot} reference_mode={model_cfg.reference_mode}')
|
||||
opt = torch.optim.AdamW(model.parameters(), lr=float(cfg['lr']), weight_decay=float(cfg.get('weight_decay', 0.01)))
|
||||
total_steps = max(1, int(cfg['epochs']) * len(loader))
|
||||
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=total_steps)
|
||||
history: dict[str, list[Any]] = {'epoch': [], 'loss': [], 'eval': []}
|
||||
lambda_flow = float(cfg.get('lambda_flow', 0.0))
|
||||
lambda_packet = float(cfg.get('lambda_packet', 0.0))
|
||||
packet_mask_ratio = float(cfg.get('packet_mask_ratio', 0.5))
|
||||
aux_enabled = lambda_flow > 0.0 or lambda_packet > 0.0
|
||||
if aux_enabled:
|
||||
print(f'[loss] λ_flow={lambda_flow} λ_packet={lambda_packet} packet_mask_ratio={packet_mask_ratio}')
|
||||
for epoch in range(1, int(cfg['epochs']) + 1):
|
||||
model.train()
|
||||
losses: list[float] = []
|
||||
aux_flow_sum = 0.0
|
||||
aux_packet_sum = 0.0
|
||||
n_steps_this_epoch = 0
|
||||
t0 = time.time()
|
||||
for (flow, packets, lens) in loader:
|
||||
flow = flow.to(device, non_blocking=True)
|
||||
packets = packets.to(device, non_blocking=True)
|
||||
lens = lens.to(device, non_blocking=True)
|
||||
if aux_enabled:
|
||||
comp = model.compute_loss(flow, packets, lens, lambda_flow=lambda_flow, lambda_packet=lambda_packet, packet_mask_ratio=packet_mask_ratio, return_components=True)
|
||||
loss = comp['total']
|
||||
aux_flow_sum += float(comp['aux_flow'].item())
|
||||
aux_packet_sum += float(comp['aux_packet'].item())
|
||||
else:
|
||||
loss = model.compute_loss(flow, packets, lens)
|
||||
opt.zero_grad(set_to_none=True)
|
||||
loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), float(cfg.get('grad_clip', 1.0)))
|
||||
opt.step()
|
||||
sched.step()
|
||||
losses.append(float(loss.item()))
|
||||
n_steps_this_epoch += 1
|
||||
mean_loss = float(np.mean(losses)) if losses else float('nan')
|
||||
eval_metrics: dict[str, float] | None = None
|
||||
if epoch % int(cfg.get('eval_every', 5)) == 0 or epoch == int(cfg['epochs']):
|
||||
eval_metrics = _quick_eval(model, data, device, cfg)
|
||||
history['epoch'].append(epoch)
|
||||
history['loss'].append(mean_loss)
|
||||
history['eval'].append(eval_metrics)
|
||||
elapsed = time.time() - t0
|
||||
terminal = ''
|
||||
if eval_metrics:
|
||||
terminal = f" auroc_terminal={eval_metrics['auroc_terminal_norm']:.3f}"
|
||||
if aux_enabled and n_steps_this_epoch:
|
||||
terminal += f' aux_flow={aux_flow_sum / n_steps_this_epoch:.4f} aux_pkt={aux_packet_sum / n_steps_this_epoch:.4f}'
|
||||
print(f"[epoch {epoch:>3d}/{cfg['epochs']:<3d}] ({elapsed:.1f}s) loss={mean_loss:.4f}{terminal}")
|
||||
if not np.isfinite(mean_loss):
|
||||
raise RuntimeError(f'non-finite loss at epoch {epoch}')
|
||||
payload = {'model_state_dict': model.state_dict(), 'model_cfg': asdict(model_cfg), 'packet_mean': data.packet_mean, 'packet_std': data.packet_std, 'flow_mean': data.flow_mean, 'flow_std': data.flow_std, 'packet_preprocess': data.packet_preprocess, 'flow_feature_names': np.asarray(data.flow_feature_names), 'packet_feature_names': np.asarray(data.packet_feature_names)}
|
||||
torch.save(payload, save_dir / 'model.pt')
|
||||
with open(save_dir / 'history.json', 'w') as f:
|
||||
json.dump(history, f, indent=2, default=str)
|
||||
print(f"[saved] {save_dir / 'model.pt'}")
|
||||
return save_dir
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument('--config', type=Path, required=True)
|
||||
p.add_argument('--override', type=str, nargs='*', default=[])
|
||||
args = p.parse_args()
|
||||
with open(args.config) as f:
|
||||
cfg = yaml.safe_load(f)
|
||||
for override in args.override:
|
||||
(key, value) = override.split('=', 1)
|
||||
cfg[key] = yaml.safe_load(value)
|
||||
train(cfg)
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
121
scripts/aggregate/baselines_cross_3x3_table.py
Normal file
121
scripts/aggregate/baselines_cross_3x3_table.py
Normal file
@@ -0,0 +1,121 @@
|
||||
"""Aggregate IF/OCSVM 3x3 cross-dataset AUROC matrices (3-seed mean ± std).
|
||||
|
||||
Reads NPZs produced by scripts/baselines/run_if_ocsvm_cross.py:
|
||||
{method}_{src}_to_{tgt}_seed{S}.npz with keys b_score, a_score, a_labels
|
||||
|
||||
Writes one Markdown table per method.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
import numpy as np
|
||||
from sklearn.metrics import roc_auc_score
|
||||
|
||||
REPO = Path(__file__).resolve().parents[2]
|
||||
|
||||
DATASETS = ["cicids2017", "cicddos2019", "ciciot2023"]
|
||||
SEEDS = [42, 43, 44]
|
||||
DEFAULT_METHODS = ["iforest", "ocsvm"]
|
||||
TITLE_NAMES = {
|
||||
"iforest": "Isolation Forest",
|
||||
"ocsvm": "OCSVM (RBF)",
|
||||
"shafir_nf": "Shafir NF (single-flow, 20-d, fast)",
|
||||
}
|
||||
SHORT = {"cicids2017": "CICIDS17", "cicddos2019": "CICDDoS19", "ciciot2023": "CICIoT23"}
|
||||
|
||||
|
||||
def cell_auroc(npz_path: Path) -> tuple[float, int, int]:
|
||||
z = np.load(npz_path, allow_pickle=True)
|
||||
b = z["b_score"]
|
||||
a = z["a_score"]
|
||||
y = np.r_[np.zeros(len(b)), np.ones(len(a))]
|
||||
s = np.r_[b, a]
|
||||
s = np.nan_to_num(s, nan=0.0, posinf=1e12, neginf=-1e12)
|
||||
return float(roc_auc_score(y, s)), len(b), len(a)
|
||||
|
||||
|
||||
def build_method_table(method: str, in_dir: Path) -> tuple[str, list[str]]:
|
||||
cells = {}
|
||||
counts = {}
|
||||
missing = []
|
||||
for src in DATASETS:
|
||||
for tgt in DATASETS:
|
||||
aucs = []
|
||||
n_b = n_a = None
|
||||
for s in SEEDS:
|
||||
p = in_dir / f"{method}_{src}_to_{tgt}_seed{s}.npz"
|
||||
if not p.exists():
|
||||
missing.append(p.name)
|
||||
continue
|
||||
auc, n_b, n_a = cell_auroc(p)
|
||||
aucs.append(auc)
|
||||
if not aucs:
|
||||
cells[(src, tgt)] = (float("nan"), float("nan"))
|
||||
else:
|
||||
a = np.asarray(aucs)
|
||||
cells[(src, tgt)] = (a.mean(), a.std())
|
||||
counts[(src, tgt)] = (n_b, n_a)
|
||||
|
||||
lines: list[str] = []
|
||||
title_name = TITLE_NAMES.get(method, method)
|
||||
lines.append(f"# 3×3 cross-dataset AUROC matrix — {title_name} (3-seed mean ± std)\n")
|
||||
lines.append("Rows = source (10K benign training); columns = target (10K benign + balanced ≤1M attacks).")
|
||||
lines.append("Trained on raw 20-d canonical flow features after `StandardScaler` fit on source benign train.")
|
||||
lines.append("Diagonal italic = within-dataset (target benign sampled from rows disjoint from training).\n")
|
||||
|
||||
header = "| Source ↓ / Target → | " + " | ".join(SHORT[t] for t in DATASETS) + " |"
|
||||
sep = "|" + "|".join(["---"] * (len(DATASETS) + 1)) + "|"
|
||||
lines.append(header)
|
||||
lines.append(sep)
|
||||
for src in DATASETS:
|
||||
row = [f"**{SHORT[src]}**"]
|
||||
for tgt in DATASETS:
|
||||
m, sd = cells[(src, tgt)]
|
||||
cell = f"{m:.4f} ± {sd:.4f}"
|
||||
if src == tgt:
|
||||
cell = f"_{cell}_"
|
||||
row.append(cell)
|
||||
lines.append("| " + " | ".join(row) + " |")
|
||||
|
||||
lines.append("\n## Sample counts (target benign / target attacks)\n")
|
||||
lines.append(header)
|
||||
lines.append(sep)
|
||||
for src in DATASETS:
|
||||
row = [SHORT[src]]
|
||||
for tgt in DATASETS:
|
||||
n_b, n_a = counts[(src, tgt)]
|
||||
row.append(f"{n_b}b / {n_a}a" if n_b is not None else "missing")
|
||||
lines.append("| " + " | ".join(row) + " |")
|
||||
return "\n".join(lines) + "\n", missing
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--in-dir", type=Path,
|
||||
default=REPO / "artifacts/baselines/if_ocsvm_cross_2026_05_11")
|
||||
p.add_argument("--out-md", type=Path,
|
||||
default=None,
|
||||
help="Combined markdown output path. Defaults to <in-dir>/CROSS_MATRIX_3x3.md")
|
||||
p.add_argument("--methods", nargs="+", default=DEFAULT_METHODS,
|
||||
help="Method names to aggregate (matching NPZ filename prefixes).")
|
||||
args = p.parse_args()
|
||||
|
||||
out_md = args.out_md or (args.in_dir / "CROSS_MATRIX_3x3.md")
|
||||
parts = []
|
||||
all_missing: list[str] = []
|
||||
for method in args.methods:
|
||||
block, missing = build_method_table(method, args.in_dir)
|
||||
parts.append(block)
|
||||
all_missing.extend(missing)
|
||||
print(block)
|
||||
print()
|
||||
if all_missing:
|
||||
print("# Missing inputs (counted as NaN cells)")
|
||||
for m in all_missing:
|
||||
print(f" - {m}")
|
||||
out_md.write_text("\n\n".join(parts))
|
||||
print(f"[wrote] {out_md}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,68 +0,0 @@
|
||||
#!/bin/bash
|
||||
# Run phase1 eval on all routes after trainings complete.
|
||||
# Splits across 2 GPUs in parallel chains.
|
||||
|
||||
set -e
|
||||
ROOT=/home/chy/JANUS
|
||||
UNIFIED_EVAL=${ROOT}/artifacts/verify_2026_04_24/eval_phase1_unified.py
|
||||
MIXED_EVAL=${ROOT}/Mixed_CFM/eval_phase1.py
|
||||
|
||||
cd ${ROOT}
|
||||
|
||||
# GPU 0: baselines + route_a (6 models)
|
||||
{
|
||||
for prefix in baseline_ciciot2023 route_a_causal_ciciot2023; do
|
||||
for seed in 42 43 44; do
|
||||
name=${prefix}_seed${seed}
|
||||
md=${ROOT}/artifacts/route_comparison/${name}
|
||||
[ -f "${md}/model.pt" ] || continue
|
||||
[ -f "${md}/phase1_summary.json" ] && continue
|
||||
echo "[GPU0 eval] ${name}"
|
||||
cd ${ROOT}/Unified_CFM
|
||||
CUDA_VISIBLE_DEVICES=0 stdbuf -oL uv run --no-sync python -u ${UNIFIED_EVAL} \
|
||||
--model-dir ${md} --out-dir ${md} \
|
||||
--batch-size 256 --n-steps 16 --jacobian-n-eps 4 \
|
||||
--n-val-cap 5000 --n-atk-cap 10000 \
|
||||
> ${md}/phase1.log 2>&1
|
||||
done
|
||||
done
|
||||
echo "[GPU0 done]"
|
||||
} &
|
||||
GPU0_PID=$!
|
||||
|
||||
# GPU 1: route_b + route_c (6 models)
|
||||
{
|
||||
for seed in 42 43 44; do
|
||||
name=route_b_spectral_ciciot2023_seed${seed}
|
||||
md=${ROOT}/artifacts/route_comparison/${name}
|
||||
[ -f "${md}/model.pt" ] || continue
|
||||
[ -f "${md}/phase1_summary.json" ] && continue
|
||||
echo "[GPU1 eval] ${name}"
|
||||
cd ${ROOT}/Unified_CFM
|
||||
CUDA_VISIBLE_DEVICES=1 stdbuf -oL uv run --no-sync python -u ${UNIFIED_EVAL} \
|
||||
--model-dir ${md} --out-dir ${md} \
|
||||
--batch-size 256 --n-steps 16 --jacobian-n-eps 4 \
|
||||
--n-val-cap 5000 --n-atk-cap 10000 \
|
||||
> ${md}/phase1.log 2>&1
|
||||
done
|
||||
for seed in 42 43 44; do
|
||||
name=route_c_mixed_ciciot2023_seed${seed}
|
||||
md=${ROOT}/artifacts/route_comparison/${name}
|
||||
[ -f "${md}/model.pt" ] || continue
|
||||
[ -f "${md}/phase1_summary.json" ] && continue
|
||||
echo "[GPU1 eval] ${name}"
|
||||
cd ${ROOT}/Mixed_CFM
|
||||
CUDA_VISIBLE_DEVICES=1 stdbuf -oL uv run --no-sync python -u ${MIXED_EVAL} \
|
||||
--model-dir ${md} --out-dir ${md} \
|
||||
--batch-size 256 --n-steps 16 \
|
||||
--n-val-cap 5000 --n-atk-cap 10000 \
|
||||
> ${md}/phase1.log 2>&1
|
||||
done
|
||||
echo "[GPU1 done]"
|
||||
} &
|
||||
GPU1_PID=$!
|
||||
|
||||
wait $GPU0_PID
|
||||
wait $GPU1_PID
|
||||
echo "[all phase1 done]"
|
||||
cd ${ROOT} && uv run --no-sync python artifacts/route_comparison/aggregate_results.py
|
||||
@@ -1,105 +0,0 @@
|
||||
#!/bin/bash
|
||||
# Cross-dataset eval for all 4 routes × 2 targets × 3 seeds = 24 runs.
|
||||
# Source: CICIoT2023 (where all models were trained).
|
||||
# Targets: CICIDS2017 + CICDDoS2019.
|
||||
|
||||
set -e
|
||||
ROOT=/home/chy/JANUS
|
||||
UNIFIED_EVAL=${ROOT}/artifacts/verify_2026_04_24/eval_phase2_cross_cicddos2019.py
|
||||
MIXED_EVAL=${ROOT}/Mixed_CFM/eval_cross.py
|
||||
CROSS_DIR=${ROOT}/artifacts/route_comparison/cross
|
||||
mkdir -p ${CROSS_DIR}
|
||||
|
||||
# Target dataset paths
|
||||
declare -A TARGETS
|
||||
TARGETS[cicids2017_store]=${ROOT}/datasets/cicids2017/processed/full_store
|
||||
TARGETS[cicids2017_flows]=${ROOT}/datasets/cicids2017/processed/flows.parquet
|
||||
TARGETS[cicids2017_features]=${ROOT}/datasets/cicids2017/processed/flow_features.parquet
|
||||
TARGETS[cicids2017_features_spectral]=${ROOT}/datasets/cicids2017/processed/flow_features_spectral.parquet
|
||||
|
||||
TARGETS[cicddos2019_store]=${ROOT}/datasets/cicddos2019/processed/full_store
|
||||
TARGETS[cicddos2019_flows]=${ROOT}/datasets/cicddos2019/processed/flows.parquet
|
||||
TARGETS[cicddos2019_features]=${ROOT}/datasets/cicddos2019/processed/flow_features.parquet
|
||||
TARGETS[cicddos2019_features_spectral]=${ROOT}/datasets/cicddos2019/processed/flow_features_spectral.parquet
|
||||
|
||||
run_unified_eval() {
|
||||
local gpu=$1 model_dir=$2 target=$3 features=$4 out_name=$5
|
||||
local out=${CROSS_DIR}/${out_name}.json
|
||||
[ -f "${out}" ] && { echo "[skip] ${out_name}"; return; }
|
||||
echo "[gpu${gpu} eval] ${out_name}"
|
||||
cd ${ROOT}/Unified_CFM
|
||||
CUDA_VISIBLE_DEVICES=${gpu} stdbuf -oL uv run --no-sync python -u ${UNIFIED_EVAL} \
|
||||
--model-dir ${model_dir} \
|
||||
--target-store ${TARGETS[${target}_store]} \
|
||||
--target-flows ${TARGETS[${target}_flows]} \
|
||||
--target-flow-features ${features} \
|
||||
--out ${out} \
|
||||
--n-benign 10000 --n-attack 10000 --seed 42 \
|
||||
--T 64 --batch-size 256 --n-steps 16 \
|
||||
> ${CROSS_DIR}/${out_name}.log 2>&1
|
||||
}
|
||||
|
||||
run_mixed_eval() {
|
||||
local gpu=$1 model_dir=$2 target=$3 out_name=$4
|
||||
local out=${CROSS_DIR}/${out_name}.json
|
||||
[ -f "${out}" ] && { echo "[skip] ${out_name}"; return; }
|
||||
echo "[gpu${gpu} mixed eval] ${out_name}"
|
||||
cd ${ROOT}/Mixed_CFM
|
||||
CUDA_VISIBLE_DEVICES=${gpu} stdbuf -oL uv run --no-sync python -u ${MIXED_EVAL} \
|
||||
--model-dir ${model_dir} \
|
||||
--target-store ${TARGETS[${target}_store]} \
|
||||
--target-flows ${TARGETS[${target}_flows]} \
|
||||
--target-flow-features ${TARGETS[${target}_features]} \
|
||||
--out ${out} \
|
||||
--n-benign 10000 --n-attack 10000 --seed 42 \
|
||||
--T 64 --batch-size 256 --n-steps 16 \
|
||||
> ${CROSS_DIR}/${out_name}.log 2>&1
|
||||
}
|
||||
|
||||
# === GPU 0 chain: baselines + route_a, both targets ===
|
||||
{
|
||||
for prefix_route in "baseline_ciciot2023:baseline" "route_a_causal_ciciot2023:route_a_causal"; do
|
||||
prefix=${prefix_route%:*}
|
||||
short=${prefix_route#*:}
|
||||
for seed in 42 43 44; do
|
||||
md=${ROOT}/artifacts/route_comparison/${prefix}_seed${seed}
|
||||
[ -f "${md}/model.pt" ] || continue
|
||||
for target in cicids2017 cicddos2019; do
|
||||
run_unified_eval 0 "${md}" "${target}" "${TARGETS[${target}_features]}" \
|
||||
"${short}_seed${seed}_to_${target}"
|
||||
done
|
||||
done
|
||||
done
|
||||
echo "[gpu0 cross chain done]"
|
||||
} > /tmp/cross_gpu0.log 2>&1 &
|
||||
GPU0=$!
|
||||
|
||||
# === GPU 1 chain: route_b (uses spectral features) + route_c (mixed) ===
|
||||
{
|
||||
# route_b: must use flow_features_spectral.parquet
|
||||
for seed in 42 43 44; do
|
||||
md=${ROOT}/artifacts/route_comparison/route_b_spectral_ciciot2023_seed${seed}
|
||||
[ -f "${md}/model.pt" ] || continue
|
||||
for target in cicids2017 cicddos2019; do
|
||||
run_unified_eval 1 "${md}" "${target}" "${TARGETS[${target}_features_spectral]}" \
|
||||
"route_b_spectral_seed${seed}_to_${target}"
|
||||
done
|
||||
done
|
||||
|
||||
# route_c: Mixed_CFM eval (uses canonical flow_features)
|
||||
for seed in 42 43 44; do
|
||||
md=${ROOT}/artifacts/route_comparison/route_c_mixed_ciciot2023_seed${seed}
|
||||
[ -f "${md}/model.pt" ] || continue
|
||||
for target in cicids2017 cicddos2019; do
|
||||
run_mixed_eval 1 "${md}" "${target}" \
|
||||
"route_c_mixed_seed${seed}_to_${target}"
|
||||
done
|
||||
done
|
||||
echo "[gpu1 cross chain done]"
|
||||
} > /tmp/cross_gpu1.log 2>&1 &
|
||||
GPU1=$!
|
||||
|
||||
wait $GPU0
|
||||
wait $GPU1
|
||||
echo "[all cross done]"
|
||||
ls -la ${CROSS_DIR}/*.json | wc -l
|
||||
@@ -1,45 +0,0 @@
|
||||
#!/bin/bash
|
||||
# Run phase1 eval on all route_comparison models.
|
||||
# Output: <model_dir>/phase1_summary.json + phase1_scores.npz
|
||||
#
|
||||
# Usage:
|
||||
# bash artifacts/route_comparison/run_phase1_all.sh [GPU_ID]
|
||||
#
|
||||
# Default GPU_ID = 0. Each eval takes ~3-5 min with the caps below.
|
||||
|
||||
set -e
|
||||
GPU_ID="${1:-0}"
|
||||
ROOT=/home/chy/JANUS
|
||||
EVAL=${ROOT}/artifacts/verify_2026_04_24/eval_phase1_unified.py
|
||||
|
||||
models=(
|
||||
baseline_ciciot2023_seed42
|
||||
baseline_ciciot2023_seed43
|
||||
baseline_ciciot2023_seed44
|
||||
route_a_causal_ciciot2023_seed42
|
||||
route_a_causal_ciciot2023_seed43
|
||||
route_a_causal_ciciot2023_seed44
|
||||
)
|
||||
|
||||
cd ${ROOT}/Unified_CFM
|
||||
for name in "${models[@]}"; do
|
||||
model_dir=${ROOT}/artifacts/route_comparison/${name}
|
||||
if [ ! -f "${model_dir}/model.pt" ]; then
|
||||
echo "[skip] ${name}: model.pt missing"
|
||||
continue
|
||||
fi
|
||||
out_dir=${model_dir}
|
||||
if [ -f "${out_dir}/phase1_summary.json" ]; then
|
||||
echo "[skip] ${name}: phase1_summary.json exists"
|
||||
continue
|
||||
fi
|
||||
echo "[eval] ${name}"
|
||||
CUDA_VISIBLE_DEVICES=${GPU_ID} stdbuf -oL uv run --no-sync python -u ${EVAL} \
|
||||
--model-dir ${model_dir} --out-dir ${out_dir} \
|
||||
--batch-size 256 --n-steps 16 \
|
||||
--jacobian-n-eps 4 \
|
||||
--n-val-cap 5000 --n-atk-cap 10000 \
|
||||
2>&1 | tee ${model_dir}/phase1.log | tail -5
|
||||
echo "[done] ${name}"
|
||||
done
|
||||
echo "[all done]"
|
||||
237
scripts/baselines/run_if_ocsvm_cross.py
Normal file
237
scripts/baselines/run_if_ocsvm_cross.py
Normal file
@@ -0,0 +1,237 @@
|
||||
"""Cross-dataset baselines (Isolation Forest, OCSVM) on the 20-d canonical
|
||||
flow-feature contract.
|
||||
|
||||
Protocol per (method, src, tgt, seed):
|
||||
- Train: 10,000 source benign rows (random sample seeded with --seed + 1000)
|
||||
- Test: 10,000 target benign rows (random sample seeded with --seed)
|
||||
+ balanced per-class attack sample with n_attack cap (--n-attack
|
||||
default 1,000,000, divided across all attack classes, matching
|
||||
Mixed_CFM/eval_cross.py)
|
||||
- For diagonal src == tgt, target benign is sampled from the source-pool
|
||||
complement (the rows not used for training) so train and test are disjoint.
|
||||
|
||||
Outputs (in --out-dir):
|
||||
{method}_{src}_to_{tgt}_seed{seed}.npz -- b_score, a_score, a_labels
|
||||
{method}_{src}_to_{tgt}_seed{seed}.json -- AUROC, AUPRC, sample counts, timing
|
||||
"""
|
||||
from __future__ import annotations
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.ensemble import IsolationForest
|
||||
from sklearn.metrics import average_precision_score, roc_auc_score
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import OneClassSVM
|
||||
|
||||
REPO = Path(__file__).resolve().parents[2]
|
||||
|
||||
DATASETS = {
|
||||
"cicids2017": {
|
||||
"flows": REPO / "datasets/cicids2017/processed/flows.parquet",
|
||||
"flow_features": REPO / "datasets/cicids2017/processed/flow_features.parquet",
|
||||
},
|
||||
"cicddos2019": {
|
||||
"flows": REPO / "datasets/cicddos2019/processed/flows.parquet",
|
||||
"flow_features": REPO / "datasets/cicddos2019/processed/flow_features.parquet",
|
||||
},
|
||||
"ciciot2023": {
|
||||
"flows": REPO / "datasets/ciciot2023/processed/full_store/flows.parquet",
|
||||
"flow_features": REPO / "datasets/ciciot2023/processed/flow_features.parquet",
|
||||
},
|
||||
}
|
||||
|
||||
FEATURE_COLS = (
|
||||
"log_duration", "log_n_pkts", "fwd_count", "bwd_count",
|
||||
"pkt_size_mean", "pkt_size_std", "pkt_size_max",
|
||||
"fwd_size_mean", "bwd_size_mean", "bwd_size_std",
|
||||
"iat_mean", "fwd_iat_max", "bwd_iat_max", "bwd_iat_std",
|
||||
"active_mean", "idle_mean",
|
||||
"log_pkts_per_s", "log_total_bytes",
|
||||
"ack_cnt", "syn_cnt",
|
||||
)
|
||||
|
||||
|
||||
def _load_dataset(name: str):
|
||||
paths = DATASETS[name]
|
||||
flows = pd.read_parquet(paths["flows"], columns=["flow_id", "label"])
|
||||
ff = pd.read_parquet(paths["flow_features"])
|
||||
if not np.array_equal(
|
||||
flows["flow_id"].to_numpy(dtype=np.uint64),
|
||||
ff["flow_id"].to_numpy(dtype=np.uint64),
|
||||
):
|
||||
raise ValueError(f"{name}: flows.parquet and flow_features.parquet are not row-aligned")
|
||||
X = ff[list(FEATURE_COLS)].to_numpy(dtype=np.float64)
|
||||
X = np.nan_to_num(X, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
|
||||
labels = flows["label"].astype(str).to_numpy()
|
||||
return X, labels
|
||||
|
||||
|
||||
def _balanced_attack_sample(labels: np.ndarray, n_attack: int, rng: np.random.Generator) -> np.ndarray:
|
||||
attack_idx = np.flatnonzero(labels != "normal")
|
||||
atk_labels = labels[attack_idx]
|
||||
classes = sorted(set(atk_labels))
|
||||
per_class = max(1, n_attack // len(classes))
|
||||
chunks = []
|
||||
for cls in classes:
|
||||
pool = attack_idx[atk_labels == cls]
|
||||
k = min(per_class, len(pool))
|
||||
if k:
|
||||
chunks.append(rng.choice(pool, size=k, replace=False))
|
||||
sel = np.sort(np.concatenate(chunks))
|
||||
if len(sel) > n_attack:
|
||||
sel = np.sort(rng.choice(sel, size=n_attack, replace=False))
|
||||
return sel
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--method", choices=["iforest", "ocsvm"], required=True)
|
||||
p.add_argument("--src", choices=list(DATASETS), required=True)
|
||||
p.add_argument("--tgt", choices=list(DATASETS), required=True)
|
||||
p.add_argument("--seed", type=int, required=True)
|
||||
p.add_argument("--out-dir", type=Path, required=True)
|
||||
p.add_argument("--n-train", type=int, default=10000)
|
||||
p.add_argument("--n-benign", type=int, default=10000)
|
||||
p.add_argument("--n-attack", type=int, default=1_000_000,
|
||||
help="Per-class balanced cap (matches Mixed_CFM/eval_cross.py).")
|
||||
# Method hyperparams
|
||||
p.add_argument("--iforest-n-estimators", type=int, default=200)
|
||||
p.add_argument("--ocsvm-nu", type=float, default=0.1)
|
||||
p.add_argument("--ocsvm-gamma", type=str, default="scale")
|
||||
p.add_argument("--ocsvm-cache-mb", type=int, default=2000)
|
||||
args = p.parse_args()
|
||||
|
||||
args.out_dir.mkdir(parents=True, exist_ok=True)
|
||||
tag = f"{args.method}_{args.src}_to_{args.tgt}_seed{args.seed}"
|
||||
print(f"[run] {tag}")
|
||||
|
||||
# --- source training ---
|
||||
t0 = time.time()
|
||||
src_X, src_labels = _load_dataset(args.src)
|
||||
src_benign_idx = np.flatnonzero(src_labels == "normal")
|
||||
rng_train = np.random.default_rng(args.seed + 1000)
|
||||
if len(src_benign_idx) < args.n_train:
|
||||
raise RuntimeError(f"{args.src}: only {len(src_benign_idx)} benign rows < n_train={args.n_train}")
|
||||
train_sel = np.sort(rng_train.choice(src_benign_idx, size=args.n_train, replace=False))
|
||||
train_X = src_X[train_sel]
|
||||
t_load_src = time.time() - t0
|
||||
|
||||
# --- target eval ---
|
||||
t0 = time.time()
|
||||
if args.tgt == args.src:
|
||||
tgt_X, tgt_labels = src_X, src_labels
|
||||
used_for_train = np.zeros(len(tgt_labels), dtype=bool)
|
||||
used_for_train[train_sel] = True
|
||||
eligible_benign = np.flatnonzero((tgt_labels == "normal") & ~used_for_train)
|
||||
else:
|
||||
tgt_X, tgt_labels = _load_dataset(args.tgt)
|
||||
eligible_benign = np.flatnonzero(tgt_labels == "normal")
|
||||
rng_eval = np.random.default_rng(args.seed)
|
||||
n_benign = min(args.n_benign, len(eligible_benign))
|
||||
if n_benign < args.n_benign:
|
||||
print(f"[warn] only {len(eligible_benign)} eligible benign rows in target (asked {args.n_benign})")
|
||||
b_sel = np.sort(rng_eval.choice(eligible_benign, size=n_benign, replace=False))
|
||||
a_sel = _balanced_attack_sample(tgt_labels, args.n_attack, rng_eval)
|
||||
val_X = tgt_X[b_sel]
|
||||
atk_X = tgt_X[a_sel]
|
||||
a_labels = tgt_labels[a_sel]
|
||||
t_load_tgt = time.time() - t0
|
||||
print(f"[data] train={len(train_X):,} val={len(val_X):,} attack={len(atk_X):,}"
|
||||
f" classes={len(set(a_labels))} D={train_X.shape[1]}")
|
||||
|
||||
# --- standardize on source train ---
|
||||
scaler = StandardScaler().fit(train_X)
|
||||
train_Z = scaler.transform(train_X).astype(np.float32)
|
||||
val_Z = scaler.transform(val_X).astype(np.float32)
|
||||
atk_Z = scaler.transform(atk_X).astype(np.float32)
|
||||
|
||||
# --- fit ---
|
||||
t0 = time.time()
|
||||
if args.method == "iforest":
|
||||
model = IsolationForest(
|
||||
n_estimators=args.iforest_n_estimators,
|
||||
random_state=args.seed,
|
||||
n_jobs=-1,
|
||||
contamination="auto",
|
||||
)
|
||||
model.fit(train_Z)
|
||||
else:
|
||||
model = OneClassSVM(
|
||||
kernel="rbf",
|
||||
nu=args.ocsvm_nu,
|
||||
gamma=args.ocsvm_gamma,
|
||||
cache_size=args.ocsvm_cache_mb,
|
||||
)
|
||||
model.fit(train_Z)
|
||||
t_fit = time.time() - t0
|
||||
|
||||
# --- score: higher = more anomalous ---
|
||||
# IsolationForest.score_samples returns higher-for-normal, so negate.
|
||||
# OneClassSVM.score_samples returns signed distance to boundary
|
||||
# (higher = more normal), so negate too.
|
||||
t0 = time.time()
|
||||
if args.method == "iforest":
|
||||
b_score = (-model.score_samples(val_Z)).astype(np.float32)
|
||||
a_score = (-model.score_samples(atk_Z)).astype(np.float32)
|
||||
else:
|
||||
b_score = (-model.decision_function(val_Z)).astype(np.float32)
|
||||
a_score = (-model.decision_function(atk_Z)).astype(np.float32)
|
||||
t_score = time.time() - t0
|
||||
|
||||
# --- metrics ---
|
||||
y = np.r_[np.zeros(len(b_score)), np.ones(len(a_score))]
|
||||
s = np.r_[b_score, a_score]
|
||||
s = np.nan_to_num(s, nan=0.0, posinf=1e12, neginf=-1e12)
|
||||
auroc = float(roc_auc_score(y, s))
|
||||
auprc = float(average_precision_score(y, s))
|
||||
|
||||
per_class = {}
|
||||
for cls in sorted(set(a_labels)):
|
||||
m = a_labels == cls
|
||||
y_c = np.r_[np.zeros(len(b_score)), np.ones(int(m.sum()))]
|
||||
s_c = np.r_[b_score, a_score[m]]
|
||||
s_c = np.nan_to_num(s_c, nan=0.0, posinf=1e12, neginf=-1e12)
|
||||
try:
|
||||
auc_c = float(roc_auc_score(y_c, s_c))
|
||||
except ValueError:
|
||||
auc_c = float("nan")
|
||||
per_class[cls] = {"_n": int(m.sum()), "auroc": auc_c}
|
||||
|
||||
out = {
|
||||
"method": args.method,
|
||||
"src": args.src,
|
||||
"tgt": args.tgt,
|
||||
"seed": args.seed,
|
||||
"n_train": int(len(train_X)),
|
||||
"n_benign": int(len(val_X)),
|
||||
"n_attack": int(len(atk_X)),
|
||||
"n_attack_classes": int(len(set(a_labels))),
|
||||
"t_load_src_sec": round(t_load_src, 2),
|
||||
"t_load_tgt_sec": round(t_load_tgt, 2),
|
||||
"t_fit_sec": round(t_fit, 2),
|
||||
"t_score_sec": round(t_score, 2),
|
||||
"overall": {"auroc": auroc, "auprc": auprc},
|
||||
"per_class": per_class,
|
||||
}
|
||||
if args.method == "iforest":
|
||||
out["hparams"] = {"n_estimators": args.iforest_n_estimators}
|
||||
else:
|
||||
out["hparams"] = {"nu": args.ocsvm_nu, "gamma": args.ocsvm_gamma}
|
||||
|
||||
json_path = args.out_dir / f"{tag}.json"
|
||||
json_path.write_text(json.dumps(out, indent=2))
|
||||
npz_path = args.out_dir / f"{tag}.npz"
|
||||
np.savez_compressed(npz_path, b_score=b_score, a_score=a_score, a_labels=a_labels.astype(str))
|
||||
print(f"[saved] {json_path}")
|
||||
print(f"[saved] {npz_path}")
|
||||
print(f"[result] {args.method:7s} {args.src} -> {args.tgt} seed={args.seed} "
|
||||
f"AUROC={auroc:.4f} AUPRC={auprc:.4f} "
|
||||
f"fit={t_fit:.1f}s score={t_score:.1f}s")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
39
scripts/baselines/run_if_ocsvm_cross_all.sh
Executable file
39
scripts/baselines/run_if_ocsvm_cross_all.sh
Executable file
@@ -0,0 +1,39 @@
|
||||
#!/usr/bin/env bash
|
||||
# Orchestrate the full 3x3 cross-dataset sweep for IF/OCSVM baselines.
|
||||
# 3 sources x 3 targets x 3 seeds x 2 methods = 54 runs.
|
||||
set -euo pipefail
|
||||
|
||||
REPO="/home/chy/JANUS"
|
||||
cd "$REPO"
|
||||
|
||||
OUT_DIR="${1:-$REPO/artifacts/baselines/if_ocsvm_cross_2026_05_11}"
|
||||
mkdir -p "$OUT_DIR"
|
||||
LOG_DIR="$OUT_DIR/logs"
|
||||
mkdir -p "$LOG_DIR"
|
||||
|
||||
DATASETS=(cicids2017 cicddos2019 ciciot2023)
|
||||
SEEDS=(42 43 44)
|
||||
METHODS=(iforest ocsvm)
|
||||
|
||||
START=$(date +%s)
|
||||
for method in "${METHODS[@]}"; do
|
||||
for src in "${DATASETS[@]}"; do
|
||||
for tgt in "${DATASETS[@]}"; do
|
||||
for seed in "${SEEDS[@]}"; do
|
||||
tag="${method}_${src}_to_${tgt}_seed${seed}"
|
||||
if [[ -f "$OUT_DIR/${tag}.json" ]]; then
|
||||
echo "[skip] $tag (json exists)"
|
||||
continue
|
||||
fi
|
||||
echo "[start] $tag"
|
||||
uv run --no-sync python scripts/baselines/run_if_ocsvm_cross.py \
|
||||
--method "$method" --src "$src" --tgt "$tgt" --seed "$seed" \
|
||||
--out-dir "$OUT_DIR" \
|
||||
> "$LOG_DIR/${tag}.log" 2>&1
|
||||
echo "[done] $tag ($(grep -F '[result]' "$LOG_DIR/${tag}.log" | tail -1))"
|
||||
done
|
||||
done
|
||||
done
|
||||
done
|
||||
END=$(date +%s)
|
||||
echo "[all done] elapsed $((END - START))s"
|
||||
233
scripts/baselines/run_if_ocsvm_cross_packets.py
Normal file
233
scripts/baselines/run_if_ocsvm_cross_packets.py
Normal file
@@ -0,0 +1,233 @@
|
||||
"""Path-B: IF/OCSVM cross-dataset baselines on RAW PACKET SEQUENCES.
|
||||
|
||||
Same protocol as run_if_ocsvm_cross.py, but the input feature vector is the
|
||||
flattened first T=64 packet tokens (9-d each) -> 576-d. No flow-stat
|
||||
aggregation — this is the input modality JANUS itself consumes, so it
|
||||
measures what classical AD can do without hand-engineered features.
|
||||
|
||||
Outputs:
|
||||
{method}_{src}_to_{tgt}_seed{seed}.{json,npz}
|
||||
"""
|
||||
from __future__ import annotations
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.ensemble import IsolationForest
|
||||
from sklearn.metrics import average_precision_score, roc_auc_score
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import OneClassSVM
|
||||
|
||||
REPO = Path(__file__).resolve().parents[2]
|
||||
sys.path.insert(0, str(REPO))
|
||||
from common.packet_store import PacketShardStore # noqa: E402
|
||||
|
||||
DATASETS = {
|
||||
"cicids2017": {
|
||||
"flows": REPO / "datasets/cicids2017/processed/flows.parquet",
|
||||
"packets_npz": REPO / "datasets/cicids2017/processed/packets.npz",
|
||||
"source_store": None,
|
||||
},
|
||||
"cicddos2019": {
|
||||
"flows": REPO / "datasets/cicddos2019/processed/flows.parquet",
|
||||
"packets_npz": None,
|
||||
"source_store": REPO / "datasets/cicddos2019/processed/full_store",
|
||||
},
|
||||
"ciciot2023": {
|
||||
"flows": REPO / "datasets/ciciot2023/processed/full_store/flows.parquet",
|
||||
"packets_npz": None,
|
||||
"source_store": REPO / "datasets/ciciot2023/processed/full_store",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _load_labels(name: str) -> np.ndarray:
|
||||
paths = DATASETS[name]
|
||||
flows = pd.read_parquet(paths["flows"], columns=["flow_id", "label"])
|
||||
return flows["label"].astype(str).to_numpy()
|
||||
|
||||
|
||||
def _materialize_packets(name: str, indices: np.ndarray, T: int) -> np.ndarray:
|
||||
paths = DATASETS[name]
|
||||
if paths["packets_npz"] is not None:
|
||||
pz = np.load(paths["packets_npz"], mmap_mode="r")
|
||||
tokens = pz["packet_tokens"]
|
||||
if T > tokens.shape[1]:
|
||||
raise ValueError(f"requested T={T} > stored {tokens.shape[1]}")
|
||||
out = np.asarray(tokens[indices, :T, :]).astype(np.float32, copy=True)
|
||||
return out
|
||||
else:
|
||||
store = PacketShardStore.open(paths["source_store"])
|
||||
tok, _ = store.read_packets(indices.astype(np.int64), T=T)
|
||||
return tok.astype(np.float32, copy=False)
|
||||
|
||||
|
||||
def _balanced_attack_sample(labels: np.ndarray, n_attack: int, rng: np.random.Generator) -> np.ndarray:
|
||||
attack_idx = np.flatnonzero(labels != "normal")
|
||||
atk_labels = labels[attack_idx]
|
||||
classes = sorted(set(atk_labels))
|
||||
per_class = max(1, n_attack // len(classes))
|
||||
chunks = []
|
||||
for cls in classes:
|
||||
pool = attack_idx[atk_labels == cls]
|
||||
k = min(per_class, len(pool))
|
||||
if k:
|
||||
chunks.append(rng.choice(pool, size=k, replace=False))
|
||||
sel = np.sort(np.concatenate(chunks))
|
||||
if len(sel) > n_attack:
|
||||
sel = np.sort(rng.choice(sel, size=n_attack, replace=False))
|
||||
return sel
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--method", choices=["iforest", "ocsvm"], required=True)
|
||||
p.add_argument("--src", choices=list(DATASETS), required=True)
|
||||
p.add_argument("--tgt", choices=list(DATASETS), required=True)
|
||||
p.add_argument("--seed", type=int, required=True)
|
||||
p.add_argument("--out-dir", type=Path, required=True)
|
||||
p.add_argument("--T", type=int, default=64, help="Packets-per-flow cap (matches JANUS T=64).")
|
||||
p.add_argument("--n-train", type=int, default=10000)
|
||||
p.add_argument("--n-benign", type=int, default=10000)
|
||||
p.add_argument("--n-attack", type=int, default=200000,
|
||||
help="Per-class balanced cap on target attacks. Smaller than the "
|
||||
"20-d run (1M) because 576-d OCSVM scoring is much slower.")
|
||||
p.add_argument("--min-len", type=int, default=2)
|
||||
# Method hyperparams
|
||||
p.add_argument("--iforest-n-estimators", type=int, default=200)
|
||||
p.add_argument("--ocsvm-nu", type=float, default=0.1)
|
||||
p.add_argument("--ocsvm-gamma", type=str, default="scale")
|
||||
p.add_argument("--ocsvm-cache-mb", type=int, default=2000)
|
||||
args = p.parse_args()
|
||||
|
||||
args.out_dir.mkdir(parents=True, exist_ok=True)
|
||||
tag = f"{args.method}_{args.src}_to_{args.tgt}_seed{args.seed}"
|
||||
print(f"[run] {tag} (raw {args.T}x9 packets = {args.T * 9}-d)")
|
||||
|
||||
# --- source training ---
|
||||
t0 = time.time()
|
||||
src_labels = _load_labels(args.src)
|
||||
src_benign_idx = np.flatnonzero(src_labels == "normal")
|
||||
rng_train = np.random.default_rng(args.seed + 1000)
|
||||
if len(src_benign_idx) < args.n_train:
|
||||
raise RuntimeError(f"{args.src}: only {len(src_benign_idx)} benign rows < n_train={args.n_train}")
|
||||
train_sel = np.sort(rng_train.choice(src_benign_idx, size=args.n_train, replace=False))
|
||||
train_tokens = _materialize_packets(args.src, train_sel, T=args.T)
|
||||
train_X = train_tokens.reshape(len(train_sel), -1)
|
||||
t_load_src = time.time() - t0
|
||||
|
||||
# --- target eval ---
|
||||
t0 = time.time()
|
||||
if args.tgt == args.src:
|
||||
tgt_labels = src_labels
|
||||
used = np.zeros(len(tgt_labels), dtype=bool)
|
||||
used[train_sel] = True
|
||||
eligible_benign = np.flatnonzero((tgt_labels == "normal") & ~used)
|
||||
else:
|
||||
tgt_labels = _load_labels(args.tgt)
|
||||
eligible_benign = np.flatnonzero(tgt_labels == "normal")
|
||||
rng_eval = np.random.default_rng(args.seed)
|
||||
n_benign = min(args.n_benign, len(eligible_benign))
|
||||
if n_benign < args.n_benign:
|
||||
print(f"[warn] only {len(eligible_benign)} eligible benign rows in target (asked {args.n_benign})")
|
||||
b_sel = np.sort(rng_eval.choice(eligible_benign, size=n_benign, replace=False))
|
||||
a_sel = _balanced_attack_sample(tgt_labels, args.n_attack, rng_eval)
|
||||
val_tokens = _materialize_packets(args.tgt, b_sel, T=args.T)
|
||||
atk_tokens = _materialize_packets(args.tgt, a_sel, T=args.T)
|
||||
val_X = val_tokens.reshape(len(b_sel), -1)
|
||||
atk_X = atk_tokens.reshape(len(a_sel), -1)
|
||||
a_labels = tgt_labels[a_sel]
|
||||
t_load_tgt = time.time() - t0
|
||||
print(f"[data] train={len(train_X):,} val={len(val_X):,} attack={len(atk_X):,}"
|
||||
f" classes={len(set(a_labels))} D={train_X.shape[1]}")
|
||||
|
||||
# --- standardize ---
|
||||
scaler = StandardScaler().fit(train_X)
|
||||
train_Z = scaler.transform(train_X).astype(np.float32)
|
||||
val_Z = scaler.transform(val_X).astype(np.float32)
|
||||
atk_Z = scaler.transform(atk_X).astype(np.float32)
|
||||
|
||||
# --- fit ---
|
||||
t0 = time.time()
|
||||
if args.method == "iforest":
|
||||
model = IsolationForest(
|
||||
n_estimators=args.iforest_n_estimators,
|
||||
random_state=args.seed,
|
||||
n_jobs=-1,
|
||||
contamination="auto",
|
||||
)
|
||||
model.fit(train_Z)
|
||||
else:
|
||||
model = OneClassSVM(
|
||||
kernel="rbf",
|
||||
nu=args.ocsvm_nu,
|
||||
gamma=args.ocsvm_gamma,
|
||||
cache_size=args.ocsvm_cache_mb,
|
||||
)
|
||||
model.fit(train_Z)
|
||||
t_fit = time.time() - t0
|
||||
|
||||
# --- score (higher = more anomalous) ---
|
||||
t0 = time.time()
|
||||
if args.method == "iforest":
|
||||
b_score = (-model.score_samples(val_Z)).astype(np.float32)
|
||||
a_score = (-model.score_samples(atk_Z)).astype(np.float32)
|
||||
else:
|
||||
b_score = (-model.decision_function(val_Z)).astype(np.float32)
|
||||
a_score = (-model.decision_function(atk_Z)).astype(np.float32)
|
||||
t_score = time.time() - t0
|
||||
|
||||
# --- metrics ---
|
||||
y = np.r_[np.zeros(len(b_score)), np.ones(len(a_score))]
|
||||
s = np.r_[b_score, a_score]
|
||||
s = np.nan_to_num(s, nan=0.0, posinf=1e12, neginf=-1e12)
|
||||
auroc = float(roc_auc_score(y, s))
|
||||
auprc = float(average_precision_score(y, s))
|
||||
|
||||
per_class = {}
|
||||
for cls in sorted(set(a_labels)):
|
||||
m = a_labels == cls
|
||||
y_c = np.r_[np.zeros(len(b_score)), np.ones(int(m.sum()))]
|
||||
s_c = np.r_[b_score, a_score[m]]
|
||||
s_c = np.nan_to_num(s_c, nan=0.0, posinf=1e12, neginf=-1e12)
|
||||
try:
|
||||
auc_c = float(roc_auc_score(y_c, s_c))
|
||||
except ValueError:
|
||||
auc_c = float("nan")
|
||||
per_class[cls] = {"_n": int(m.sum()), "auroc": auc_c}
|
||||
|
||||
out = {
|
||||
"method": args.method,
|
||||
"src": args.src,
|
||||
"tgt": args.tgt,
|
||||
"seed": args.seed,
|
||||
"T": args.T,
|
||||
"feature_dim": int(train_X.shape[1]),
|
||||
"input_mode": "raw_packet_sequence",
|
||||
"n_train": int(len(train_X)),
|
||||
"n_benign": int(len(val_X)),
|
||||
"n_attack": int(len(atk_X)),
|
||||
"n_attack_classes": int(len(set(a_labels))),
|
||||
"t_load_src_sec": round(t_load_src, 2),
|
||||
"t_load_tgt_sec": round(t_load_tgt, 2),
|
||||
"t_fit_sec": round(t_fit, 2),
|
||||
"t_score_sec": round(t_score, 2),
|
||||
"overall": {"auroc": auroc, "auprc": auprc},
|
||||
"per_class": per_class,
|
||||
}
|
||||
json_path = args.out_dir / f"{tag}.json"
|
||||
json_path.write_text(json.dumps(out, indent=2))
|
||||
npz_path = args.out_dir / f"{tag}.npz"
|
||||
np.savez_compressed(npz_path, b_score=b_score, a_score=a_score, a_labels=a_labels.astype(str))
|
||||
print(f"[saved] {json_path}")
|
||||
print(f"[result] {args.method:7s} {args.src} -> {args.tgt} seed={args.seed} "
|
||||
f"AUROC={auroc:.4f} AUPRC={auprc:.4f} "
|
||||
f"fit={t_fit:.1f}s score={t_score:.1f}s")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
38
scripts/baselines/run_if_ocsvm_cross_packets_all.sh
Executable file
38
scripts/baselines/run_if_ocsvm_cross_packets_all.sh
Executable file
@@ -0,0 +1,38 @@
|
||||
#!/usr/bin/env bash
|
||||
# Path-B sweep: IF/OCSVM on raw 64x9 packet sequence (576-d), 3x3 cross-dataset.
|
||||
set -euo pipefail
|
||||
|
||||
REPO="/home/chy/JANUS"
|
||||
cd "$REPO"
|
||||
|
||||
OUT_DIR="${1:-$REPO/artifacts/baselines/if_ocsvm_cross_packets_2026_05_11}"
|
||||
mkdir -p "$OUT_DIR"
|
||||
LOG_DIR="$OUT_DIR/logs"
|
||||
mkdir -p "$LOG_DIR"
|
||||
|
||||
DATASETS=(cicids2017 cicddos2019 ciciot2023)
|
||||
SEEDS=(42 43 44)
|
||||
METHODS=(iforest ocsvm)
|
||||
|
||||
START=$(date +%s)
|
||||
for method in "${METHODS[@]}"; do
|
||||
for src in "${DATASETS[@]}"; do
|
||||
for tgt in "${DATASETS[@]}"; do
|
||||
for seed in "${SEEDS[@]}"; do
|
||||
tag="${method}_${src}_to_${tgt}_seed${seed}"
|
||||
if [[ -f "$OUT_DIR/${tag}.json" ]]; then
|
||||
echo "[skip] $tag (json exists)"
|
||||
continue
|
||||
fi
|
||||
echo "[start] $tag"
|
||||
uv run --no-sync python scripts/baselines/run_if_ocsvm_cross_packets.py \
|
||||
--method "$method" --src "$src" --tgt "$tgt" --seed "$seed" \
|
||||
--out-dir "$OUT_DIR" \
|
||||
> "$LOG_DIR/${tag}.log" 2>&1
|
||||
echo "[done] $tag ($(grep -F '[result]' "$LOG_DIR/${tag}.log" | tail -1))"
|
||||
done
|
||||
done
|
||||
done
|
||||
done
|
||||
END=$(date +%s)
|
||||
echo "[all done] elapsed $((END - START))s"
|
||||
247
scripts/baselines/run_shafir_nf_cross.py
Normal file
247
scripts/baselines/run_shafir_nf_cross.py
Normal file
@@ -0,0 +1,247 @@
|
||||
"""Lightweight Shafir-NF cross-dataset runner.
|
||||
|
||||
Same data protocol as scripts/baselines/run_if_ocsvm_cross.py (path A):
|
||||
- 10K source benign training rows
|
||||
- 10K target benign + balanced per-class target attacks (default cap 200K)
|
||||
- 20-d canonical flow features (CANONICAL_FLOW_FEATURE_NAMES)
|
||||
- StandardScaler-style z-score using source-trained flow_mean/flow_std saved
|
||||
in JANUS within-dataset checkpoints under artifacts/route_comparison/
|
||||
|
||||
Anomaly score = -log_prob from a single pzflow NormalizingFlow trained on
|
||||
source benign for `--epochs` (default 100). No SHAP-subset, no 2-NF ensemble.
|
||||
Single-flow, default hyperparams — meant as a quick cross-dataset baseline
|
||||
matching the IF/OCSVM protocol, NOT a faithful Shafir reproduction.
|
||||
|
||||
Outputs:
|
||||
{tag}.json - summary
|
||||
{tag}.npz - b_score, a_score, a_labels (same key schema as IF/OCSVM runner)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import torch
|
||||
from sklearn.metrics import average_precision_score, roc_auc_score
|
||||
|
||||
os.environ.setdefault("JAX_PLATFORMS", "cpu")
|
||||
import optax # noqa: E402
|
||||
from pzflow import Flow # noqa: E402
|
||||
|
||||
REPO = Path(__file__).resolve().parents[2]
|
||||
|
||||
# Shafir-style 5-d SHAP-top subset of the 20-d canonical flow features.
|
||||
# Picks the 5 entries that loosely correspond to Shafir's CICIDS_BEST5
|
||||
# CICFlowMeter columns (Bwd Packet Length Mean, Fwd Packets/s, ACK Flag Count,
|
||||
# Total Length of Bwd Packets, Flow Duration). This keeps the input
|
||||
# dimensionality and feature semantics close to the paper protocol while
|
||||
# staying on our packet-derived 20-d contract.
|
||||
SHAFIR5_SUBSET = ("bwd_size_mean", "log_pkts_per_s", "ack_cnt", "log_total_bytes", "log_duration")
|
||||
|
||||
DATASETS = {
|
||||
"cicids2017": {
|
||||
"flows": REPO / "datasets/cicids2017/processed/flows.parquet",
|
||||
"flow_features": REPO / "datasets/cicids2017/processed/flow_features.parquet",
|
||||
"model_template": REPO / "artifacts/route_comparison/janus_cicids2017_seed{seed}",
|
||||
},
|
||||
"cicddos2019": {
|
||||
"flows": REPO / "datasets/cicddos2019/processed/flows.parquet",
|
||||
"flow_features": REPO / "datasets/cicddos2019/processed/flow_features.parquet",
|
||||
"model_template": REPO / "artifacts/route_comparison/janus_cicddos2019_seed{seed}",
|
||||
},
|
||||
"ciciot2023": {
|
||||
"flows": REPO / "datasets/ciciot2023/processed/full_store/flows.parquet",
|
||||
"flow_features": REPO / "datasets/ciciot2023/processed/flow_features.parquet",
|
||||
"model_template": REPO / "artifacts/route_comparison/janus_ciciot2023_seed{seed}",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _load_src_stats(src: str, seed: int) -> tuple[np.ndarray, np.ndarray, list[str]]:
|
||||
model_dir = Path(str(DATASETS[src]["model_template"]).format(seed=seed))
|
||||
ckpt = torch.load(model_dir / "model.pt", map_location="cpu", weights_only=False)
|
||||
flow_mean = np.asarray(ckpt["flow_mean"], dtype=np.float32)
|
||||
flow_std = np.asarray(ckpt["flow_std"], dtype=np.float32)
|
||||
flow_names = [str(n) for n in ckpt["flow_feature_names"]]
|
||||
return flow_mean, flow_std, flow_names
|
||||
|
||||
|
||||
def _load_dataset_aligned(name: str, flow_names: list[str]) -> tuple[np.ndarray, np.ndarray]:
|
||||
flows = pd.read_parquet(DATASETS[name]["flows"], columns=["flow_id", "label"])
|
||||
ff = pd.read_parquet(DATASETS[name]["flow_features"])
|
||||
if not np.array_equal(
|
||||
flows["flow_id"].to_numpy(dtype=np.uint64),
|
||||
ff["flow_id"].to_numpy(dtype=np.uint64),
|
||||
):
|
||||
raise ValueError(f"{name}: flows.parquet and flow_features.parquet are not row-aligned")
|
||||
X = ff[flow_names].to_numpy(dtype=np.float64)
|
||||
X = np.nan_to_num(X, nan=0.0, posinf=0.0, neginf=0.0).astype(np.float32)
|
||||
labels = flows["label"].astype(str).to_numpy()
|
||||
return X, labels
|
||||
|
||||
|
||||
def _balanced_attack_sample(labels: np.ndarray, n_attack: int, rng: np.random.Generator) -> np.ndarray:
|
||||
attack_idx = np.flatnonzero(labels != "normal")
|
||||
atk_labels = labels[attack_idx]
|
||||
classes = sorted(set(atk_labels))
|
||||
per_class = max(1, n_attack // len(classes))
|
||||
chunks = []
|
||||
for cls in classes:
|
||||
pool = attack_idx[atk_labels == cls]
|
||||
k = min(per_class, len(pool))
|
||||
if k:
|
||||
chunks.append(rng.choice(pool, size=k, replace=False))
|
||||
sel = np.sort(np.concatenate(chunks))
|
||||
if len(sel) > n_attack:
|
||||
sel = np.sort(rng.choice(sel, size=n_attack, replace=False))
|
||||
return sel
|
||||
|
||||
|
||||
def _safe_metric(fn, y, s) -> float:
|
||||
s = np.nan_to_num(s, nan=0.0, posinf=1e12, neginf=-1e12)
|
||||
try:
|
||||
return float(fn(y, s))
|
||||
except ValueError:
|
||||
return float("nan")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--src", choices=list(DATASETS), required=True)
|
||||
p.add_argument("--tgt", choices=list(DATASETS), required=True)
|
||||
p.add_argument("--seed", type=int, required=True)
|
||||
p.add_argument("--out-dir", type=Path, required=True)
|
||||
p.add_argument("--n-train", type=int, default=10000)
|
||||
p.add_argument("--n-benign", type=int, default=10000)
|
||||
p.add_argument("--n-attack", type=int, default=200000)
|
||||
p.add_argument("--epochs", type=int, default=100)
|
||||
p.add_argument("--lr", type=float, default=1e-3)
|
||||
p.add_argument("--optimizer", choices=["sgd", "adam"], default="sgd")
|
||||
p.add_argument("--feature-subset", choices=["shafir5", "full20"], default="shafir5",
|
||||
help="shafir5: 5-d SHAP-top loose match (default, matches paper protocol); "
|
||||
"full20: all 20-d canonical features (stronger but not Shafir-faithful)")
|
||||
p.add_argument("--verbose", action="store_true")
|
||||
args = p.parse_args()
|
||||
args.out_dir.mkdir(parents=True, exist_ok=True)
|
||||
tag = f"shafir_nf_{args.src}_to_{args.tgt}_seed{args.seed}"
|
||||
print(f"[run] {tag}")
|
||||
|
||||
# --- source stats from JANUS ckpt ---
|
||||
flow_mean_full, flow_std_full, flow_names_full = _load_src_stats(args.src, args.seed)
|
||||
if args.feature_subset == "shafir5":
|
||||
keep_idx = [flow_names_full.index(n) for n in SHAFIR5_SUBSET]
|
||||
flow_mean = flow_mean_full[keep_idx]
|
||||
flow_std = flow_std_full[keep_idx]
|
||||
flow_names = list(SHAFIR5_SUBSET)
|
||||
else:
|
||||
flow_mean, flow_std, flow_names = flow_mean_full, flow_std_full, flow_names_full
|
||||
print(f"[src] model_dir={DATASETS[args.src]['model_template']} (seed={args.seed})")
|
||||
print(f"[src] feature_subset={args.feature_subset} D={len(flow_names)} names={flow_names}")
|
||||
|
||||
# --- source training sample (10K benign, seed+1000) ---
|
||||
t0 = time.time()
|
||||
src_X, src_labels = _load_dataset_aligned(args.src, flow_names)
|
||||
src_benign_idx = np.flatnonzero(src_labels == "normal")
|
||||
rng_train = np.random.default_rng(args.seed + 1000)
|
||||
if len(src_benign_idx) < args.n_train:
|
||||
raise RuntimeError(f"{args.src}: only {len(src_benign_idx)} benign rows")
|
||||
train_sel = np.sort(rng_train.choice(src_benign_idx, size=args.n_train, replace=False))
|
||||
train_X = src_X[train_sel]
|
||||
train_Z = ((train_X - flow_mean) / np.maximum(flow_std, 1e-6)).astype(np.float32)
|
||||
t_load_src = time.time() - t0
|
||||
|
||||
# --- target eval sample ---
|
||||
t0 = time.time()
|
||||
if args.tgt == args.src:
|
||||
tgt_X, tgt_labels = src_X, src_labels
|
||||
used = np.zeros(len(tgt_labels), dtype=bool)
|
||||
used[train_sel] = True
|
||||
eligible_benign = np.flatnonzero((tgt_labels == "normal") & ~used)
|
||||
else:
|
||||
tgt_X, tgt_labels = _load_dataset_aligned(args.tgt, flow_names)
|
||||
eligible_benign = np.flatnonzero(tgt_labels == "normal")
|
||||
rng_eval = np.random.default_rng(args.seed)
|
||||
n_benign = min(args.n_benign, len(eligible_benign))
|
||||
if n_benign < args.n_benign:
|
||||
print(f"[warn] only {len(eligible_benign)} eligible benign rows in target")
|
||||
b_sel = np.sort(rng_eval.choice(eligible_benign, size=n_benign, replace=False))
|
||||
a_sel = _balanced_attack_sample(tgt_labels, args.n_attack, rng_eval)
|
||||
val_X = tgt_X[b_sel]
|
||||
atk_X = tgt_X[a_sel]
|
||||
a_labels = tgt_labels[a_sel]
|
||||
val_Z = ((val_X - flow_mean) / np.maximum(flow_std, 1e-6)).astype(np.float32)
|
||||
atk_Z = ((atk_X - flow_mean) / np.maximum(flow_std, 1e-6)).astype(np.float32)
|
||||
t_load_tgt = time.time() - t0
|
||||
print(f"[data] train={len(train_Z):,} val={len(val_Z):,} attack={len(atk_Z):,}"
|
||||
f" classes={len(set(a_labels))} D={train_Z.shape[1]}")
|
||||
|
||||
# --- fit pzflow NF ---
|
||||
cols = [f"x{i}" for i in range(train_Z.shape[1])]
|
||||
df_train = pd.DataFrame(train_Z.astype(np.float32), columns=cols)
|
||||
df_val = pd.DataFrame(val_Z.astype(np.float32), columns=cols)
|
||||
df_atk = pd.DataFrame(atk_Z.astype(np.float32), columns=cols)
|
||||
opt = optax.sgd(args.lr) if args.optimizer == "sgd" else optax.adam(args.lr)
|
||||
flow = Flow(df_train.columns.tolist())
|
||||
t0 = time.time()
|
||||
losses = flow.train(df_train, optimizer=opt, epochs=args.epochs, verbose=args.verbose)
|
||||
t_fit = time.time() - t0
|
||||
|
||||
# --- score (anomaly = -log_prob; higher = more anomalous) ---
|
||||
t0 = time.time()
|
||||
lp_val = np.asarray(flow.log_prob(df_val))
|
||||
lp_atk = np.asarray(flow.log_prob(df_atk))
|
||||
b_score = (-lp_val).astype(np.float32)
|
||||
a_score = (-lp_atk).astype(np.float32)
|
||||
t_score = time.time() - t0
|
||||
|
||||
# --- metrics ---
|
||||
y = np.r_[np.zeros(len(b_score)), np.ones(len(a_score))]
|
||||
s = np.r_[b_score, a_score]
|
||||
auroc = _safe_metric(roc_auc_score, y, s)
|
||||
auprc = _safe_metric(average_precision_score, y, s)
|
||||
|
||||
per_class = {}
|
||||
for cls in sorted(set(a_labels)):
|
||||
m = a_labels == cls
|
||||
y_c = np.r_[np.zeros(len(b_score)), np.ones(int(m.sum()))]
|
||||
s_c = np.r_[b_score, a_score[m]]
|
||||
per_class[cls] = {"_n": int(m.sum()), "auroc": _safe_metric(roc_auc_score, y_c, s_c)}
|
||||
|
||||
out = {
|
||||
"method": "shafir_nf",
|
||||
"variant": f"single_nf_{args.feature_subset}",
|
||||
"feature_subset": args.feature_subset,
|
||||
"feature_names": list(flow_names),
|
||||
"src": args.src,
|
||||
"tgt": args.tgt,
|
||||
"seed": args.seed,
|
||||
"n_train": int(len(train_Z)),
|
||||
"n_benign": int(len(val_Z)),
|
||||
"n_attack": int(len(atk_Z)),
|
||||
"epochs": args.epochs,
|
||||
"lr": args.lr,
|
||||
"optimizer": args.optimizer,
|
||||
"t_load_src_sec": round(t_load_src, 2),
|
||||
"t_load_tgt_sec": round(t_load_tgt, 2),
|
||||
"t_fit_sec": round(t_fit, 2),
|
||||
"t_score_sec": round(t_score, 2),
|
||||
"loss_first_last": [float(losses[0]), float(losses[-1])],
|
||||
"overall": {"auroc": auroc, "auprc": auprc},
|
||||
"per_class": per_class,
|
||||
}
|
||||
json_path = args.out_dir / f"{tag}.json"
|
||||
json_path.write_text(json.dumps(out, indent=2))
|
||||
npz_path = args.out_dir / f"{tag}.npz"
|
||||
np.savez_compressed(npz_path, b_score=b_score, a_score=a_score, a_labels=a_labels.astype(str))
|
||||
print(f"[saved] {json_path}")
|
||||
print(f"[result] shafir_nf {args.src} -> {args.tgt} seed={args.seed} "
|
||||
f"AUROC={auroc:.4f} AUPRC={auprc:.4f} "
|
||||
f"fit={t_fit:.1f}s score={t_score:.1f}s")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
40
scripts/baselines/run_shafir_nf_cross_all.sh
Executable file
40
scripts/baselines/run_shafir_nf_cross_all.sh
Executable file
@@ -0,0 +1,40 @@
|
||||
#!/usr/bin/env bash
|
||||
# Fast-scheme Shafir-NF 3x3 cross-dataset sweep.
|
||||
# 3 src x 3 tgt x 3 seeds = 27 runs. epochs=10 (fast, see run_shafir_nf_cross.py
|
||||
# sanity: 10 epochs already reaches AUROC ~0.89 within-CICIDS17).
|
||||
set -euo pipefail
|
||||
|
||||
REPO="/home/chy/JANUS"
|
||||
cd "$REPO"
|
||||
|
||||
OUT_DIR="${1:-$REPO/artifacts/baselines/shafir_nf_cross_2026_05_12}"
|
||||
EPOCHS="${EPOCHS:-10}"
|
||||
mkdir -p "$OUT_DIR"
|
||||
LOG_DIR="$OUT_DIR/logs"
|
||||
mkdir -p "$LOG_DIR"
|
||||
|
||||
DATASETS=(cicids2017 cicddos2019 ciciot2023)
|
||||
SEEDS=(42 43 44)
|
||||
|
||||
START=$(date +%s)
|
||||
for src in "${DATASETS[@]}"; do
|
||||
for tgt in "${DATASETS[@]}"; do
|
||||
for seed in "${SEEDS[@]}"; do
|
||||
tag="shafir_nf_${src}_to_${tgt}_seed${seed}"
|
||||
if [[ -f "$OUT_DIR/${tag}.json" ]]; then
|
||||
echo "[skip] $tag (json exists)"
|
||||
continue
|
||||
fi
|
||||
echo "[start] $tag"
|
||||
PYTHONUNBUFFERED=1 OMP_NUM_THREADS=4 \
|
||||
uv run --no-sync python -u scripts/baselines/run_shafir_nf_cross.py \
|
||||
--src "$src" --tgt "$tgt" --seed "$seed" \
|
||||
--epochs "$EPOCHS" \
|
||||
--out-dir "$OUT_DIR" \
|
||||
> "$LOG_DIR/${tag}.log" 2>&1
|
||||
echo "[done] $tag ($(grep -F '[result]' "$LOG_DIR/${tag}.log" | tail -1))"
|
||||
done
|
||||
done
|
||||
done
|
||||
END=$(date +%s)
|
||||
echo "[all done] elapsed $((END - START))s"
|
||||
@@ -105,11 +105,52 @@ def plot_one(npz: Path, dataset: str) -> Path:
|
||||
|
||||
out = OUT / f"velocity_field_view_{dataset.lower()}.pdf"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".svg"), bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".png"), bbox_inches="tight", dpi=160)
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
|
||||
def plot_one_overview(npz: Path, dataset: str) -> Path:
|
||||
"""Render a clean single-panel velocity-field SVG for use as the overview-
|
||||
figure component 03 (CFM head). Training-phase visualization only:
|
||||
log-norm heatmap + white streamlines + benign t=0.5 cloud. No attacks,
|
||||
no axes / colorbar / title (the surrounding overview wrapper supplies
|
||||
those). Outputs both SVG and PDF for LaTeX flexibility.
|
||||
"""
|
||||
z = np.load(npz)
|
||||
GX = z["grid_x"]
|
||||
GY = z["grid_y"]
|
||||
field_log = z["field_log_norm"]
|
||||
field_v = z["field_v_2d"]
|
||||
benign_t05 = z["benign_t05_2d"]
|
||||
|
||||
fig, ax = plt.subplots(figsize=(3.0, 2.6), constrained_layout=True)
|
||||
vmin, vmax = np.percentile(field_log, [5, 95])
|
||||
ax.pcolormesh(GX, GY, field_log, cmap="viridis", shading="auto",
|
||||
vmin=vmin, vmax=vmax, rasterized=True)
|
||||
speed = np.linalg.norm(field_v, axis=-1)
|
||||
lw = 0.35 + 1.5 * (speed / (speed.max() + 1e-9))
|
||||
ax.streamplot(GX, GY, field_v[..., 0], field_v[..., 1],
|
||||
color="white", linewidth=lw, density=0.85, arrowsize=0.7)
|
||||
n_overlay = min(200, benign_t05.shape[0])
|
||||
rng = np.random.default_rng(0)
|
||||
idx_ov = rng.choice(benign_t05.shape[0], n_overlay, replace=False)
|
||||
ax.scatter(benign_t05[idx_ov, 0], benign_t05[idx_ov, 1],
|
||||
s=2.5, c="white", alpha=0.55, edgecolors="black",
|
||||
linewidths=0.12, rasterized=True, zorder=4)
|
||||
ax.set_xticks([])
|
||||
ax.set_yticks([])
|
||||
for spine in ax.spines.values():
|
||||
spine.set_visible(False)
|
||||
|
||||
out = OUT / f"velocity_field_overview_{dataset.lower()}.svg"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".pdf"), bbox_inches="tight")
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--datasets", nargs="+",
|
||||
@@ -126,6 +167,8 @@ def main() -> None:
|
||||
continue
|
||||
p = plot_one(npz, pretty.get(ds, ds))
|
||||
print(f"[wrote] {p}")
|
||||
p_ov = plot_one_overview(npz, pretty.get(ds, ds))
|
||||
print(f"[wrote] {p_ov}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -97,6 +97,7 @@ def plot_corr_heatmap() -> Path:
|
||||
cbar.set_label("Pearson ρ on benign val", fontsize=10)
|
||||
out = OUT / "subscore_correlation_benign_val.pdf"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".svg"), bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".png"), bbox_inches="tight", dpi=160)
|
||||
plt.close(fig)
|
||||
return out
|
||||
@@ -211,11 +212,227 @@ def plot_dual_head() -> Path:
|
||||
|
||||
out = OUT / "dual_head_oas_ellipses_top__whitened_pca_bottom.pdf"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".svg"), bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".png"), bbox_inches="tight", dpi=160)
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
|
||||
def plot_dual_head_overview(dataset: str = "cicddos2019") -> Path:
|
||||
"""Render a clean single-panel OAS-ellipse SVG for use as overview-figure
|
||||
component 06 (Mahalanobis-OAS aggregator).
|
||||
|
||||
Visual: benign as a smooth 2D KDE density blob (blue, 6 filled
|
||||
contour levels), attacks as sparse bright red dots with white halos
|
||||
clearly outside the dense benign region, and 1/2/3-sigma OAS-Mahalanobis
|
||||
ellipses overlaid on top in bold black. The visual story: the
|
||||
aggregator (ellipses) is fit on the dense benign cloud; attacks at
|
||||
inference fall outside the ellipses, which is what makes $d^2$ a
|
||||
useful anomaly score.
|
||||
"""
|
||||
from scipy.stats import gaussian_kde
|
||||
|
||||
val, atk = load_scores(dataset)
|
||||
rng = np.random.default_rng(0)
|
||||
fig, ax = plt.subplots(figsize=(3.0, 2.6), constrained_layout=True)
|
||||
|
||||
x_v, y_v = val[:, 0], val[:, 3]
|
||||
x_a, y_a = atk[:, 0], atk[:, 3]
|
||||
|
||||
nv = min(5000, len(x_v))
|
||||
iv = rng.choice(len(x_v), nv, replace=False)
|
||||
na = min(120, len(x_a)) # fewer, brighter attack dots for visibility
|
||||
ia = rng.choice(len(x_a), na, replace=False)
|
||||
|
||||
# View window: capture 99% of benign + 95% of attack
|
||||
x_lo = min(np.quantile(x_v, 0.005), np.quantile(x_a, 0.05))
|
||||
x_hi = max(np.quantile(x_v, 0.995), np.quantile(x_a, 0.95))
|
||||
y_lo = min(np.quantile(y_v, 0.005), np.quantile(y_a, 0.05))
|
||||
y_hi = max(np.quantile(y_v, 0.995), np.quantile(y_a, 0.95))
|
||||
pad_x = 0.05 * (x_hi - x_lo)
|
||||
pad_y = 0.05 * (y_hi - y_lo)
|
||||
xlim = (x_lo - pad_x, x_hi + pad_x)
|
||||
ylim = (y_lo - pad_y, y_hi + pad_y)
|
||||
|
||||
# Benign 2D KDE density blob
|
||||
kde = gaussian_kde(np.vstack([x_v[iv], y_v[iv]]))
|
||||
xx, yy = np.meshgrid(np.linspace(*xlim, 90), np.linspace(*ylim, 90))
|
||||
grid = np.vstack([xx.ravel(), yy.ravel()])
|
||||
density = kde(grid).reshape(xx.shape)
|
||||
# Drop the lowest-density floor (clip near-zero edge artefacts)
|
||||
floor = np.quantile(density, 0.55)
|
||||
levels = np.linspace(floor, density.max() * 0.97, 6)
|
||||
ax.contourf(xx, yy, density, levels=levels, cmap="Blues", alpha=0.92, zorder=1)
|
||||
|
||||
# Attack scatter (sparse, bright, white halo for crispness)
|
||||
ax.scatter(x_a[ia], y_a[ia], s=11, c="#d7191c",
|
||||
edgecolors="white", linewidth=0.5, alpha=0.95, zorder=3)
|
||||
|
||||
# OAS Mahalanobis ellipses on top: bold black
|
||||
XY_v = val[:, [0, 3]]
|
||||
oas2 = OAS().fit(XY_v)
|
||||
mu2 = XY_v.mean(axis=0)
|
||||
for ns, ls in [(1, "-"), (2, "--"), (3, ":")]:
|
||||
e = _ellipse_from_2x2(
|
||||
mu2, oas2.covariance_, ns,
|
||||
edgecolor="black", facecolor="none", lw=1.3, ls=ls, alpha=0.92,
|
||||
zorder=5,
|
||||
)
|
||||
ax.add_patch(e)
|
||||
|
||||
ax.set_xlim(*xlim)
|
||||
ax.set_ylim(*ylim)
|
||||
ax.set_xticks([])
|
||||
ax.set_yticks([])
|
||||
for spine in ax.spines.values():
|
||||
spine.set_visible(False)
|
||||
|
||||
out = OUT / f"oas_ellipse_overview_{dataset.lower()}.svg"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".pdf"), bbox_inches="tight")
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
|
||||
def _benign_disc_p1(dataset: str) -> np.ndarray:
|
||||
"""Empirical per-channel P(c_i = 1) over all benign packets in the
|
||||
processed dataset. Returns shape [6] for (dir, SYN, FIN, RST, PSH, ACK).
|
||||
"""
|
||||
import pandas as pd
|
||||
data_root = ROOT / "datasets" / dataset / "processed"
|
||||
pkts = np.load(data_root / "packets.npz")
|
||||
packet_tokens = pkts["packet_tokens"] # [N, T_full, 9]
|
||||
packet_lengths = pkts["packet_lengths"] # [N]
|
||||
|
||||
flows_df = pd.read_parquet(data_root / "flows.parquet")
|
||||
label_norm = flows_df["label"].astype(str).str.strip().str.lower()
|
||||
benign_aliases = {"benign", "normal"}
|
||||
benign_mask = label_norm.isin(benign_aliases).values
|
||||
benign_idx = np.where(benign_mask)[0]
|
||||
|
||||
if benign_idx.size == 0:
|
||||
return np.zeros(6, dtype=np.float64)
|
||||
|
||||
T_full = packet_tokens.shape[1]
|
||||
valid = np.arange(T_full)[None, :] < packet_lengths[benign_idx, None] # [Nb, T]
|
||||
# Disc channels live at indices 2..7 of the canonical 9-d packet schema.
|
||||
disc = packet_tokens[benign_idx, :, 2:8].astype(np.float64) # [Nb, T, 6]
|
||||
masked_sum = (disc * valid[..., None]).sum(axis=(0, 1)) # [6]
|
||||
total = float(valid.sum())
|
||||
return masked_sum / max(total, 1.0)
|
||||
|
||||
|
||||
def plot_dfm_head_overview(dataset: str = "cicids2017") -> Path:
|
||||
"""Render a clean single-panel DFM head SVG for use as overview-figure
|
||||
component 04. Six paired bars (P(c=0) light / P(c=1) dark) show the
|
||||
empirical categorical distribution of the six binary packet channels
|
||||
(direction + five TCP flags) on benign packets — the distribution the
|
||||
DFM head is trained to model. Training-phase visualization: benign-only.
|
||||
|
||||
Default uses CICIDS2017 because (a) it ships a flat `packets.npz` that
|
||||
this helper reads directly, and (b) its 51/49 TCP/UDP split exercises
|
||||
the full range of flag distributions in a way UDP-heavy CICDDoS2019
|
||||
or TCP-heavy CICIoT2023 do not.
|
||||
"""
|
||||
p_c1 = _benign_disc_p1(dataset)
|
||||
p_c0 = 1.0 - p_c1
|
||||
channel_labels = ["dir", "SYN", "FIN", "RST", "PSH", "ACK"]
|
||||
|
||||
fig, ax = plt.subplots(figsize=(3.2, 2.4), constrained_layout=True)
|
||||
x = np.arange(6, dtype=float)
|
||||
bar_w = 0.36
|
||||
ax.bar(x - bar_w / 2 - 0.02, p_c0, bar_w,
|
||||
color="#F4C58A", edgecolor="#a85518", linewidth=0.5,
|
||||
label=r"$P(c_i{=}0)$")
|
||||
ax.bar(x + bar_w / 2 + 0.02, p_c1, bar_w,
|
||||
color="#A85518", edgecolor="#5a2f0e", linewidth=0.5,
|
||||
label=r"$P(c_i{=}1)$")
|
||||
|
||||
# Reference line at y=0.5
|
||||
ax.axhline(0.5, color="#888", lw=0.4, ls="--", alpha=0.55, zorder=0)
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(channel_labels, fontsize=7.5)
|
||||
ax.set_ylim(0, 1.08)
|
||||
ax.set_yticks([])
|
||||
ax.tick_params(axis="x", length=0, pad=2)
|
||||
for side in ("top", "right", "left"):
|
||||
ax.spines[side].set_visible(False)
|
||||
ax.spines["bottom"].set_linewidth(0.45)
|
||||
|
||||
ax.legend(
|
||||
loc="upper right", fontsize=6.5, frameon=False,
|
||||
ncol=2, bbox_to_anchor=(1.00, 1.13),
|
||||
handletextpad=0.35, columnspacing=0.8,
|
||||
)
|
||||
|
||||
out = OUT / f"dfm_head_overview_{dataset.lower()}.svg"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".pdf"), bbox_inches="tight")
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
|
||||
def plot_score_family_overview(dataset: str = "cicids2017") -> Path:
|
||||
"""Render a clean single-panel score-family SVG for use as overview-figure
|
||||
component 05 (the 10-d score vector $s(x)$ between the heads and the
|
||||
aggregator). Replaces the schematic 10-cell row with real data: each of
|
||||
the 10 sub-scores becomes one vertical bar whose height is the *attack
|
||||
median z-score* relative to benign val (i.e., how many benign-std units
|
||||
the typical attack is shifted on this score).
|
||||
|
||||
Layout:
|
||||
- 3 cBlue bars on the left (term3, CFM-head scores)
|
||||
- 7 cOrange bars on the right (disc7, DFM-head scores)
|
||||
- Group brackets and small labels above each group.
|
||||
- Score-name x-tick labels below each bar (rotated 30°).
|
||||
- Faint benign reference line at z=0.
|
||||
"""
|
||||
val, atk = load_scores(dataset)
|
||||
mu = val.mean(axis=0)
|
||||
sd = val.std(axis=0) + 1e-9
|
||||
# z-normalised attacks: median over the attack class per score.
|
||||
z_atk = np.median((atk - mu) / sd, axis=0)
|
||||
# Same for benign val (sanity: should be ~0).
|
||||
z_val = np.median((val - mu) / sd, axis=0)
|
||||
|
||||
# CFM head fill = #FFF2CC (drawio yellow), DFM head fill = #D5E8D4 (drawio green).
|
||||
# Use the matching darker shade for the edge so bars are still visible.
|
||||
cfm_fill, cfm_edge = "#FFF2CC", "#D6B656"
|
||||
dfm_fill, dfm_edge = "#D5E8D4", "#82B366"
|
||||
fills = [cfm_fill] * 3 + [dfm_fill] * 7
|
||||
edges = [cfm_edge] * 3 + [dfm_edge] * 7
|
||||
|
||||
fig, ax = plt.subplots(figsize=(3.6, 2.0), constrained_layout=True)
|
||||
x = np.arange(10, dtype=float)
|
||||
bar_w = 0.72
|
||||
ax.bar(x, z_atk, bar_w, color=fills,
|
||||
edgecolor=edges, linewidth=0.9, zorder=3)
|
||||
|
||||
# Faint benign reference line at z=0.
|
||||
ax.axhline(0.0, color="#888", lw=0.5, ls="--", alpha=0.7, zorder=1)
|
||||
|
||||
# No x-tick labels, no top bracket annotations: clean bars only.
|
||||
ax.set_xticks([])
|
||||
ax.set_yticks([])
|
||||
ax.tick_params(axis="x", length=0, pad=2)
|
||||
ax.set_xlim(-0.6, 9.6)
|
||||
for side in ("top", "right", "left"):
|
||||
ax.spines[side].set_visible(False)
|
||||
ax.spines["bottom"].set_linewidth(0.45)
|
||||
|
||||
# Y-limits: keep small headroom but no need for bracket clearance now.
|
||||
hi = float(max(z_atk.max() * 1.10, 1.0))
|
||||
lo = float(min(z_atk.min() * 1.10, -0.05))
|
||||
ax.set_ylim(lo, hi)
|
||||
|
||||
out = OUT / f"score_family_overview_{dataset.lower()}.svg"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".pdf"), bbox_inches="tight")
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
|
||||
def plot_score_hist() -> Path:
|
||||
fig, axes = plt.subplots(4, 4, figsize=(16, 12), constrained_layout=True)
|
||||
for col, ds in enumerate(DATASETS):
|
||||
@@ -265,6 +482,7 @@ def plot_score_hist() -> Path:
|
||||
axes[0, 3].legend(loc="upper right", fontsize=8, framealpha=0.85)
|
||||
out = OUT / "score_distributions_raw__termOAS__discOAS__allOAS.pdf"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".svg"), bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".png"), bbox_inches="tight", dpi=160)
|
||||
plt.close(fig)
|
||||
return out
|
||||
@@ -302,9 +520,131 @@ def _hist_panel(ax, sv, sa, log_x: bool = False):
|
||||
ax.set_yticks([])
|
||||
|
||||
|
||||
def _load_cross(src: str, tgt: str, seeds=(42, 43, 44)) -> tuple[np.ndarray, np.ndarray]:
|
||||
"""Load 10-d score vectors for the (src→tgt) cross-domain pair, pooled
|
||||
across seeds. b_* are benign-val from the source training domain;
|
||||
a_* are attacks from the target test domain.
|
||||
"""
|
||||
val_pool, atk_pool = [], []
|
||||
for s in seeds:
|
||||
npz = ROOT / "artifacts" / "route_comparison" / "cross" / f"janus_seed{s}_{src}_to_{tgt}.npz"
|
||||
z = np.load(npz, allow_pickle=True)
|
||||
bv = np.stack([z[f"b_{k}"] for k in SCORE_KEYS], axis=1)
|
||||
av = np.stack([z[f"a_{k}"] for k in SCORE_KEYS], axis=1)
|
||||
val_pool.append(bv)
|
||||
atk_pool.append(av)
|
||||
val = np.concatenate(val_pool, axis=0)
|
||||
atk = np.concatenate(atk_pool, axis=0)
|
||||
val = np.nan_to_num(val, nan=0.0, posinf=1e6, neginf=-1e6).astype(np.float64)
|
||||
atk = np.nan_to_num(atk, nan=0.0, posinf=1e6, neginf=-1e6).astype(np.float64)
|
||||
return val, atk
|
||||
|
||||
|
||||
def plot_collapse_diagnosis_overview(src: str = "cicddos2019",
|
||||
tgt: str = "cicids2017") -> Path:
|
||||
"""Render a compact two-panel SVG that visualises the source-likeness
|
||||
collapse (left: raw 1D terminal_norm under cross-dataset shift) and the
|
||||
Mahalanobis-OAS cure (right: aggregated d^2). Real cross-domain scores
|
||||
pooled across seeds 42-44.
|
||||
|
||||
Used as the overview figure component that bridges Stage 4 (score family)
|
||||
and Stage 5 (Mahal-OAS aggregator), supplying the diagnostic backbone of
|
||||
contribution C2 directly inside the architecture sketch.
|
||||
"""
|
||||
val, atk = _load_cross(src, tgt)
|
||||
y = np.r_[np.zeros(len(val)), np.ones(len(atk))]
|
||||
|
||||
# --- left panel: raw terminal_norm (1D NLL-style score) ---
|
||||
sv_raw = val[:, SCORE_KEYS.index("terminal_norm")]
|
||||
sa_raw = atk[:, SCORE_KEYS.index("terminal_norm")]
|
||||
auc_raw = roc_auc_score(y, np.r_[sv_raw, sa_raw])
|
||||
|
||||
# --- right panel: Mahal-OAS d^2 over the full 10-d score family ---
|
||||
mu, inv_cov, *_ = fit_oas(val)
|
||||
sv_mah = mahal(val, mu, inv_cov)
|
||||
sa_mah = mahal(atk, mu, inv_cov)
|
||||
auc_mah = roc_auc_score(y, np.r_[sv_mah, sa_mah])
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(5.8, 1.95), constrained_layout=False,
|
||||
gridspec_kw=dict(wspace=0.22))
|
||||
# No bottom-legend reservation; legend moves into the upper-left panel.
|
||||
fig.subplots_adjust(left=0.05, right=0.99, top=0.80, bottom=0.14)
|
||||
|
||||
def _kde_panel(ax, sv, sa, auc, log_x: bool, label_top: str):
|
||||
s = np.r_[sv, sa]
|
||||
if log_x:
|
||||
eps = max(1e-3, np.quantile(s[s > 0], 0.005) * 0.5) if (s > 0).any() else 1e-3
|
||||
sv_p = np.maximum(sv, eps)
|
||||
sa_p = np.maximum(sa, eps)
|
||||
lo = np.quantile(np.r_[sv_p, sa_p], 0.005)
|
||||
hi = np.quantile(np.r_[sv_p, sa_p], 0.999)
|
||||
bins = np.geomspace(max(lo, eps), hi, 60)
|
||||
mask_v = (sv_p >= lo) & (sv_p <= hi)
|
||||
mask_a = (sa_p >= lo) & (sa_p <= hi)
|
||||
sv_p, sa_p = sv_p[mask_v], sa_p[mask_a]
|
||||
ax.set_xscale("log")
|
||||
else:
|
||||
lo, hi = np.quantile(s, [0.005, 0.995])
|
||||
bins = np.linspace(lo, hi, 60)
|
||||
mask_v = (sv >= lo) & (sv <= hi)
|
||||
mask_a = (sa >= lo) & (sa <= hi)
|
||||
sv_p, sa_p = sv[mask_v], sa[mask_a]
|
||||
# Use raw count weighting so each class integrates to 1
|
||||
# (avoids leftover-mass spikes at clip edges).
|
||||
w_v = np.full_like(sv_p, 1.0 / max(len(sv_p), 1))
|
||||
w_a = np.full_like(sa_p, 1.0 / max(len(sa_p), 1))
|
||||
ax.hist(sv_p, bins=bins, color="#2c7fb8", alpha=0.65,
|
||||
weights=w_v, edgecolor="none")
|
||||
ax.hist(sa_p, bins=bins, color="#d7191c", alpha=0.65,
|
||||
weights=w_a, edgecolor="none")
|
||||
ax.set_yticks([])
|
||||
ax.tick_params(axis="x", labelsize=6.5, length=2, pad=1.5)
|
||||
for side in ("top", "right", "left"):
|
||||
ax.spines[side].set_visible(False)
|
||||
ax.spines["bottom"].set_linewidth(0.45)
|
||||
# State word top-center (one or two words describing the panel state).
|
||||
ax.text(
|
||||
0.5, 1.08, label_top,
|
||||
transform=ax.transAxes, ha="center", va="bottom", fontsize=9.0,
|
||||
color=("#a02a2a" if auc < 0.75 else "#1f6f3a"),
|
||||
fontweight="bold",
|
||||
)
|
||||
|
||||
_kde_panel(axes[0], sv_raw, sa_raw, auc_raw, log_x=False,
|
||||
label_top="collapse")
|
||||
_kde_panel(axes[1], sv_mah, sa_mah, auc_mah, log_x=True,
|
||||
label_top="separated")
|
||||
|
||||
# Centred arrow between panels — fig-coordinates so it sits between axes.
|
||||
fig.text(0.5125, 0.48, r"$\Rightarrow$", ha="center", va="center",
|
||||
fontsize=18, color="#444")
|
||||
# Compact vertical legend in the upper-right of the LEFT panel.
|
||||
from matplotlib.patches import Patch
|
||||
legend_handles = [
|
||||
Patch(facecolor="#2c7fb8", alpha=0.65, label="benign-val"),
|
||||
Patch(facecolor="#d7191c", alpha=0.65, label="attack"),
|
||||
]
|
||||
axes[0].legend(
|
||||
handles=legend_handles,
|
||||
loc="upper right", ncol=1,
|
||||
fontsize=6.5, frameon=False,
|
||||
handlelength=0.9, handleheight=0.7,
|
||||
handletextpad=0.35, labelspacing=0.30,
|
||||
borderaxespad=0.4,
|
||||
)
|
||||
|
||||
out = OUT / f"collapse_diagnosis_overview_{src}_to_{tgt}.svg"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".pdf"), bbox_inches="tight")
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--which", choices=["all", "corr", "dual", "hist"], default="all")
|
||||
parser.add_argument("--which",
|
||||
choices=["all", "corr", "dual", "hist", "diag", "score"],
|
||||
default="all")
|
||||
args = parser.parse_args()
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
mpl.rcParams.update({
|
||||
@@ -320,9 +660,19 @@ def main() -> None:
|
||||
if args.which in ("all", "dual"):
|
||||
p = plot_dual_head()
|
||||
print(f"[wrote] {p}")
|
||||
p_ov = plot_dual_head_overview()
|
||||
print(f"[wrote] {p_ov}")
|
||||
p_dfm = plot_dfm_head_overview()
|
||||
print(f"[wrote] {p_dfm}")
|
||||
if args.which in ("all", "hist"):
|
||||
p = plot_score_hist()
|
||||
print(f"[wrote] {p}")
|
||||
if args.which in ("all", "diag"):
|
||||
p = plot_collapse_diagnosis_overview()
|
||||
print(f"[wrote] {p}")
|
||||
if args.which in ("all", "score"):
|
||||
p = plot_score_family_overview()
|
||||
print(f"[wrote] {p}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -82,16 +82,17 @@ def plot_trajectory(npz_paths: dict[str, Path]) -> Path:
|
||||
)
|
||||
out = OUT / "fig4_trajectory_pca.pdf"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".svg"), bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".png"), bbox_inches="tight", dpi=160)
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
|
||||
def plot_velocity_norm(npz_paths: dict[str, Path]) -> Path:
|
||||
fig, axes = plt.subplots(1, len(npz_paths), figsize=(6.5 * len(npz_paths), 5.6), constrained_layout=True)
|
||||
fig, axes = plt.subplots(1, len(npz_paths), figsize=(6.5 * len(npz_paths), 5.6 * 2 / 3), constrained_layout=True)
|
||||
if len(npz_paths) == 1:
|
||||
axes = [axes]
|
||||
for ax, (ds, npz) in zip(axes, npz_paths.items()):
|
||||
for i, (ax, (ds, npz)) in enumerate(zip(axes, npz_paths.items())):
|
||||
z = np.load(npz)
|
||||
vn_v = z["vnorm_v"] # [n, n_steps]
|
||||
vn_a = z["vnorm_a"]
|
||||
@@ -105,13 +106,15 @@ def plot_velocity_norm(npz_paths: dict[str, Path]) -> Path:
|
||||
ax.fill_between(t_steps, m_v - s_v, m_v + s_v, color="#2c7fb8", alpha=0.18)
|
||||
ax.plot(t_steps, m_a, color="#d7191c", lw=1.6, label="attack mean")
|
||||
ax.fill_between(t_steps, m_a - s_a, m_a + s_a, color="#d7191c", alpha=0.18)
|
||||
ax.set_xlabel("CFM time t (1 = data → 0 = source)")
|
||||
ax.set_ylabel("‖v(x_t, t)‖ per real token (mean over flow)")
|
||||
ax.set_xlabel("CFM time t")
|
||||
if i == 0:
|
||||
ax.set_ylabel(r"Per-token CFM velocity magnitude $\|v_\theta(x_t, t)\|_2$")
|
||||
ax.text(0.02, 1.02, PRETTY[ds], transform=ax.transAxes, fontsize=11)
|
||||
ax.invert_xaxis() # so left is t=1 (data), right is t=0 (source)
|
||||
ax.legend(fontsize=8, loc="upper left", framealpha=0.85)
|
||||
out = OUT / "velocity_norm_vs_t_benign_vs_attack.pdf"
|
||||
fig.savefig(out, bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".svg"), bbox_inches="tight")
|
||||
fig.savefig(out.with_suffix(".png"), bbox_inches="tight", dpi=160)
|
||||
plt.close(fig)
|
||||
return out
|
||||
|
||||
Reference in New Issue
Block a user