Initial commit: code, paper, small artifacts

This commit is contained in:
2026-05-07 20:47:30 +08:00
commit fae2db8cff
322 changed files with 33159 additions and 0 deletions

112
scripts/download/README.md Normal file
View File

@@ -0,0 +1,112 @@
# Dataset download scripts
Target layout (mirrors `datasets/cicids2017/`):
```
datasets/
ciciot2023/raw/{pcap,csv}
iscxtor2016/raw/{pcap,csv}
cicapt_iiot2024/raw/{pcap,csv}
ustc_tfc2016/raw/pcap
datacon2020/raw/pcap
```
## CICIoT2023 / ISCXTor2016 (automated)
UNB/CIC gates downloads behind a consent form. After submission the site issues
a `Token` cookie (domain `.cicresearch.ca`) that unlocks two endpoints:
- `browse.php?p=<path>` — HTML directory listing
- `download.php?file=<path>` — raw file bytes
`cic_download.py` is a stdlib-only recursive crawler that walks `browse.php`
and fetches each leaf via `download.php`. Already-downloaded files are
skipped (presence-based; the PHP endpoint does not advertise sizes).
### Workflow
1. Open the dataset page in a browser, fill and submit the form:
- CICIoT2023 : <https://www.unb.ca/cic/datasets/iotdataset-2023.html>
- ISCXTor2016: <https://www.unb.ca/cic/datasets/tor.html>
2. After submit, click through to `cicresearch.ca/.../browse.php`. The page
must load successfully in your browser — this proves the Token is set.
3. Export the cookie in **Netscape format** (tab-separated). One line is
sufficient:
```
# Netscape HTTP Cookie File
.cicresearch.ca TRUE / TRUE <expiry> Token <value>
```
Save as:
- `scripts/download/cookies_ciciot2023.txt`
- `scripts/download/cookies_iscxtor2016.txt`
Tokens are per-dataset — a CICIoT2023 cookie will not work for ISCXTor.
4. Run:
```bash
bash scripts/download/download_ciciot2023.sh
bash scripts/download/download_iscxtor2016.sh
```
Env vars: `WHAT=pcap|csv|both`, `DEST=`, `COOKIES=`, `DRY_RUN=1`, `LIMIT=N`.
For ISCXTor, if the remote subdir names differ from the defaults
(`Pcaps` / `CSVs`), set `PCAP_ROOT=` / `CSV_ROOT=`.
### Known remote tree sizes
- **CICIoT2023** — `CSV/` 328 files (includes `CSV.zip`, `MERGED_CSV.zip`,
`MERGED_CSV/`, and per-attack CSVs), `PCAP/` 311 files across 36 attack
categories. Full dataset is ~12 GB.
### Quick commands
```bash
# Dry-run (enumerate only, no downloads)
DRY_RUN=1 bash scripts/download/download_ciciot2023.sh
# Download first 5 files as a smoke test
LIMIT=5 WHAT=csv bash scripts/download/download_ciciot2023.sh
# Full download
bash scripts/download/download_ciciot2023.sh
```
## CICAPT-IIoT2024 (automated)
Same UNB/CIC pipeline as CICIoT2023, but crawled in a single pass — the
entire `CICAPT-IIoT Dataset/` top-level folder is mirrored (pcap, csv, and
anything else) under `datasets/cicapt_iiot2024/raw/`.
Cookie file: `scripts/download/cookies_cicapt_iiot2024.txt` (Token for
`.cicresearch.ca`).
```bash
# Smoke test first
DRY_RUN=1 LIMIT=5 bash scripts/download/download_cicapt_iiot2024.sh
# Full download
bash scripts/download/download_cicapt_iiot2024.sh
# Skip heavy archives if they duplicate a per-file tree
SKIP_EXT=zip,7z bash scripts/download/download_cicapt_iiot2024.sh
```
Reference URL (browser, with Token cookie live):
<https://cicresearch.ca/IOTDataset/CICAPT-IIoT-Dataset/browse.php?p=CICAPT-IIoT+Dataset>
## USTC-TFC2016 (manual)
```bash
cd datasets/ustc_tfc2016/raw/pcap
git clone --depth=1 https://github.com/yungshenglu/USTC-TFC2016.git .
```
No official CSV — extract features yourself (CICFlowMeter, USTC-TK2016).
## DataCon2020 (manual)
Register at <https://datacon.qianxin.com/opendata/maliciousstream> and place
the `black/` `white/` `test/` pcap bundles under
`datasets/datacon2020/raw/pcap/`. No official CSV.