Initial commit: code, paper, small artifacts

2026-05-07 20:47:30 +08:00
commit fae2db8cff
322 changed files with 33159 additions and 0 deletions
--- a/scripts/download/README.md
+++ b/scripts/download/README.md
@@ -0,0 +1,112 @@
+# Dataset download scripts
+
+Target layout (mirrors `datasets/cicids2017/`):
+
+```
+datasets/
+  ciciot2023/raw/{pcap,csv}
+  iscxtor2016/raw/{pcap,csv}
+  cicapt_iiot2024/raw/{pcap,csv}
+  ustc_tfc2016/raw/pcap
+  datacon2020/raw/pcap
+```
+
+## CICIoT2023 / ISCXTor2016 (automated)
+
+UNB/CIC gates downloads behind a consent form. After submission the site issues
+a `Token` cookie (domain `.cicresearch.ca`) that unlocks two endpoints:
+
+- `browse.php?p=<path>` — HTML directory listing
+- `download.php?file=<path>` — raw file bytes
+
+`cic_download.py` is a stdlib-only recursive crawler that walks `browse.php`
+and fetches each leaf via `download.php`. Already-downloaded files are
+skipped (presence-based; the PHP endpoint does not advertise sizes).
+
+### Workflow
+
+1. Open the dataset page in a browser, fill and submit the form:
+   - CICIoT2023 : <https://www.unb.ca/cic/datasets/iotdataset-2023.html>
+   - ISCXTor2016: <https://www.unb.ca/cic/datasets/tor.html>
+2. After submit, click through to `cicresearch.ca/.../browse.php`. The page
+   must load successfully in your browser — this proves the Token is set.
+3. Export the cookie in **Netscape format** (tab-separated). One line is
+   sufficient:
+
+   ```
+   # Netscape HTTP Cookie File
+   .cicresearch.ca	TRUE	/	TRUE	<expiry>	Token	<value>
+   ```
+
+   Save as:
+   - `scripts/download/cookies_ciciot2023.txt`
+   - `scripts/download/cookies_iscxtor2016.txt`
+
+   Tokens are per-dataset — a CICIoT2023 cookie will not work for ISCXTor.
+4. Run:
+
+   ```bash
+   bash scripts/download/download_ciciot2023.sh
+   bash scripts/download/download_iscxtor2016.sh
+   ```
+
+   Env vars: `WHAT=pcap|csv|both`, `DEST=`, `COOKIES=`, `DRY_RUN=1`, `LIMIT=N`.
+   For ISCXTor, if the remote subdir names differ from the defaults
+   (`Pcaps` / `CSVs`), set `PCAP_ROOT=` / `CSV_ROOT=`.
+
+### Known remote tree sizes
+
+- **CICIoT2023** — `CSV/` 328 files (includes `CSV.zip`, `MERGED_CSV.zip`,
+  `MERGED_CSV/`, and per-attack CSVs), `PCAP/` 311 files across 36 attack
+  categories. Full dataset is ~12 GB.
+
+### Quick commands
+
+```bash
+# Dry-run (enumerate only, no downloads)
+DRY_RUN=1 bash scripts/download/download_ciciot2023.sh
+
+# Download first 5 files as a smoke test
+LIMIT=5 WHAT=csv bash scripts/download/download_ciciot2023.sh
+
+# Full download
+bash scripts/download/download_ciciot2023.sh
+```
+
+## CICAPT-IIoT2024 (automated)
+
+Same UNB/CIC pipeline as CICIoT2023, but crawled in a single pass — the
+entire `CICAPT-IIoT Dataset/` top-level folder is mirrored (pcap, csv, and
+anything else) under `datasets/cicapt_iiot2024/raw/`.
+
+Cookie file: `scripts/download/cookies_cicapt_iiot2024.txt` (Token for
+`.cicresearch.ca`).
+
+```bash
+# Smoke test first
+DRY_RUN=1 LIMIT=5 bash scripts/download/download_cicapt_iiot2024.sh
+
+# Full download
+bash scripts/download/download_cicapt_iiot2024.sh
+
+# Skip heavy archives if they duplicate a per-file tree
+SKIP_EXT=zip,7z bash scripts/download/download_cicapt_iiot2024.sh
+```
+
+Reference URL (browser, with Token cookie live):
+<https://cicresearch.ca/IOTDataset/CICAPT-IIoT-Dataset/browse.php?p=CICAPT-IIoT+Dataset>
+
+## USTC-TFC2016 (manual)
+
+```bash
+cd datasets/ustc_tfc2016/raw/pcap
+git clone --depth=1 https://github.com/yungshenglu/USTC-TFC2016.git .
+```
+
+No official CSV — extract features yourself (CICFlowMeter, USTC-TK2016).
+
+## DataCon2020 (manual)
+
+Register at <https://datacon.qianxin.com/opendata/maliciousstream> and place
+the `black/` `white/` `test/` pcap bundles under
+`datasets/datacon2020/raw/pcap/`. No official CSV.