# Dataset download scripts Target layout (mirrors `datasets/cicids2017/`): ``` datasets/ ciciot2023/raw/{pcap,csv} iscxtor2016/raw/{pcap,csv} cicapt_iiot2024/raw/{pcap,csv} ustc_tfc2016/raw/pcap datacon2020/raw/pcap ``` ## CICIoT2023 / ISCXTor2016 (automated) UNB/CIC gates downloads behind a consent form. After submission the site issues a `Token` cookie (domain `.cicresearch.ca`) that unlocks two endpoints: - `browse.php?p=` — HTML directory listing - `download.php?file=` — raw file bytes `cic_download.py` is a stdlib-only recursive crawler that walks `browse.php` and fetches each leaf via `download.php`. Already-downloaded files are skipped (presence-based; the PHP endpoint does not advertise sizes). ### Workflow 1. Open the dataset page in a browser, fill and submit the form: - CICIoT2023 : - ISCXTor2016: 2. After submit, click through to `cicresearch.ca/.../browse.php`. The page must load successfully in your browser — this proves the Token is set. 3. Export the cookie in **Netscape format** (tab-separated). One line is sufficient: ``` # Netscape HTTP Cookie File .cicresearch.ca TRUE / TRUE Token ``` Save as: - `scripts/download/cookies_ciciot2023.txt` - `scripts/download/cookies_iscxtor2016.txt` Tokens are per-dataset — a CICIoT2023 cookie will not work for ISCXTor. 4. Run: ```bash bash scripts/download/download_ciciot2023.sh bash scripts/download/download_iscxtor2016.sh ``` Env vars: `WHAT=pcap|csv|both`, `DEST=`, `COOKIES=`, `DRY_RUN=1`, `LIMIT=N`. For ISCXTor, if the remote subdir names differ from the defaults (`Pcaps` / `CSVs`), set `PCAP_ROOT=` / `CSV_ROOT=`. ### Known remote tree sizes - **CICIoT2023** — `CSV/` 328 files (includes `CSV.zip`, `MERGED_CSV.zip`, `MERGED_CSV/`, and per-attack CSVs), `PCAP/` 311 files across 36 attack categories. Full dataset is ~12 GB. ### Quick commands ```bash # Dry-run (enumerate only, no downloads) DRY_RUN=1 bash scripts/download/download_ciciot2023.sh # Download first 5 files as a smoke test LIMIT=5 WHAT=csv bash scripts/download/download_ciciot2023.sh # Full download bash scripts/download/download_ciciot2023.sh ``` ## CICAPT-IIoT2024 (automated) Same UNB/CIC pipeline as CICIoT2023, but crawled in a single pass — the entire `CICAPT-IIoT Dataset/` top-level folder is mirrored (pcap, csv, and anything else) under `datasets/cicapt_iiot2024/raw/`. Cookie file: `scripts/download/cookies_cicapt_iiot2024.txt` (Token for `.cicresearch.ca`). ```bash # Smoke test first DRY_RUN=1 LIMIT=5 bash scripts/download/download_cicapt_iiot2024.sh # Full download bash scripts/download/download_cicapt_iiot2024.sh # Skip heavy archives if they duplicate a per-file tree SKIP_EXT=zip,7z bash scripts/download/download_cicapt_iiot2024.sh ``` Reference URL (browser, with Token cookie live): ## USTC-TFC2016 (manual) ```bash cd datasets/ustc_tfc2016/raw/pcap git clone --depth=1 https://github.com/yungshenglu/USTC-TFC2016.git . ``` No official CSV — extract features yourself (CICFlowMeter, USTC-TK2016). ## DataCon2020 (manual) Register at and place the `black/` `white/` `test/` pcap bundles under `datasets/datacon2020/raw/pcap/`. No official CSV.