Dataset download scripts
Target layout (mirrors datasets/cicids2017/):
datasets/
ciciot2023/raw/{pcap,csv}
iscxtor2016/raw/{pcap,csv}
cicapt_iiot2024/raw/{pcap,csv}
ustc_tfc2016/raw/pcap
datacon2020/raw/pcap
CICIoT2023 / ISCXTor2016 (automated)
UNB/CIC gates downloads behind a consent form. After submission the site issues
a Token cookie (domain .cicresearch.ca) that unlocks two endpoints:
browse.php?p=<path>— HTML directory listingdownload.php?file=<path>— raw file bytes
cic_download.py is a stdlib-only recursive crawler that walks browse.php
and fetches each leaf via download.php. Already-downloaded files are
skipped (presence-based; the PHP endpoint does not advertise sizes).
Workflow
-
Open the dataset page in a browser, fill and submit the form:
- CICIoT2023 : https://www.unb.ca/cic/datasets/iotdataset-2023.html
- ISCXTor2016: https://www.unb.ca/cic/datasets/tor.html
-
After submit, click through to
cicresearch.ca/.../browse.php. The page must load successfully in your browser — this proves the Token is set. -
Export the cookie in Netscape format (tab-separated). One line is sufficient:
# Netscape HTTP Cookie File .cicresearch.ca TRUE / TRUE <expiry> Token <value>Save as:
scripts/download/cookies_ciciot2023.txtscripts/download/cookies_iscxtor2016.txt
Tokens are per-dataset — a CICIoT2023 cookie will not work for ISCXTor.
-
Run:
bash scripts/download/download_ciciot2023.sh bash scripts/download/download_iscxtor2016.shEnv vars:
WHAT=pcap|csv|both,DEST=,COOKIES=,DRY_RUN=1,LIMIT=N. For ISCXTor, if the remote subdir names differ from the defaults (Pcaps/CSVs), setPCAP_ROOT=/CSV_ROOT=.
Known remote tree sizes
- CICIoT2023 —
CSV/328 files (includesCSV.zip,MERGED_CSV.zip,MERGED_CSV/, and per-attack CSVs),PCAP/311 files across 36 attack categories. Full dataset is ~12 GB.
Quick commands
# Dry-run (enumerate only, no downloads)
DRY_RUN=1 bash scripts/download/download_ciciot2023.sh
# Download first 5 files as a smoke test
LIMIT=5 WHAT=csv bash scripts/download/download_ciciot2023.sh
# Full download
bash scripts/download/download_ciciot2023.sh
CICAPT-IIoT2024 (automated)
Same UNB/CIC pipeline as CICIoT2023, but crawled in a single pass — the
entire CICAPT-IIoT Dataset/ top-level folder is mirrored (pcap, csv, and
anything else) under datasets/cicapt_iiot2024/raw/.
Cookie file: scripts/download/cookies_cicapt_iiot2024.txt (Token for
.cicresearch.ca).
# Smoke test first
DRY_RUN=1 LIMIT=5 bash scripts/download/download_cicapt_iiot2024.sh
# Full download
bash scripts/download/download_cicapt_iiot2024.sh
# Skip heavy archives if they duplicate a per-file tree
SKIP_EXT=zip,7z bash scripts/download/download_cicapt_iiot2024.sh
Reference URL (browser, with Token cookie live): https://cicresearch.ca/IOTDataset/CICAPT-IIoT-Dataset/browse.php?p=CICAPT-IIoT+Dataset
USTC-TFC2016 (manual)
cd datasets/ustc_tfc2016/raw/pcap
git clone --depth=1 https://github.com/yungshenglu/USTC-TFC2016.git .
No official CSV — extract features yourself (CICFlowMeter, USTC-TK2016).
DataCon2020 (manual)
Register at https://datacon.qianxin.com/opendata/maliciousstream and place
the black/ white/ test/ pcap bundles under
datasets/datacon2020/raw/pcap/. No official CSV.