Files
JANUS/scripts/download

Dataset download scripts

Target layout (mirrors datasets/cicids2017/):

datasets/
  ciciot2023/raw/{pcap,csv}
  iscxtor2016/raw/{pcap,csv}
  cicapt_iiot2024/raw/{pcap,csv}
  ustc_tfc2016/raw/pcap
  datacon2020/raw/pcap

CICIoT2023 / ISCXTor2016 (automated)

UNB/CIC gates downloads behind a consent form. After submission the site issues a Token cookie (domain .cicresearch.ca) that unlocks two endpoints:

  • browse.php?p=<path> — HTML directory listing
  • download.php?file=<path> — raw file bytes

cic_download.py is a stdlib-only recursive crawler that walks browse.php and fetches each leaf via download.php. Already-downloaded files are skipped (presence-based; the PHP endpoint does not advertise sizes).

Workflow

  1. Open the dataset page in a browser, fill and submit the form:

  2. After submit, click through to cicresearch.ca/.../browse.php. The page must load successfully in your browser — this proves the Token is set.

  3. Export the cookie in Netscape format (tab-separated). One line is sufficient:

    # Netscape HTTP Cookie File
    .cicresearch.ca	TRUE	/	TRUE	<expiry>	Token	<value>
    

    Save as:

    • scripts/download/cookies_ciciot2023.txt
    • scripts/download/cookies_iscxtor2016.txt

    Tokens are per-dataset — a CICIoT2023 cookie will not work for ISCXTor.

  4. Run:

    bash scripts/download/download_ciciot2023.sh
    bash scripts/download/download_iscxtor2016.sh
    

    Env vars: WHAT=pcap|csv|both, DEST=, COOKIES=, DRY_RUN=1, LIMIT=N. For ISCXTor, if the remote subdir names differ from the defaults (Pcaps / CSVs), set PCAP_ROOT= / CSV_ROOT=.

Known remote tree sizes

  • CICIoT2023CSV/ 328 files (includes CSV.zip, MERGED_CSV.zip, MERGED_CSV/, and per-attack CSVs), PCAP/ 311 files across 36 attack categories. Full dataset is ~12 GB.

Quick commands

# Dry-run (enumerate only, no downloads)
DRY_RUN=1 bash scripts/download/download_ciciot2023.sh

# Download first 5 files as a smoke test
LIMIT=5 WHAT=csv bash scripts/download/download_ciciot2023.sh

# Full download
bash scripts/download/download_ciciot2023.sh

CICAPT-IIoT2024 (automated)

Same UNB/CIC pipeline as CICIoT2023, but crawled in a single pass — the entire CICAPT-IIoT Dataset/ top-level folder is mirrored (pcap, csv, and anything else) under datasets/cicapt_iiot2024/raw/.

Cookie file: scripts/download/cookies_cicapt_iiot2024.txt (Token for .cicresearch.ca).

# Smoke test first
DRY_RUN=1 LIMIT=5 bash scripts/download/download_cicapt_iiot2024.sh

# Full download
bash scripts/download/download_cicapt_iiot2024.sh

# Skip heavy archives if they duplicate a per-file tree
SKIP_EXT=zip,7z bash scripts/download/download_cicapt_iiot2024.sh

Reference URL (browser, with Token cookie live): https://cicresearch.ca/IOTDataset/CICAPT-IIoT-Dataset/browse.php?p=CICAPT-IIoT+Dataset

USTC-TFC2016 (manual)

cd datasets/ustc_tfc2016/raw/pcap
git clone --depth=1 https://github.com/yungshenglu/USTC-TFC2016.git .

No official CSV — extract features yourself (CICFlowMeter, USTC-TK2016).

DataCon2020 (manual)

Register at https://datacon.qianxin.com/opendata/maliciousstream and place the black/ white/ test/ pcap bundles under datasets/datacon2020/raw/pcap/. No official CSV.