Initial commit: code, paper, small artifacts
This commit is contained in:
112
scripts/download/README.md
Normal file
112
scripts/download/README.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Dataset download scripts
|
||||
|
||||
Target layout (mirrors `datasets/cicids2017/`):
|
||||
|
||||
```
|
||||
datasets/
|
||||
ciciot2023/raw/{pcap,csv}
|
||||
iscxtor2016/raw/{pcap,csv}
|
||||
cicapt_iiot2024/raw/{pcap,csv}
|
||||
ustc_tfc2016/raw/pcap
|
||||
datacon2020/raw/pcap
|
||||
```
|
||||
|
||||
## CICIoT2023 / ISCXTor2016 (automated)
|
||||
|
||||
UNB/CIC gates downloads behind a consent form. After submission the site issues
|
||||
a `Token` cookie (domain `.cicresearch.ca`) that unlocks two endpoints:
|
||||
|
||||
- `browse.php?p=<path>` — HTML directory listing
|
||||
- `download.php?file=<path>` — raw file bytes
|
||||
|
||||
`cic_download.py` is a stdlib-only recursive crawler that walks `browse.php`
|
||||
and fetches each leaf via `download.php`. Already-downloaded files are
|
||||
skipped (presence-based; the PHP endpoint does not advertise sizes).
|
||||
|
||||
### Workflow
|
||||
|
||||
1. Open the dataset page in a browser, fill and submit the form:
|
||||
- CICIoT2023 : <https://www.unb.ca/cic/datasets/iotdataset-2023.html>
|
||||
- ISCXTor2016: <https://www.unb.ca/cic/datasets/tor.html>
|
||||
2. After submit, click through to `cicresearch.ca/.../browse.php`. The page
|
||||
must load successfully in your browser — this proves the Token is set.
|
||||
3. Export the cookie in **Netscape format** (tab-separated). One line is
|
||||
sufficient:
|
||||
|
||||
```
|
||||
# Netscape HTTP Cookie File
|
||||
.cicresearch.ca TRUE / TRUE <expiry> Token <value>
|
||||
```
|
||||
|
||||
Save as:
|
||||
- `scripts/download/cookies_ciciot2023.txt`
|
||||
- `scripts/download/cookies_iscxtor2016.txt`
|
||||
|
||||
Tokens are per-dataset — a CICIoT2023 cookie will not work for ISCXTor.
|
||||
4. Run:
|
||||
|
||||
```bash
|
||||
bash scripts/download/download_ciciot2023.sh
|
||||
bash scripts/download/download_iscxtor2016.sh
|
||||
```
|
||||
|
||||
Env vars: `WHAT=pcap|csv|both`, `DEST=`, `COOKIES=`, `DRY_RUN=1`, `LIMIT=N`.
|
||||
For ISCXTor, if the remote subdir names differ from the defaults
|
||||
(`Pcaps` / `CSVs`), set `PCAP_ROOT=` / `CSV_ROOT=`.
|
||||
|
||||
### Known remote tree sizes
|
||||
|
||||
- **CICIoT2023** — `CSV/` 328 files (includes `CSV.zip`, `MERGED_CSV.zip`,
|
||||
`MERGED_CSV/`, and per-attack CSVs), `PCAP/` 311 files across 36 attack
|
||||
categories. Full dataset is ~12 GB.
|
||||
|
||||
### Quick commands
|
||||
|
||||
```bash
|
||||
# Dry-run (enumerate only, no downloads)
|
||||
DRY_RUN=1 bash scripts/download/download_ciciot2023.sh
|
||||
|
||||
# Download first 5 files as a smoke test
|
||||
LIMIT=5 WHAT=csv bash scripts/download/download_ciciot2023.sh
|
||||
|
||||
# Full download
|
||||
bash scripts/download/download_ciciot2023.sh
|
||||
```
|
||||
|
||||
## CICAPT-IIoT2024 (automated)
|
||||
|
||||
Same UNB/CIC pipeline as CICIoT2023, but crawled in a single pass — the
|
||||
entire `CICAPT-IIoT Dataset/` top-level folder is mirrored (pcap, csv, and
|
||||
anything else) under `datasets/cicapt_iiot2024/raw/`.
|
||||
|
||||
Cookie file: `scripts/download/cookies_cicapt_iiot2024.txt` (Token for
|
||||
`.cicresearch.ca`).
|
||||
|
||||
```bash
|
||||
# Smoke test first
|
||||
DRY_RUN=1 LIMIT=5 bash scripts/download/download_cicapt_iiot2024.sh
|
||||
|
||||
# Full download
|
||||
bash scripts/download/download_cicapt_iiot2024.sh
|
||||
|
||||
# Skip heavy archives if they duplicate a per-file tree
|
||||
SKIP_EXT=zip,7z bash scripts/download/download_cicapt_iiot2024.sh
|
||||
```
|
||||
|
||||
Reference URL (browser, with Token cookie live):
|
||||
<https://cicresearch.ca/IOTDataset/CICAPT-IIoT-Dataset/browse.php?p=CICAPT-IIoT+Dataset>
|
||||
|
||||
## USTC-TFC2016 (manual)
|
||||
|
||||
```bash
|
||||
cd datasets/ustc_tfc2016/raw/pcap
|
||||
git clone --depth=1 https://github.com/yungshenglu/USTC-TFC2016.git .
|
||||
```
|
||||
|
||||
No official CSV — extract features yourself (CICFlowMeter, USTC-TK2016).
|
||||
|
||||
## DataCon2020 (manual)
|
||||
|
||||
Register at <https://datacon.qianxin.com/opendata/maliciousstream> and place
|
||||
the `black/` `white/` `test/` pcap bundles under
|
||||
`datasets/datacon2020/raw/pcap/`. No official CSV.
|
||||
Reference in New Issue
Block a user