whqxbs/llmiotsafe

Fork 0

Files

History

whqxbs e56494b487 initial commit

2026-05-12 17:01:39 +08:00

__pycache__

initial commit

2026-05-12 17:01:39 +08:00

__init__.py

initial commit

2026-05-12 17:01:39 +08:00

assemble_preference_pairs.py

initial commit

2026-05-12 17:01:39 +08:00

audit_dataset.py

initial commit

2026-05-12 17:01:39 +08:00

build_full_log_dataset.py

initial commit

2026-05-12 17:01:39 +08:00

build_trl_dataset.py

initial commit

2026-05-12 17:01:39 +08:00

collect_api_answers.py

initial commit

2026-05-12 17:01:39 +08:00

export_model_tasks.py

initial commit

2026-05-12 17:01:39 +08:00

prompt_builder.py

initial commit

2026-05-12 17:01:39 +08:00

README.md

initial commit

2026-05-12 17:01:39 +08:00

README.md

DPODataGenFullLog

与 DPODataGen/ 不同，这条数据线不再裁剪日志，而是直接复用 src/evaluation/prompt_builder.py 的 benchmark-style full-log baseline prompt。

核心定位：

复用已生成好的 data_dpo_v2/*/episodes/*.json
保留原有 chosen / rejected / pair metadata
只重建 prompt，使训练分布尽量贴近 benchmark baseline
导出强/弱模型出题任务
收集 strong/weak answers
回填成最终 full-log preference pairs

1. 重建 full-log prompt 数据

python DPODataGenFullLog/build_full_log_dataset.py --input-root data_dpo_v2 --output-root data_dpo_full_log_v1 --split both --mode baseline --overwrite

2. 审计长度

python DPODataGenFullLog/audit_dataset.py --root data_dpo_full_log_v1

3. 导出给 strong / weak model 的任务

python DPODataGenFullLog/export_model_tasks.py --split train --input-root data_dpo_full_log_v1 --output tmp_dpo_full_log/train_tasks.jsonl
python DPODataGenFullLog/export_model_tasks.py --split dev --input-root data_dpo_full_log_v1 --output tmp_dpo_full_log/dev_tasks.jsonl

4. 调 strong / weak model 收答案

默认策略：

默认 temperature=0
默认 no_thinking
默认会带 chat_template_kwargs={"enable_thinking": false}
默认 max_tokens=4096

strong model:

python DPODataGenFullLog/collect_api_answers.py --input tmp_dpo_full_log/train_tasks.jsonl --output tmp_dpo_full_log/strong_train_answers.jsonl --model <strong-model> --api-base <api-base> --api-key <api-key> --workers 4

weak model:

python DPODataGenFullLog/collect_api_answers.py --input tmp_dpo_full_log/train_tasks.jsonl --output tmp_dpo_full_log/weak_train_answers.jsonl --model <weak-model> --api-base <api-base> --api-key <api-key> --workers 4

如果你确实想开 thinking，再显式加 --thinking。

5. 回填成最终 preference pairs

python DPODataGenFullLog/assemble_preference_pairs.py --split train --input-root data_dpo_full_log_v1 --strong-answers tmp_dpo_full_log/strong_train_answers.jsonl --weak-answers tmp_dpo_full_log/weak_train_answers.jsonl --output tmp_dpo_full_log/final_train_pairs.jsonl --report tmp_dpo_full_log/final_train_pairs_report.json

dev 同理，把 train 换成 dev。

说明

这版 prompt 会非常长，长度会显著高于 DPODataGen/ 的压缩版。
它更适合你现在要验证的问题：训练 prompt 是否应尽量对齐 benchmark baseline。
但它也意味着 DPO 显存压力会非常大；2B DPO 比 9B DPO 更现实，9B 更适合后续做 chosen-only SFT。

README.md Unescape Escape

DPODataGenFullLog

1. 重建 full-log prompt 数据

2. 审计长度

3. 导出给 strong / weak model 的任务

4. 调 strong / weak model 收答案

5. 回填成最终 preference pairs

说明

README.md