llmiotsafe/PROGRESS.md

# SafeHome Benchmark 鏋勫缓杩涘害杩借釜

**椤圭洰鐩爣**: 鏋勫缓鍩轰簬 Matter 1.5.1 鍗忚鐨勬櫤鑳藉灞呭畨鍏ㄦ帹鐞?Benchmark
**鐩爣浼氳**: ICLR 2027 (CCF-A)
**寮€濮嬫棩鏈?*: 2026-04-29

---

## 鎬讳綋杩涘害

| Step | 鍐呭 | 鐘舵€?| 瀹屾垚鏃ユ湡 | 杈撳嚭鏂囦欢 |
|------|------|------|---------|---------|
| 1 | 瀹氫箟 Matter 璁惧妯″瀷 (JSON Schema) | **宸插畬鎴?* | 2026-04-29 | `data/device_schemas/` (13涓枃浠? |
| 2 | 璁捐瀹跺涵甯冨眬妯℃澘 | **宸插畬鎴?* | 2026-04-29 | `data/home_layouts/` (4涓枃浠? |
| 3 | 鍐欒涓烘ā寮忕敓鎴愬櫒 | **宸插畬鎴?* | 2026-04-29 | `data/behavior_patterns/` (3涓枃浠? |
| 4 | 璁捐寮傚父鍦烘櫙搴?| **宸插畬鎴?* | 2026-04-29 | `data/anomaly_templates/` (8涓枃浠? |
| 5 | 瀹炵幇 episode 鐢熸垚 pipeline | **宸插畬鎴?* | 2026-04-29 | `src/episode_gen/` (5涓ā鍧? |
| 6 | 鐢熸垚 + 浜哄伐楠岃瘉 | **鐢熸垚瀹屾垚** | 2026-04-29 | `data/benchmark/` (900涓狫SON, 77MB) |

---

## Step 1: Matter 璁惧妯″瀷瀹氫箟

### 闇€瑕佸畾涔夌殑璁惧鍒楄〃

**瀹夊叏鏍稿績璁惧 (6绉?**:
| # | 璁惧 | Matter 绔犺妭 | 鍏抽敭 Cluster | schema 鏂囦欢 | 鐘舵€?|
|---|------|-----------|-------------|------------|------|
| 1 | Contact Sensor (闂ㄧ獥浼犳劅鍣? | 7.1 | BooleanState | contact_sensor.json | 鉁?瀹屾垚 |
| 2 | Occupancy Sensor (杩愬姩浼犳劅鍣? | 7.3 | OccupancySensing | occupancy_sensor.json | 鉁?瀹屾垚 |
| 3 | Smoke CO Alarm (鐑熼浘鎶ヨ鍣? | 7.9 | SmokeCoAlarm | smoke_co_alarm.json | 鉁?瀹屾垚 |
| 4 | Door Lock (鏅鸿兘闂ㄩ攣) | 8.1 | DoorLock | door_lock.json | 鉁?瀹屾垚 |
| 5 | Water Leak Detector (婕忔按浼犳劅鍣? | 7.12 | BooleanState | water_leak_detector.json | 鉁?瀹屾垚 |
| 6 | Window Covering (绐楀笜/绐楁埛) | 8.3 | WindowCovering | window_covering.json | 鉁?瀹屾垚 |

**琛屼负涓婁笅鏂囪澶?(7绉?**:
| # | 璁惧 | Matter 绔犺妭 | 鍏抽敭 Cluster | schema 鏂囦欢 | 鐘舵€?|
|---|------|-----------|-------------|------------|------|
| 7 | On/Off Light (寮€鍏崇伅) | 4.1 | OnOff | onoff_light.json | 鉁?瀹屾垚 |
| 8 | Dimmable Light (璋冨厜鐏? | 4.2 | OnOff, LevelControl | dimmable_light.json | 鉁?瀹屾垚 |
| 9 | Temperature Sensor (娓╁害浼犳劅鍣? | 7.4 | TemperatureMeasurement | temperature_sensor.json | 鉁?瀹屾垚 |
| 10 | Room Air Conditioner (绌鸿皟) | 13.3 | OnOff, Thermostat, FanControl | air_conditioner.json | 鉁?瀹屾垚 |
| 11 | Laundry Washer (娲楄。鏈? | 13.1 | OnOff, OperationalState | laundry_washer.json | 鉁?瀹屾垚 |
| 12 | Cook Surface (鐏跺叿) | 13.7 | OnOff, TemperatureControl | cook_surface.json | 鉁?瀹屾垚 |
| 13 | Dishwasher (娲楃鏈? | 13.5 | OnOff, OperationalState | dishwasher.json | 鉁?瀹屾垚 |

### 鏁版嵁鏉ユ簮
- Matter 1.5.1 Device Library: `23-27351-009_Matter-1.5.1-Device-Library-Specification.pdf`
- Matter 1.5.1 Application Cluster: `23-27350-009_Matter-1.5.1-Application-Cluster-Specification.pdf`
- 姣忎釜璁惧鐨?schema 涓ユ牸浠?PDF 涓彁鍙栵紝涓嶅嚟绌虹紪閫?
---

## 鍙樻洿鏃ュ織

| 鏃ユ湡 | 鍙樻洿鍐呭 |
|------|---------|
| 2026-04-29 | 鍒涘缓椤圭洰杩借釜鏂囦欢锛屽紑濮?Step 1 |
| 2026-04-29 | Step 1 瀹屾垚: 13涓?Matter 璁惧妯″瀷 schema 鍏ㄩ儴浠?PDF 瑙勮寖涓彁鍙栧苟鍐欏叆 JSON |
| 2026-04-29 | Step 2 瀹屾垚: 4绉嶅搴竷灞€妯℃澘 (A:涓€瀹や竴鍘?4璁惧, B:涓ゅ涓€鍘?4璁惧, C:涓夊涓ゅ巺48璁惧, D:鐙眳鑰佷汉26璁惧)锛屽凡閫愯澶囨牳瀵?|
| 2026-04-29 | Step 1 鏍稿: 13涓?device schema 閫愰」瀵圭収 Matter 1.5.1 PDF 鏍稿锛屼慨姝?涓枃浠剁殑閿欒锛坈onformance 閿欒銆佺己澶?cluster 绛夛級 |
| 2026-04-29 | Step 3 瀹屾垚: 琛屼负妯″紡鏁版嵁瀹氫箟 鈥?action_event_mapping(18绉嶅姩浣溾啋璁惧浜嬩欢鏄犲皠), daily_routines(3绉嶅眳浣忚€呯敾鍍徝楀伐浣滄棩/鍛ㄦ湯鏃堕棿琛?, environment_baselines(娓╁害鍩虹嚎+浼犳劅鍣ㄦ晠闅滄ā寮? |
| 2026-04-29 | Step 4 瀹屾垚: 寮傚父鍦烘櫙搴?鈥?7澶х被35涓満鏅?16涓鎶ュ彉浣擄紝宸叉牳瀵规暟閲忎笌taxonomy涓€鑷?|
| 2026-04-29 | Step 5 瀹屾垚: episode 鐢熸垚 pipeline 鈥?home_state.py, behavior_engine.py, anomaly_injector.py, episode_builder.py, generate.py |
| 2026-04-29 | Step 6 (v1): 棣栫増900涓猠pisodes锛屽彂鐜?64涓棶棰?|
| 2026-04-29 | Bug淇: 6绫婚棶棰樺叏閮ㄤ慨澶?鈥?EL/CH鍦烘櫙甯冨眬鐢诲儚绾︽潫銆乹uery妯℃澘鏈浛鎹€佹俯搴rend鍊煎睍寮€銆丏F-06璁惧鍖归厤銆丅A-01杩囧害鍒嗛厤(MAX_PER_SCENARIO=50) |
| 2026-04-29 | Step 6 (v2): 鍏ㄩ噺6绫绘壂鎻?0闂锛屼絾鎵╁睍鍒?5绫绘鏌ュ悗鍙戠幇61涓柊闂(TP_NOT_ANOMALY=35, TP_NO_ANOMALY_EVENTS=26) |
| 2026-04-29 | Bug淇(v3): BA-02/CH-05涓嶅垎閰嶇粰TP(is_anomaly!=True); FG-03/EL-04/FG-02璁惧鍖归厤+absence鍦烘櫙浜嬩欢; fallback鏈哄埗瑙ｅ喅鍦烘櫙鑰楀敖; UnboundLocalError淇 |
| 2026-04-29 | Step 6 (v3): 15椤圭粨鏋勬鏌ラ€氳繃锛屼絾娣卞害瀹℃煡鍙戠幇FP鍐呭闂: 139涓狥P episode鏃犳敞鍏ヤ簨浠?|
| 2026-04-29 | Bug淇(v4): 琛ュ叏7涓狥P鍙樹綋鐨別vents鍒楄〃(DF-01,DF-03,EL-02,EL-05,CH-01,BA-03,BA-05); 淇璁惧鍖归厤瑕嗙洊bug |
| 2026-04-29 | **Step 6 (v4-final): 1200涓猠pisodes锛?5椤圭粨鏋勬鏌?FP鍐呭妫€鏌ュ叏閮ㄩ€氳繃=0闂銆侳P绌轰簨浠朵粠139闄嶅埌0** |
| 2026-04-29 | Step 7: 璇勪及妗嗘灦瀹屾垚 鈥?prompt_builder.py, scorer.py, metrics.py, runner.py, judge.py锛屽叏閮?import 楠岃瘉閫氳繃 |
| 2026-04-30 | Step 7 鏇存柊: runner.py 鍔犲叆 Anthropic (Claude) API 鍚庣鏀寔; 鏂板 eval_local.py 鍜?eval_api.py(绾痳equests) |
| 2026-04-30 | Step 8: 鍏ㄩ噺璇勬祴瀹屾垚 鈥?Qwen2.5-7B(1200ep,5s/ep,0 API err) + Claude Opus 4.6(1200ep,25s/ep,15 API err); 鍒嗘瀽鎶ュ憡 results/ANALYSIS_REPORT.md |
| 2026-04-30 | Step 6 (v5): 娓╁害閲囨牱浠?0鍒嗛挓鏀逛负5鍒嗛挓锛宑ooking_boost鏀逛负娓愯繘寮忓崌闄?姝ｅ鸡鏃ラ棿鏇茬嚎+绾挎€у仛楗崌娓? |
| 2026-04-30 | Step 6 (v6): 杩愬姩浼犳劅鍣ㄥ姞鍛ㄦ湡鎬т笂鎶?鏈変汉5min/鏃犱汉30min)銆備簨浠堕噺avg=1762/episode銆?5椤规鏌ラ€氳繃銆俻rompt绾?2K-70K token |
| 2026-05-01 | Step 9: EDRC 鎺ㄧ悊妗嗘灦瀹屾垚 鈥?prompt_builder.py 鏀寔 3 绉嶆ā寮?baseline/edrc/cot); scorer.py 鏀寔瑙ｆ瀽 EDRC 6 姝ヨ緭鍑? eval_api.py 鍔?--mode 鍙傛暟 |
| 2026-05-02 | Step 10: 棰嗗煙涓撳瀹℃煡淇 鈥?5涓棶棰樺叏閮ㄤ慨澶? (1)鍘绘帀500浜嬩欢鎴柇 (2)SQ3鎸塹uery鏃堕棿绐楄繃婊ゆ棩蹇?(3)SQ5涓嶅啀娉勬紡鍦烘櫙鍚?(4)22涓狦T鏍囩缁熶竴涓烘爣鍑嗗悕 (5)INS-05鍒犻櫎澶栭儴淇℃伅渚濊禆銆俠enchmark閲嶆柊鐢熸垚骞堕獙璇侀€氳繃 |
| 2026-05-02 | Step 10 杩藉姞: 淇 SQ-category 姹℃煋闂 鈥?fallback 涓嶅啀璺ㄧ被鍒崬鍦烘櫙銆備慨澶嶅墠: SQ3姹℃煋112涓?47.7%), SQ4姹℃煋127涓?54.0%), SQ5姹℃煋1涓€備慨澶嶅悗: 闆惰繚瑙?|
| 2026-05-02 | Step 11: 琛ュ叏鍏ㄩ儴 35 涓満鏅殑 FP 鍙樹綋 鈥?浠?16/35 琛ュ埌 35/35 |
| 2026-05-02 | Step 12: 闅惧害閲忓寲 鈥?5缁村瑙傝瘎鍒?evidence_count/signal_directness/cross_device/temporal_span/fp_similarity)鏇挎崲涓昏鏍囩锛?5鍦烘櫙鍏ㄩ儴鏍囨敞銆傚垎甯? L1=3, L2=15, L3=17 |
| 2026-05-02 | Step 12: Cohen's kappa 鏍囨敞宸ュ叿 鈥?annotate_kappa.py 鏀寔鍙屾ā鍨嬬嫭绔嬫爣娉?kappa璁＄畻+鍒嗗眰鎶芥牱 |
| 2026-05-02 | Step 12: 浜哄伐鏍囨敞鎶ュ憡 鈥?30涓猠pisode鍒嗗眰鎶芥牱锛屽弻浜烘爣娉ㄎ?0.879锛屼汉vsGT 魏鈮?.44锛孡3鍦烘櫙浜虹被妫€鍑虹巼浠?0%銆傛姤鍛? results/HUMAN_ANNOTATION_REPORT.md |
| 2026-05-02 | Step 13: Diversity & Hardness 鎶ュ憡 鈥?11涓淮搴﹀叏闈㈢粺璁?瑙勬ā/SQ鍒嗗竷/绫诲埆/甯冨眬/鐢诲儚/浜嬩欢閲?闅惧害5缁?娉ㄥ叆鏃堕棿/寮傚父缁熻/鏃跺簭鐗瑰緛/SimuHome瀵规瘮)銆傛姤鍛? results/DIVERSITY_REPORT.md |
| 2026-05-04 | Step 14: EGP (Evidence-Guided Pipeline) 鍘熷瀷瀹炵幇 鈥?鏂板缓 `EGP/` 鐙珛鐩綍锛屼笉淇敼鏃㈡湁璇勬祴浠ｇ爜锛涘疄鐜?API 瀹㈡埛绔€乺esume/episode 鏀堕泦銆佽瘉鎹帇缂┿€佸闃舵 prompt銆丒GP runner锛涙敮鎸?`--max-episodes` / `--episode-id` / `--preview-only`锛屽苟鏀寔绌?API key 鐨勬湰鍦?vLLM 鍦烘櫙 |
| 2026-05-04 | Step 14 鏇存柊: EGP 鏈湴 Qwen/vLLM 鍏煎鎬т慨澶?鈥?API client 鍏煎 `content=None` / `reasoning_content` 杩斿洖锛沗run_egp.py` 鏀寔 `--no-thinking` 涓?`--no_thinking` 鍙傛暟鍒悕锛屾部鐢?eval_api 鐨?no-thinking 閲囨牱閫昏緫 |
| 2026-05-04 | Step 15: EGPv2 workflow prototype - created `EGPv2/` without exposing `SQ1-SQ5`; added LLM-in-the-loop roles (`Triage` / `Investigator` / `Supervisor` / `Verifier`); supports OpenAI-compatible APIs, local vLLM without API key, resume, `--no-thinking`, and `--max-tokens` |
| 2026-05-04 | Step 15 update: EGPv2 signal fixes - corrected SQ3/SQ4 query-window filtering in `signals.py`; switched room mapping to strict `home_state.room_id`; verified `run_egpv2.py --help`, `preview-only`, and `compileall` |
| 2026-05-04 | Step 15 update: added `EGPv2/diagnosis_subset_60.txt` and `EGPv2/diagnosis_subset_60.md` for module-level workflow debugging; the 60-episode subset emphasizes SQ1 drift, SQ3/SQ4 hard cases, SQ2/SQ5 controls, and a small TN sanity pack |
| 2026-05-04 | Step 15 update: implemented `EGPv2/fix_checklist.md`; made EGPv2 Matter-aware by formatting protocol values in `signals.py`, adding protocol guidance in `prompts.py`, turning the pipeline into a bounded 2-round workflow, and enforcing supervisor decisions in code; verified with `compileall`, `preview-only`, and a 1-episode local smoke test |
| 2026-05-04 | Step 15 update: rebalanced EGPv2 decision policy - prompts now separate direct-evidence requirements for device faults from temporal/cross-device evidence for behavior and safety anomalies; verifier no longer collapses insufficient evidence into `none` by default; supervisor code gate now hard-blocks only explicit `abstain`; verified with `compileall` and a 2-episode TP/FP local smoke pair (`SQ4_TP_B_0721`, `SQ4_FP_B_0885`) |
| 2026-05-04 | Step 15 update: locked EGPv2 request temperature to `0.0` in `api_client.py` for both thinking and non-thinking modes, removing the earlier no-thinking temperature bump; verified with `compileall` and direct client instantiation checks |
| 2026-05-04 | Step 15 update: tightened `EGPv2/prompts.py` for SQ3/SQ4/SQ5 broad-query failures - added scope-coverage requirements for composite/emergency analysis, blocked escalation from single transient `None` or missing logs, and added stricter trigger rules for `sensor_malfunction`, `unattended_cooking`, `fire_risk`, and intrusion-style labels; verified with `compileall` |
| 2026-05-04 | Step 15 update: created `EGPv2_1/` as a prompt-only branch from `EGPv2/` for minimal SQ-targeted tuning; preserved the latest SQ3/SQ5 conservative gains while restoring SQ1 device-health recall and SQ4 composite-safety recall by (1) strengthening device-health chunk selection around retries/follow-up/recovery, (2) restricting `secondary_task_profile=device-health` to explicit fault evidence, and (3) allowing local hazardous sequences with vulnerable context plus delayed/weak mitigation to remain anomalous; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help` |
| 2026-05-05 | Step 15 update: refined `EGPv2_1` after diag60 analysis - added stricter anti-false-alarm rules for SQ1/SQ2-style device-health cases in `prompts.py` (single transient spike is not enough for `sensor_malfunction`; delayed auto-lock / long unlocked interval is not enough for `lock_malfunction`; device-health verdicts should inspect pre/post outcome chunks rather than a single local snippet); fixed `EGPv2_1/run_egpv2.py` summary/version label from `EGPv2` to `EGPv2.1`, and corrected the already-generated `results/qwen36_35B_egpv2_1diag60/summary.json` pipeline field |
| 2026-05-05 | Step 15 update: applied MISS-driven recall fixes to `EGPv2_1` - added query-intent guardrails in `pipeline.py` so behavior/security/composite queries are no longer silently collapsed into `device-health`, changed second-round refinement to prioritize supervisor-requested chunks plus adjacent context instead of reusing the original safe chunk set, and strengthened `prompts.py` so non-device-health queries do not get derailed by transient telemetry noise while verifier no longer defaults to `none` when a coherent anomaly sequence still directly answers the query; verified with `python -m compileall EGPv2_1`, `python EGPv2_1/run_egpv2.py --help`, and `--preview-only` smoke run |
| 2026-05-05 | Step 15 update: added FP-recovery tuning to `EGPv2_1/prompts.py` after recall overshoot - kept the new query-intent / second-round retrieval logic, but tightened anomaly acceptance for `unattended_cooking`, `fire_risk`, `intrusion`, and `behavioral_anomaly` so absence-based supervision assumptions, sparse lock/contact activity, missing OFF logs near truncation, and one ambiguous telemetry inconsistency are no longer enough to force anomaly; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help` |
| 2026-05-05 | Step 15 update: built a hybrid `EGPv2_1` prompt variant for full-run testing - preserved the newer `pipeline.py` query-intent and second-round retrieval fixes, but rolled back the last round's global verifier/investigator over-tightening for `single-event-safety` and `behavior-sequence`, keeping the stricter absence-based safeguards only for `composite-safety` / `emergency-response` so SQ4-style FP control is retained without globally suppressing SQ1/SQ2/SQ3 recall; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help` |
| 2026-05-05 | Step 16: created `EGPv3/` as a new Debate-then-Judge architecture - replaced the serial `Triage -> Investigator -> Supervisor -> Verifier` veto chain with a neutral `Extractor` plus heterogeneous debate (`Prosecutor` for recall, `Defender` for precision) followed by a `Judge` that sees both arguments and the same raw focused evidence; implemented `EGPv3/prompts.py`, `EGPv3/pipeline.py`, `EGPv3/run_egpv3.py`, updated `README.md`, and reused the existing OpenAI-compatible client / benchmark / signal stack; verified with `python -m compileall EGPv3`, `python EGPv3/run_egpv3.py --help`, and `python EGPv3/run_egpv3.py --preview-only --max-episodes 1` |
| 2026-05-05 | Step 16 update: created `EGPv3_1/` as a stronger debate variant focused on SQ3/TN false-alarm recovery - upgraded `Defender` from a parallel benign narrator into a rebuttal role that explicitly sees and attacks the `Prosecutor`'s claims point-by-point, and upgraded the `Judge` prompt to apply a burden-of-proof test (direct evidence vs absence-based inference vs coherent normal routine) before awarding anomaly; implemented `EGPv3_1/prompts.py`, wired defender access to prosecutor output in `EGPv3_1/pipeline.py`, added `EGPv3_1/run_egpv3_1.py`, and verified with `python -m compileall EGPv3_1`, `python EGPv3_1/run_egpv3_1.py --help`, and `python EGPv3_1/run_egpv3_1.py --preview-only --max-episodes 1` |
| 2026-05-05 | Step 16 update: created `EGPv3_2/` to repair the over-conservative `EGPv3_1` burden rules - kept the stronger rebuttal-style `Defender`, but recalibrated `Prosecutor` and `Judge` so anomaly can win on multi-signal convergence even without explicit fault codes, while a benign story only wins if it is positively supported by the logs rather than merely plausible; added query-type alignment pressure to `Prosecutor`, support-quality fields to both debate sides, and a support-vs-speculation burden test in `Judge`; verified with `python -m compileall EGPv3_2`, `python EGPv3_2/run_egpv3_2.py --help`, and `python EGPv3_2/run_egpv3_2.py --preview-only --max-episodes 1` |
| 2026-05-05 | Step 17: implemented `EGPv4/` following the ADISR design - added a deterministic rule-based `Evidence Extractor` (`extractor.py`) that compresses raw device logs into key events / temperature trends / occupancy summaries / alert lines, then feeds a 3-stage adversarial pipeline (`Prosecutor` -> `Defender` -> `Judge`) with asymmetric burden-of-proof prompts in `prompts.py`; created `pipeline.py`, `run_egpv4.py`, `README.md`, and reused the existing OpenAI-compatible client / benchmark / signal stack; verified with `python -m compileall EGPv4`, `python EGPv4/run_egpv4.py --help`, and `python EGPv4/run_egpv4.py --preview-only --max-episodes 1` |
| 2026-05-06 | Step 18: created isolated DPO training-data generator `DPODataGen/` 鈥?added a 40-scenario training bank disjoint from the 35 benchmark scenarios, split-level scenario isolation (`train_pref_v1` vs `dev_pref_v1`), synthetic Matter-style episode generation reusing `HomeState` + normal behavior streams, and `pairs.jsonl` assembly with `rule` chosen answers plus `constructed` rejected answers; verified with `python -m compileall DPODataGen`, `python -m DPODataGen.run_generate_dpo --split both --max-episodes 6 --overwrite`, and direct-script execution `python DPODataGen/run_generate_dpo.py --split train --max-episodes 1 --overwrite` |
| 2026-05-06 | Step 18 update: replaced DPO pair prompt construction with a DPO-specific compact prompt builder in `DPODataGen/prompt_builder.py` 鈥?prompts now keep focused rooms/devices, local event windows, sparse device history, and multi-day activity summaries instead of full raw logs; added `audit_dataset.py` for prompt-length checks and `export_model_tasks.py` for later strong/weak model answer collection; smoke-verified on `data_dpo_smoke2/` with prompt length reduced from ~200k chars mean to ~8k chars mean (`python -m DPODataGen.run_generate_dpo --split both --max-episodes 20 --output-root data_dpo_smoke2 --overwrite`, `python DPODataGen/audit_dataset.py --root .\\data_dpo_smoke2`) |
| 2026-05-06 | Step 18 update: added model-answer collection and pair assembly utilities for DPO curation 鈥?`collect_api_answers.py` calls OpenAI-compatible APIs with resume/no-thinking support, `assemble_preference_pairs.py` scores strong/weak model outputs against each episode and upgrades base pairs to `strong_model` chosen / `weak_model_actual_error` rejected when usable; verified with `python -m compileall DPODataGen`, `python DPODataGen/export_model_tasks.py --split train --input-root data_dpo_v2 --output tmp_dpo/train_tasks_5.jsonl --max-pairs 5`, and fallback assembly smoke `python DPODataGen/assemble_preference_pairs.py --split train --input-root data_dpo_v2 --output tmp_dpo/train_pairs_fallback.jsonl --report tmp_dpo/train_pairs_fallback_report.json` |
| 2026-05-07 | Step 19: packaged a self-contained server-side DPO bundle in `DPO_QWEN35_SERVER_BUNDLE/` 鈥?copied raw final preference pairs and reports, converted them into TRL conversational-format datasets (`train_dpo_clean.jsonl`, `dev_dpo_clean.jsonl`), dropped weak-model `parse_fail` rejected pairs for the clean split (train: 2500鈫?435, dev: 300鈫?92), and added standalone server scripts (`scripts/train_dpo.py`, `scripts/analyze_token_lengths.py`), `requirements.txt`, and `run_train.sh`; verified with `python -m compileall DPO_QWEN35_SERVER_BUNDLE` and manifest/schema checks |
| 2026-05-07 | Step 19 update: hardened the DPO server bundle for mixed TRL versions and 2xA100 40G memory limits 閳?`scripts/train_dpo.py` now conditionally enables `precompute_ref_log_probs` when supported, and `run_train.sh` now defaults to `max_length=6144` / `max_prompt_length=5632` with reference-logprob precomputation enabled to reduce DPO reference-forward OOM risk |
| 2026-05-07 | Step 19 update: added an aggressive low-memory DPO path in `DPO_QWEN35_SERVER_BUNDLE/` 閳?`scripts/train_dpo.py` now conditionally forwards `precompute_ref_log_probs` / `reference_free` / truncation-related args to whichever TRL API surface is available, and new `run_train_lowmem.sh` cuts the working sequence to `4096/3584/512`, uses `keep_end` truncation, `paged_adamw_8bit`, LoRA rank 16, and `flash_attention_2` for a substantially lower VRAM footprint |
| 2026-05-07 | Step 19 update: added persistent reference-logprob caching for DPO training 閳?`DPO_QWEN35_SERVER_BUNDLE/scripts/train_dpo.py` can now `load_from_disk` cached datasets with `ref_chosen_logps/ref_rejected_logps`, or trigger TRL precompute once and `save_to_disk` for reuse; also added `run_precompute_ref_logps_lowmem.sh` so the costly precompute stage can be finished and cached before actual training resumes |
| 2026-05-07 | Step 19 update: switched the low-memory DPO entrypoint to `reference_free` mode 閳?`run_train_lowmem.sh` now trains without an explicit reference model, and `scripts/train_dpo.py` skips loading or saving `ref_logps` caches when `--reference-free` is enabled so stale reference caches are not mistakenly reused |
| 2026-05-07 | Step 19 update: added an ultra-low-memory `reference_free` DPO path 閳?`scripts/train_dpo.py` now conditionally forwards `padding_free` and `use_logits_to_keep` when supported by the installed TRL version, and new `run_train_ultralowmem_ref_free.sh` drops the sequence budget to `2048/1536/256`, reduces LoRA rank to 8, relaxes eval frequency, and enables both memory-saving flags for a last-pass 9B DPO attempt |
| 2026-05-07 | Step 19 update: rebalanced the ultra-low-memory `reference_free` launch profile toward longer prompts under an `sdpa`-only server constraint 閳?`run_train_ultralowmem_ref_free.sh` now allocates `3584/3072/384` tokens to preserve more episode context, removes the experimental `padding_free` flag, and switches attention implementation from `flash_attention_2` to `sdpa` |
| 2026-05-07 | Step 19 update: rolled the ultra-low-memory `reference_free` profile back to the last known runnable budget after the `3584/3072/384` attempt OOMed under `sdpa` 閳?`run_train_ultralowmem_ref_free.sh` is restored to `2048/1536/256` with `use_logits_to_keep`, LoRA rank 8, and `sdpa` so the server can continue from the previously stable training envelope |
| 2026-05-08 | Step 20: created `DPODataGenFullLog/` to rebuild preference data with benchmark-style full raw logs instead of compressed DPO prompts 閳?added `build_full_log_dataset.py`, `prompt_builder.py`, `export_model_tasks.py`, `collect_api_answers.py`, `assemble_preference_pairs.py`, `audit_dataset.py`, and `README.md`; rebuilt `data_dpo_full_log_v1/` from existing `data_dpo_v2` episodes/pairs (train=2500, dev=300) and audited prompt lengths showing full-log prompts are extremely long (train mean ~197.5k chars, p95 ~454k, max ~609k), while answer lengths remain short |

| 2026-05-10 | Step 20 update: packaged DPO_FULLLOG_QWEN2B_SERVER_BUNDLE for direct server-side 2B DPO experiments; copied full-log TRL datasets (clean and full), raw final pairs, and reports; added GPU-selectable run_train.sh mapping 0=GPU0, 1=GPU1, 2=dual GPU; default profile uses reference_free + QLoRA + bf16 + sdpa with longer sequence budgets than the previous 9B ultra-lowmem path |

| 2026-05-10 | Step 21: packaged SFT_FULLLOG_QWEN2B_SERVER_BUNDLE for direct server-side Qwen3.5-2B full-log SFT; converted chosen responses from full-log preference pairs into train_sft/dev_sft conversational datasets; added train_sft.py, analyze_sft_lengths.py, compare_eval_results.py, and GPU-selectable run_train.sh with conservative default sequence lengths for memory stability |

| 2026-05-11 | Step 21 update: packaged SFT_FULLLOG_QWEN2B_V2_SERVER_BUNDLE for a targeted full-log SFT v2 path; rebuilt chosen-only supervision from final full-log preference pairs, normalized weak rule answers into canonical JSON, preserved 879 strong-model chosen answers, and created a 5010-example focus split that upweights SQ3/SQ4 and high-difficulty TP/FP cases for direct benchmark-oriented repair |