Files
llmiotsafe/PROGRESS.md
2026-05-12 17:01:39 +08:00

120 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SafeHome Benchmark 鏋勫缓杩涘害杩借釜
**椤圭洰鐩爣**: 鏋勫缓鍩轰簬 Matter 1.5.1 鍗忚鐨勬櫤鑳藉灞呭畨鍏ㄦ帹鐞?Benchmark
**鐩爣浼氳**: ICLR 2027 (CCF-A)
**寮€濮嬫棩鏈?*: 2026-04-29
---
## 鎬讳綋杩涘害
| Step | 鍐呭 | 鐘舵€?| 瀹屾垚鏃ユ湡 | 杈撳嚭鏂囦欢 |
|------|------|------|---------|---------|
| 1 | 瀹氫箟 Matter 璁惧妯″瀷 (JSON Schema) | **宸插畬鎴?* | 2026-04-29 | `data/device_schemas/` (13涓枃浠? |
| 2 | 璁捐瀹跺涵甯冨眬妯℃澘 | **宸插畬鎴?* | 2026-04-29 | `data/home_layouts/` (4涓枃浠? |
| 3 | 鍐欒涓烘ā寮忕敓鎴愬櫒 | **宸插畬鎴?* | 2026-04-29 | `data/behavior_patterns/` (3涓枃浠? |
| 4 | 璁捐寮傚父鍦烘櫙搴?| **宸插畬鎴?* | 2026-04-29 | `data/anomaly_templates/` (8涓枃浠? |
| 5 | 瀹炵幇 episode 鐢熸垚 pipeline | **宸插畬鎴?* | 2026-04-29 | `src/episode_gen/` (5涓ā鍧? |
| 6 | 鐢熸垚 + 浜哄伐楠岃瘉 | **鐢熸垚瀹屾垚** | 2026-04-29 | `data/benchmark/` (900涓狫SON, 77MB) |
---
## Step 1: Matter 璁惧妯″瀷瀹氫箟
### 闇€瑕佸畾涔夌殑璁惧鍒楄〃
**瀹夊叏鏍稿績璁惧 (6绉?**:
| # | 璁惧 | Matter 绔犺妭 | 鍏抽敭 Cluster | schema 鏂囦欢 | 鐘舵€?|
|---|------|-----------|-------------|------------|------|
| 1 | Contact Sensor (闂ㄧ獥浼犳劅鍣? | 7.1 | BooleanState | contact_sensor.json | 鉁?瀹屾垚 |
| 2 | Occupancy Sensor (杩愬姩浼犳劅鍣? | 7.3 | OccupancySensing | occupancy_sensor.json | 鉁?瀹屾垚 |
| 3 | Smoke CO Alarm (鐑熼浘鎶ヨ鍣? | 7.9 | SmokeCoAlarm | smoke_co_alarm.json | 鉁?瀹屾垚 |
| 4 | Door Lock (鏅鸿兘闂ㄩ攣) | 8.1 | DoorLock | door_lock.json | 鉁?瀹屾垚 |
| 5 | Water Leak Detector (婕忔按浼犳劅鍣? | 7.12 | BooleanState | water_leak_detector.json | 鉁?瀹屾垚 |
| 6 | Window Covering (绐楀笜/绐楁埛) | 8.3 | WindowCovering | window_covering.json | 鉁?瀹屾垚 |
**琛屼负涓婁笅鏂囪澶?(7绉?**:
| # | 璁惧 | Matter 绔犺妭 | 鍏抽敭 Cluster | schema 鏂囦欢 | 鐘舵€?|
|---|------|-----------|-------------|------------|------|
| 7 | On/Off Light (寮€鍏崇伅) | 4.1 | OnOff | onoff_light.json | 鉁?瀹屾垚 |
| 8 | Dimmable Light (璋冨厜鐏? | 4.2 | OnOff, LevelControl | dimmable_light.json | 鉁?瀹屾垚 |
| 9 | Temperature Sensor (娓╁害浼犳劅鍣? | 7.4 | TemperatureMeasurement | temperature_sensor.json | 鉁?瀹屾垚 |
| 10 | Room Air Conditioner (绌鸿皟) | 13.3 | OnOff, Thermostat, FanControl | air_conditioner.json | 鉁?瀹屾垚 |
| 11 | Laundry Washer (娲楄。鏈? | 13.1 | OnOff, OperationalState | laundry_washer.json | 鉁?瀹屾垚 |
| 12 | Cook Surface (鐏跺叿) | 13.7 | OnOff, TemperatureControl | cook_surface.json | 鉁?瀹屾垚 |
| 13 | Dishwasher (娲楃鏈? | 13.5 | OnOff, OperationalState | dishwasher.json | 鉁?瀹屾垚 |
### 鏁版嵁鏉ユ簮
- Matter 1.5.1 Device Library: `23-27351-009_Matter-1.5.1-Device-Library-Specification.pdf`
- Matter 1.5.1 Application Cluster: `23-27350-009_Matter-1.5.1-Application-Cluster-Specification.pdf`
- 姣忎釜璁惧鐨?schema 涓ユ牸浠?PDF 涓彁鍙栵紝涓嶅嚟绌虹紪閫?
---
## 鍙樻洿鏃ュ織
| 鏃ユ湡 | 鍙樻洿鍐呭 |
|------|---------|
| 2026-04-29 | 鍒涘缓椤圭洰杩借釜鏂囦欢锛屽紑濮?Step 1 |
| 2026-04-29 | Step 1 瀹屾垚: 13涓?Matter 璁惧妯″瀷 schema 鍏ㄩ儴浠?PDF 瑙勮寖涓彁鍙栧苟鍐欏叆 JSON |
| 2026-04-29 | Step 2 瀹屾垚: 4绉嶅搴竷灞€妯℃澘 (A:涓€瀹や竴鍘?4璁惧, B:涓ゅ涓€鍘?4璁惧, C:涓夊涓ゅ巺48璁惧, D:鐙眳鑰佷汉26璁惧)锛屽凡閫愯澶囨牳瀵?|
| 2026-04-29 | Step 1 鏍稿: 13涓?device schema 閫愰」瀵圭収 Matter 1.5.1 PDF 鏍稿锛屼慨姝?涓枃浠剁殑閿欒锛坈onformance 閿欒銆佺己澶?cluster 绛夛級 |
| 2026-04-29 | Step 3 瀹屾垚: 琛屼负妯″紡鏁版嵁瀹氫箟 鈥?action_event_mapping(18绉嶅姩浣溾啋璁惧浜嬩欢鏄犲皠), daily_routines(3绉嶅眳浣忚€呯敾鍍徝楀伐浣滄棩/鍛ㄦ湯鏃堕棿琛?, environment_baselines(娓╁害鍩虹嚎+浼犳劅鍣ㄦ晠闅滄ā寮? |
| 2026-04-29 | Step 4 瀹屾垚: 寮傚父鍦烘櫙搴?鈥?7澶х被35涓満鏅?16涓鎶ュ彉浣擄紝宸叉牳瀵规暟閲忎笌taxonomy涓€鑷?|
| 2026-04-29 | Step 5 瀹屾垚: episode 鐢熸垚 pipeline 鈥?home_state.py, behavior_engine.py, anomaly_injector.py, episode_builder.py, generate.py |
| 2026-04-29 | Step 6 (v1): 棣栫増900涓猠pisodes锛屽彂鐜?64涓棶棰?|
| 2026-04-29 | Bug淇: 6绫婚棶棰樺叏閮ㄤ慨澶?鈥?EL/CH鍦烘櫙甯冨眬鐢诲儚绾︽潫銆乹uery妯℃澘鏈浛鎹€佹俯搴rend鍊煎睍寮€銆丏F-06璁惧鍖归厤銆丅A-01杩囧害鍒嗛厤(MAX_PER_SCENARIO=50) |
| 2026-04-29 | Step 6 (v2): 鍏ㄩ噺6绫绘壂鎻?0闂锛屼絾鎵╁睍鍒?5绫绘鏌ュ悗鍙戠幇61涓柊闂(TP_NOT_ANOMALY=35, TP_NO_ANOMALY_EVENTS=26) |
| 2026-04-29 | Bug淇(v3): BA-02/CH-05涓嶅垎閰嶇粰TP(is_anomaly!=True); FG-03/EL-04/FG-02璁惧鍖归厤+absence鍦烘櫙浜嬩欢; fallback鏈哄埗瑙喅鍦烘櫙鑰楀敖; UnboundLocalError淇 |
| 2026-04-29 | Step 6 (v3): 15椤圭粨鏋勬鏌ラ€氳繃锛屼絾娣卞害瀹℃煡鍙戠幇FP鍐呭闂: 139涓狥P episode鏃犳敞鍏ヤ簨浠?|
| 2026-04-29 | Bug淇(v4): 琛ュ叏7涓狥P鍙樹綋鐨別vents鍒楄〃(DF-01,DF-03,EL-02,EL-05,CH-01,BA-03,BA-05); 淇璁惧鍖归厤瑕嗙洊bug |
| 2026-04-29 | **Step 6 (v4-final): 1200涓猠pisodes锛?5椤圭粨鏋勬鏌?FP鍐呭妫€鏌ュ叏閮ㄩ€氳繃=0闂銆侳P绌轰簨浠朵粠139闄嶅埌0** |
| 2026-04-29 | Step 7: 璇勪及妗嗘灦瀹屾垚 鈥?prompt_builder.py, scorer.py, metrics.py, runner.py, judge.py锛屽叏閮?import 楠岃瘉閫氳繃 |
| 2026-04-30 | Step 7 鏇存柊: runner.py 鍔犲叆 Anthropic (Claude) API 鍚庣鏀寔; 鏂板 eval_local.py 鍜?eval_api.py(绾痳equests) |
| 2026-04-30 | Step 8: 鍏ㄩ噺璇勬祴瀹屾垚 鈥?Qwen2.5-7B(1200ep,5s/ep,0 API err) + Claude Opus 4.6(1200ep,25s/ep,15 API err); 鍒嗘瀽鎶ュ憡 results/ANALYSIS_REPORT.md |
| 2026-04-30 | Step 6 (v5): 娓╁害閲囨牱浠?0鍒嗛挓鏀逛负5鍒嗛挓锛宑ooking_boost鏀逛负娓愯繘寮忓崌闄?姝e鸡鏃ラ棿鏇茬嚎+绾挎€у仛楗崌娓? |
| 2026-04-30 | Step 6 (v6): 杩愬姩浼犳劅鍣ㄥ姞鍛ㄦ湡鎬т笂鎶?鏈変汉5min/鏃犱汉30min)銆備簨浠堕噺avg=1762/episode銆?5椤规鏌ラ€氳繃銆俻rompt绾?2K-70K token |
| 2026-05-01 | Step 9: EDRC 鎺ㄧ悊妗嗘灦瀹屾垚 鈥?prompt_builder.py 鏀寔 3 绉嶆ā寮?baseline/edrc/cot); scorer.py 鏀寔瑙f瀽 EDRC 6 姝ヨ緭鍑? eval_api.py 鍔?--mode 鍙傛暟 |
| 2026-05-02 | Step 10: 棰嗗煙涓撳瀹℃煡淇 鈥?5涓棶棰樺叏閮ㄤ慨澶? (1)鍘绘帀500浜嬩欢鎴柇 (2)SQ3鎸塹uery鏃堕棿绐楄繃婊ゆ棩蹇?(3)SQ5涓嶅啀娉勬紡鍦烘櫙鍚?(4)22涓狦T鏍囩缁熶竴涓烘爣鍑嗗悕 (5)INS-05鍒犻櫎澶栭儴淇℃伅渚濊禆銆俠enchmark閲嶆柊鐢熸垚骞堕獙璇侀€氳繃 |
| 2026-05-02 | Step 10 杩藉姞: 淇 SQ-category 姹℃煋闂 鈥?fallback 涓嶅啀璺ㄧ被鍒崬鍦烘櫙銆備慨澶嶅墠: SQ3姹℃煋112涓?47.7%), SQ4姹℃煋127涓?54.0%), SQ5姹℃煋1涓€備慨澶嶅悗: 闆惰繚瑙?|
| 2026-05-02 | Step 11: 琛ュ叏鍏ㄩ儴 35 涓満鏅殑 FP 鍙樹綋 鈥?浠?16/35 琛ュ埌 35/35 |
| 2026-05-02 | Step 12: 闅惧害閲忓寲 鈥?5缁村瑙傝瘎鍒?evidence_count/signal_directness/cross_device/temporal_span/fp_similarity)鏇挎崲涓昏鏍囩锛?5鍦烘櫙鍏ㄩ儴鏍囨敞銆傚垎甯? L1=3, L2=15, L3=17 |
| 2026-05-02 | Step 12: Cohen's kappa 鏍囨敞宸ュ叿 鈥?annotate_kappa.py 鏀寔鍙屾ā鍨嬬嫭绔嬫爣娉?kappa璁畻+鍒嗗眰鎶芥牱 |
| 2026-05-02 | Step 12: 浜哄伐鏍囨敞鎶ュ憡 鈥?30涓猠pisode鍒嗗眰鎶芥牱锛屽弻浜烘爣娉ㄎ?0.879锛屼汉vsGT 魏鈮?.44锛孡3鍦烘櫙浜虹被妫€鍑虹巼浠?0%銆傛姤鍛? results/HUMAN_ANNOTATION_REPORT.md |
| 2026-05-02 | Step 13: Diversity & Hardness 鎶ュ憡 鈥?11涓淮搴﹀叏闈㈢粺璁?瑙勬ā/SQ鍒嗗竷/绫诲埆/甯冨眬/鐢诲儚/浜嬩欢閲?闅惧害5缁?娉ㄥ叆鏃堕棿/寮傚父缁熻/鏃跺簭鐗瑰緛/SimuHome瀵规瘮)銆傛姤鍛? results/DIVERSITY_REPORT.md |
| 2026-05-04 | Step 14: EGP (Evidence-Guided Pipeline) 鍘熷瀷瀹炵幇 鈥?鏂板缓 `EGP/` 鐙珛鐩綍锛屼笉淇敼鏃㈡湁璇勬祴浠g爜锛涘疄鐜?API 瀹㈡埛绔€乺esume/episode 鏀堕泦銆佽瘉鎹帇缂┿€佸闃舵 prompt銆丒GP runner锛涙敮鎸?`--max-episodes` / `--episode-id` / `--preview-only`锛屽苟鏀寔绌?API key 鐨勬湰鍦?vLLM 鍦烘櫙 |
| 2026-05-04 | Step 14 鏇存柊: EGP 鏈湴 Qwen/vLLM 鍏煎鎬т慨澶?鈥?API client 鍏煎 `content=None` / `reasoning_content` 杩斿洖锛沗run_egp.py` 鏀寔 `--no-thinking` 涓?`--no_thinking` 鍙傛暟鍒悕锛屾部鐢?eval_api 鐨?no-thinking 閲囨牱閫昏緫 |
| 2026-05-04 | Step 15: EGPv2 workflow prototype - created `EGPv2/` without exposing `SQ1-SQ5`; added LLM-in-the-loop roles (`Triage` / `Investigator` / `Supervisor` / `Verifier`); supports OpenAI-compatible APIs, local vLLM without API key, resume, `--no-thinking`, and `--max-tokens` |
| 2026-05-04 | Step 15 update: EGPv2 signal fixes - corrected SQ3/SQ4 query-window filtering in `signals.py`; switched room mapping to strict `home_state.room_id`; verified `run_egpv2.py --help`, `preview-only`, and `compileall` |
| 2026-05-04 | Step 15 update: added `EGPv2/diagnosis_subset_60.txt` and `EGPv2/diagnosis_subset_60.md` for module-level workflow debugging; the 60-episode subset emphasizes SQ1 drift, SQ3/SQ4 hard cases, SQ2/SQ5 controls, and a small TN sanity pack |
| 2026-05-04 | Step 15 update: implemented `EGPv2/fix_checklist.md`; made EGPv2 Matter-aware by formatting protocol values in `signals.py`, adding protocol guidance in `prompts.py`, turning the pipeline into a bounded 2-round workflow, and enforcing supervisor decisions in code; verified with `compileall`, `preview-only`, and a 1-episode local smoke test |
| 2026-05-04 | Step 15 update: rebalanced EGPv2 decision policy - prompts now separate direct-evidence requirements for device faults from temporal/cross-device evidence for behavior and safety anomalies; verifier no longer collapses insufficient evidence into `none` by default; supervisor code gate now hard-blocks only explicit `abstain`; verified with `compileall` and a 2-episode TP/FP local smoke pair (`SQ4_TP_B_0721`, `SQ4_FP_B_0885`) |
| 2026-05-04 | Step 15 update: locked EGPv2 request temperature to `0.0` in `api_client.py` for both thinking and non-thinking modes, removing the earlier no-thinking temperature bump; verified with `compileall` and direct client instantiation checks |
| 2026-05-04 | Step 15 update: tightened `EGPv2/prompts.py` for SQ3/SQ4/SQ5 broad-query failures - added scope-coverage requirements for composite/emergency analysis, blocked escalation from single transient `None` or missing logs, and added stricter trigger rules for `sensor_malfunction`, `unattended_cooking`, `fire_risk`, and intrusion-style labels; verified with `compileall` |
| 2026-05-04 | Step 15 update: created `EGPv2_1/` as a prompt-only branch from `EGPv2/` for minimal SQ-targeted tuning; preserved the latest SQ3/SQ5 conservative gains while restoring SQ1 device-health recall and SQ4 composite-safety recall by (1) strengthening device-health chunk selection around retries/follow-up/recovery, (2) restricting `secondary_task_profile=device-health` to explicit fault evidence, and (3) allowing local hazardous sequences with vulnerable context plus delayed/weak mitigation to remain anomalous; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help` |
| 2026-05-05 | Step 15 update: refined `EGPv2_1` after diag60 analysis - added stricter anti-false-alarm rules for SQ1/SQ2-style device-health cases in `prompts.py` (single transient spike is not enough for `sensor_malfunction`; delayed auto-lock / long unlocked interval is not enough for `lock_malfunction`; device-health verdicts should inspect pre/post outcome chunks rather than a single local snippet); fixed `EGPv2_1/run_egpv2.py` summary/version label from `EGPv2` to `EGPv2.1`, and corrected the already-generated `results/qwen36_35B_egpv2_1diag60/summary.json` pipeline field |
| 2026-05-05 | Step 15 update: applied MISS-driven recall fixes to `EGPv2_1` - added query-intent guardrails in `pipeline.py` so behavior/security/composite queries are no longer silently collapsed into `device-health`, changed second-round refinement to prioritize supervisor-requested chunks plus adjacent context instead of reusing the original safe chunk set, and strengthened `prompts.py` so non-device-health queries do not get derailed by transient telemetry noise while verifier no longer defaults to `none` when a coherent anomaly sequence still directly answers the query; verified with `python -m compileall EGPv2_1`, `python EGPv2_1/run_egpv2.py --help`, and `--preview-only` smoke run |
| 2026-05-05 | Step 15 update: added FP-recovery tuning to `EGPv2_1/prompts.py` after recall overshoot - kept the new query-intent / second-round retrieval logic, but tightened anomaly acceptance for `unattended_cooking`, `fire_risk`, `intrusion`, and `behavioral_anomaly` so absence-based supervision assumptions, sparse lock/contact activity, missing OFF logs near truncation, and one ambiguous telemetry inconsistency are no longer enough to force anomaly; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help` |
| 2026-05-05 | Step 15 update: built a hybrid `EGPv2_1` prompt variant for full-run testing - preserved the newer `pipeline.py` query-intent and second-round retrieval fixes, but rolled back the last round's global verifier/investigator over-tightening for `single-event-safety` and `behavior-sequence`, keeping the stricter absence-based safeguards only for `composite-safety` / `emergency-response` so SQ4-style FP control is retained without globally suppressing SQ1/SQ2/SQ3 recall; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help` |
| 2026-05-05 | Step 16: created `EGPv3/` as a new Debate-then-Judge architecture - replaced the serial `Triage -> Investigator -> Supervisor -> Verifier` veto chain with a neutral `Extractor` plus heterogeneous debate (`Prosecutor` for recall, `Defender` for precision) followed by a `Judge` that sees both arguments and the same raw focused evidence; implemented `EGPv3/prompts.py`, `EGPv3/pipeline.py`, `EGPv3/run_egpv3.py`, updated `README.md`, and reused the existing OpenAI-compatible client / benchmark / signal stack; verified with `python -m compileall EGPv3`, `python EGPv3/run_egpv3.py --help`, and `python EGPv3/run_egpv3.py --preview-only --max-episodes 1` |
| 2026-05-05 | Step 16 update: created `EGPv3_1/` as a stronger debate variant focused on SQ3/TN false-alarm recovery - upgraded `Defender` from a parallel benign narrator into a rebuttal role that explicitly sees and attacks the `Prosecutor`'s claims point-by-point, and upgraded the `Judge` prompt to apply a burden-of-proof test (direct evidence vs absence-based inference vs coherent normal routine) before awarding anomaly; implemented `EGPv3_1/prompts.py`, wired defender access to prosecutor output in `EGPv3_1/pipeline.py`, added `EGPv3_1/run_egpv3_1.py`, and verified with `python -m compileall EGPv3_1`, `python EGPv3_1/run_egpv3_1.py --help`, and `python EGPv3_1/run_egpv3_1.py --preview-only --max-episodes 1` |
| 2026-05-05 | Step 16 update: created `EGPv3_2/` to repair the over-conservative `EGPv3_1` burden rules - kept the stronger rebuttal-style `Defender`, but recalibrated `Prosecutor` and `Judge` so anomaly can win on multi-signal convergence even without explicit fault codes, while a benign story only wins if it is positively supported by the logs rather than merely plausible; added query-type alignment pressure to `Prosecutor`, support-quality fields to both debate sides, and a support-vs-speculation burden test in `Judge`; verified with `python -m compileall EGPv3_2`, `python EGPv3_2/run_egpv3_2.py --help`, and `python EGPv3_2/run_egpv3_2.py --preview-only --max-episodes 1` |
| 2026-05-05 | Step 17: implemented `EGPv4/` following the ADISR design - added a deterministic rule-based `Evidence Extractor` (`extractor.py`) that compresses raw device logs into key events / temperature trends / occupancy summaries / alert lines, then feeds a 3-stage adversarial pipeline (`Prosecutor` -> `Defender` -> `Judge`) with asymmetric burden-of-proof prompts in `prompts.py`; created `pipeline.py`, `run_egpv4.py`, `README.md`, and reused the existing OpenAI-compatible client / benchmark / signal stack; verified with `python -m compileall EGPv4`, `python EGPv4/run_egpv4.py --help`, and `python EGPv4/run_egpv4.py --preview-only --max-episodes 1` |
| 2026-05-06 | Step 18: created isolated DPO training-data generator `DPODataGen/` 鈥?added a 40-scenario training bank disjoint from the 35 benchmark scenarios, split-level scenario isolation (`train_pref_v1` vs `dev_pref_v1`), synthetic Matter-style episode generation reusing `HomeState` + normal behavior streams, and `pairs.jsonl` assembly with `rule` chosen answers plus `constructed` rejected answers; verified with `python -m compileall DPODataGen`, `python -m DPODataGen.run_generate_dpo --split both --max-episodes 6 --overwrite`, and direct-script execution `python DPODataGen/run_generate_dpo.py --split train --max-episodes 1 --overwrite` |
| 2026-05-06 | Step 18 update: replaced DPO pair prompt construction with a DPO-specific compact prompt builder in `DPODataGen/prompt_builder.py` 鈥?prompts now keep focused rooms/devices, local event windows, sparse device history, and multi-day activity summaries instead of full raw logs; added `audit_dataset.py` for prompt-length checks and `export_model_tasks.py` for later strong/weak model answer collection; smoke-verified on `data_dpo_smoke2/` with prompt length reduced from ~200k chars mean to ~8k chars mean (`python -m DPODataGen.run_generate_dpo --split both --max-episodes 20 --output-root data_dpo_smoke2 --overwrite`, `python DPODataGen/audit_dataset.py --root .\\data_dpo_smoke2`) |
| 2026-05-06 | Step 18 update: added model-answer collection and pair assembly utilities for DPO curation 鈥?`collect_api_answers.py` calls OpenAI-compatible APIs with resume/no-thinking support, `assemble_preference_pairs.py` scores strong/weak model outputs against each episode and upgrades base pairs to `strong_model` chosen / `weak_model_actual_error` rejected when usable; verified with `python -m compileall DPODataGen`, `python DPODataGen/export_model_tasks.py --split train --input-root data_dpo_v2 --output tmp_dpo/train_tasks_5.jsonl --max-pairs 5`, and fallback assembly smoke `python DPODataGen/assemble_preference_pairs.py --split train --input-root data_dpo_v2 --output tmp_dpo/train_pairs_fallback.jsonl --report tmp_dpo/train_pairs_fallback_report.json` |
| 2026-05-07 | Step 19: packaged a self-contained server-side DPO bundle in `DPO_QWEN35_SERVER_BUNDLE/` 鈥?copied raw final preference pairs and reports, converted them into TRL conversational-format datasets (`train_dpo_clean.jsonl`, `dev_dpo_clean.jsonl`), dropped weak-model `parse_fail` rejected pairs for the clean split (train: 2500鈫?435, dev: 300鈫?92), and added standalone server scripts (`scripts/train_dpo.py`, `scripts/analyze_token_lengths.py`), `requirements.txt`, and `run_train.sh`; verified with `python -m compileall DPO_QWEN35_SERVER_BUNDLE` and manifest/schema checks |
| 2026-05-07 | Step 19 update: hardened the DPO server bundle for mixed TRL versions and 2xA100 40G memory limits 閳?`scripts/train_dpo.py` now conditionally enables `precompute_ref_log_probs` when supported, and `run_train.sh` now defaults to `max_length=6144` / `max_prompt_length=5632` with reference-logprob precomputation enabled to reduce DPO reference-forward OOM risk |
| 2026-05-07 | Step 19 update: added an aggressive low-memory DPO path in `DPO_QWEN35_SERVER_BUNDLE/` 閳?`scripts/train_dpo.py` now conditionally forwards `precompute_ref_log_probs` / `reference_free` / truncation-related args to whichever TRL API surface is available, and new `run_train_lowmem.sh` cuts the working sequence to `4096/3584/512`, uses `keep_end` truncation, `paged_adamw_8bit`, LoRA rank 16, and `flash_attention_2` for a substantially lower VRAM footprint |
| 2026-05-07 | Step 19 update: added persistent reference-logprob caching for DPO training 閳?`DPO_QWEN35_SERVER_BUNDLE/scripts/train_dpo.py` can now `load_from_disk` cached datasets with `ref_chosen_logps/ref_rejected_logps`, or trigger TRL precompute once and `save_to_disk` for reuse; also added `run_precompute_ref_logps_lowmem.sh` so the costly precompute stage can be finished and cached before actual training resumes |
| 2026-05-07 | Step 19 update: switched the low-memory DPO entrypoint to `reference_free` mode 閳?`run_train_lowmem.sh` now trains without an explicit reference model, and `scripts/train_dpo.py` skips loading or saving `ref_logps` caches when `--reference-free` is enabled so stale reference caches are not mistakenly reused |
| 2026-05-07 | Step 19 update: added an ultra-low-memory `reference_free` DPO path 閳?`scripts/train_dpo.py` now conditionally forwards `padding_free` and `use_logits_to_keep` when supported by the installed TRL version, and new `run_train_ultralowmem_ref_free.sh` drops the sequence budget to `2048/1536/256`, reduces LoRA rank to 8, relaxes eval frequency, and enables both memory-saving flags for a last-pass 9B DPO attempt |
| 2026-05-07 | Step 19 update: rebalanced the ultra-low-memory `reference_free` launch profile toward longer prompts under an `sdpa`-only server constraint 閳?`run_train_ultralowmem_ref_free.sh` now allocates `3584/3072/384` tokens to preserve more episode context, removes the experimental `padding_free` flag, and switches attention implementation from `flash_attention_2` to `sdpa` |
| 2026-05-07 | Step 19 update: rolled the ultra-low-memory `reference_free` profile back to the last known runnable budget after the `3584/3072/384` attempt OOMed under `sdpa` 閳?`run_train_ultralowmem_ref_free.sh` is restored to `2048/1536/256` with `use_logits_to_keep`, LoRA rank 8, and `sdpa` so the server can continue from the previously stable training envelope |
| 2026-05-08 | Step 20: created `DPODataGenFullLog/` to rebuild preference data with benchmark-style full raw logs instead of compressed DPO prompts 閳?added `build_full_log_dataset.py`, `prompt_builder.py`, `export_model_tasks.py`, `collect_api_answers.py`, `assemble_preference_pairs.py`, `audit_dataset.py`, and `README.md`; rebuilt `data_dpo_full_log_v1/` from existing `data_dpo_v2` episodes/pairs (train=2500, dev=300) and audited prompt lengths showing full-log prompts are extremely long (train mean ~197.5k chars, p95 ~454k, max ~609k), while answer lengths remain short |
| 2026-05-10 | Step 20 update: packaged DPO_FULLLOG_QWEN2B_SERVER_BUNDLE for direct server-side 2B DPO experiments; copied full-log TRL datasets (clean and full), raw final pairs, and reports; added GPU-selectable run_train.sh mapping 0=GPU0, 1=GPU1, 2=dual GPU; default profile uses reference_free + QLoRA + bf16 + sdpa with longer sequence budgets than the previous 9B ultra-lowmem path |
| 2026-05-10 | Step 21: packaged SFT_FULLLOG_QWEN2B_SERVER_BUNDLE for direct server-side Qwen3.5-2B full-log SFT; converted chosen responses from full-log preference pairs into train_sft/dev_sft conversational datasets; added train_sft.py, analyze_sft_lengths.py, compare_eval_results.py, and GPU-selectable run_train.sh with conservative default sequence lengths for memory stability |
| 2026-05-11 | Step 21 update: packaged SFT_FULLLOG_QWEN2B_V2_SERVER_BUNDLE for a targeted full-log SFT v2 path; rebuilt chosen-only supervision from final full-log preference pairs, normalized weak rule answers into canonical JSON, preserved 879 strong-model chosen answers, and created a 5010-example focus split that upweights SQ3/SQ4 and high-difficulty TP/FP cases for direct benchmark-oriented repair |