SafeHome Benchmark 鏋勫缓杩涘害杩借釜

椤圭洰鐩爣: 鏋勫缓鍩轰簬 Matter 1.5.1 鍗忚鐨勬櫤鑳藉灞呭畨鍏ㄦ帹鐞?Benchmark 鐩爣浼氳: ICLR 2027 (CCF-A) **寮€濮嬫棩鏈?*: 2026-04-29

鎬讳綋杩涘害

Step	鍐呭	鐘舵€?	瀹屾垚鏃ユ湡	杈撳嚭鏂囦欢
1	瀹氫箟 Matter 璁惧妯″瀷 (JSON Schema)	*宸插畬鎴?	2026-04-29	`data/device_schemas/` (13涓枃浠?
2	璁捐瀹跺涵甯冨眬妯℃澘	*宸插畬鎴?	2026-04-29	`data/home_layouts/` (4涓枃浠?
3	鍐欒涓烘ā寮忕敓鎴愬櫒	*宸插畬鎴?	2026-04-29	`data/behavior_patterns/` (3涓枃浠?
4	璁捐寮傚父鍦烘櫙搴?	*宸插畬鎴?	2026-04-29	`data/anomaly_templates/` (8涓枃浠?
5	瀹炵幇 episode 鐢熸垚 pipeline	*宸插畬鎴?	2026-04-29	`src/episode_gen/` (5涓ā鍧?
6	鐢熸垚 + 浜哄伐楠岃瘉	鐢熸垚瀹屾垚	2026-04-29	`data/benchmark/` (900涓狫SON, 77MB)

Step 1: Matter 璁惧妯″瀷瀹氫箟

闇€瑕佸畾涔夌殑璁惧鍒楄〃

瀹夊叏鏍稿績璁惧 (6绉?:

#	璁惧	Matter 绔犺妭	鍏抽敭 Cluster	schema 鏂囦欢	鐘舵€?
1	Contact Sensor (闂ㄧ獥浼犳劅鍣?	7.1	BooleanState	contact_sensor.json	鉁?瀹屾垚
2	Occupancy Sensor (杩愬姩浼犳劅鍣?	7.3	OccupancySensing	occupancy_sensor.json	鉁?瀹屾垚
3	Smoke CO Alarm (鐑熼浘鎶ヨ鍣?	7.9	SmokeCoAlarm	smoke_co_alarm.json	鉁?瀹屾垚
4	Door Lock (鏅鸿兘闂ㄩ攣)	8.1	DoorLock	door_lock.json	鉁?瀹屾垚
5	Water Leak Detector (婕忔按浼犳劅鍣?	7.12	BooleanState	water_leak_detector.json	鉁?瀹屾垚
6	Window Covering (绐楀笜/绐楁埛)	8.3	WindowCovering	window_covering.json	鉁?瀹屾垚

琛屼负涓婁笅鏂囪澶?(7绉?:

#	璁惧	Matter 绔犺妭	鍏抽敭 Cluster	schema 鏂囦欢	鐘舵€?
7	On/Off Light (寮€鍏崇伅)	4.1	OnOff	onoff_light.json	鉁?瀹屾垚
8	Dimmable Light (璋冨厜鐏?	4.2	OnOff, LevelControl	dimmable_light.json	鉁?瀹屾垚
9	Temperature Sensor (娓╁害浼犳劅鍣?	7.4	TemperatureMeasurement	temperature_sensor.json	鉁?瀹屾垚
10	Room Air Conditioner (绌鸿皟)	13.3	OnOff, Thermostat, FanControl	air_conditioner.json	鉁?瀹屾垚
11	Laundry Washer (娲楄。鏈?	13.1	OnOff, OperationalState	laundry_washer.json	鉁?瀹屾垚
12	Cook Surface (鐏跺叿)	13.7	OnOff, TemperatureControl	cook_surface.json	鉁?瀹屾垚
13	Dishwasher (娲楃鏈?	13.5	OnOff, OperationalState	dishwasher.json	鉁?瀹屾垚

鏁版嵁鏉ユ簮

Matter 1.5.1 Device Library: 23-27351-009_Matter-1.5.1-Device-Library-Specification.pdf
Matter 1.5.1 Application Cluster: 23-27350-009_Matter-1.5.1-Application-Cluster-Specification.pdf
姣忎釜璁惧鐨?schema 涓ユ牸浠?PDF 涓彁鍙栵紝涓嶅嚟绌虹紪閫?

鍙樻洿鏃ュ織

鏃ユ湡	鍙樻洿鍐呭
2026-04-29	鍒涘缓椤圭洰杩借釜鏂囦欢锛屽紑濮?Step 1
2026-04-29	Step 1 瀹屾垚: 13涓?Matter 璁惧妯″瀷 schema 鍏ㄩ儴浠?PDF 瑙勮寖涓彁鍙栧苟鍐欏叆 JSON
2026-04-29	Step 2 瀹屾垚: 4绉嶅搴竷灞€妯℃澘 (A:涓€瀹や竴鍘?4璁惧, B:涓ゅ涓€鍘?4璁惧, C:涓夊涓ゅ巺48璁惧, D:鐙眳鑰佷汉26璁惧)锛屽凡閫愯澶囨牳瀵?
2026-04-29	Step 1 鏍稿: 13涓?device schema 閫愰」瀵圭収 Matter 1.5.1 PDF 鏍稿锛屼慨姝?涓枃浠剁殑閿欒锛坈onformance 閿欒銆佺己澶?cluster 绛夛級
2026-04-29	Step 3 瀹屾垚: 琛屼负妯″紡鏁版嵁瀹氫箟鈥?action_event_mapping(18绉嶅姩浣溾啋璁惧浜嬩欢鏄犲皠), daily_routines(3绉嶅眳浣忚€呯敾鍍徝楀伐浣滄棩/鍛ㄦ湯鏃堕棿琛?, environment_baselines(娓╁害鍩虹嚎+浼犳劅鍣ㄦ晠闅滄ā寮?
2026-04-29	Step 4 瀹屾垚: 寮傚父鍦烘櫙搴?鈥?7澶х被35涓満鏅?16涓鎶ュ彉浣擄紝宸叉牳瀵规暟閲忎笌taxonomy涓€鑷?
2026-04-29	Step 5 瀹屾垚: episode 鐢熸垚 pipeline 鈥?home_state.py, behavior_engine.py, anomaly_injector.py, episode_builder.py, generate.py
2026-04-29	Step 6 (v1): 棣栫増900涓猠pisodes锛屽彂鐜?64涓棶棰?
2026-04-29	Bug淇: 6绫婚棶棰樺叏閮ㄤ慨澶?鈥?EL/CH鍦烘櫙甯冨眬鐢诲儚绾︽潫銆乹uery妯℃澘鏈浛鎹€佹俯搴rend鍊煎睍寮€銆丏F-06璁惧鍖归厤銆丅A-01杩囧害鍒嗛厤(MAX_PER_SCENARIO=50)
2026-04-29	Step 6 (v2): 鍏ㄩ噺6绫绘壂鎻?0闂锛屼絾鎵╁睍鍒?5绫绘鏌ュ悗鍙戠幇61涓柊闂(TP_NOT_ANOMALY=35, TP_NO_ANOMALY_EVENTS=26)
2026-04-29	Bug淇(v3): BA-02/CH-05涓嶅垎閰嶇粰TP(is_anomaly!=True); FG-03/EL-04/FG-02璁惧鍖归厤+absence鍦烘櫙浜嬩欢; fallback鏈哄埗瑙ｅ喅鍦烘櫙鑰楀敖; UnboundLocalError淇
2026-04-29	Step 6 (v3): 15椤圭粨鏋勬鏌ラ€氳繃锛屼絾娣卞害瀹℃煡鍙戠幇FP鍐呭闂: 139涓狥P episode鏃犳敞鍏ヤ簨浠?
2026-04-29	Bug淇(v4): 琛ュ叏7涓狥P鍙樹綋鐨別vents鍒楄〃(DF-01,DF-03,EL-02,EL-05,CH-01,BA-03,BA-05); 淇璁惧鍖归厤瑕嗙洊bug
2026-04-29	Step 6 (v4-final): 1200涓猠pisodes锛?5椤圭粨鏋勬鏌?FP鍐呭妫€鏌ュ叏閮ㄩ€氳繃=0闂銆侳P绌轰簨浠朵粠139闄嶅埌0
2026-04-29	Step 7: 璇勪及妗嗘灦瀹屾垚鈥?prompt_builder.py, scorer.py, metrics.py, runner.py, judge.py锛屽叏閮?import 楠岃瘉閫氳繃
2026-04-30	Step 7 鏇存柊: runner.py 鍔犲叆 Anthropic (Claude) API 鍚庣鏀寔; 鏂板 eval_local.py 鍜?eval_api.py(绾痳equests)
2026-04-30	Step 8: 鍏ㄩ噺璇勬祴瀹屾垚鈥?Qwen2.5-7B(1200ep,5s/ep,0 API err) + Claude Opus 4.6(1200ep,25s/ep,15 API err); 鍒嗘瀽鎶ュ憡 results/ANALYSIS_REPORT.md
2026-04-30	Step 6 (v5): 娓╁害閲囨牱浠?0鍒嗛挓鏀逛负5鍒嗛挓锛宑ooking_boost鏀逛负娓愯繘寮忓崌闄?姝ｅ鸡鏃ラ棿鏇茬嚎+绾挎€у仛楗崌娓?
2026-04-30	Step 6 (v6): 杩愬姩浼犳劅鍣ㄥ姞鍛ㄦ湡鎬т笂鎶?鏈変汉5min/鏃犱汉30min)銆備簨浠堕噺avg=1762/episode銆?5椤规鏌ラ€氳繃銆俻rompt绾?2K-70K token
2026-05-01	Step 9: EDRC 鎺ㄧ悊妗嗘灦瀹屾垚鈥?prompt_builder.py 鏀寔 3 绉嶆ā寮?baseline/edrc/cot); scorer.py 鏀寔瑙ｆ瀽 EDRC 6 姝ヨ緭鍑? eval_api.py 鍔?--mode 鍙傛暟
2026-05-02	Step 10: 棰嗗煙涓撳瀹℃煡淇 鈥?5涓棶棰樺叏閮ㄤ慨澶? (1)鍘绘帀500浜嬩欢鎴柇 (2)SQ3鎸塹uery鏃堕棿绐楄繃婊ゆ棩蹇?(3)SQ5涓嶅啀娉勬紡鍦烘櫙鍚?(4)22涓狦T鏍囩缁熶竴涓烘爣鍑嗗悕 (5)INS-05鍒犻櫎澶栭儴淇℃伅渚濊禆銆俠enchmark閲嶆柊鐢熸垚骞堕獙璇侀€氳繃
2026-05-02	Step 10 杩藉姞: 淇 SQ-category 姹℃煋闂 鈥?fallback 涓嶅啀璺ㄧ被鍒崬鍦烘櫙銆備慨澶嶅墠: SQ3姹℃煋112涓?47.7%), SQ4姹℃煋127涓?54.0%), SQ5姹℃煋1涓€備慨澶嶅悗: 闆惰繚瑙?
2026-05-02	Step 11: 琛ュ叏鍏ㄩ儴 35 涓満鏅殑 FP 鍙樹綋鈥?浠?16/35 琛ュ埌 35/35
2026-05-02	Step 12: 闅惧害閲忓寲鈥?5缁村瑙傝瘎鍒?evidence_count/signal_directness/cross_device/temporal_span/fp_similarity)鏇挎崲涓昏鏍囩锛?5鍦烘櫙鍏ㄩ儴鏍囨敞銆傚垎甯? L1=3, L2=15, L3=17
2026-05-02	Step 12: Cohen's kappa 鏍囨敞宸ュ叿鈥?annotate_kappa.py 鏀寔鍙屾ā鍨嬬嫭绔嬫爣娉?kappa璁＄畻+鍒嗗眰鎶芥牱
2026-05-02	Step 12: 浜哄伐鏍囨敞鎶ュ憡鈥?30涓猠pisode鍒嗗眰鎶芥牱锛屽弻浜烘爣娉ㄎ?0.879锛屼汉vsGT 魏鈮?.44锛孡3鍦烘櫙浜虹被妫€鍑虹巼浠?0%銆傛姤鍛? results/HUMAN_ANNOTATION_REPORT.md
2026-05-02	Step 13: Diversity & Hardness 鎶ュ憡鈥?11涓淮搴﹀叏闈㈢粺璁?瑙勬ā/SQ鍒嗗竷/绫诲埆/甯冨眬/鐢诲儚/浜嬩欢閲?闅惧害5缁?娉ㄥ叆鏃堕棿/寮傚父缁熻/鏃跺簭鐗瑰緛/SimuHome瀵规瘮)銆傛姤鍛? results/DIVERSITY_REPORT.md
2026-05-04	Step 14: EGP (Evidence-Guided Pipeline) 鍘熷瀷瀹炵幇鈥?鏂板缓 `EGP/` 鐙珛鐩綍锛屼笉淇敼鏃㈡湁璇勬祴浠ｇ爜锛涘疄鐜?API 瀹㈡埛绔€乺esume/episode 鏀堕泦銆佽瘉鎹帇缂┿€佸闃舵 prompt銆丒GP runner锛涙敮鎸?`--max-episodes` / `--episode-id` / `--preview-only`锛屽苟鏀寔绌?API key 鐨勬湰鍦?vLLM 鍦烘櫙
2026-05-04	Step 14 鏇存柊: EGP 鏈湴 Qwen/vLLM 鍏煎鎬т慨澶?鈥?API client 鍏煎 `content=None` / `reasoning_content` 杩斿洖锛沗run_egp.py`鏀寔`--no-thinking `涓?`--no_thinking` 鍙傛暟鍒悕锛屾部鐢?eval_api 鐨?no-thinking 閲囨牱閫昏緫
2026-05-04	Step 15: EGPv2 workflow prototype - created `EGPv2/` without exposing `SQ1-SQ5`; added LLM-in-the-loop roles (`Triage` / `Investigator` / `Supervisor` / `Verifier`); supports OpenAI-compatible APIs, local vLLM without API key, resume, `--no-thinking`, and `--max-tokens`
2026-05-04	Step 15 update: EGPv2 signal fixes - corrected SQ3/SQ4 query-window filtering in `signals.py`; switched room mapping to strict `home_state.room_id`; verified `run_egpv2.py --help`, `preview-only`, and `compileall`
2026-05-04	Step 15 update: added `EGPv2/diagnosis_subset_60.txt` and `EGPv2/diagnosis_subset_60.md` for module-level workflow debugging; the 60-episode subset emphasizes SQ1 drift, SQ3/SQ4 hard cases, SQ2/SQ5 controls, and a small TN sanity pack
2026-05-04	Step 15 update: implemented `EGPv2/fix_checklist.md`; made EGPv2 Matter-aware by formatting protocol values in `signals.py`, adding protocol guidance in `prompts.py`, turning the pipeline into a bounded 2-round workflow, and enforcing supervisor decisions in code; verified with `compileall`, `preview-only`, and a 1-episode local smoke test
2026-05-04	Step 15 update: rebalanced EGPv2 decision policy - prompts now separate direct-evidence requirements for device faults from temporal/cross-device evidence for behavior and safety anomalies; verifier no longer collapses insufficient evidence into `none` by default; supervisor code gate now hard-blocks only explicit `abstain`; verified with `compileall` and a 2-episode TP/FP local smoke pair (`SQ4_TP_B_0721`, `SQ4_FP_B_0885`)
2026-05-04	Step 15 update: locked EGPv2 request temperature to `0.0` in `api_client.py` for both thinking and non-thinking modes, removing the earlier no-thinking temperature bump; verified with `compileall` and direct client instantiation checks
2026-05-04	Step 15 update: tightened `EGPv2/prompts.py` for SQ3/SQ4/SQ5 broad-query failures - added scope-coverage requirements for composite/emergency analysis, blocked escalation from single transient `None` or missing logs, and added stricter trigger rules for `sensor_malfunction`, `unattended_cooking`, `fire_risk`, and intrusion-style labels; verified with `compileall`
2026-05-04	Step 15 update: created `EGPv2_1/` as a prompt-only branch from `EGPv2/` for minimal SQ-targeted tuning; preserved the latest SQ3/SQ5 conservative gains while restoring SQ1 device-health recall and SQ4 composite-safety recall by (1) strengthening device-health chunk selection around retries/follow-up/recovery, (2) restricting `secondary_task_profile=device-health` to explicit fault evidence, and (3) allowing local hazardous sequences with vulnerable context plus delayed/weak mitigation to remain anomalous; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help`
2026-05-05	Step 15 update: refined `EGPv2_1` after diag60 analysis - added stricter anti-false-alarm rules for SQ1/SQ2-style device-health cases in `prompts.py` (single transient spike is not enough for `sensor_malfunction`; delayed auto-lock / long unlocked interval is not enough for `lock_malfunction`; device-health verdicts should inspect pre/post outcome chunks rather than a single local snippet); fixed `EGPv2_1/run_egpv2.py` summary/version label from `EGPv2` to `EGPv2.1`, and corrected the already-generated `results/qwen36_35B_egpv2_1diag60/summary.json` pipeline field
2026-05-05	Step 15 update: applied MISS-driven recall fixes to `EGPv2_1` - added query-intent guardrails in `pipeline.py` so behavior/security/composite queries are no longer silently collapsed into `device-health`, changed second-round refinement to prioritize supervisor-requested chunks plus adjacent context instead of reusing the original safe chunk set, and strengthened `prompts.py` so non-device-health queries do not get derailed by transient telemetry noise while verifier no longer defaults to `none` when a coherent anomaly sequence still directly answers the query; verified with `python -m compileall EGPv2_1`, `python EGPv2_1/run_egpv2.py --help`, and `--preview-only` smoke run
2026-05-05	Step 15 update: added FP-recovery tuning to `EGPv2_1/prompts.py` after recall overshoot - kept the new query-intent / second-round retrieval logic, but tightened anomaly acceptance for `unattended_cooking`, `fire_risk`, `intrusion`, and `behavioral_anomaly` so absence-based supervision assumptions, sparse lock/contact activity, missing OFF logs near truncation, and one ambiguous telemetry inconsistency are no longer enough to force anomaly; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help`
2026-05-05	Step 15 update: built a hybrid `EGPv2_1` prompt variant for full-run testing - preserved the newer `pipeline.py` query-intent and second-round retrieval fixes, but rolled back the last round's global verifier/investigator over-tightening for `single-event-safety` and `behavior-sequence`, keeping the stricter absence-based safeguards only for `composite-safety` / `emergency-response` so SQ4-style FP control is retained without globally suppressing SQ1/SQ2/SQ3 recall; verified with `python -m compileall EGPv2_1` and `python EGPv2_1/run_egpv2.py --help`
2026-05-05	Step 16: created `EGPv3/` as a new Debate-then-Judge architecture - replaced the serial `Triage -> Investigator -> Supervisor -> Verifier` veto chain with a neutral `Extractor` plus heterogeneous debate (`Prosecutor` for recall, `Defender` for precision) followed by a `Judge` that sees both arguments and the same raw focused evidence; implemented `EGPv3/prompts.py`, `EGPv3/pipeline.py`, `EGPv3/run_egpv3.py`, updated `README.md`, and reused the existing OpenAI-compatible client / benchmark / signal stack; verified with `python -m compileall EGPv3`, `python EGPv3/run_egpv3.py --help`, and `python EGPv3/run_egpv3.py --preview-only --max-episodes 1`
2026-05-05	Step 16 update: created `EGPv3_1/` as a stronger debate variant focused on SQ3/TN false-alarm recovery - upgraded `Defender` from a parallel benign narrator into a rebuttal role that explicitly sees and attacks the `Prosecutor`'s claims point-by-point, and upgraded the `Judge` prompt to apply a burden-of-proof test (direct evidence vs absence-based inference vs coherent normal routine) before awarding anomaly; implemented `EGPv3_1/prompts.py`, wired defender access to prosecutor output in `EGPv3_1/pipeline.py`, added `EGPv3_1/run_egpv3_1.py`, and verified with `python -m compileall EGPv3_1`, `python EGPv3_1/run_egpv3_1.py --help`, and `python EGPv3_1/run_egpv3_1.py --preview-only --max-episodes 1`
2026-05-05	Step 16 update: created `EGPv3_2/` to repair the over-conservative `EGPv3_1` burden rules - kept the stronger rebuttal-style `Defender`, but recalibrated `Prosecutor` and `Judge` so anomaly can win on multi-signal convergence even without explicit fault codes, while a benign story only wins if it is positively supported by the logs rather than merely plausible; added query-type alignment pressure to `Prosecutor`, support-quality fields to both debate sides, and a support-vs-speculation burden test in `Judge`; verified with `python -m compileall EGPv3_2`, `python EGPv3_2/run_egpv3_2.py --help`, and `python EGPv3_2/run_egpv3_2.py --preview-only --max-episodes 1`
2026-05-05	Step 17: implemented `EGPv4/` following the ADISR design - added a deterministic rule-based `Evidence Extractor` (`extractor.py`) that compresses raw device logs into key events / temperature trends / occupancy summaries / alert lines, then feeds a 3-stage adversarial pipeline (`Prosecutor` -> `Defender` -> `Judge`) with asymmetric burden-of-proof prompts in `prompts.py`; created `pipeline.py`, `run_egpv4.py`, `README.md`, and reused the existing OpenAI-compatible client / benchmark / signal stack; verified with `python -m compileall EGPv4`, `python EGPv4/run_egpv4.py --help`, and `python EGPv4/run_egpv4.py --preview-only --max-episodes 1`
2026-05-06	Step 18: created isolated DPO training-data generator `DPODataGen/` 鈥?added a 40-scenario training bank disjoint from the 35 benchmark scenarios, split-level scenario isolation (`train_pref_v1` vs `dev_pref_v1`), synthetic Matter-style episode generation reusing `HomeState` + normal behavior streams, and `pairs.jsonl` assembly with `rule` chosen answers plus `constructed` rejected answers; verified with `python -m compileall DPODataGen`, `python -m DPODataGen.run_generate_dpo --split both --max-episodes 6 --overwrite`, and direct-script execution `python DPODataGen/run_generate_dpo.py --split train --max-episodes 1 --overwrite`
2026-05-06	Step 18 update: replaced DPO pair prompt construction with a DPO-specific compact prompt builder in `DPODataGen/prompt_builder.py` 鈥?prompts now keep focused rooms/devices, local event windows, sparse device history, and multi-day activity summaries instead of full raw logs; added `audit_dataset.py` for prompt-length checks and `export_model_tasks.py` for later strong/weak model answer collection; smoke-verified on `data_dpo_smoke2/` with prompt length reduced from ~200k chars mean to ~8k chars mean (`python -m DPODataGen.run_generate_dpo --split both --max-episodes 20 --output-root data_dpo_smoke2 --overwrite`, `python DPODataGen/audit_dataset.py --root .\\data_dpo_smoke2`)
2026-05-06	Step 18 update: added model-answer collection and pair assembly utilities for DPO curation 鈥?`collect_api_answers.py` calls OpenAI-compatible APIs with resume/no-thinking support, `assemble_preference_pairs.py` scores strong/weak model outputs against each episode and upgrades base pairs to `strong_model` chosen / `weak_model_actual_error` rejected when usable; verified with `python -m compileall DPODataGen`, `python DPODataGen/export_model_tasks.py --split train --input-root data_dpo_v2 --output tmp_dpo/train_tasks_5.jsonl --max-pairs 5`, and fallback assembly smoke `python DPODataGen/assemble_preference_pairs.py --split train --input-root data_dpo_v2 --output tmp_dpo/train_pairs_fallback.jsonl --report tmp_dpo/train_pairs_fallback_report.json`
2026-05-07	Step 19: packaged a self-contained server-side DPO bundle in `DPO_QWEN35_SERVER_BUNDLE/` 鈥?copied raw final preference pairs and reports, converted them into TRL conversational-format datasets (`train_dpo_clean.jsonl`, `dev_dpo_clean.jsonl`), dropped weak-model `parse_fail` rejected pairs for the clean split (train: 2500鈫?435, dev: 300鈫?92), and added standalone server scripts (`scripts/train_dpo.py`, `scripts/analyze_token_lengths.py`), `requirements.txt`, and `run_train.sh`; verified with `python -m compileall DPO_QWEN35_SERVER_BUNDLE` and manifest/schema checks
2026-05-07	Step 19 update: hardened the DPO server bundle for mixed TRL versions and 2xA100 40G memory limits 閳?`scripts/train_dpo.py` now conditionally enables `precompute_ref_log_probs` when supported, and `run_train.sh` now defaults to `max_length=6144` / `max_prompt_length=5632` with reference-logprob precomputation enabled to reduce DPO reference-forward OOM risk
2026-05-07	Step 19 update: added an aggressive low-memory DPO path in `DPO_QWEN35_SERVER_BUNDLE/` 閳?`scripts/train_dpo.py` now conditionally forwards `precompute_ref_log_probs` / `reference_free` / truncation-related args to whichever TRL API surface is available, and new `run_train_lowmem.sh` cuts the working sequence to `4096/3584/512`, uses `keep_end` truncation, `paged_adamw_8bit`, LoRA rank 16, and `flash_attention_2` for a substantially lower VRAM footprint
2026-05-07	Step 19 update: added persistent reference-logprob caching for DPO training 閳?`DPO_QWEN35_SERVER_BUNDLE/scripts/train_dpo.py` can now `load_from_disk` cached datasets with `ref_chosen_logps/ref_rejected_logps`, or trigger TRL precompute once and `save_to_disk` for reuse; also added `run_precompute_ref_logps_lowmem.sh` so the costly precompute stage can be finished and cached before actual training resumes
2026-05-07	Step 19 update: switched the low-memory DPO entrypoint to `reference_free` mode 閳?`run_train_lowmem.sh` now trains without an explicit reference model, and `scripts/train_dpo.py` skips loading or saving `ref_logps` caches when `--reference-free` is enabled so stale reference caches are not mistakenly reused
2026-05-07	Step 19 update: added an ultra-low-memory `reference_free` DPO path 閳?`scripts/train_dpo.py` now conditionally forwards `padding_free` and `use_logits_to_keep` when supported by the installed TRL version, and new `run_train_ultralowmem_ref_free.sh` drops the sequence budget to `2048/1536/256`, reduces LoRA rank to 8, relaxes eval frequency, and enables both memory-saving flags for a last-pass 9B DPO attempt
2026-05-07	Step 19 update: rebalanced the ultra-low-memory `reference_free` launch profile toward longer prompts under an `sdpa`-only server constraint 閳?`run_train_ultralowmem_ref_free.sh` now allocates `3584/3072/384` tokens to preserve more episode context, removes the experimental `padding_free` flag, and switches attention implementation from `flash_attention_2` to `sdpa`
2026-05-07	Step 19 update: rolled the ultra-low-memory `reference_free` profile back to the last known runnable budget after the `3584/3072/384` attempt OOMed under `sdpa` 閳?`run_train_ultralowmem_ref_free.sh` is restored to `2048/1536/256` with `use_logits_to_keep`, LoRA rank 8, and `sdpa` so the server can continue from the previously stable training envelope
2026-05-08	Step 20: created `DPODataGenFullLog/` to rebuild preference data with benchmark-style full raw logs instead of compressed DPO prompts 閳?added `build_full_log_dataset.py`, `prompt_builder.py`, `export_model_tasks.py`, `collect_api_answers.py`, `assemble_preference_pairs.py`, `audit_dataset.py`, and `README.md`; rebuilt `data_dpo_full_log_v1/` from existing `data_dpo_v2` episodes/pairs (train=2500, dev=300) and audited prompt lengths showing full-log prompts are extremely long (train mean ~197.5k chars, p95 ~454k, max ~609k), while answer lengths remain short

| 2026-05-10 | Step 20 update: packaged DPO_FULLLOG_QWEN2B_SERVER_BUNDLE for direct server-side 2B DPO experiments; copied full-log TRL datasets (clean and full), raw final pairs, and reports; added GPU-selectable run_train.sh mapping 0=GPU0, 1=GPU1, 2=dual GPU; default profile uses reference_free + QLoRA + bf16 + sdpa with longer sequence budgets than the previous 9B ultra-lowmem path |

| 2026-05-10 | Step 21: packaged SFT_FULLLOG_QWEN2B_SERVER_BUNDLE for direct server-side Qwen3.5-2B full-log SFT; converted chosen responses from full-log preference pairs into train_sft/dev_sft conversational datasets; added train_sft.py, analyze_sft_lengths.py, compare_eval_results.py, and GPU-selectable run_train.sh with conservative default sequence lengths for memory stability |

| 2026-05-11 | Step 21 update: packaged SFT_FULLLOG_QWEN2B_V2_SERVER_BUNDLE for a targeted full-log SFT v2 path; rebuilt chosen-only supervision from final full-log preference pairs, normalized weak rule answers into canonical JSON, preserved 879 strong-model chosen answers, and created a 5010-example focus split that upweights SQ3/SQ4 and high-difficulty TP/FP cases for direct benchmark-oriented repair |

24 KiB Raw Blame History Unescape Escape