Files
llmiotsafe/PROGRESS.md
2026-05-12 17:01:39 +08:00

24 KiB
Raw Blame History

SafeHome Benchmark 鏋勫缓杩涘害杩借釜

椤圭洰鐩爣: 鏋勫缓鍩轰簬 Matter 1.5.1 鍗忚鐨勬櫤鑳藉灞呭畨鍏ㄦ帹鐞?Benchmark 鐩爣浼氳: ICLR 2027 (CCF-A) **寮€濮嬫棩鏈?*: 2026-04-29


鎬讳綋杩涘害

Step 鍐呭 鐘舵€? 瀹屾垚鏃ユ湡 杈撳嚭鏂囦欢
1 瀹氫箟 Matter 璁惧妯″瀷 (JSON Schema) *宸插畬鎴? 2026-04-29 data/device_schemas/ (13涓枃浠?
2 璁捐瀹跺涵甯冨眬妯℃澘 *宸插畬鎴? 2026-04-29 data/home_layouts/ (4涓枃浠?
3 鍐欒涓烘ā寮忕敓鎴愬櫒 *宸插畬鎴? 2026-04-29 data/behavior_patterns/ (3涓枃浠?
4 璁捐寮傚父鍦烘櫙搴? *宸插畬鎴? 2026-04-29 data/anomaly_templates/ (8涓枃浠?
5 瀹炵幇 episode 鐢熸垚 pipeline *宸插畬鎴? 2026-04-29 src/episode_gen/ (5涓ā鍧?
6 鐢熸垚 + 浜哄伐楠岃瘉 鐢熸垚瀹屾垚 2026-04-29 data/benchmark/ (900涓狫SON, 77MB)

Step 1: Matter 璁惧妯″瀷瀹氫箟

闇€瑕佸畾涔夌殑璁惧鍒楄〃

瀹夊叏鏍稿績璁惧 (6绉?:

# 璁惧 Matter 绔犺妭 鍏抽敭 Cluster schema 鏂囦欢 鐘舵€?
1 Contact Sensor (闂ㄧ獥浼犳劅鍣? 7.1 BooleanState contact_sensor.json 鉁?瀹屾垚
2 Occupancy Sensor (杩愬姩浼犳劅鍣? 7.3 OccupancySensing occupancy_sensor.json 鉁?瀹屾垚
3 Smoke CO Alarm (鐑熼浘鎶ヨ鍣? 7.9 SmokeCoAlarm smoke_co_alarm.json 鉁?瀹屾垚
4 Door Lock (鏅鸿兘闂ㄩ攣) 8.1 DoorLock door_lock.json 鉁?瀹屾垚
5 Water Leak Detector (婕忔按浼犳劅鍣? 7.12 BooleanState water_leak_detector.json 鉁?瀹屾垚
6 Window Covering (绐楀笜/绐楁埛) 8.3 WindowCovering window_covering.json 鉁?瀹屾垚

琛屼负涓婁笅鏂囪澶?(7绉?:

# 璁惧 Matter 绔犺妭 鍏抽敭 Cluster schema 鏂囦欢 鐘舵€?
7 On/Off Light (寮€鍏崇伅) 4.1 OnOff onoff_light.json 鉁?瀹屾垚
8 Dimmable Light (璋冨厜鐏? 4.2 OnOff, LevelControl dimmable_light.json 鉁?瀹屾垚
9 Temperature Sensor (娓╁害浼犳劅鍣? 7.4 TemperatureMeasurement temperature_sensor.json 鉁?瀹屾垚
10 Room Air Conditioner (绌鸿皟) 13.3 OnOff, Thermostat, FanControl air_conditioner.json 鉁?瀹屾垚
11 Laundry Washer (娲楄。鏈? 13.1 OnOff, OperationalState laundry_washer.json 鉁?瀹屾垚
12 Cook Surface (鐏跺叿) 13.7 OnOff, TemperatureControl cook_surface.json 鉁?瀹屾垚
13 Dishwasher (娲楃鏈? 13.5 OnOff, OperationalState dishwasher.json 鉁?瀹屾垚

鏁版嵁鏉ユ簮

  • Matter 1.5.1 Device Library: 23-27351-009_Matter-1.5.1-Device-Library-Specification.pdf
  • Matter 1.5.1 Application Cluster: 23-27350-009_Matter-1.5.1-Application-Cluster-Specification.pdf
  • 姣忎釜璁惧鐨?schema 涓ユ牸浠?PDF 涓彁鍙栵紝涓嶅嚟绌虹紪閫?

鍙樻洿鏃ュ織

鏃ユ湡 鍙樻洿鍐呭
2026-04-29 鍒涘缓椤圭洰杩借釜鏂囦欢锛屽紑濮?Step 1
2026-04-29 Step 1 瀹屾垚: 13涓?Matter 璁惧妯″瀷 schema 鍏ㄩ儴浠?PDF 瑙勮寖涓彁鍙栧苟鍐欏叆 JSON
2026-04-29 Step 2 瀹屾垚: 4绉嶅搴竷灞€妯℃澘 (A:涓€瀹や竴鍘?4璁惧, B:涓ゅ涓€鍘?4璁惧, C:涓夊涓ゅ巺48璁惧, D:鐙眳鑰佷汉26璁惧)锛屽凡閫愯澶囨牳瀵?
2026-04-29 Step 1 鏍稿: 13涓?device schema 閫愰」瀵圭収 Matter 1.5.1 PDF 鏍稿锛屼慨姝?涓枃浠剁殑閿欒锛坈onformance 閿欒銆佺己澶?cluster 绛夛級
2026-04-29 Step 3 瀹屾垚: 琛屼负妯″紡鏁版嵁瀹氫箟 鈥?action_event_mapping(18绉嶅姩浣溾啋璁惧浜嬩欢鏄犲皠), daily_routines(3绉嶅眳浣忚€呯敾鍍徝楀伐浣滄棩/鍛ㄦ湯鏃堕棿琛?, environment_baselines(娓╁害鍩虹嚎+浼犳劅鍣ㄦ晠闅滄ā寮?
2026-04-29 Step 4 瀹屾垚: 寮傚父鍦烘櫙搴?鈥?7澶х被35涓満鏅?16涓鎶ュ彉浣擄紝宸叉牳瀵规暟閲忎笌taxonomy涓€鑷?
2026-04-29 Step 5 瀹屾垚: episode 鐢熸垚 pipeline 鈥?home_state.py, behavior_engine.py, anomaly_injector.py, episode_builder.py, generate.py
2026-04-29 Step 6 (v1): 棣栫増900涓猠pisodes锛屽彂鐜?64涓棶棰?
2026-04-29 Bug淇: 6绫婚棶棰樺叏閮ㄤ慨澶?鈥?EL/CH鍦烘櫙甯冨眬鐢诲儚绾︽潫銆乹uery妯℃澘鏈浛鎹€佹俯搴rend鍊煎睍寮€銆丏F-06璁惧鍖归厤銆丅A-01杩囧害鍒嗛厤(MAX_PER_SCENARIO=50)
2026-04-29 Step 6 (v2): 鍏ㄩ噺6绫绘壂鎻?0闂锛屼絾鎵╁睍鍒?5绫绘鏌ュ悗鍙戠幇61涓柊闂(TP_NOT_ANOMALY=35, TP_NO_ANOMALY_EVENTS=26)
2026-04-29 Bug淇(v3): BA-02/CH-05涓嶅垎閰嶇粰TP(is_anomaly!=True); FG-03/EL-04/FG-02璁惧鍖归厤+absence鍦烘櫙浜嬩欢; fallback鏈哄埗瑙喅鍦烘櫙鑰楀敖; UnboundLocalError淇
2026-04-29 Step 6 (v3): 15椤圭粨鏋勬鏌ラ€氳繃锛屼絾娣卞害瀹℃煡鍙戠幇FP鍐呭闂: 139涓狥P episode鏃犳敞鍏ヤ簨浠?
2026-04-29 Bug淇(v4): 琛ュ叏7涓狥P鍙樹綋鐨別vents鍒楄〃(DF-01,DF-03,EL-02,EL-05,CH-01,BA-03,BA-05); 淇璁惧鍖归厤瑕嗙洊bug
2026-04-29 Step 6 (v4-final): 1200涓猠pisodes锛?5椤圭粨鏋勬鏌?FP鍐呭妫€鏌ュ叏閮ㄩ€氳繃=0闂銆侳P绌轰簨浠朵粠139闄嶅埌0
2026-04-29 Step 7: 璇勪及妗嗘灦瀹屾垚 鈥?prompt_builder.py, scorer.py, metrics.py, runner.py, judge.py锛屽叏閮?import 楠岃瘉閫氳繃
2026-04-30 Step 7 鏇存柊: runner.py 鍔犲叆 Anthropic (Claude) API 鍚庣鏀寔; 鏂板 eval_local.py 鍜?eval_api.py(绾痳equests)
2026-04-30 Step 8: 鍏ㄩ噺璇勬祴瀹屾垚 鈥?Qwen2.5-7B(1200ep,5s/ep,0 API err) + Claude Opus 4.6(1200ep,25s/ep,15 API err); 鍒嗘瀽鎶ュ憡 results/ANALYSIS_REPORT.md
2026-04-30 Step 6 (v5): 娓╁害閲囨牱浠?0鍒嗛挓鏀逛负5鍒嗛挓锛宑ooking_boost鏀逛负娓愯繘寮忓崌闄?姝e鸡鏃ラ棿鏇茬嚎+绾挎€у仛楗崌娓?
2026-04-30 Step 6 (v6): 杩愬姩浼犳劅鍣ㄥ姞鍛ㄦ湡鎬т笂鎶?鏈変汉5min/鏃犱汉30min)銆備簨浠堕噺avg=1762/episode銆?5椤规鏌ラ€氳繃銆俻rompt绾?2K-70K token
2026-05-01 Step 9: EDRC 鎺ㄧ悊妗嗘灦瀹屾垚 鈥?prompt_builder.py 鏀寔 3 绉嶆ā寮?baseline/edrc/cot); scorer.py 鏀寔瑙f瀽 EDRC 6 姝ヨ緭鍑? eval_api.py 鍔?--mode 鍙傛暟
2026-05-02 Step 10: 棰嗗煙涓撳瀹℃煡淇 鈥?5涓棶棰樺叏閮ㄤ慨澶? (1)鍘绘帀500浜嬩欢鎴柇 (2)SQ3鎸塹uery鏃堕棿绐楄繃婊ゆ棩蹇?(3)SQ5涓嶅啀娉勬紡鍦烘櫙鍚?(4)22涓狦T鏍囩缁熶竴涓烘爣鍑嗗悕 (5)INS-05鍒犻櫎澶栭儴淇℃伅渚濊禆銆俠enchmark閲嶆柊鐢熸垚骞堕獙璇侀€氳繃
2026-05-02 Step 10 杩藉姞: 淇 SQ-category 姹℃煋闂 鈥?fallback 涓嶅啀璺ㄧ被鍒崬鍦烘櫙銆備慨澶嶅墠: SQ3姹℃煋112涓?47.7%), SQ4姹℃煋127涓?54.0%), SQ5姹℃煋1涓€備慨澶嶅悗: 闆惰繚瑙?
2026-05-02 Step 11: 琛ュ叏鍏ㄩ儴 35 涓満鏅殑 FP 鍙樹綋 鈥?浠?16/35 琛ュ埌 35/35
2026-05-02 Step 12: 闅惧害閲忓寲 鈥?5缁村瑙傝瘎鍒?evidence_count/signal_directness/cross_device/temporal_span/fp_similarity)鏇挎崲涓昏鏍囩锛?5鍦烘櫙鍏ㄩ儴鏍囨敞銆傚垎甯? L1=3, L2=15, L3=17
2026-05-02 Step 12: Cohen's kappa 鏍囨敞宸ュ叿 鈥?annotate_kappa.py 鏀寔鍙屾ā鍨嬬嫭绔嬫爣娉?kappa璁畻+鍒嗗眰鎶芥牱
2026-05-02 Step 12: 浜哄伐鏍囨敞鎶ュ憡 鈥?30涓猠pisode鍒嗗眰鎶芥牱锛屽弻浜烘爣娉ㄎ?0.879锛屼汉vsGT 魏鈮?.44锛孡3鍦烘櫙浜虹被妫€鍑虹巼浠?0%銆傛姤鍛? results/HUMAN_ANNOTATION_REPORT.md
2026-05-02 Step 13: Diversity & Hardness 鎶ュ憡 鈥?11涓淮搴﹀叏闈㈢粺璁?瑙勬ā/SQ鍒嗗竷/绫诲埆/甯冨眬/鐢诲儚/浜嬩欢閲?闅惧害5缁?娉ㄥ叆鏃堕棿/寮傚父缁熻/鏃跺簭鐗瑰緛/SimuHome瀵规瘮)銆傛姤鍛? results/DIVERSITY_REPORT.md
2026-05-04 Step 14: EGP (Evidence-Guided Pipeline) 鍘熷瀷瀹炵幇 鈥?鏂板缓 EGP/ 鐙珛鐩綍锛屼笉淇敼鏃㈡湁璇勬祴浠g爜锛涘疄鐜?API 瀹㈡埛绔€乺esume/episode 鏀堕泦銆佽瘉鎹帇缂┿€佸闃舵 prompt銆丒GP runner锛涙敮鎸?--max-episodes / --episode-id / --preview-only锛屽苟鏀寔绌?API key 鐨勬湰鍦?vLLM 鍦烘櫙
2026-05-04 Step 14 鏇存柊: EGP 鏈湴 Qwen/vLLM 鍏煎鎬т慨澶?鈥?API client 鍏煎 content=None / reasoning_content 杩斿洖锛沗run_egp.py鏀寔--no-thinking 涓?--no_thinking` 鍙傛暟鍒悕锛屾部鐢?eval_api 鐨?no-thinking 閲囨牱閫昏緫
2026-05-04 Step 15: EGPv2 workflow prototype - created EGPv2/ without exposing SQ1-SQ5; added LLM-in-the-loop roles (Triage / Investigator / Supervisor / Verifier); supports OpenAI-compatible APIs, local vLLM without API key, resume, --no-thinking, and --max-tokens
2026-05-04 Step 15 update: EGPv2 signal fixes - corrected SQ3/SQ4 query-window filtering in signals.py; switched room mapping to strict home_state.room_id; verified run_egpv2.py --help, preview-only, and compileall
2026-05-04 Step 15 update: added EGPv2/diagnosis_subset_60.txt and EGPv2/diagnosis_subset_60.md for module-level workflow debugging; the 60-episode subset emphasizes SQ1 drift, SQ3/SQ4 hard cases, SQ2/SQ5 controls, and a small TN sanity pack
2026-05-04 Step 15 update: implemented EGPv2/fix_checklist.md; made EGPv2 Matter-aware by formatting protocol values in signals.py, adding protocol guidance in prompts.py, turning the pipeline into a bounded 2-round workflow, and enforcing supervisor decisions in code; verified with compileall, preview-only, and a 1-episode local smoke test
2026-05-04 Step 15 update: rebalanced EGPv2 decision policy - prompts now separate direct-evidence requirements for device faults from temporal/cross-device evidence for behavior and safety anomalies; verifier no longer collapses insufficient evidence into none by default; supervisor code gate now hard-blocks only explicit abstain; verified with compileall and a 2-episode TP/FP local smoke pair (SQ4_TP_B_0721, SQ4_FP_B_0885)
2026-05-04 Step 15 update: locked EGPv2 request temperature to 0.0 in api_client.py for both thinking and non-thinking modes, removing the earlier no-thinking temperature bump; verified with compileall and direct client instantiation checks
2026-05-04 Step 15 update: tightened EGPv2/prompts.py for SQ3/SQ4/SQ5 broad-query failures - added scope-coverage requirements for composite/emergency analysis, blocked escalation from single transient None or missing logs, and added stricter trigger rules for sensor_malfunction, unattended_cooking, fire_risk, and intrusion-style labels; verified with compileall
2026-05-04 Step 15 update: created EGPv2_1/ as a prompt-only branch from EGPv2/ for minimal SQ-targeted tuning; preserved the latest SQ3/SQ5 conservative gains while restoring SQ1 device-health recall and SQ4 composite-safety recall by (1) strengthening device-health chunk selection around retries/follow-up/recovery, (2) restricting secondary_task_profile=device-health to explicit fault evidence, and (3) allowing local hazardous sequences with vulnerable context plus delayed/weak mitigation to remain anomalous; verified with python -m compileall EGPv2_1 and python EGPv2_1/run_egpv2.py --help
2026-05-05 Step 15 update: refined EGPv2_1 after diag60 analysis - added stricter anti-false-alarm rules for SQ1/SQ2-style device-health cases in prompts.py (single transient spike is not enough for sensor_malfunction; delayed auto-lock / long unlocked interval is not enough for lock_malfunction; device-health verdicts should inspect pre/post outcome chunks rather than a single local snippet); fixed EGPv2_1/run_egpv2.py summary/version label from EGPv2 to EGPv2.1, and corrected the already-generated results/qwen36_35B_egpv2_1diag60/summary.json pipeline field
2026-05-05 Step 15 update: applied MISS-driven recall fixes to EGPv2_1 - added query-intent guardrails in pipeline.py so behavior/security/composite queries are no longer silently collapsed into device-health, changed second-round refinement to prioritize supervisor-requested chunks plus adjacent context instead of reusing the original safe chunk set, and strengthened prompts.py so non-device-health queries do not get derailed by transient telemetry noise while verifier no longer defaults to none when a coherent anomaly sequence still directly answers the query; verified with python -m compileall EGPv2_1, python EGPv2_1/run_egpv2.py --help, and --preview-only smoke run
2026-05-05 Step 15 update: added FP-recovery tuning to EGPv2_1/prompts.py after recall overshoot - kept the new query-intent / second-round retrieval logic, but tightened anomaly acceptance for unattended_cooking, fire_risk, intrusion, and behavioral_anomaly so absence-based supervision assumptions, sparse lock/contact activity, missing OFF logs near truncation, and one ambiguous telemetry inconsistency are no longer enough to force anomaly; verified with python -m compileall EGPv2_1 and python EGPv2_1/run_egpv2.py --help
2026-05-05 Step 15 update: built a hybrid EGPv2_1 prompt variant for full-run testing - preserved the newer pipeline.py query-intent and second-round retrieval fixes, but rolled back the last round's global verifier/investigator over-tightening for single-event-safety and behavior-sequence, keeping the stricter absence-based safeguards only for composite-safety / emergency-response so SQ4-style FP control is retained without globally suppressing SQ1/SQ2/SQ3 recall; verified with python -m compileall EGPv2_1 and python EGPv2_1/run_egpv2.py --help
2026-05-05 Step 16: created EGPv3/ as a new Debate-then-Judge architecture - replaced the serial Triage -> Investigator -> Supervisor -> Verifier veto chain with a neutral Extractor plus heterogeneous debate (Prosecutor for recall, Defender for precision) followed by a Judge that sees both arguments and the same raw focused evidence; implemented EGPv3/prompts.py, EGPv3/pipeline.py, EGPv3/run_egpv3.py, updated README.md, and reused the existing OpenAI-compatible client / benchmark / signal stack; verified with python -m compileall EGPv3, python EGPv3/run_egpv3.py --help, and python EGPv3/run_egpv3.py --preview-only --max-episodes 1
2026-05-05 Step 16 update: created EGPv3_1/ as a stronger debate variant focused on SQ3/TN false-alarm recovery - upgraded Defender from a parallel benign narrator into a rebuttal role that explicitly sees and attacks the Prosecutor's claims point-by-point, and upgraded the Judge prompt to apply a burden-of-proof test (direct evidence vs absence-based inference vs coherent normal routine) before awarding anomaly; implemented EGPv3_1/prompts.py, wired defender access to prosecutor output in EGPv3_1/pipeline.py, added EGPv3_1/run_egpv3_1.py, and verified with python -m compileall EGPv3_1, python EGPv3_1/run_egpv3_1.py --help, and python EGPv3_1/run_egpv3_1.py --preview-only --max-episodes 1
2026-05-05 Step 16 update: created EGPv3_2/ to repair the over-conservative EGPv3_1 burden rules - kept the stronger rebuttal-style Defender, but recalibrated Prosecutor and Judge so anomaly can win on multi-signal convergence even without explicit fault codes, while a benign story only wins if it is positively supported by the logs rather than merely plausible; added query-type alignment pressure to Prosecutor, support-quality fields to both debate sides, and a support-vs-speculation burden test in Judge; verified with python -m compileall EGPv3_2, python EGPv3_2/run_egpv3_2.py --help, and python EGPv3_2/run_egpv3_2.py --preview-only --max-episodes 1
2026-05-05 Step 17: implemented EGPv4/ following the ADISR design - added a deterministic rule-based Evidence Extractor (extractor.py) that compresses raw device logs into key events / temperature trends / occupancy summaries / alert lines, then feeds a 3-stage adversarial pipeline (Prosecutor -> Defender -> Judge) with asymmetric burden-of-proof prompts in prompts.py; created pipeline.py, run_egpv4.py, README.md, and reused the existing OpenAI-compatible client / benchmark / signal stack; verified with python -m compileall EGPv4, python EGPv4/run_egpv4.py --help, and python EGPv4/run_egpv4.py --preview-only --max-episodes 1
2026-05-06 Step 18: created isolated DPO training-data generator DPODataGen/ 鈥?added a 40-scenario training bank disjoint from the 35 benchmark scenarios, split-level scenario isolation (train_pref_v1 vs dev_pref_v1), synthetic Matter-style episode generation reusing HomeState + normal behavior streams, and pairs.jsonl assembly with rule chosen answers plus constructed rejected answers; verified with python -m compileall DPODataGen, python -m DPODataGen.run_generate_dpo --split both --max-episodes 6 --overwrite, and direct-script execution python DPODataGen/run_generate_dpo.py --split train --max-episodes 1 --overwrite
2026-05-06 Step 18 update: replaced DPO pair prompt construction with a DPO-specific compact prompt builder in DPODataGen/prompt_builder.py 鈥?prompts now keep focused rooms/devices, local event windows, sparse device history, and multi-day activity summaries instead of full raw logs; added audit_dataset.py for prompt-length checks and export_model_tasks.py for later strong/weak model answer collection; smoke-verified on data_dpo_smoke2/ with prompt length reduced from ~200k chars mean to ~8k chars mean (python -m DPODataGen.run_generate_dpo --split both --max-episodes 20 --output-root data_dpo_smoke2 --overwrite, python DPODataGen/audit_dataset.py --root .\\data_dpo_smoke2)
2026-05-06 Step 18 update: added model-answer collection and pair assembly utilities for DPO curation 鈥?collect_api_answers.py calls OpenAI-compatible APIs with resume/no-thinking support, assemble_preference_pairs.py scores strong/weak model outputs against each episode and upgrades base pairs to strong_model chosen / weak_model_actual_error rejected when usable; verified with python -m compileall DPODataGen, python DPODataGen/export_model_tasks.py --split train --input-root data_dpo_v2 --output tmp_dpo/train_tasks_5.jsonl --max-pairs 5, and fallback assembly smoke python DPODataGen/assemble_preference_pairs.py --split train --input-root data_dpo_v2 --output tmp_dpo/train_pairs_fallback.jsonl --report tmp_dpo/train_pairs_fallback_report.json
2026-05-07 Step 19: packaged a self-contained server-side DPO bundle in DPO_QWEN35_SERVER_BUNDLE/ 鈥?copied raw final preference pairs and reports, converted them into TRL conversational-format datasets (train_dpo_clean.jsonl, dev_dpo_clean.jsonl), dropped weak-model parse_fail rejected pairs for the clean split (train: 2500鈫?435, dev: 300鈫?92), and added standalone server scripts (scripts/train_dpo.py, scripts/analyze_token_lengths.py), requirements.txt, and run_train.sh; verified with python -m compileall DPO_QWEN35_SERVER_BUNDLE and manifest/schema checks
2026-05-07 Step 19 update: hardened the DPO server bundle for mixed TRL versions and 2xA100 40G memory limits 閳?scripts/train_dpo.py now conditionally enables precompute_ref_log_probs when supported, and run_train.sh now defaults to max_length=6144 / max_prompt_length=5632 with reference-logprob precomputation enabled to reduce DPO reference-forward OOM risk
2026-05-07 Step 19 update: added an aggressive low-memory DPO path in DPO_QWEN35_SERVER_BUNDLE/ 閳?scripts/train_dpo.py now conditionally forwards precompute_ref_log_probs / reference_free / truncation-related args to whichever TRL API surface is available, and new run_train_lowmem.sh cuts the working sequence to 4096/3584/512, uses keep_end truncation, paged_adamw_8bit, LoRA rank 16, and flash_attention_2 for a substantially lower VRAM footprint
2026-05-07 Step 19 update: added persistent reference-logprob caching for DPO training 閳?DPO_QWEN35_SERVER_BUNDLE/scripts/train_dpo.py can now load_from_disk cached datasets with ref_chosen_logps/ref_rejected_logps, or trigger TRL precompute once and save_to_disk for reuse; also added run_precompute_ref_logps_lowmem.sh so the costly precompute stage can be finished and cached before actual training resumes
2026-05-07 Step 19 update: switched the low-memory DPO entrypoint to reference_free mode 閳?run_train_lowmem.sh now trains without an explicit reference model, and scripts/train_dpo.py skips loading or saving ref_logps caches when --reference-free is enabled so stale reference caches are not mistakenly reused
2026-05-07 Step 19 update: added an ultra-low-memory reference_free DPO path 閳?scripts/train_dpo.py now conditionally forwards padding_free and use_logits_to_keep when supported by the installed TRL version, and new run_train_ultralowmem_ref_free.sh drops the sequence budget to 2048/1536/256, reduces LoRA rank to 8, relaxes eval frequency, and enables both memory-saving flags for a last-pass 9B DPO attempt
2026-05-07 Step 19 update: rebalanced the ultra-low-memory reference_free launch profile toward longer prompts under an sdpa-only server constraint 閳?run_train_ultralowmem_ref_free.sh now allocates 3584/3072/384 tokens to preserve more episode context, removes the experimental padding_free flag, and switches attention implementation from flash_attention_2 to sdpa
2026-05-07 Step 19 update: rolled the ultra-low-memory reference_free profile back to the last known runnable budget after the 3584/3072/384 attempt OOMed under sdpa 閳?run_train_ultralowmem_ref_free.sh is restored to 2048/1536/256 with use_logits_to_keep, LoRA rank 8, and sdpa so the server can continue from the previously stable training envelope
2026-05-08 Step 20: created DPODataGenFullLog/ to rebuild preference data with benchmark-style full raw logs instead of compressed DPO prompts 閳?added build_full_log_dataset.py, prompt_builder.py, export_model_tasks.py, collect_api_answers.py, assemble_preference_pairs.py, audit_dataset.py, and README.md; rebuilt data_dpo_full_log_v1/ from existing data_dpo_v2 episodes/pairs (train=2500, dev=300) and audited prompt lengths showing full-log prompts are extremely long (train mean ~197.5k chars, p95 ~454k, max ~609k), while answer lengths remain short

| 2026-05-10 | Step 20 update: packaged DPO_FULLLOG_QWEN2B_SERVER_BUNDLE for direct server-side 2B DPO experiments; copied full-log TRL datasets (clean and full), raw final pairs, and reports; added GPU-selectable run_train.sh mapping 0=GPU0, 1=GPU1, 2=dual GPU; default profile uses reference_free + QLoRA + bf16 + sdpa with longer sequence budgets than the previous 9B ultra-lowmem path |

| 2026-05-10 | Step 21: packaged SFT_FULLLOG_QWEN2B_SERVER_BUNDLE for direct server-side Qwen3.5-2B full-log SFT; converted chosen responses from full-log preference pairs into train_sft/dev_sft conversational datasets; added train_sft.py, analyze_sft_lengths.py, compare_eval_results.py, and GPU-selectable run_train.sh with conservative default sequence lengths for memory stability |

| 2026-05-11 | Step 21 update: packaged SFT_FULLLOG_QWEN2B_V2_SERVER_BUNDLE for a targeted full-log SFT v2 path; rebuilt chosen-only supervision from final full-log preference pairs, normalized weak rule answers into canonical JSON, preserved 879 strong-model chosen answers, and created a 5010-example focus split that upweights SQ3/SQ4 and high-difficulty TP/FP cases for direct benchmark-oriented repair |