3.3 KiB
3.3 KiB
EGPv2 Diagnosis Subset (60 episodes)
This subset is designed for workflow debugging, not headline scoring.
Design goals
- Stress
SQ1to see whether the pipeline drifts from device-health diagnosis into generic occupancy or intrusion narratives. - Stress
SQ3andSQ4onL2/L3cases to expose chunk-selection drift, evidence hallucination, and weak supervisor corrections. - Add a small
SQ2/SQ5control pack to check whether errors are specific to the new workflow rather than universal. - Add 4
TNepisodes to separate pure-normal false alarms from near-negativeFPfailures.
Composition
SQ1: 12 episodes- 6
TP+ 6FP - one paired
TP/FPexample for eachDF-01 ... DF-06
- 6
SQ3: 16 episodes- 8
TP+ 8FP - intrusion / behavior / elderly / child coverage
- mostly
L2/L3
- 8
SQ4: 16 episodes- 8
TP+ 8FP - fire-gas / behavior / elderly / child coverage
- mostly
L2/L3
- 8
SQ2: 6 episodes- intrusion / fire-gas / water-damage controls
SQ5: 6 episodes- emergency-planning controls
TN sanity: 4 episodes
Recommended run order
- Run this 60-episode subset first.
- Inspect
results.jsonland focus on:egpv2_trace.triage_parsedegpv2_trace.investigator_parsedegpv2_trace.supervisor_parsed- final
model_response
- If the same module-level error repeats, fix prompts or signal extraction before any larger run.
PowerShell command
python EGPv2/run_egpv2.py `
--model Qwen/Qwen3.5-9B `
--api-base http://localhost:8000/v1 `
--no_thinking `
--max-tokens 4096 `
--workers 4 `
--output-dir results/qwen35_egpv2_diag60 `
--episode-id $(Get-Content EGPv2/diagnosis_subset_60.txt)
Group breakdown
SQ1 Drift Pack (12)
SQ1_TP_C_0005/SQ1_FP_C_0085:DF-01,L3SQ1_TP_A_0006/SQ1_FP_A_0083:DF-02,L3SQ1_TP_B_0000/SQ1_FP_B_0088:DF-03,L2SQ1_TP_A_0036/SQ1_FP_A_0080:DF-04,L2SQ1_TP_B_0011/SQ1_FP_B_0092:DF-05,L1SQ1_TP_A_0004/SQ1_FP_C_0081:DF-06,L2
SQ3 Hard Pack (16)
SQ3_TP_B_0457/SQ3_FP_C_0592:INS-01,L2SQ3_TP_A_0433/SQ3_FP_B_0583:INS-05,L3SQ3_TP_A_0478/SQ3_FP_B_0575:BA-03,L2SQ3_TP_B_0452/SQ3_FP_C_0642:BA-01,L3SQ3_TP_D_0464/SQ3_FP_D_0620:EL-03,L2SQ3_TP_D_0443/SQ3_FP_D_0565:EL-07,L3SQ3_TP_C_0447/SQ3_FP_C_0614:CH-02,L2SQ3_TP_C_0444/SQ3_FP_C_0581:CH-04,L2
SQ4 Hard Pack (16)
SQ4_TP_B_0721/SQ4_FP_B_0885:FG-02,L2SQ4_TP_A_0720/SQ4_FP_A_0857:FG-01,L3SQ4_TP_B_0768/SQ4_FP_C_0861:BA-03,L2SQ4_TP_B_0722/SQ4_FP_B_0916:BA-01,L3SQ4_TP_D_0745/SQ4_FP_D_0878:EL-03,L2SQ4_TP_D_0752/SQ4_FP_D_0851:EL-02,L3SQ4_TP_C_0737/SQ4_FP_C_0854:CH-01,L2SQ4_TP_C_0727/SQ4_FP_C_0880:CH-04,L2
SQ2/SQ5 Controls (12)
SQ2_TP_B_0192/SQ2_FP_A_0329:INS-02,L2SQ2_TP_D_0206/SQ2_FP_D_0299:FG-03,L1SQ2_TP_B_0220/SQ2_FP_C_0307:WD-03,L2SQ5_TP_B_1054/SQ5_FP_B_1116:INS-04,L3SQ5_TP_B_1037/SQ5_FP_B_1142:FG-02,L2SQ5_TP_D_1012/SQ5_FP_B_1124:WD-01,L1
TN Sanity Pack (4)
SQ1_TN_A_0135SQ3_TN_A_0665SQ4_TN_A_0961SQ5_TN_A_1173