Files
llmiotsafe/EGPv2/diagnosis_subset_60.md
2026-05-12 17:01:39 +08:00

3.3 KiB

EGPv2 Diagnosis Subset (60 episodes)

This subset is designed for workflow debugging, not headline scoring.

Design goals

  • Stress SQ1 to see whether the pipeline drifts from device-health diagnosis into generic occupancy or intrusion narratives.
  • Stress SQ3 and SQ4 on L2/L3 cases to expose chunk-selection drift, evidence hallucination, and weak supervisor corrections.
  • Add a small SQ2/SQ5 control pack to check whether errors are specific to the new workflow rather than universal.
  • Add 4 TN episodes to separate pure-normal false alarms from near-negative FP failures.

Composition

  • SQ1: 12 episodes
    • 6 TP + 6 FP
    • one paired TP/FP example for each DF-01 ... DF-06
  • SQ3: 16 episodes
    • 8 TP + 8 FP
    • intrusion / behavior / elderly / child coverage
    • mostly L2/L3
  • SQ4: 16 episodes
    • 8 TP + 8 FP
    • fire-gas / behavior / elderly / child coverage
    • mostly L2/L3
  • SQ2: 6 episodes
    • intrusion / fire-gas / water-damage controls
  • SQ5: 6 episodes
    • emergency-planning controls
  • TN sanity: 4 episodes
  1. Run this 60-episode subset first.
  2. Inspect results.jsonl and focus on:
    • egpv2_trace.triage_parsed
    • egpv2_trace.investigator_parsed
    • egpv2_trace.supervisor_parsed
    • final model_response
  3. If the same module-level error repeats, fix prompts or signal extraction before any larger run.

PowerShell command

python EGPv2/run_egpv2.py `
  --model Qwen/Qwen3.5-9B `
  --api-base http://localhost:8000/v1 `
  --no_thinking `
  --max-tokens 4096 `
  --workers 4 `
  --output-dir results/qwen35_egpv2_diag60 `
  --episode-id $(Get-Content EGPv2/diagnosis_subset_60.txt)

Group breakdown

SQ1 Drift Pack (12)

  • SQ1_TP_C_0005 / SQ1_FP_C_0085 : DF-01, L3
  • SQ1_TP_A_0006 / SQ1_FP_A_0083 : DF-02, L3
  • SQ1_TP_B_0000 / SQ1_FP_B_0088 : DF-03, L2
  • SQ1_TP_A_0036 / SQ1_FP_A_0080 : DF-04, L2
  • SQ1_TP_B_0011 / SQ1_FP_B_0092 : DF-05, L1
  • SQ1_TP_A_0004 / SQ1_FP_C_0081 : DF-06, L2

SQ3 Hard Pack (16)

  • SQ3_TP_B_0457 / SQ3_FP_C_0592 : INS-01, L2
  • SQ3_TP_A_0433 / SQ3_FP_B_0583 : INS-05, L3
  • SQ3_TP_A_0478 / SQ3_FP_B_0575 : BA-03, L2
  • SQ3_TP_B_0452 / SQ3_FP_C_0642 : BA-01, L3
  • SQ3_TP_D_0464 / SQ3_FP_D_0620 : EL-03, L2
  • SQ3_TP_D_0443 / SQ3_FP_D_0565 : EL-07, L3
  • SQ3_TP_C_0447 / SQ3_FP_C_0614 : CH-02, L2
  • SQ3_TP_C_0444 / SQ3_FP_C_0581 : CH-04, L2

SQ4 Hard Pack (16)

  • SQ4_TP_B_0721 / SQ4_FP_B_0885 : FG-02, L2
  • SQ4_TP_A_0720 / SQ4_FP_A_0857 : FG-01, L3
  • SQ4_TP_B_0768 / SQ4_FP_C_0861 : BA-03, L2
  • SQ4_TP_B_0722 / SQ4_FP_B_0916 : BA-01, L3
  • SQ4_TP_D_0745 / SQ4_FP_D_0878 : EL-03, L2
  • SQ4_TP_D_0752 / SQ4_FP_D_0851 : EL-02, L3
  • SQ4_TP_C_0737 / SQ4_FP_C_0854 : CH-01, L2
  • SQ4_TP_C_0727 / SQ4_FP_C_0880 : CH-04, L2

SQ2/SQ5 Controls (12)

  • SQ2_TP_B_0192 / SQ2_FP_A_0329 : INS-02, L2
  • SQ2_TP_D_0206 / SQ2_FP_D_0299 : FG-03, L1
  • SQ2_TP_B_0220 / SQ2_FP_C_0307 : WD-03, L2
  • SQ5_TP_B_1054 / SQ5_FP_B_1116 : INS-04, L3
  • SQ5_TP_B_1037 / SQ5_FP_B_1142 : FG-02, L2
  • SQ5_TP_D_1012 / SQ5_FP_B_1124 : WD-01, L1

TN Sanity Pack (4)

  • SQ1_TN_A_0135
  • SQ3_TN_A_0665
  • SQ4_TN_A_0961
  • SQ5_TN_A_1173