Files
llmiotsafe/EGPv2_1/diagnosis_subset_60.md
2026-05-12 17:01:39 +08:00

102 lines
3.3 KiB
Markdown

# EGPv2 Diagnosis Subset (60 episodes)
This subset is designed for workflow debugging, not headline scoring.
## Design goals
- Stress `SQ1` to see whether the pipeline drifts from device-health diagnosis into generic occupancy or intrusion narratives.
- Stress `SQ3` and `SQ4` on `L2/L3` cases to expose chunk-selection drift, evidence hallucination, and weak supervisor corrections.
- Add a small `SQ2/SQ5` control pack to check whether errors are specific to the new workflow rather than universal.
- Add 4 `TN` episodes to separate pure-normal false alarms from near-negative `FP` failures.
## Composition
- `SQ1`: 12 episodes
- 6 `TP` + 6 `FP`
- one paired `TP/FP` example for each `DF-01 ... DF-06`
- `SQ3`: 16 episodes
- 8 `TP` + 8 `FP`
- intrusion / behavior / elderly / child coverage
- mostly `L2/L3`
- `SQ4`: 16 episodes
- 8 `TP` + 8 `FP`
- fire-gas / behavior / elderly / child coverage
- mostly `L2/L3`
- `SQ2`: 6 episodes
- intrusion / fire-gas / water-damage controls
- `SQ5`: 6 episodes
- emergency-planning controls
- `TN sanity`: 4 episodes
## Recommended run order
1. Run this 60-episode subset first.
2. Inspect `results.jsonl` and focus on:
- `egpv2_trace.triage_parsed`
- `egpv2_trace.investigator_parsed`
- `egpv2_trace.supervisor_parsed`
- final `model_response`
3. If the same module-level error repeats, fix prompts or signal extraction before any larger run.
## PowerShell command
```powershell
python EGPv2/run_egpv2.py `
--model Qwen/Qwen3.5-9B `
--api-base http://localhost:8000/v1 `
--no_thinking `
--max-tokens 4096 `
--workers 4 `
--output-dir results/qwen35_egpv2_diag60 `
--episode-id $(Get-Content EGPv2/diagnosis_subset_60.txt)
```
## Group breakdown
### SQ1 Drift Pack (12)
- `SQ1_TP_C_0005` / `SQ1_FP_C_0085` : `DF-01`, `L3`
- `SQ1_TP_A_0006` / `SQ1_FP_A_0083` : `DF-02`, `L3`
- `SQ1_TP_B_0000` / `SQ1_FP_B_0088` : `DF-03`, `L2`
- `SQ1_TP_A_0036` / `SQ1_FP_A_0080` : `DF-04`, `L2`
- `SQ1_TP_B_0011` / `SQ1_FP_B_0092` : `DF-05`, `L1`
- `SQ1_TP_A_0004` / `SQ1_FP_C_0081` : `DF-06`, `L2`
### SQ3 Hard Pack (16)
- `SQ3_TP_B_0457` / `SQ3_FP_C_0592` : `INS-01`, `L2`
- `SQ3_TP_A_0433` / `SQ3_FP_B_0583` : `INS-05`, `L3`
- `SQ3_TP_A_0478` / `SQ3_FP_B_0575` : `BA-03`, `L2`
- `SQ3_TP_B_0452` / `SQ3_FP_C_0642` : `BA-01`, `L3`
- `SQ3_TP_D_0464` / `SQ3_FP_D_0620` : `EL-03`, `L2`
- `SQ3_TP_D_0443` / `SQ3_FP_D_0565` : `EL-07`, `L3`
- `SQ3_TP_C_0447` / `SQ3_FP_C_0614` : `CH-02`, `L2`
- `SQ3_TP_C_0444` / `SQ3_FP_C_0581` : `CH-04`, `L2`
### SQ4 Hard Pack (16)
- `SQ4_TP_B_0721` / `SQ4_FP_B_0885` : `FG-02`, `L2`
- `SQ4_TP_A_0720` / `SQ4_FP_A_0857` : `FG-01`, `L3`
- `SQ4_TP_B_0768` / `SQ4_FP_C_0861` : `BA-03`, `L2`
- `SQ4_TP_B_0722` / `SQ4_FP_B_0916` : `BA-01`, `L3`
- `SQ4_TP_D_0745` / `SQ4_FP_D_0878` : `EL-03`, `L2`
- `SQ4_TP_D_0752` / `SQ4_FP_D_0851` : `EL-02`, `L3`
- `SQ4_TP_C_0737` / `SQ4_FP_C_0854` : `CH-01`, `L2`
- `SQ4_TP_C_0727` / `SQ4_FP_C_0880` : `CH-04`, `L2`
### SQ2/SQ5 Controls (12)
- `SQ2_TP_B_0192` / `SQ2_FP_A_0329` : `INS-02`, `L2`
- `SQ2_TP_D_0206` / `SQ2_FP_D_0299` : `FG-03`, `L1`
- `SQ2_TP_B_0220` / `SQ2_FP_C_0307` : `WD-03`, `L2`
- `SQ5_TP_B_1054` / `SQ5_FP_B_1116` : `INS-04`, `L3`
- `SQ5_TP_B_1037` / `SQ5_FP_B_1142` : `FG-02`, `L2`
- `SQ5_TP_D_1012` / `SQ5_FP_B_1124` : `WD-01`, `L1`
### TN Sanity Pack (4)
- `SQ1_TN_A_0135`
- `SQ3_TN_A_0665`
- `SQ4_TN_A_0961`
- `SQ5_TN_A_1173`