# EGPv2.1 MISS Checklist on `qwen36_35B_egpv2_1_1diag60` ## Snapshot - Total `MISS`: 16 / 28 TP - By SQ: - `SQ1`: 4 - `SQ2`: 2 - `SQ3`: 5 - `SQ4`: 5 - `SQ5`: 0 - By workflow behavior: - `primary_task_profile=device-health`: 9 / 16 - `supervisor_action=refine_investigation`: 14 / 16 - `evidence_sufficient=false`: 14 / 16 - Final prediction: all 16 were collapsed to `is_anomaly=false`, `threat_type=none` ## Cluster 1: SQ1 real device faults are still over-suppressed Affected episodes: - `SQ1_TP_A_0004` / `DF-06` / `actuator_stuck` - `SQ1_TP_A_0006` / `DF-02` / `sensor_drift` - `SQ1_TP_B_0011` / `DF-05` / `safety_device_failure` - `SQ1_TP_C_0005` / `DF-01` / `sensor_stuck` Observed pattern: - The new anti-false-alarm rules are working, but in these 4 cases they over-regularize real faults. - The model often explains away the anomaly as: - transient communication issue - missing logs - normal continuous readings in other sensors - incomplete evidence, therefore normal - Focus selection is still too narrow in some cases: - `C00 + C13` only - `C00 + C20` only Diagnosis: - The current prompt now rejects weak fault claims well, but it still lacks a strong positive rule for: - persistent stuck behavior - slow drift across chunks - safety device failing in the presence of correlated hazard context - actuator not changing despite command or expected physical consequence Next patch target: - Strengthen positive triggers for real `device-health` anomalies. - Especially add explicit language for: - cross-chunk persistence - monotonic drift without return - commanded actuator state not physically progressing - safety-device silence/failure when hazard-side evidence is present ## Cluster 2: SQ2/SQ3 still drift into wrong task understanding Affected episodes: - `SQ2_TP_B_0192` / `INS-02` / `intrusion` - `SQ2_TP_B_0220` / `WD-03` / `water_leak` - `SQ3_TP_A_0433` / `INS-05` / `credential_theft` - `SQ3_TP_B_0452` / `BA-01` / `behavioral_anomaly` - `SQ3_TP_B_0457` / `INS-01` / `intrusion` - `SQ3_TP_C_0444` / `CH-04` / `child_safety` Observed pattern: - These are not being rejected because the model saw no anomaly at all. - They are being rejected because the pipeline often frames them as: - `device-health` - normal cooking - transient telemetry noise - ordinary evening routine Strong symptom: - `SQ3` misses are dominated by task-profile drift: - 4 / 5 `SQ3` misses have `primary_task_profile=device-health` Diagnosis: - The workflow is still too attracted to local sensor irregularities (`None`, spike, dropout, lock flip). - Once triage drifts there, later stages keep discussing telemetry quality instead of: - access-path inconsistency - identity / credential mismatch - subject-specific risk - behavior-sequence abnormality Next patch target: - In triage, if the query is about behavior, intrusion, vulnerable groups, or security events: - do not let isolated telemetry oddities dominate `suspected_patterns` - prioritize entry path, human sequence, device-use context, and room-to-room progression - In investigator/verifier: - explicitly demote “normal cooking” as a default explanation unless it actually explains the queried abnormal event ## Cluster 3: SQ4 composite-safety is now closer, but retrieval/evidence assembly still leaks recall Affected episodes: - `SQ4_TP_A_0720` / `FG-01` / `unattended_cooking` - `SQ4_TP_B_0721` / `FG-02` / `fire_risk` - `SQ4_TP_B_0722` / `BA-01` / `behavioral_anomaly` - `SQ4_TP_B_0768` / `BA-03` / `behavioral_anomaly` - `SQ4_TP_D_0752` / `EL-02` / `possible_fall` Observed pattern: - Triage is at least choosing `composite-safety`, so the high-level framing is better than before. - But the final answer still collapses to: - “no definitive safety anomaly” - “transient telemetry issues” - “incomplete logs, low confidence” Most important concrete failure: - `SQ4_TP_A_0720`: - supervisor explicitly says the investigator identified the right anomaly region - but the focused chunks did not actually include the needed chunks - so the final stage had no solid evidence to commit Diagnosis: - The current SQ4 weakness is less about task-profile drift and more about: - wrong chunk retrieval - incomplete hazard-context assembly - verifier collapsing `refine_investigation` into `none` Next patch target: - When `primary_task_profile=composite-safety` and a hazard candidate exists: - enforce inclusion of the hazard chunk itself - plus one chunk for human/vulnerability context - plus one chunk for mitigation/outcome - If supervisor says `refine_investigation` and names missing critical chunks, the second round should bias harder toward those chunks instead of reusing safe chunks ## Cluster 4: Workflow-level over-conservatism after supervisor uncertainty Evidence: - `supervisor_action=refine_investigation`: 14 / 16 misses - `evidence_sufficient=false`: 14 / 16 misses - final verifier still returned `none` for all 16 misses Diagnosis: - The intended rule was: - `evidence_sufficient=false` does not mean `no anomaly` - In practice, the verifier still behaves as if: - uncertain = reject anomaly Interpretation: - The workflow has solved much of the false-alarm problem. - The remaining recall gap is now largely a “decision policy under uncertainty” problem. Next patch target: - Prevent the verifier from collapsing every plausible-but-incomplete TP into `none`. - For non-device-fault cases especially: - if there is a coherent anomaly hypothesis, - and supervisor did not `abstain`, - and the anomaly directly answers the query, - prefer a low/medium-confidence anomaly over a blanket normal verdict. ## Priority Order for Next Fix 1. Fix `SQ3` / `SQ2` task-profile drift away from telemetry-noise narratives. 2. Fix `SQ4` second-round retrieval so supervisor-requested hazard chunks are actually pulled in. 3. Add stronger positive rules for real `SQ1` faults so anti-false-alarm logic does not suppress true persistence/drift/stuck cases. 4. Tighten verifier policy so `refine_investigation != normal`.