6.0 KiB
6.0 KiB
EGPv2.1 MISS Checklist on qwen36_35B_egpv2_1_1diag60
Snapshot
- Total
MISS: 16 / 28 TP - By SQ:
SQ1: 4SQ2: 2SQ3: 5SQ4: 5SQ5: 0
- By workflow behavior:
primary_task_profile=device-health: 9 / 16supervisor_action=refine_investigation: 14 / 16evidence_sufficient=false: 14 / 16- Final prediction: all 16 were collapsed to
is_anomaly=false,threat_type=none
Cluster 1: SQ1 real device faults are still over-suppressed
Affected episodes:
SQ1_TP_A_0004/DF-06/actuator_stuckSQ1_TP_A_0006/DF-02/sensor_driftSQ1_TP_B_0011/DF-05/safety_device_failureSQ1_TP_C_0005/DF-01/sensor_stuck
Observed pattern:
- The new anti-false-alarm rules are working, but in these 4 cases they over-regularize real faults.
- The model often explains away the anomaly as:
- transient communication issue
- missing logs
- normal continuous readings in other sensors
- incomplete evidence, therefore normal
- Focus selection is still too narrow in some cases:
C00 + C13onlyC00 + C20only
Diagnosis:
- The current prompt now rejects weak fault claims well, but it still lacks a strong positive rule for:
- persistent stuck behavior
- slow drift across chunks
- safety device failing in the presence of correlated hazard context
- actuator not changing despite command or expected physical consequence
Next patch target:
- Strengthen positive triggers for real
device-healthanomalies. - Especially add explicit language for:
- cross-chunk persistence
- monotonic drift without return
- commanded actuator state not physically progressing
- safety-device silence/failure when hazard-side evidence is present
Cluster 2: SQ2/SQ3 still drift into wrong task understanding
Affected episodes:
SQ2_TP_B_0192/INS-02/intrusionSQ2_TP_B_0220/WD-03/water_leakSQ3_TP_A_0433/INS-05/credential_theftSQ3_TP_B_0452/BA-01/behavioral_anomalySQ3_TP_B_0457/INS-01/intrusionSQ3_TP_C_0444/CH-04/child_safety
Observed pattern:
- These are not being rejected because the model saw no anomaly at all.
- They are being rejected because the pipeline often frames them as:
device-health- normal cooking
- transient telemetry noise
- ordinary evening routine
Strong symptom:
SQ3misses are dominated by task-profile drift:- 4 / 5
SQ3misses haveprimary_task_profile=device-health
- 4 / 5
Diagnosis:
- The workflow is still too attracted to local sensor irregularities (
None, spike, dropout, lock flip). - Once triage drifts there, later stages keep discussing telemetry quality instead of:
- access-path inconsistency
- identity / credential mismatch
- subject-specific risk
- behavior-sequence abnormality
Next patch target:
- In triage, if the query is about behavior, intrusion, vulnerable groups, or security events:
- do not let isolated telemetry oddities dominate
suspected_patterns - prioritize entry path, human sequence, device-use context, and room-to-room progression
- do not let isolated telemetry oddities dominate
- In investigator/verifier:
- explicitly demote “normal cooking” as a default explanation unless it actually explains the queried abnormal event
Cluster 3: SQ4 composite-safety is now closer, but retrieval/evidence assembly still leaks recall
Affected episodes:
SQ4_TP_A_0720/FG-01/unattended_cookingSQ4_TP_B_0721/FG-02/fire_riskSQ4_TP_B_0722/BA-01/behavioral_anomalySQ4_TP_B_0768/BA-03/behavioral_anomalySQ4_TP_D_0752/EL-02/possible_fall
Observed pattern:
- Triage is at least choosing
composite-safety, so the high-level framing is better than before. - But the final answer still collapses to:
- “no definitive safety anomaly”
- “transient telemetry issues”
- “incomplete logs, low confidence”
Most important concrete failure:
SQ4_TP_A_0720:- supervisor explicitly says the investigator identified the right anomaly region
- but the focused chunks did not actually include the needed chunks
- so the final stage had no solid evidence to commit
Diagnosis:
- The current SQ4 weakness is less about task-profile drift and more about:
- wrong chunk retrieval
- incomplete hazard-context assembly
- verifier collapsing
refine_investigationintonone
Next patch target:
- When
primary_task_profile=composite-safetyand a hazard candidate exists:- enforce inclusion of the hazard chunk itself
- plus one chunk for human/vulnerability context
- plus one chunk for mitigation/outcome
- If supervisor says
refine_investigationand names missing critical chunks, the second round should bias harder toward those chunks instead of reusing safe chunks
Cluster 4: Workflow-level over-conservatism after supervisor uncertainty
Evidence:
supervisor_action=refine_investigation: 14 / 16 missesevidence_sufficient=false: 14 / 16 misses- final verifier still returned
nonefor all 16 misses
Diagnosis:
- The intended rule was:
evidence_sufficient=falsedoes not meanno anomaly
- In practice, the verifier still behaves as if:
- uncertain = reject anomaly
Interpretation:
- The workflow has solved much of the false-alarm problem.
- The remaining recall gap is now largely a “decision policy under uncertainty” problem.
Next patch target:
- Prevent the verifier from collapsing every plausible-but-incomplete TP into
none. - For non-device-fault cases especially:
- if there is a coherent anomaly hypothesis,
- and supervisor did not
abstain, - and the anomaly directly answers the query,
- prefer a low/medium-confidence anomaly over a blanket normal verdict.
Priority Order for Next Fix
- Fix
SQ3/SQ2task-profile drift away from telemetry-noise narratives. - Fix
SQ4second-round retrieval so supervisor-requested hazard chunks are actually pulled in. - Add stronger positive rules for real
SQ1faults so anti-false-alarm logic does not suppress true persistence/drift/stuck cases. - Tighten verifier policy so
refine_investigation != normal.