155 lines
6.0 KiB
Markdown
155 lines
6.0 KiB
Markdown
# EGPv2.1 MISS Checklist on `qwen36_35B_egpv2_1_1diag60`
|
|
|
|
## Snapshot
|
|
|
|
- Total `MISS`: 16 / 28 TP
|
|
- By SQ:
|
|
- `SQ1`: 4
|
|
- `SQ2`: 2
|
|
- `SQ3`: 5
|
|
- `SQ4`: 5
|
|
- `SQ5`: 0
|
|
- By workflow behavior:
|
|
- `primary_task_profile=device-health`: 9 / 16
|
|
- `supervisor_action=refine_investigation`: 14 / 16
|
|
- `evidence_sufficient=false`: 14 / 16
|
|
- Final prediction: all 16 were collapsed to `is_anomaly=false`, `threat_type=none`
|
|
|
|
## Cluster 1: SQ1 real device faults are still over-suppressed
|
|
|
|
Affected episodes:
|
|
- `SQ1_TP_A_0004` / `DF-06` / `actuator_stuck`
|
|
- `SQ1_TP_A_0006` / `DF-02` / `sensor_drift`
|
|
- `SQ1_TP_B_0011` / `DF-05` / `safety_device_failure`
|
|
- `SQ1_TP_C_0005` / `DF-01` / `sensor_stuck`
|
|
|
|
Observed pattern:
|
|
- The new anti-false-alarm rules are working, but in these 4 cases they over-regularize real faults.
|
|
- The model often explains away the anomaly as:
|
|
- transient communication issue
|
|
- missing logs
|
|
- normal continuous readings in other sensors
|
|
- incomplete evidence, therefore normal
|
|
- Focus selection is still too narrow in some cases:
|
|
- `C00 + C13` only
|
|
- `C00 + C20` only
|
|
|
|
Diagnosis:
|
|
- The current prompt now rejects weak fault claims well, but it still lacks a strong positive rule for:
|
|
- persistent stuck behavior
|
|
- slow drift across chunks
|
|
- safety device failing in the presence of correlated hazard context
|
|
- actuator not changing despite command or expected physical consequence
|
|
|
|
Next patch target:
|
|
- Strengthen positive triggers for real `device-health` anomalies.
|
|
- Especially add explicit language for:
|
|
- cross-chunk persistence
|
|
- monotonic drift without return
|
|
- commanded actuator state not physically progressing
|
|
- safety-device silence/failure when hazard-side evidence is present
|
|
|
|
## Cluster 2: SQ2/SQ3 still drift into wrong task understanding
|
|
|
|
Affected episodes:
|
|
- `SQ2_TP_B_0192` / `INS-02` / `intrusion`
|
|
- `SQ2_TP_B_0220` / `WD-03` / `water_leak`
|
|
- `SQ3_TP_A_0433` / `INS-05` / `credential_theft`
|
|
- `SQ3_TP_B_0452` / `BA-01` / `behavioral_anomaly`
|
|
- `SQ3_TP_B_0457` / `INS-01` / `intrusion`
|
|
- `SQ3_TP_C_0444` / `CH-04` / `child_safety`
|
|
|
|
Observed pattern:
|
|
- These are not being rejected because the model saw no anomaly at all.
|
|
- They are being rejected because the pipeline often frames them as:
|
|
- `device-health`
|
|
- normal cooking
|
|
- transient telemetry noise
|
|
- ordinary evening routine
|
|
|
|
Strong symptom:
|
|
- `SQ3` misses are dominated by task-profile drift:
|
|
- 4 / 5 `SQ3` misses have `primary_task_profile=device-health`
|
|
|
|
Diagnosis:
|
|
- The workflow is still too attracted to local sensor irregularities (`None`, spike, dropout, lock flip).
|
|
- Once triage drifts there, later stages keep discussing telemetry quality instead of:
|
|
- access-path inconsistency
|
|
- identity / credential mismatch
|
|
- subject-specific risk
|
|
- behavior-sequence abnormality
|
|
|
|
Next patch target:
|
|
- In triage, if the query is about behavior, intrusion, vulnerable groups, or security events:
|
|
- do not let isolated telemetry oddities dominate `suspected_patterns`
|
|
- prioritize entry path, human sequence, device-use context, and room-to-room progression
|
|
- In investigator/verifier:
|
|
- explicitly demote “normal cooking” as a default explanation unless it actually explains the queried abnormal event
|
|
|
|
## Cluster 3: SQ4 composite-safety is now closer, but retrieval/evidence assembly still leaks recall
|
|
|
|
Affected episodes:
|
|
- `SQ4_TP_A_0720` / `FG-01` / `unattended_cooking`
|
|
- `SQ4_TP_B_0721` / `FG-02` / `fire_risk`
|
|
- `SQ4_TP_B_0722` / `BA-01` / `behavioral_anomaly`
|
|
- `SQ4_TP_B_0768` / `BA-03` / `behavioral_anomaly`
|
|
- `SQ4_TP_D_0752` / `EL-02` / `possible_fall`
|
|
|
|
Observed pattern:
|
|
- Triage is at least choosing `composite-safety`, so the high-level framing is better than before.
|
|
- But the final answer still collapses to:
|
|
- “no definitive safety anomaly”
|
|
- “transient telemetry issues”
|
|
- “incomplete logs, low confidence”
|
|
|
|
Most important concrete failure:
|
|
- `SQ4_TP_A_0720`:
|
|
- supervisor explicitly says the investigator identified the right anomaly region
|
|
- but the focused chunks did not actually include the needed chunks
|
|
- so the final stage had no solid evidence to commit
|
|
|
|
Diagnosis:
|
|
- The current SQ4 weakness is less about task-profile drift and more about:
|
|
- wrong chunk retrieval
|
|
- incomplete hazard-context assembly
|
|
- verifier collapsing `refine_investigation` into `none`
|
|
|
|
Next patch target:
|
|
- When `primary_task_profile=composite-safety` and a hazard candidate exists:
|
|
- enforce inclusion of the hazard chunk itself
|
|
- plus one chunk for human/vulnerability context
|
|
- plus one chunk for mitigation/outcome
|
|
- If supervisor says `refine_investigation` and names missing critical chunks, the second round should bias harder toward those chunks instead of reusing safe chunks
|
|
|
|
## Cluster 4: Workflow-level over-conservatism after supervisor uncertainty
|
|
|
|
Evidence:
|
|
- `supervisor_action=refine_investigation`: 14 / 16 misses
|
|
- `evidence_sufficient=false`: 14 / 16 misses
|
|
- final verifier still returned `none` for all 16 misses
|
|
|
|
Diagnosis:
|
|
- The intended rule was:
|
|
- `evidence_sufficient=false` does not mean `no anomaly`
|
|
- In practice, the verifier still behaves as if:
|
|
- uncertain = reject anomaly
|
|
|
|
Interpretation:
|
|
- The workflow has solved much of the false-alarm problem.
|
|
- The remaining recall gap is now largely a “decision policy under uncertainty” problem.
|
|
|
|
Next patch target:
|
|
- Prevent the verifier from collapsing every plausible-but-incomplete TP into `none`.
|
|
- For non-device-fault cases especially:
|
|
- if there is a coherent anomaly hypothesis,
|
|
- and supervisor did not `abstain`,
|
|
- and the anomaly directly answers the query,
|
|
- prefer a low/medium-confidence anomaly over a blanket normal verdict.
|
|
|
|
## Priority Order for Next Fix
|
|
|
|
1. Fix `SQ3` / `SQ2` task-profile drift away from telemetry-noise narratives.
|
|
2. Fix `SQ4` second-round retrieval so supervisor-requested hazard chunks are actually pulled in.
|
|
3. Add stronger positive rules for real `SQ1` faults so anti-false-alarm logic does not suppress true persistence/drift/stuck cases.
|
|
4. Tighten verifier policy so `refine_investigation != normal`.
|