whqxbs/llmiotsafe

Fork 0

Files

whqxbs e56494b487 initial commit

2026-05-12 17:01:39 +08:00

6.0 KiB

Raw Permalink Blame History

EGPv2.1 MISS Checklist on `qwen36_35B_egpv2_1_1diag60`

Snapshot

Total MISS: 16 / 28 TP
By SQ:
- SQ1: 4
- SQ2: 2
- SQ3: 5
- SQ4: 5
- SQ5: 0
By workflow behavior:
- primary_task_profile=device-health: 9 / 16
- supervisor_action=refine_investigation: 14 / 16
- evidence_sufficient=false: 14 / 16
- Final prediction: all 16 were collapsed to is_anomaly=false, threat_type=none

Cluster 1: SQ1 real device faults are still over-suppressed

Affected episodes:

SQ1_TP_A_0004 / DF-06 / actuator_stuck
SQ1_TP_A_0006 / DF-02 / sensor_drift
SQ1_TP_B_0011 / DF-05 / safety_device_failure
SQ1_TP_C_0005 / DF-01 / sensor_stuck

Observed pattern:

The new anti-false-alarm rules are working, but in these 4 cases they over-regularize real faults.
The model often explains away the anomaly as:
- transient communication issue
- missing logs
- normal continuous readings in other sensors
- incomplete evidence, therefore normal
Focus selection is still too narrow in some cases:
- C00 + C13 only
- C00 + C20 only

Diagnosis:

The current prompt now rejects weak fault claims well, but it still lacks a strong positive rule for:
- persistent stuck behavior
- slow drift across chunks
- safety device failing in the presence of correlated hazard context
- actuator not changing despite command or expected physical consequence

Next patch target:

Strengthen positive triggers for real device-health anomalies.
Especially add explicit language for:
- cross-chunk persistence
- monotonic drift without return
- commanded actuator state not physically progressing
- safety-device silence/failure when hazard-side evidence is present

Cluster 2: SQ2/SQ3 still drift into wrong task understanding

Affected episodes:

SQ2_TP_B_0192 / INS-02 / intrusion
SQ2_TP_B_0220 / WD-03 / water_leak
SQ3_TP_A_0433 / INS-05 / credential_theft
SQ3_TP_B_0452 / BA-01 / behavioral_anomaly
SQ3_TP_B_0457 / INS-01 / intrusion
SQ3_TP_C_0444 / CH-04 / child_safety

Observed pattern:

These are not being rejected because the model saw no anomaly at all.
They are being rejected because the pipeline often frames them as:
- device-health
- normal cooking
- transient telemetry noise
- ordinary evening routine

Strong symptom:

SQ3 misses are dominated by task-profile drift:
- 4 / 5 SQ3 misses have primary_task_profile=device-health

Diagnosis:

The workflow is still too attracted to local sensor irregularities (None, spike, dropout, lock flip).
Once triage drifts there, later stages keep discussing telemetry quality instead of:
- access-path inconsistency
- identity / credential mismatch
- subject-specific risk
- behavior-sequence abnormality

Next patch target:

In triage, if the query is about behavior, intrusion, vulnerable groups, or security events:
- do not let isolated telemetry oddities dominate suspected_patterns
- prioritize entry path, human sequence, device-use context, and room-to-room progression
In investigator/verifier:
- explicitly demote “normal cooking” as a default explanation unless it actually explains the queried abnormal event

Cluster 3: SQ4 composite-safety is now closer, but retrieval/evidence assembly still leaks recall

Affected episodes:

SQ4_TP_A_0720 / FG-01 / unattended_cooking
SQ4_TP_B_0721 / FG-02 / fire_risk
SQ4_TP_B_0722 / BA-01 / behavioral_anomaly
SQ4_TP_B_0768 / BA-03 / behavioral_anomaly
SQ4_TP_D_0752 / EL-02 / possible_fall

Observed pattern:

Triage is at least choosing composite-safety, so the high-level framing is better than before.
But the final answer still collapses to:
- “no definitive safety anomaly”
- “transient telemetry issues”
- “incomplete logs, low confidence”

Most important concrete failure:

SQ4_TP_A_0720:
- supervisor explicitly says the investigator identified the right anomaly region
- but the focused chunks did not actually include the needed chunks
- so the final stage had no solid evidence to commit

Diagnosis:

The current SQ4 weakness is less about task-profile drift and more about:
- wrong chunk retrieval
- incomplete hazard-context assembly
- verifier collapsing refine_investigation into none

Next patch target:

When primary_task_profile=composite-safety and a hazard candidate exists:
- enforce inclusion of the hazard chunk itself
- plus one chunk for human/vulnerability context
- plus one chunk for mitigation/outcome
If supervisor says refine_investigation and names missing critical chunks, the second round should bias harder toward those chunks instead of reusing safe chunks

Cluster 4: Workflow-level over-conservatism after supervisor uncertainty

Evidence:

supervisor_action=refine_investigation: 14 / 16 misses
evidence_sufficient=false: 14 / 16 misses
final verifier still returned none for all 16 misses

Diagnosis:

The intended rule was:
- evidence_sufficient=false does not mean no anomaly
In practice, the verifier still behaves as if:
- uncertain = reject anomaly

Interpretation:

The workflow has solved much of the false-alarm problem.
The remaining recall gap is now largely a “decision policy under uncertainty” problem.

Next patch target:

Prevent the verifier from collapsing every plausible-but-incomplete TP into none.
For non-device-fault cases especially:
- if there is a coherent anomaly hypothesis,
- and supervisor did not abstain,
- and the anomaly directly answers the query,
- prefer a low/medium-confidence anomaly over a blanket normal verdict.

Priority Order for Next Fix

Fix SQ3 / SQ2 task-profile drift away from telemetry-noise narratives.
Fix SQ4 second-round retrieval so supervisor-requested hazard chunks are actually pulled in.
Add stronger positive rules for real SQ1 faults so anti-false-alarm logic does not suppress true persistence/drift/stuck cases.
Tighten verifier policy so refine_investigation != normal.

6.0 KiB Raw Permalink Blame History

EGPv2.1 MISS Checklist on qwen36_35B_egpv2_1_1diag60

Snapshot

Cluster 1: SQ1 real device faults are still over-suppressed

Cluster 2: SQ2/SQ3 still drift into wrong task understanding

Cluster 3: SQ4 composite-safety is now closer, but retrieval/evidence assembly still leaks recall

Cluster 4: Workflow-level over-conservatism after supervisor uncertainty

Priority Order for Next Fix

6.0 KiB

Raw Permalink Blame History

EGPv2.1 MISS Checklist on `qwen36_35B_egpv2_1_1diag60`