Files
llmiotsafe/results/DIVERSITY_REPORT.md
2026-05-12 17:01:39 +08:00

9.2 KiB
Raw Permalink Blame History

SafeHome Benchmark: Diversity & Hardness Report

1. Overall Scale

Metric Value
Total episodes 1200
True Positive (TP) 550
False Positive (FP) 400
True Negative (TN) 250
Unique anomaly scenarios 33
Unique FP scenarios 35
Anomaly categories 7
Query types (SQ) 5
Home layouts 4
Resident profiles 3
Difficulty levels 3 (L1/L2/L3)

2. Distribution by Query Type

SQ Type Description TP FP TN Total
SQ1 Device Health Diagnosis 80 55 55 190
SQ2 Single-Event Safety Judgment 105 80 55 240
SQ3 Behavioral Sequence Analysis 130 105 55 290
SQ4 Composite Safety Reasoning 130 105 55 290
SQ5 Emergency Response Planning 105 55 30 190

3. Distribution by Anomaly Category

Category TP episodes FP episodes Unique scenarios
Intrusion 147 93 5
Fire/Gas 119 73 4
Water Damage 42 51 3
Device Fault 80 55 6
Elderly-Specific 45 23 7
Child-Specific 29 24 4
Behavioral Anomaly 88 81 4

4. Distribution by Home Layout

Layout Type Devices Episodes Avg Events/Episode
A Studio/1BR 24 250 1210
B 2-Bedroom 34 290 1691
C 3-Bedroom 48 373 2106
D Elderly Living Alone 26 287 1855

5. Distribution by Resident Profile

Profile Episodes Applicable Layouts
Young Professional 400 A, B
Family with Children 513 B, C
Elderly Living Alone 287 D

6. Event Count Statistics

Metric All Layout A Layout B Layout C Layout D
Min 1114 1114 1392 1488 1844
Max 2475 1354 1849 2475 1894
Mean 1759 1210 1691 2106 1855
Median 1840 1130 1660 2295 1854
Std 391 99 142 371 7

7. Difficulty Distribution

7.1 By Level

Level Label Score Range Description Episodes (TP+FP) %
L1 基础检测 5-7 Direct alarm signals, single device 111 11%
L2 推理检测 8-10 State correlation, cross-device reasoning 394 41%
L3 复合推理 11-15 Temporal analysis, absence reasoning, high FP similarity 445 46%

7.2 Five-Dimensional Difficulty Breakdown

Dimension Description Mean Distribution (1/2/3)
D1_evidence_count Number of evidence pieces needed 2.19 145/479/326
D2_signal_directness Signal directness (1=alarm, 3=absence) 1.98 214/542/194
D3_cross_device Cross-device reasoning required 1.91 304/429/217
D4_temporal_span Temporal span of evidence 1.71 497/235/218
D5_fp_similarity TP/FP discrimination difficulty 2.25 0/715/235

7.3 Difficulty Score Distribution (TP+FP)

Score 5: 0
Score 6: 111 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 7: 0
Score 8: 93 █████████████████████████████████████████████████████████████████████████████████████████████ Score 9: 137 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 10: 164 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 11: 217 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 12: 104 ████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 13: 124 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 14: 0
Score 15: 0

8. Anomaly Injection Characteristics (TP only)

8.1 Injection Time-of-Day

Hour Count
00:00 24 ████████████
01:00 41 █████████████████████
02:00 55 ████████████████████████████
03:00 21 ███████████
04:00 14 ███████
05:00 17 █████████
06:00 12 ██████
07:00 30 ███████████████
08:00 45 ███████████████████████
09:00 24 ████████████
10:00 24 ████████████
11:00 14 ███████
12:00 15 ████████
13:00 12 ██████
14:00 19 ██████████
15:00 10 █████
16:00 15 ████████
17:00 24 ████████████
18:00 30 ███████████████
19:00 26 █████████████
20:00 15 ████████
21:00 24 ████████████
22:00 13 ███████
23:00 26 █████████████

8.2 Anomaly Event Statistics

Metric Value
Anomaly events per TP episode (mean) 4.7
Anomaly events per TP episode (range) [1, 11]
Anomaly devices per TP episode (mean) 2.4
Anomaly events as % of total events 0.26%

9. Temporal Characteristics

Metric Value
Event log time span (mean) 24.7 hours
Event log time span (range) [23.9, 60.0] hours
Sampling interval (temperature) 5 minutes
Sampling interval (occupancy, occupied) 5 minutes
Sampling interval (occupancy, unoccupied) 30 minutes

10. SQ Type × Difficulty Level Cross-tabulation

L1 L2 L3 TN Total
SQ1 20 75 40 55 190
SQ2 39 77 69 55 240
SQ3 0 105 130 55 290
SQ4 25 79 131 55 290
SQ5 27 58 75 30 190

11. Comparison with SimuHome

Dimension SimuHome (ICLR 2026) SafeHome (Ours)
Total episodes 600 1,200
Query/task types 6 (QT1-QT4) 5 (SQ1-SQ5)
Evaluation categories 12 (6×F/IF) 15 (5×TP/FP/TN)
Anomaly scenarios N/A (task execution) 35
FP near-miss scenarios N/A 35 (1:1 paired)
Anomaly categories N/A 7
Home layouts Random config 4 fixed archetypes
Resident profiles N/A 3
Difficulty levels Implicit (QT complexity) 3 levels, 5-dim scored
Avg events per episode ~50 (API calls) ~1,750 (sensor logs)
Device protocol Matter (17 types) Matter 1.5.1 (13 types)
Evaluation focus Proactive task execution Reactive anomaly reasoning