SafeHome Benchmark: Diversity & Hardness Report
1. Overall Scale
| Metric |
Value |
| Total episodes |
1200 |
| True Positive (TP) |
550 |
| False Positive (FP) |
400 |
| True Negative (TN) |
250 |
| Unique anomaly scenarios |
33 |
| Unique FP scenarios |
35 |
| Anomaly categories |
7 |
| Query types (SQ) |
5 |
| Home layouts |
4 |
| Resident profiles |
3 |
| Difficulty levels |
3 (L1/L2/L3) |
2. Distribution by Query Type
| SQ Type |
Description |
TP |
FP |
TN |
Total |
| SQ1 |
Device Health Diagnosis |
80 |
55 |
55 |
190 |
| SQ2 |
Single-Event Safety Judgment |
105 |
80 |
55 |
240 |
| SQ3 |
Behavioral Sequence Analysis |
130 |
105 |
55 |
290 |
| SQ4 |
Composite Safety Reasoning |
130 |
105 |
55 |
290 |
| SQ5 |
Emergency Response Planning |
105 |
55 |
30 |
190 |
3. Distribution by Anomaly Category
| Category |
TP episodes |
FP episodes |
Unique scenarios |
| Intrusion |
147 |
93 |
5 |
| Fire/Gas |
119 |
73 |
4 |
| Water Damage |
42 |
51 |
3 |
| Device Fault |
80 |
55 |
6 |
| Elderly-Specific |
45 |
23 |
7 |
| Child-Specific |
29 |
24 |
4 |
| Behavioral Anomaly |
88 |
81 |
4 |
4. Distribution by Home Layout
| Layout |
Type |
Devices |
Episodes |
Avg Events/Episode |
| A |
Studio/1BR |
24 |
250 |
1210 |
| B |
2-Bedroom |
34 |
290 |
1691 |
| C |
3-Bedroom |
48 |
373 |
2106 |
| D |
Elderly Living Alone |
26 |
287 |
1855 |
5. Distribution by Resident Profile
| Profile |
Episodes |
Applicable Layouts |
| Young Professional |
400 |
A, B |
| Family with Children |
513 |
B, C |
| Elderly Living Alone |
287 |
D |
6. Event Count Statistics
| Metric |
All |
Layout A |
Layout B |
Layout C |
Layout D |
| Min |
1114 |
1114 |
1392 |
1488 |
1844 |
| Max |
2475 |
1354 |
1849 |
2475 |
1894 |
| Mean |
1759 |
1210 |
1691 |
2106 |
1855 |
| Median |
1840 |
1130 |
1660 |
2295 |
1854 |
| Std |
391 |
99 |
142 |
371 |
7 |
7. Difficulty Distribution
7.1 By Level
| Level |
Label |
Score Range |
Description |
Episodes (TP+FP) |
% |
| L1 |
基础检测 |
5-7 |
Direct alarm signals, single device |
111 |
11% |
| L2 |
推理检测 |
8-10 |
State correlation, cross-device reasoning |
394 |
41% |
| L3 |
复合推理 |
11-15 |
Temporal analysis, absence reasoning, high FP similarity |
445 |
46% |
7.2 Five-Dimensional Difficulty Breakdown
| Dimension |
Description |
Mean |
Distribution (1/2/3) |
| D1_evidence_count |
Number of evidence pieces needed |
2.19 |
145/479/326 |
| D2_signal_directness |
Signal directness (1=alarm, 3=absence) |
1.98 |
214/542/194 |
| D3_cross_device |
Cross-device reasoning required |
1.91 |
304/429/217 |
| D4_temporal_span |
Temporal span of evidence |
1.71 |
497/235/218 |
| D5_fp_similarity |
TP/FP discrimination difficulty |
2.25 |
0/715/235 |
7.3 Difficulty Score Distribution (TP+FP)
Score 5: 0
Score 6: 111 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 7: 0
Score 8: 93 █████████████████████████████████████████████████████████████████████████████████████████████
Score 9: 137 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 10: 164 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 11: 217 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 12: 104 ████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 13: 124 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 14: 0
Score 15: 0
8. Anomaly Injection Characteristics (TP only)
8.1 Injection Time-of-Day
| Hour |
Count |
|
| 00:00 |
24 |
████████████ |
| 01:00 |
41 |
█████████████████████ |
| 02:00 |
55 |
████████████████████████████ |
| 03:00 |
21 |
███████████ |
| 04:00 |
14 |
███████ |
| 05:00 |
17 |
█████████ |
| 06:00 |
12 |
██████ |
| 07:00 |
30 |
███████████████ |
| 08:00 |
45 |
███████████████████████ |
| 09:00 |
24 |
████████████ |
| 10:00 |
24 |
████████████ |
| 11:00 |
14 |
███████ |
| 12:00 |
15 |
████████ |
| 13:00 |
12 |
██████ |
| 14:00 |
19 |
██████████ |
| 15:00 |
10 |
█████ |
| 16:00 |
15 |
████████ |
| 17:00 |
24 |
████████████ |
| 18:00 |
30 |
███████████████ |
| 19:00 |
26 |
█████████████ |
| 20:00 |
15 |
████████ |
| 21:00 |
24 |
████████████ |
| 22:00 |
13 |
███████ |
| 23:00 |
26 |
█████████████ |
8.2 Anomaly Event Statistics
| Metric |
Value |
| Anomaly events per TP episode (mean) |
4.7 |
| Anomaly events per TP episode (range) |
[1, 11] |
| Anomaly devices per TP episode (mean) |
2.4 |
| Anomaly events as % of total events |
0.26% |
9. Temporal Characteristics
| Metric |
Value |
| Event log time span (mean) |
24.7 hours |
| Event log time span (range) |
[23.9, 60.0] hours |
| Sampling interval (temperature) |
5 minutes |
| Sampling interval (occupancy, occupied) |
5 minutes |
| Sampling interval (occupancy, unoccupied) |
30 minutes |
10. SQ Type × Difficulty Level Cross-tabulation
|
L1 |
L2 |
L3 |
TN |
Total |
| SQ1 |
20 |
75 |
40 |
55 |
190 |
| SQ2 |
39 |
77 |
69 |
55 |
240 |
| SQ3 |
0 |
105 |
130 |
55 |
290 |
| SQ4 |
25 |
79 |
131 |
55 |
290 |
| SQ5 |
27 |
58 |
75 |
30 |
190 |
11. Comparison with SimuHome
| Dimension |
SimuHome (ICLR 2026) |
SafeHome (Ours) |
| Total episodes |
600 |
1,200 |
| Query/task types |
6 (QT1-QT4) |
5 (SQ1-SQ5) |
| Evaluation categories |
12 (6×F/IF) |
15 (5×TP/FP/TN) |
| Anomaly scenarios |
N/A (task execution) |
35 |
| FP near-miss scenarios |
N/A |
35 (1:1 paired) |
| Anomaly categories |
N/A |
7 |
| Home layouts |
Random config |
4 fixed archetypes |
| Resident profiles |
N/A |
3 |
| Difficulty levels |
Implicit (QT complexity) |
3 levels, 5-dim scored |
| Avg events per episode |
~50 (API calls) |
~1,750 (sensor logs) |
| Device protocol |
Matter (17 types) |
Matter 1.5.1 (13 types) |
| Evaluation focus |
Proactive task execution |
Reactive anomaly reasoning |