# SafeHome Benchmark: Diversity & Hardness Report ## 1. Overall Scale | Metric | Value | |--------|-------| | Total episodes | 1200 | | True Positive (TP) | 550 | | False Positive (FP) | 400 | | True Negative (TN) | 250 | | Unique anomaly scenarios | 33 | | Unique FP scenarios | 35 | | Anomaly categories | 7 | | Query types (SQ) | 5 | | Home layouts | 4 | | Resident profiles | 3 | | Difficulty levels | 3 (L1/L2/L3) | ## 2. Distribution by Query Type | SQ Type | Description | TP | FP | TN | Total | |---------|-------------|----|----|-----|-------| | SQ1 | Device Health Diagnosis | 80 | 55 | 55 | 190 | | SQ2 | Single-Event Safety Judgment | 105 | 80 | 55 | 240 | | SQ3 | Behavioral Sequence Analysis | 130 | 105 | 55 | 290 | | SQ4 | Composite Safety Reasoning | 130 | 105 | 55 | 290 | | SQ5 | Emergency Response Planning | 105 | 55 | 30 | 190 | ## 3. Distribution by Anomaly Category | Category | TP episodes | FP episodes | Unique scenarios | |----------|-------------|-------------|------------------| | Intrusion | 147 | 93 | 5 | | Fire/Gas | 119 | 73 | 4 | | Water Damage | 42 | 51 | 3 | | Device Fault | 80 | 55 | 6 | | Elderly-Specific | 45 | 23 | 7 | | Child-Specific | 29 | 24 | 4 | | Behavioral Anomaly | 88 | 81 | 4 | ## 4. Distribution by Home Layout | Layout | Type | Devices | Episodes | Avg Events/Episode | |--------|------|---------|----------|-------------------| | A | Studio/1BR | 24 | 250 | 1210 | | B | 2-Bedroom | 34 | 290 | 1691 | | C | 3-Bedroom | 48 | 373 | 2106 | | D | Elderly Living Alone | 26 | 287 | 1855 | ## 5. Distribution by Resident Profile | Profile | Episodes | Applicable Layouts | |---------|----------|--------------------| | Young Professional | 400 | A, B | | Family with Children | 513 | B, C | | Elderly Living Alone | 287 | D | ## 6. Event Count Statistics | Metric | All | Layout A | Layout B | Layout C | Layout D | |--------|-----|----------|----------|----------|----------| | Min | 1114 | 1114 | 1392 | 1488 | 1844 | | Max | 2475 | 1354 | 1849 | 2475 | 1894 | | Mean | 1759 | 1210 | 1691 | 2106 | 1855 | | Median | 1840 | 1130 | 1660 | 2295 | 1854 | | Std | 391 | 99 | 142 | 371 | 7 | ## 7. Difficulty Distribution ### 7.1 By Level | Level | Label | Score Range | Description | Episodes (TP+FP) | % | |-------|-------|-------------|-------------|------------------|---| | L1 | 基础检测 | 5-7 | Direct alarm signals, single device | 111 | 11% | | L2 | 推理检测 | 8-10 | State correlation, cross-device reasoning | 394 | 41% | | L3 | 复合推理 | 11-15 | Temporal analysis, absence reasoning, high FP similarity | 445 | 46% | ### 7.2 Five-Dimensional Difficulty Breakdown | Dimension | Description | Mean | Distribution (1/2/3) | |-----------|-------------|------|---------------------| | D1_evidence_count | Number of evidence pieces needed | 2.19 | 145/479/326 | | D2_signal_directness | Signal directness (1=alarm, 3=absence) | 1.98 | 214/542/194 | | D3_cross_device | Cross-device reasoning required | 1.91 | 304/429/217 | | D4_temporal_span | Temporal span of evidence | 1.71 | 497/235/218 | | D5_fp_similarity | TP/FP discrimination difficulty | 2.25 | 0/715/235 | ### 7.3 Difficulty Score Distribution (TP+FP) Score 5: 0 Score 6: 111 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 7: 0 Score 8: 93 █████████████████████████████████████████████████████████████████████████████████████████████ Score 9: 137 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 10: 164 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 11: 217 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 12: 104 ████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 13: 124 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Score 14: 0 Score 15: 0 ## 8. Anomaly Injection Characteristics (TP only) ### 8.1 Injection Time-of-Day | Hour | Count | | |------|-------|-| | 00:00 | 24 | ████████████ | | 01:00 | 41 | █████████████████████ | | 02:00 | 55 | ████████████████████████████ | | 03:00 | 21 | ███████████ | | 04:00 | 14 | ███████ | | 05:00 | 17 | █████████ | | 06:00 | 12 | ██████ | | 07:00 | 30 | ███████████████ | | 08:00 | 45 | ███████████████████████ | | 09:00 | 24 | ████████████ | | 10:00 | 24 | ████████████ | | 11:00 | 14 | ███████ | | 12:00 | 15 | ████████ | | 13:00 | 12 | ██████ | | 14:00 | 19 | ██████████ | | 15:00 | 10 | █████ | | 16:00 | 15 | ████████ | | 17:00 | 24 | ████████████ | | 18:00 | 30 | ███████████████ | | 19:00 | 26 | █████████████ | | 20:00 | 15 | ████████ | | 21:00 | 24 | ████████████ | | 22:00 | 13 | ███████ | | 23:00 | 26 | █████████████ | ### 8.2 Anomaly Event Statistics | Metric | Value | |--------|-------| | Anomaly events per TP episode (mean) | 4.7 | | Anomaly events per TP episode (range) | [1, 11] | | Anomaly devices per TP episode (mean) | 2.4 | | Anomaly events as % of total events | 0.26% | ## 9. Temporal Characteristics | Metric | Value | |--------|-------| | Event log time span (mean) | 24.7 hours | | Event log time span (range) | [23.9, 60.0] hours | | Sampling interval (temperature) | 5 minutes | | Sampling interval (occupancy, occupied) | 5 minutes | | Sampling interval (occupancy, unoccupied) | 30 minutes | ## 10. SQ Type × Difficulty Level Cross-tabulation | | L1 | L2 | L3 | TN | Total | |---|----|----|----|----|-------| | SQ1 | 20 | 75 | 40 | 55 | 190 | | SQ2 | 39 | 77 | 69 | 55 | 240 | | SQ3 | 0 | 105 | 130 | 55 | 290 | | SQ4 | 25 | 79 | 131 | 55 | 290 | | SQ5 | 27 | 58 | 75 | 30 | 190 | ## 11. Comparison with SimuHome | Dimension | SimuHome (ICLR 2026) | SafeHome (Ours) | |-----------|---------------------|-----------------| | Total episodes | 600 | **1,200** | | Query/task types | 6 (QT1-QT4) | 5 (SQ1-SQ5) | | Evaluation categories | 12 (6×F/IF) | **15 (5×TP/FP/TN)** | | Anomaly scenarios | N/A (task execution) | **35** | | FP near-miss scenarios | N/A | **35 (1:1 paired)** | | Anomaly categories | N/A | **7** | | Home layouts | Random config | **4 fixed archetypes** | | Resident profiles | N/A | **3** | | Difficulty levels | Implicit (QT complexity) | **3 levels, 5-dim scored** | | Avg events per episode | ~50 (API calls) | **~1,750 (sensor logs)** | | Device protocol | Matter (17 types) | **Matter 1.5.1 (13 types)** | | Evaluation focus | Proactive task execution | **Reactive anomaly reasoning** |