177 lines
9.2 KiB
Markdown
177 lines
9.2 KiB
Markdown
# SafeHome Benchmark: Diversity & Hardness Report
|
||
|
||
## 1. Overall Scale
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Total episodes | 1200 |
|
||
| True Positive (TP) | 550 |
|
||
| False Positive (FP) | 400 |
|
||
| True Negative (TN) | 250 |
|
||
| Unique anomaly scenarios | 33 |
|
||
| Unique FP scenarios | 35 |
|
||
| Anomaly categories | 7 |
|
||
| Query types (SQ) | 5 |
|
||
| Home layouts | 4 |
|
||
| Resident profiles | 3 |
|
||
| Difficulty levels | 3 (L1/L2/L3) |
|
||
|
||
## 2. Distribution by Query Type
|
||
|
||
| SQ Type | Description | TP | FP | TN | Total |
|
||
|---------|-------------|----|----|-----|-------|
|
||
| SQ1 | Device Health Diagnosis | 80 | 55 | 55 | 190 |
|
||
| SQ2 | Single-Event Safety Judgment | 105 | 80 | 55 | 240 |
|
||
| SQ3 | Behavioral Sequence Analysis | 130 | 105 | 55 | 290 |
|
||
| SQ4 | Composite Safety Reasoning | 130 | 105 | 55 | 290 |
|
||
| SQ5 | Emergency Response Planning | 105 | 55 | 30 | 190 |
|
||
|
||
## 3. Distribution by Anomaly Category
|
||
|
||
| Category | TP episodes | FP episodes | Unique scenarios |
|
||
|----------|-------------|-------------|------------------|
|
||
| Intrusion | 147 | 93 | 5 |
|
||
| Fire/Gas | 119 | 73 | 4 |
|
||
| Water Damage | 42 | 51 | 3 |
|
||
| Device Fault | 80 | 55 | 6 |
|
||
| Elderly-Specific | 45 | 23 | 7 |
|
||
| Child-Specific | 29 | 24 | 4 |
|
||
| Behavioral Anomaly | 88 | 81 | 4 |
|
||
|
||
## 4. Distribution by Home Layout
|
||
|
||
| Layout | Type | Devices | Episodes | Avg Events/Episode |
|
||
|--------|------|---------|----------|-------------------|
|
||
| A | Studio/1BR | 24 | 250 | 1210 |
|
||
| B | 2-Bedroom | 34 | 290 | 1691 |
|
||
| C | 3-Bedroom | 48 | 373 | 2106 |
|
||
| D | Elderly Living Alone | 26 | 287 | 1855 |
|
||
|
||
## 5. Distribution by Resident Profile
|
||
|
||
| Profile | Episodes | Applicable Layouts |
|
||
|---------|----------|--------------------|
|
||
| Young Professional | 400 | A, B |
|
||
| Family with Children | 513 | B, C |
|
||
| Elderly Living Alone | 287 | D |
|
||
|
||
## 6. Event Count Statistics
|
||
|
||
| Metric | All | Layout A | Layout B | Layout C | Layout D |
|
||
|--------|-----|----------|----------|----------|----------|
|
||
| Min | 1114 | 1114 | 1392 | 1488 | 1844 |
|
||
| Max | 2475 | 1354 | 1849 | 2475 | 1894 |
|
||
| Mean | 1759 | 1210 | 1691 | 2106 | 1855 |
|
||
| Median | 1840 | 1130 | 1660 | 2295 | 1854 |
|
||
| Std | 391 | 99 | 142 | 371 | 7 |
|
||
|
||
## 7. Difficulty Distribution
|
||
|
||
### 7.1 By Level
|
||
|
||
| Level | Label | Score Range | Description | Episodes (TP+FP) | % |
|
||
|-------|-------|-------------|-------------|------------------|---|
|
||
| L1 | 基础检测 | 5-7 | Direct alarm signals, single device | 111 | 11% |
|
||
| L2 | 推理检测 | 8-10 | State correlation, cross-device reasoning | 394 | 41% |
|
||
| L3 | 复合推理 | 11-15 | Temporal analysis, absence reasoning, high FP similarity | 445 | 46% |
|
||
|
||
### 7.2 Five-Dimensional Difficulty Breakdown
|
||
|
||
| Dimension | Description | Mean | Distribution (1/2/3) |
|
||
|-----------|-------------|------|---------------------|
|
||
| D1_evidence_count | Number of evidence pieces needed | 2.19 | 145/479/326 |
|
||
| D2_signal_directness | Signal directness (1=alarm, 3=absence) | 1.98 | 214/542/194 |
|
||
| D3_cross_device | Cross-device reasoning required | 1.91 | 304/429/217 |
|
||
| D4_temporal_span | Temporal span of evidence | 1.71 | 497/235/218 |
|
||
| D5_fp_similarity | TP/FP discrimination difficulty | 2.25 | 0/715/235 |
|
||
|
||
### 7.3 Difficulty Score Distribution (TP+FP)
|
||
|
||
Score 5: 0
|
||
Score 6: 111 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████
|
||
Score 7: 0
|
||
Score 8: 93 █████████████████████████████████████████████████████████████████████████████████████████████
|
||
Score 9: 137 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
|
||
Score 10: 164 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
|
||
Score 11: 217 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
|
||
Score 12: 104 ████████████████████████████████████████████████████████████████████████████████████████████████████████
|
||
Score 13: 124 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
|
||
Score 14: 0
|
||
Score 15: 0
|
||
|
||
## 8. Anomaly Injection Characteristics (TP only)
|
||
|
||
### 8.1 Injection Time-of-Day
|
||
|
||
| Hour | Count | |
|
||
|------|-------|-|
|
||
| 00:00 | 24 | ████████████ |
|
||
| 01:00 | 41 | █████████████████████ |
|
||
| 02:00 | 55 | ████████████████████████████ |
|
||
| 03:00 | 21 | ███████████ |
|
||
| 04:00 | 14 | ███████ |
|
||
| 05:00 | 17 | █████████ |
|
||
| 06:00 | 12 | ██████ |
|
||
| 07:00 | 30 | ███████████████ |
|
||
| 08:00 | 45 | ███████████████████████ |
|
||
| 09:00 | 24 | ████████████ |
|
||
| 10:00 | 24 | ████████████ |
|
||
| 11:00 | 14 | ███████ |
|
||
| 12:00 | 15 | ████████ |
|
||
| 13:00 | 12 | ██████ |
|
||
| 14:00 | 19 | ██████████ |
|
||
| 15:00 | 10 | █████ |
|
||
| 16:00 | 15 | ████████ |
|
||
| 17:00 | 24 | ████████████ |
|
||
| 18:00 | 30 | ███████████████ |
|
||
| 19:00 | 26 | █████████████ |
|
||
| 20:00 | 15 | ████████ |
|
||
| 21:00 | 24 | ████████████ |
|
||
| 22:00 | 13 | ███████ |
|
||
| 23:00 | 26 | █████████████ |
|
||
|
||
### 8.2 Anomaly Event Statistics
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Anomaly events per TP episode (mean) | 4.7 |
|
||
| Anomaly events per TP episode (range) | [1, 11] |
|
||
| Anomaly devices per TP episode (mean) | 2.4 |
|
||
| Anomaly events as % of total events | 0.26% |
|
||
|
||
## 9. Temporal Characteristics
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Event log time span (mean) | 24.7 hours |
|
||
| Event log time span (range) | [23.9, 60.0] hours |
|
||
| Sampling interval (temperature) | 5 minutes |
|
||
| Sampling interval (occupancy, occupied) | 5 minutes |
|
||
| Sampling interval (occupancy, unoccupied) | 30 minutes |
|
||
|
||
## 10. SQ Type × Difficulty Level Cross-tabulation
|
||
|
||
| | L1 | L2 | L3 | TN | Total |
|
||
|---|----|----|----|----|-------|
|
||
| SQ1 | 20 | 75 | 40 | 55 | 190 |
|
||
| SQ2 | 39 | 77 | 69 | 55 | 240 |
|
||
| SQ3 | 0 | 105 | 130 | 55 | 290 |
|
||
| SQ4 | 25 | 79 | 131 | 55 | 290 |
|
||
| SQ5 | 27 | 58 | 75 | 30 | 190 |
|
||
|
||
## 11. Comparison with SimuHome
|
||
|
||
| Dimension | SimuHome (ICLR 2026) | SafeHome (Ours) |
|
||
|-----------|---------------------|-----------------|
|
||
| Total episodes | 600 | **1,200** |
|
||
| Query/task types | 6 (QT1-QT4) | 5 (SQ1-SQ5) |
|
||
| Evaluation categories | 12 (6×F/IF) | **15 (5×TP/FP/TN)** |
|
||
| Anomaly scenarios | N/A (task execution) | **35** |
|
||
| FP near-miss scenarios | N/A | **35 (1:1 paired)** |
|
||
| Anomaly categories | N/A | **7** |
|
||
| Home layouts | Random config | **4 fixed archetypes** |
|
||
| Resident profiles | N/A | **3** |
|
||
| Difficulty levels | Implicit (QT complexity) | **3 levels, 5-dim scored** |
|
||
| Avg events per episode | ~50 (API calls) | **~1,750 (sensor logs)** |
|
||
| Device protocol | Matter (17 types) | **Matter 1.5.1 (13 types)** |
|
||
| Evaluation focus | Proactive task execution | **Reactive anomaly reasoning** | |