Files
llmiotsafe/results/DIVERSITY_REPORT.md
2026-05-12 17:01:39 +08:00

177 lines
9.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SafeHome Benchmark: Diversity & Hardness Report
## 1. Overall Scale
| Metric | Value |
|--------|-------|
| Total episodes | 1200 |
| True Positive (TP) | 550 |
| False Positive (FP) | 400 |
| True Negative (TN) | 250 |
| Unique anomaly scenarios | 33 |
| Unique FP scenarios | 35 |
| Anomaly categories | 7 |
| Query types (SQ) | 5 |
| Home layouts | 4 |
| Resident profiles | 3 |
| Difficulty levels | 3 (L1/L2/L3) |
## 2. Distribution by Query Type
| SQ Type | Description | TP | FP | TN | Total |
|---------|-------------|----|----|-----|-------|
| SQ1 | Device Health Diagnosis | 80 | 55 | 55 | 190 |
| SQ2 | Single-Event Safety Judgment | 105 | 80 | 55 | 240 |
| SQ3 | Behavioral Sequence Analysis | 130 | 105 | 55 | 290 |
| SQ4 | Composite Safety Reasoning | 130 | 105 | 55 | 290 |
| SQ5 | Emergency Response Planning | 105 | 55 | 30 | 190 |
## 3. Distribution by Anomaly Category
| Category | TP episodes | FP episodes | Unique scenarios |
|----------|-------------|-------------|------------------|
| Intrusion | 147 | 93 | 5 |
| Fire/Gas | 119 | 73 | 4 |
| Water Damage | 42 | 51 | 3 |
| Device Fault | 80 | 55 | 6 |
| Elderly-Specific | 45 | 23 | 7 |
| Child-Specific | 29 | 24 | 4 |
| Behavioral Anomaly | 88 | 81 | 4 |
## 4. Distribution by Home Layout
| Layout | Type | Devices | Episodes | Avg Events/Episode |
|--------|------|---------|----------|-------------------|
| A | Studio/1BR | 24 | 250 | 1210 |
| B | 2-Bedroom | 34 | 290 | 1691 |
| C | 3-Bedroom | 48 | 373 | 2106 |
| D | Elderly Living Alone | 26 | 287 | 1855 |
## 5. Distribution by Resident Profile
| Profile | Episodes | Applicable Layouts |
|---------|----------|--------------------|
| Young Professional | 400 | A, B |
| Family with Children | 513 | B, C |
| Elderly Living Alone | 287 | D |
## 6. Event Count Statistics
| Metric | All | Layout A | Layout B | Layout C | Layout D |
|--------|-----|----------|----------|----------|----------|
| Min | 1114 | 1114 | 1392 | 1488 | 1844 |
| Max | 2475 | 1354 | 1849 | 2475 | 1894 |
| Mean | 1759 | 1210 | 1691 | 2106 | 1855 |
| Median | 1840 | 1130 | 1660 | 2295 | 1854 |
| Std | 391 | 99 | 142 | 371 | 7 |
## 7. Difficulty Distribution
### 7.1 By Level
| Level | Label | Score Range | Description | Episodes (TP+FP) | % |
|-------|-------|-------------|-------------|------------------|---|
| L1 | 基础检测 | 5-7 | Direct alarm signals, single device | 111 | 11% |
| L2 | 推理检测 | 8-10 | State correlation, cross-device reasoning | 394 | 41% |
| L3 | 复合推理 | 11-15 | Temporal analysis, absence reasoning, high FP similarity | 445 | 46% |
### 7.2 Five-Dimensional Difficulty Breakdown
| Dimension | Description | Mean | Distribution (1/2/3) |
|-----------|-------------|------|---------------------|
| D1_evidence_count | Number of evidence pieces needed | 2.19 | 145/479/326 |
| D2_signal_directness | Signal directness (1=alarm, 3=absence) | 1.98 | 214/542/194 |
| D3_cross_device | Cross-device reasoning required | 1.91 | 304/429/217 |
| D4_temporal_span | Temporal span of evidence | 1.71 | 497/235/218 |
| D5_fp_similarity | TP/FP discrimination difficulty | 2.25 | 0/715/235 |
### 7.3 Difficulty Score Distribution (TP+FP)
Score 5: 0
Score 6: 111 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 7: 0
Score 8: 93 █████████████████████████████████████████████████████████████████████████████████████████████
Score 9: 137 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 10: 164 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 11: 217 █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 12: 104 ████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 13: 124 ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Score 14: 0
Score 15: 0
## 8. Anomaly Injection Characteristics (TP only)
### 8.1 Injection Time-of-Day
| Hour | Count | |
|------|-------|-|
| 00:00 | 24 | ████████████ |
| 01:00 | 41 | █████████████████████ |
| 02:00 | 55 | ████████████████████████████ |
| 03:00 | 21 | ███████████ |
| 04:00 | 14 | ███████ |
| 05:00 | 17 | █████████ |
| 06:00 | 12 | ██████ |
| 07:00 | 30 | ███████████████ |
| 08:00 | 45 | ███████████████████████ |
| 09:00 | 24 | ████████████ |
| 10:00 | 24 | ████████████ |
| 11:00 | 14 | ███████ |
| 12:00 | 15 | ████████ |
| 13:00 | 12 | ██████ |
| 14:00 | 19 | ██████████ |
| 15:00 | 10 | █████ |
| 16:00 | 15 | ████████ |
| 17:00 | 24 | ████████████ |
| 18:00 | 30 | ███████████████ |
| 19:00 | 26 | █████████████ |
| 20:00 | 15 | ████████ |
| 21:00 | 24 | ████████████ |
| 22:00 | 13 | ███████ |
| 23:00 | 26 | █████████████ |
### 8.2 Anomaly Event Statistics
| Metric | Value |
|--------|-------|
| Anomaly events per TP episode (mean) | 4.7 |
| Anomaly events per TP episode (range) | [1, 11] |
| Anomaly devices per TP episode (mean) | 2.4 |
| Anomaly events as % of total events | 0.26% |
## 9. Temporal Characteristics
| Metric | Value |
|--------|-------|
| Event log time span (mean) | 24.7 hours |
| Event log time span (range) | [23.9, 60.0] hours |
| Sampling interval (temperature) | 5 minutes |
| Sampling interval (occupancy, occupied) | 5 minutes |
| Sampling interval (occupancy, unoccupied) | 30 minutes |
## 10. SQ Type × Difficulty Level Cross-tabulation
| | L1 | L2 | L3 | TN | Total |
|---|----|----|----|----|-------|
| SQ1 | 20 | 75 | 40 | 55 | 190 |
| SQ2 | 39 | 77 | 69 | 55 | 240 |
| SQ3 | 0 | 105 | 130 | 55 | 290 |
| SQ4 | 25 | 79 | 131 | 55 | 290 |
| SQ5 | 27 | 58 | 75 | 30 | 190 |
## 11. Comparison with SimuHome
| Dimension | SimuHome (ICLR 2026) | SafeHome (Ours) |
|-----------|---------------------|-----------------|
| Total episodes | 600 | **1,200** |
| Query/task types | 6 (QT1-QT4) | 5 (SQ1-SQ5) |
| Evaluation categories | 12 (6×F/IF) | **15 (5×TP/FP/TN)** |
| Anomaly scenarios | N/A (task execution) | **35** |
| FP near-miss scenarios | N/A | **35 (1:1 paired)** |
| Anomaly categories | N/A | **7** |
| Home layouts | Random config | **4 fixed archetypes** |
| Resident profiles | N/A | **3** |
| Difficulty levels | Implicit (QT complexity) | **3 levels, 5-dim scored** |
| Avg events per episode | ~50 (API calls) | **~1,750 (sensor logs)** |
| Device protocol | Matter (17 types) | **Matter 1.5.1 (13 types)** |
| Evaluation focus | Proactive task execution | **Reactive anomaly reasoning** |