BattleTag
6b485b98f7
fix(grounding): auto-rescue hallucinated invocation_id + list real ids in error
...
First full-case run (runs/2026-05-20T20-15-04/) produced 83 GroundingError
rejections, almost all from a single failure mode: LLM cites a plausible-
looking inv-XXXXXXXX that doesn't exist, while the fact's value is in fact
present verbatim in one of its real tool outputs. The agent knew which
tool it read from, it just mis-typed the citation id.
Two-layer fix in evidence_graph.validate_fact_grounding:
Layer A (silent heal): when the cited inv-id misses, search the same
agent / task's invocations for one whose output contains the value
(strict or normalised substring). If exactly one matches, rewrite
fact.invocation_id in place and accept. Multi-match is NOT auto-
rescued — the candidate ids go back to the LLM so it picks deliberately.
Layer B (informative retry): GroundingError now appends the agent's
recent invocation ids and a brief tool-call summary, so the LLM has
the real ids in front of it for the next attempt rather than
fabricating again from memory.
Both layers preserve the design invariant: the fact's value must still
be present in a real tool output — nothing new can land grounded that
wasn't already verifiable. Cross-agent / cross-task isolation is also
preserved (rescue candidates filtered on agent + task_id).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-21 02:14:20 -10:00
BattleTag
81ade8f7ac
feat(refit): complete S1-S6 — case abstraction, grounding, log-odds, plugins, coref, multi-source
...
Consolidates the long-running refit work (DESIGN.md as authoritative spec)
into a single baseline commit. Six stages landed together:
S1 Case + EvidenceSource abstraction; tools parameterised by source_id
(case.py, main.py multi-source bootstrap, .bin extension support)
S2 Grounding gateway in add_phenomenon: verified_facts cite real
ToolInvocation ids; substring / normalised match enforced; agent +
task scope checked. Phenomenon.description split into verified_facts
(grounded) + interpretation (free text). [invocation: inv-xxx]
prefix on every wrapped tool result so the LLM can cite.
S3 Confidence as additive log-odds: edge_type → log10(LR) calibration
table; commutative updates; supported / refuted thresholds derived
from log_odds; hypothesis × evidence matrix view.
S4 iOS plugin: unzip_archive + parse_plist / sqlite_tables /
sqlite_query / parse_ios_keychain / read_idevice_info;
IOSArtifactAgent; SOURCE_TYPE_AGENTS routing.
S5 Cross-source entity resolution: typed identifiers on Entity,
observe_identity gateway, auto coref hypothesis with shared /
conflicting strong/weak LR edges, reversible same_as edges,
actor_clusters() view.
S6 Android partition probe + AndroidArtifactAgent; MediaAgent with
OCR fallback; orchestrator Phase 1 iterates every analysable
source; platform-aware get_triage_agent_type; ReportAgent renders
actor clusters + per-source breakdown.
142 unit tests / 1 skipped — full coverage of the new gateway, log-odds
math, coref hypothesis fall-out, and orchestrator multi-source dispatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-21 02:12:10 -10:00
BattleTag
444d58726a
refactor: native tool calling + generic forced-retry + terminal exit
...
- llm_client: switch tool_call_loop from text-based <tool_call> regex
to OpenAI-native tools=[...] / structured tool_calls field; accumulate
delta.reasoning_content for DeepSeek thinking-mode echo-back; fold
preserves system msg and aligns boundary to never orphan role:tool
- base_agent: generic forced-retry via mandatory_record_tools class attr
(filesystem -> add_phenomenon, timeline -> add_temporal_edge,
hypothesis -> add_hypothesis, report -> save_report); count via
executor wrapper
- terminal_tools class attr + loop short-circuit: when a terminal tool
is called, loop exits with its raw return as final_text. ReportAgent
declares save_report as terminal - replaces the <answer>-tag stop
signal that native tool calling broke
- _execute_*: return (raw, formatted) - terminal exit uses untruncated
raw, conversation history uses 3000-char-capped formatted
- evidence_graph + orchestrator: LLM-derived InvestigationArea support
(hypothesis-driven coverage check, replaces hardcoded _AREA_KEYWORDS /
_AREA_TOOLS); manual yaml block kept as optional seed
- strip <answer> references from agent prompts (no longer load-bearing)
Verified on CFReDS image across 4 smoke runs: 0 JSON parse failures
(was 3); 22 temporal edges from Phase 4 (was 0); ReportAgent exits via
save_report (was max_iterations regression). 78/78 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-13 13:51:19 +08:00
BattleTag
31812a72ee
test: track tests/ directory in version control
...
tests/test_optimizations.py — 60 pytest cases covering:
- EvidenceGraph: quality scoring, Jaccard merge, async safety,
hypothesis confidence updates, asset library
- llm_client: tool-result truncation, parallel batch execution,
progressive context decay, message folding
- orchestrator: parallel dispatch, batched lead generation,
batched judging
- tool_registry: result cache key derivation
FakeAgent.run signatures updated to BaseAgent.run(task, lead_id=None).
Previously listed in .gitignore (which is itself untracked, so the
ignore rule lives only locally).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-12 14:10:31 +08:00