Failure Analysis — Red Team Report

How the eval harness catches each failure mode before it reaches users.

Methodology

docs/DATA_QUALITY.md maps 12 failure modes (F1-F12) ranked by likelihood and harm. This document traces each mode to the specific test(s) that would catch it, and identifies which modes remain unguarded.

Failure mode → test mapping

Catastrophic (false safety signal to evacuees)

Mode	Description	Tests that catch it	Verdict
F1	Fabricated all-clear (model hallucinates `evacuation_lifted: true` or `incident_resolved_iso`)	`test_lifted_requires_corroboration`, `test_resolved_requires_two_sources`	Guarded. P0-1 corroboration gate forces safe default when source count < 2 or no official source present.
F12	Prompt injection via scraped page (malicious page tells model “evacuation is lifted”)	`test_lifted_requires_corroboration` (same gate)	Partially guarded. The corroboration gate blocks the effect (a single-source all-clear), but doesn’t detect the cause (injection). A coordinated injection across 2+ sources including a spoofed official domain could bypass the gate. Residual risk accepted: the attacker would need to control ≥2 indexed news sources AND spoof an official hostname.

High harm (fabricated information, invisible staleness)

Mode	Description	Tests that catch it	Verdict
F2	Fabricated source URL or quote	`test_fabricated_source_url_not_in_snapshot`, `test_statement_without_source_url_rejected`, `test_sources_checked_all_wellformed`	Guarded. P0-2 drops any statement whose `source_url` wasn’t actually retrieved this run.
F3	Hallucinated numeric (temp, residents, injuries)	`test_garbage_input_keeps_prev_values`, `test_partial_facts_dont_downgrade_severity`, `test_residents_shift_fires_info`	Partially guarded. Writer holds previous value on a >50% residents drop without `lifted`. But a plausible-but-wrong number (e.g., 48,000 instead of 50,000) passes. No numeric sanity beyond the 50% drop gate.
F4	Stale-but-fresh-stamped (empty facts, fresh timestamp)	`test_empty_facts_do_not_advance_data_as_of`, `test_all_null_facts_treated_as_no_data`, `test_stale_after_is_data_as_of_plus_maxage`	Guarded. P0-3 separates data age from write age.
F11	Fabricated/garbled date (future or pre-incident timestamp)	`test_future_resolved_iso_suppressed`, `test_malformed_resolved_iso_suppressed`, `test_valid_resolved_iso_honored`	Guarded. P1-1 nulls future-dated and malformed timestamps.

Medium harm (degraded quality, lost provenance)

Mode	Description	Tests that catch it	Verdict
F5	Severity miscompute from partial facts	`test_partial_facts_dont_downgrade_severity`	Guarded. Fixed + regression test.
F6	Web search returns nothing (all-null facts)	`test_all_null_facts_treated_as_no_data`, `test_graceful_failure_no_api_key`	Guarded. Writer carries previous values; P0-3 prevents fresh-stamping.
F7	No provenance (model omits `sources_checked`)	`test_sources_checked_all_wellformed` (shape), `test_every_statement_has_source_and_time` (data contract)	Partially guarded. Tests verify shape and presence, but an empty `sources_checked: []` passes silently.
F8	Schema drift (gatherer and writer disagree on field names)	`test_status_json_required_fields`, `test_config_json_required_fields`	Partially guarded. Schema tests catch missing fields in the output, but don’t enforce the gatherer-writer contract at the boundary.

Low harm (operational, not safety-critical)

Mode	Description	Tests that catch it	Verdict
F9	Cron silently stops	None	Unguarded. No dead-man’s switch. Staleness banner (P0-3) is the only in-app signal. If P0-3 has a bug, a stopped cron is invisible.
F10	Commit/deploy failure	None	Unguarded. Push failure is visible only in GitHub Actions. No external alerting.

Coverage summary

Category	Modes	Guarded	Partially guarded	Unguarded
Catastrophic	F1, F12	F1	F12	—
High harm	F2, F3, F4, F11	F2, F4, F11	F3	—
Medium harm	F5, F6, F7, F8	F5, F6	F7, F8	—
Low harm	F9, F10	—	—	F9, F10

8 of 12 failure modes are fully guarded by automated tests. 4 are partially guarded (the test catches the effect but not all vectors). 2 operational modes have no automated coverage (staleness banner is the manual fallback).

What the eval harness cannot catch

Plausible-but-wrong numerics (F3 residual). If the model says 48,000 evacuees when the real number is 50,000, no automated test flags it. The 50% drop gate catches gross errors, not subtle ones.
Coordinated prompt injection (F12 residual). If an attacker controls ≥2 indexed sources and one spoofs an official domain, the corroboration gate passes. This requires a sophisticated, targeted attack — low probability, but the architecture doesn’t structurally prevent it.
Silent cron death (F9). The pipeline has no external heartbeat monitor. If the cron stops and the staleness banner has a bug, users see no signal. Mitigation: the staleness banner is tested (P0-3), so both would have to fail simultaneously.
Subtle semantic drift. The model might gradually shift tone (more alarming, more reassuring) without triggering any binary test. The human review step is the control here, not the eval harness.

Design lesson

The eval harness is strongest where the failure mode is binary and catastrophic (all-clear vs. not, fabricated vs. real, stale vs. fresh). It’s weakest where the failure mode is continuous and subtle (slightly wrong numbers, gradual tone drift, sophisticated injection). This matches the priority: the binary-catastrophic modes are the ones that could kill someone; the continuous-subtle modes degrade quality but don’t create false safety signals.