In an emergency dashboard, the failure that can kill someone is a wrong “all-clear.” So the design question that matters most is not how good the model is. It is who is allowed to declare safety, and what stops the model from doing it alone.
This system answers with a hard boundary. The AI is scoped to situational awareness: gathering, corroborating, and surfacing facts. The human keeps the final verdict: the irreversible safety calls, like lifting an evacuation or declaring an incident resolved. The model never owns that decision. Three mechanisms hold the boundary, and the third one is the honest part.
The most dangerous output is a fabricated all-clear: the model emitting evacuation_lifted: true or a resolved timestamp from a single source, a hallucination, or an injected page. The system structurally refuses it. The P0-1 corroboration gate forces a safe default unless at least two sources agree and an official source is present (test_lifted_requires_corroboration, test_resolved_requires_two_sources). A model that “concludes” the danger is over cannot make that conclusion stick on its own. The verdict is gated, not trusted.
A person can only hold the verdict if the screen tells the truth about what is known. So the system separates data age from write age (P0-3): a failed or empty fetch cannot refresh-stamp stale facts as current (test_empty_facts_do_not_advance_data_as_of). Every statement carries the source actually retrieved on that run, and fabricated source_urls are dropped (P0-2). Future-dated and malformed timestamps are nulled (P1-1). These gates all serve one goal: the reviewer sees staleness and uncertainty for what they are, not as a clean, confident, wrong answer.
The eval harness is strongest where failure is binary and catastrophic: all-clear vs. not, fabricated vs. real, stale vs. fresh. Eight of twelve red-team failure modes are fully guarded by automated tests. It is weakest where failure is continuous and subtle, and the method does not pretend otherwise. Three classes rely on the human, not the harness:
For all three, the control is a person reading the output, not a green test suite. Naming them is the method. A system that claimed full automated coverage here would be lying, and a reviewer who trusted that claim would be unguarded exactly where the danger is subtle.
The automated suite verifies that the machinery behaves: 198/198 green (eval summary, red-team). It does not, and cannot, own the safety verdict. Keeping the human as the final authority is what makes the residual risks survivable. Every failure the harness cannot catch is one a human is positioned to catch, because the system surfaces uncertainty to that person instead of papering over it. The AI makes the reviewer faster and better informed. It does not get the last word.