AI Assurance is Evidence, not Confidence

by Seb Matthews | Jan 7, 2026 | AI, Military, Post | 0 comments

military folk using an ai app in a forward deployed context

AI Assurance Is Evidence, Not Confidence

The first time I saw an AI system make a critical decision in a defence context, nobody could explain how it got there. The answer existed somewhere in a model’s weights, across retrieval databases, through a chain of prompts, buried in logs that hadn’t been properly structured. We had a result. We had confidence. We had no idea what actually happened.

That is the problem with how defence organisations talk about AI right now.

Why Defence Organisations Are Adopting AI

Let me start with what actually works. AI delivers real operational advantage. Faster triage of incoming information. Sense-making at scale when human attention is scarce. Spotting patterns in noise that would take months to find manually. These are not theoretical benefits. They are happening now.

The danger is that these tangible wins create a false sense of security about the technology itself. Because it works in the lab, in the demo, under controlled conditions with cooperative data and unlimited time, people start to assume it will work the same way when it matters. When time is short. When data is messy. When stakes are high.

That assumption is where everything breaks.

What Makes AI Different from Every Other System You Deploy

Suppose you build a traditional software system. You write requirements. You test it against those requirements in controlled ways. You release it. Change happens through deliberate updates that you manage. It is predictable. It is stable. The same code produces the same output, every time.

AI does not work that way.

These systems are probabilistic. Context-sensitive. They see different data, they behave differently. A model that performs well on one distribution of inputs can fail completely on another without any code changing. The system itself is not one thing but a pipeline: data flows in, retrieval systems fetch context, prompts get constructed, models get selected, tools get called, policies get enforced, humans interact with all of it. Weakness at any point in that chain hides until it matters.

Then there is change velocity. Traditional software updates every few quarters. AI systems update constantly. Models get retrained. Retrieval databases shift. Embeddings drift. Dependencies patch. If your governance cannot keep pace with that, your assurance decays in real time.

But here is what most technical people miss: humans are not separate from the system. They are part of the control loop. People learn workarounds. They develop patterns of reliance that you never intended. They defer to the system when they should override it. Those behaviours are not bugs in your design. They are predictable human responses to how you built the interface. And they create risk that never shows up in your accuracy metrics.

The moment you deploy an AI system in defence, you have added uncertainty that existing assurance frameworks were not designed to handle.

The Problem with Trust as a Framework

Most conversations in this space eventually use the word trust. We hear about trustworthy AI, trusted systems, building trust in automation.

Trust is the wrong problem statement entirely.

You cannot write trust as a requirement. You cannot test for it. You cannot build it into code. Confidence is not a control. What actually matters is assurance. And assurance only comes from evidence. Real evidence, not reassurance.

If you cannot show credible evidence that a system stays within defined bounds under realistic conditions, and you cannot trace everything it did to get there, then you do not have an assured capability. You have hope. Hope wrapped in a dashboard looks good in a demo. It falls apart the moment it matters.

What Evidence Actually Looks Like

Evidence gets thrown around loosely, so it helps to pin down what it means.

Evidence is not a successful demo. It is not an anecdote about one case where the system performed well. It is not a vendor promise or a policy document saying you will do something later. Evidence is material that someone independent could inspect and use to reach the same conclusion you did.

For evidence to function in practice, it needs three things.

First, it has to connect to specific claims. Saying this system reduces analyst workload is vague. Saying it will not disclose classified information across security boundaries is closer to testable. Saying it will never hallucinate is neither credible nor useful. But saying it will label uncertainty when confidence is low, cite sources when it retrieves information, and block actions above defined thresholds without human approval is something you can actually verify. That is a claim you can build evidence around.

Second, evidence must come from realistic conditions. Testing only on clean data, cooperative users, and perfect networks means you are measuring a system that does not exist. Real conditions mean messy data. Users under time pressure. Networks under stress. Edge cases that nobody anticipated. If you only test the happy path, your evidence is fiction.

Third, evidence must be reconstructable. You need to show what happened, who did what, what the system saw, what it produced, which policies it applied, which version was running. If you cannot walk through the entire path from input to output, you do not have evidence. You have a story.

Three Layers: Claims, Arguments, Evidence

A practical structure that keeps thinking grounded separates into three distinct layers.

Claims are what you assert about the system. What it will do. What it will not do. What it will not reveal. How fast it operates. How it maintains control. These need to be specific enough that someone could say yes or no to them.

Arguments are the logical reasons why someone should believe those claims. Why should this evidence support this particular claim? What is the reasoning that connects the two?

Evidence is what you can actually show. Data. Tests. Logs. Failure reports. Incident records. Anything tangible that another person can inspect and verify.

This structure does one concrete thing: it stops assurance conversations from dissolving into instinct, arguments from authority, and gut feelings dressed up as professional judgement. It forces clarity.

Where the Real Evidence Comes From

Rather than getting lost in infinite variations of evidence, it helps to think in five categories. These are the areas that actually matter to buyers, operators, and the people who have to maintain the system long-term.

Data and provenance evidence starts everything. If you cannot explain what data your system uses, where it came from, what it contains and what it excludes, how access is controlled, then you cannot assure the outcome. This means tracking provenance. Controlling access. Managing how classified or redacted material flows through the system. Testing for information leakage, including through prompt injection. Covering edge cases and adversarial inputs that people might throw at it. This is the foundation. Everything else sits on top.

Model behaviour evidence is where most conversations stop, and that is precisely why it is not enough. You need performance metrics connected to actual mission outcomes, not just accuracy numbers. You need stress testing to see what happens when context degrades, inputs become ambiguous, or someone tries to manipulate it deliberately. You need to understand calibration and uncertainty, not just how it performs on average. You need documented failure modes so people know exactly what the system does when it cannot deliver. This is where you find out what you are actually getting.

System level control evidence is what transforms AI from an experimental capability into something you can actually deploy. This means policies that are enforced across the entire system, boundaries that hold even when you push hard on them, exception handling that actually works, output grounded in sources with proper citations, guardrails on actions not just generated text, and audit logs that let you trace the complete path from input to decision. This is the difference between a system that works and a system that works reliably.

Human factors evidence exists because a system can be technically sound and still fail operationally if people use it wrong. This includes clear documentation about what the system can and cannot do, interfaces that surface confidence appropriately so operators understand uncertainty, training that works with realistic workflows, and evidence that the system speeds up decision-making without creating hidden dependence on it. If your operators do not trust the system correctly, assurance becomes fragile no matter what else you have.

Operational change evidence rounds out the picture because assurance is not something you build once and ship. You need real change control for models, prompts, data sources, and dependencies. Rollback procedures that actually work, not ones that look good on paper. Monitoring tied to what matters operationally, not vanity metrics. Incident playbooks that specifically account for AI-type failures. This is how you maintain assurance over the lifespan of the system.

Why Procurement Matters

Many AI assurance problems are actually built in at the requirements stage.

Ask a vendor for AI with assurance without specifying what evidence you actually need, and you will get compliance theatre. Thick policy documents. Audits that check whether the policy document exists. Trust us repackaged as rigorous process. You will get reassurance disguised as substance.

Specify evidence, and you get a different result. Suppliers can differentiate by showing what they actually built rather than arguing about what they meant. The entire supply chain faces pressure to produce something real.

If you are buying, stop asking for confidence statements and start asking for evidence packs. Ask what evidence they can produce, under what conditions they tested it, and how you can reconstruct what happened. Ask the hard questions. Demand specificity.

A Baseline for Deployment

If you need something concise enough that a programme board can understand before committing, here is a starting point. Before you deploy, answer yes to every one of these.

You can state your top five assurance claims on a single page. You have evidence for each claim, not narratives or promises. You have tested realistic failure modes and adversarial inputs, and you can show what you found. You have policy enforcement that you have demonstrated, not just described. You can reconstruct what the system did from logs, end to end. You control change cadence and can roll back safely. You understand what happens when conditions degrade, and you have tested it. You have defined what a safe failure looks like, and you have actually observed it occur. You know what operators are expected to do, and you have trained them to do it. You have incident playbooks that account for AI-specific failure modes.

This is not perfection. This is deployable seriousness.

The Real Advantage

AI will deliver advantage in defence. The speed, the throughput, the ability to find signal in noise. But only if you treat it as a capability that needs to be built and assured, not a gadget you bolt on and hope works.

Assurance is how you keep that speed without betting everything on trust.

'Sovereign' is a Control Model, Not a Geography →

Written by Seb Matthews

Military to NASA to boardroom, I bridge operators and engineers to deliver real world AI outcomes and commercially grounded results, fast.

Edge AI is a Logistics Programme Wearing a Software Badge

by Seb Matthews | Feb 10, 2026 | AI, Edge, Military, Post

Edge AI in defence fails not because of the model but because of the operating model. This post examines what breaks in real deployments across land, maritime, and air environments and why it matters.

Orbital Datacentres. Real or Fantasy?

by Seb Matthews | Feb 8, 2026 | Government, Post, Space

Orbital datacentres make sense first for space native workloads: processing Earth observation on orbit, reducing dependence on fragile downlinks, and offering burst compute to other spacecraft. Off planet storage may help continuity, but only with serious governance.

Should we Wargame LLM-enabled Adversaries?

by Seb Matthews | Jan 28, 2026 | AI, Military, Post

LLM-enabled adversaries change the cost structure of information operations permanently. This post makes the case for wargaming against AI-generated pressure and explains what you are actually testing for when you do it right.