Should we Wargame LLM-enabled Adversaries?

by | Jan 28, 2026 | AI, Military, Post | 0 comments

Should We Wargame Against AI-Powered Adversaries?

Last year I sat in a wargame where the red team had to manually generate fake intelligence reports. Every claim took time. Every variation required human effort. By day three, the red team had essentially run out of bandwidth. They could disrupt, but only in bursts. The blue team adapted. The exercise felt manageable.

Now imagine that same scenario with language models generating variations automatically. The red team doesn’t get tired. They don’t run out of ideas. They just keep flooding the zone, hour after hour, with plausible-sounding noise mixed with genuine fragments of truth. The friction doesn’t come in waves. It comes as a permanent hum in the background. That is a fundamentally different test.

The question is not whether large language models (LLMs) are impressive technology. The question is whether we should deliberately train our leaders and organisations to make decisions in that kind of environment. I think we should. Not because the technology is magic, but because it changes something essential about how pressure works in wartime.

What We Are Actually Talking About

People get confused about this because they assume the threat is about perfect AI propaganda replacing human thought. That is not the issue.

The real threat is much simpler. LLMs make it cheap to produce enormous amounts of variation. A human adversary can design a narrative. An LLM can generate a thousand variations on that narrative, adjusted for different audiences, timed to hit at different moments, adapted based on what the other side is saying. The human operator sets the direction. The model handles the volume.

From an adversary’s perspective, this changes everything. They can now scale the things they already do well: spreading confusion, creating administrative chaos, exploiting gaps between organisations, and sowing doubt in institutions. What used to require teams of people writing content now requires a few people managing a system. The cost goes down. The persistence goes up. The targeting becomes tighter.

The centre of gravity shifts. Instead of worrying about whether one piece of misinformation is believed, an adversary cares about attention. They care about tempo. They care about whether your organisation can actually keep up. Do your people have time to verify claims, or are they drowning? Can your decision-makers get clean information, or is every channel noisy? Can your units coordinate, or are they isolated by classification barriers? Those organisational friction points become the real target.

Why Most Wargames Miss This

Current defence wargames tend to model platforms well. They model weapons systems well. They model logistics and manoeuvre well. What they do not model well is cognition under sustained pressure.

Information operations usually show up as discrete events. A false report gets injected. The players deal with it. They move on. But that is not how an LLM-enabled adversary actually works. The pressure is continuous. It adapts. It learns what stresses your system and does more of that. The blue team never gets to reset.

There is a deeper reason why this matters. The real weapon is not believing false things. The real weapon is time. If I can make you spend three hours investigating something false, I have stolen three hours from you. If I can make you verify information seven different ways because you do not trust any single source, I have multiplied your overhead. If I can create enough alternative narratives that nobody in your organisation agrees on what is actually happening, I have broken your command structure without lying very much at all. The friction itself does the damage.

And then there is what wargaming can expose that peacetime analysis simply cannot. Every organisation has seams. Places where different units have conflicting policies. Places where classification rules prevent the people who need information from getting it. Places where one person is responsible for too many decisions. Places where handovers lose critical context. A sophisticated red team does not need to win the public narrative. It just needs to find your weak points and apply pressure there. Only a wargame shows where those seams actually are, and only under stress.

The Real Risks, and They Are Not Trivial

This approach has genuine downsides that deserve serious attention.

The most obvious one is that wargames can become technology demonstrations. The red team spends the exercise showing off what language models can generate rather than actually testing decision-making. That defeats the purpose. You end up learning nothing about how your leaders think and everything about model capabilities.

There is also the threat picture problem. Emphasise synthetic influence too much and you under-prepare for the physical realities that usually matter more: supply lines that break, infrastructure that fails, allied forces that do not show up on schedule, the grinding exhaustion of actual combat. An adversary with an LLM has not solved those problems. They still have to fight a real war. Wargames need balance.

The deepest risk is psychological. If the lesson your staff internalize is that everything is manipulable and truth does not matter, they stop trying. They lose the will to make decisions. They become passive. A good wargame sharpens thinking. A badly executed one can paralyse it.

The answer is not to avoid this entirely. The answer is to include it only when you maintain ruthless focus on what actually matters: decisions, evidence standards, and operational tempo.

What You Are Actually Testing For

Stop thinking about detecting deception. That is almost impossible at scale and it is the wrong metric.

The capability that actually matters is something much more unglamorous: the ability to make sound decisions when the information environment is deliberately corrupted. That is a different skill. It is about triage. What matters enough to investigate? What can you act on even if you are not completely certain? What requires absolute confirmation? How do you maintain decision quality when you are exhausted and uncertain?

A wargame testing this should not be dramatic. There should be no moment where the red team reveals they were lying and everyone gasps. Instead, the blue team should experience a low, constant pressure. New claims come in. Some are true. Some are false. Some are half-true. The blue team has to keep functioning. The question is whether they function well or whether they break under the strain.

The metrics matter too. Do not measure success by how many deceptions were caught. Measure it by the decisions made. Were they sound? Did the people making them have reasonable evidence? Did they act under uncertainty without either freezing or gambling? Did the process scale, or did it collapse? Which procedures failed open and which failed safely?

If your wargame cannot produce those kinds of insights, adding an LLM element is just technology for its own sake.

Execution That Actually Works

The red team should be persistent and quiet, not flashy. They generate content continuously. They adjust it based on what the blue team does. They have access to enough computing power that they are not manually writing propaganda. The point is to simulate what an actual adversary would do, not to impress anyone.

The blue team should feel the cost. Not in single moments of being fooled. In the cumulative weight of administrative overhead, in the time spent on triage, in the complexity of coordinating when nobody is completely sure what is happening. That is what it really feels like.

Observation matters as much as the scenario itself. Watch where decision tempo breaks down. Track which claims got acted on without adequate evidence. Notice how people handle disagreement and contested information. Document whether people trusted their analysis more than their intuition, or the reverse. See which procedures created safer failures and which created worse ones. This data is what you are actually after.

What Gets Worse If We Do Not Do This

If wargames stay clean, leaders learn instincts tuned for a cleaner war than the one they will actually fight. They over-invest in perfect information and perfect detection. They under-invest in the mundane work of triage, governance, and evidence standards that actually separate organisations that can function under pressure from organisations that cannot.

More dangerously, you miss the organisational seams. The places where policies conflict between units. The places where classification rules prevent sharing. The places where handovers lose critical information. The people who are single points of failure. Those vulnerabilities are invisible during peacetime. During a real conflict with a sophisticated adversary applying pressure precisely where it causes maximum friction, those seams become catastrophic. A wargame is the only venue where you can force them into the open and do something about them before the real test.

The Actual Question

The answer to “should we include LLM-enabled adversary behaviour in our wargames” is yes, if you do it carefully.

The purpose is not to predict tactics. It is not to frighten people with technology. It is to stress-test an organisation’s ability to make decisions, maintain evidence standards, and operate at tempo when disruption is cheap and relentless.

If the wargame makes your leaders better at triage. If it makes them more disciplined in how they use evidence. If it accelerates their decision-making under genuine uncertainty. Then it has done the job. If it becomes a showcase of technology, it has wasted everyone’s time.

The real test is whether your organisation can actually do this kind of wargaming well. That is a different question entirely.

Written by Seb Matthews

Military to NASA to boardroom, I bridge operators and engineers to deliver real world AI outcomes and commercially grounded results, fast.

Related Posts

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *