Why AI in the SOC comes down to the data you hold and the accountability you’ll stand behind, not the model you buy.
By Rob Demain, CEO & Founder, e2e-assure
There is a particular pitch every security leader has heard by now. It goes: the model is getting better, the next one will be better still, and soon it will be good enough to trust. Reliability, in this telling, is something that arrives – a feature of a future release, available to anyone willing to wait for it or pay for it.
It is the wrong scoreboard for security operations. And the evidence that it is wrong has become hard to argue with.
After three decades in this field, I have learned to be wary of any technology sold on the promise that it improves on its own. AI is genuinely new ground -for the first time in years I am working on something that did not exist before, not a rebrand of an old idea. But “genuinely new” and “ready to be trusted blindly” are not the same statement, and the gap between them is exactly where security teams get hurt.
The short version
- Hallucination is structural – not a bug a bigger model grows out of. The latest research says so plainly.
- Grounding AI in good data helps, but never eliminates the problem. Verification and accountability are what make AI safe in a SOC.
- The model is no one’s edge -attackers have frontier models too. Your environment and your team’s knowledge are the advantage they can’t copy.
- We learned this the hard way, in a pilot that went wrong. It shaped everything we’ve built since.
You cannot scale your way out of hallucination
The common assumption is that AI accuracy is a function of model quality -that a better, larger model will eventually stop making things up. This year, researchers at OpenAI and Georgia Tech published work, since carried in Nature, arguing the opposite. Hallucination is not a bug that scaling removes; it is structurally incentivised. The way models are trained and evaluated rewards confident guessing over admitting uncertainty -a model that says “I don’t know” scores worse on the benchmarks than one that guesses and is occasionally right. The field has, in effect, taught models to bluff, and it persists in the newest and most capable systems.
That dismantles the “just wait for the better model” strategy. If unreliability is baked into the incentive structure, you do not upgrade your way out of it.
Specialised domains make it worse. The cleanest public evidence comes from law, where the failure is verifiable. Stanford researchers found general-purpose models hallucinating on legal queries at least 58% of the time, rising as high as 88% on specific questions. More soberingly, the purpose-built legal tools -the ones using retrieval over curated, authoritative data, exactly the architecture vendors point to as the fix –still hallucinated 17% to 33% of the time. And a randomised controlled trial of law students found that a retrieval-grounded tool lifted productivity without pushing error rates below what people produced using no AI at all: grounding kept hallucination in check, but did not eliminate it.
The lesson for security is direct: grounding a model in good data helps enormously, but it does not make the problem vanish. And the most dangerous error is not the obviously wrong one. It is the answer that drew on real evidence, reasoned from it incorrectly, and looks exactly right -what researchers call misgrounding. In a SOC that is a confident, plausible, well-cited verdict on an alert that happens to be wrong, and it is far harder to catch than an obvious fabrication.
The most dangerous error is not the obviously wrong one. It is the answer that looks exactly right.
We learned this the hard way
We do not write about this from the outside. A few years ago, in an early pilot of AI in the SOC, a single hallucinated verdict went unchallenged and turned into an expensive, time-consuming mess. The model was not malicious; the data was not obviously wrong. What turned a bad answer into real damage was everything around it. No one could explain why the system had reached that conclusion. No one clearly owned the outcome. And -the part that has stayed with me -no one felt confident enough to overrule the machine. A plausible, confident, wrong answer stood unchallenged, because challenging it required an understanding of the model that nobody in the room had.
The cost was not only wasted hours and resource. It was trust, which is far harder to win back than to lose.
That experience shaped everything we have built since. The real failure was not the hallucination -those, as the research shows, are unavoidable. The failure was the absence of the three things that should have caught it: a way to explain the verdict, a clear owner of the outcome, and people both equipped and empowered to say “no, that is wrong.” Everything that follows is a direct answer to those three.
So the differentiator was never the model
If you cannot buy reliability and you cannot fully eliminate hallucination, the questions that actually matter shift. Two of them decide whether AI belongs anywhere near a live SOC: what can you ground the model in, and who is accountable when it is wrong?
Both are harder than they look, and neither is solved by a better model.
The first advantage: we already hold the data
Everyone now agrees that AI in security depends on reliable, structured access to the right data – Palo Alto Networks, among others, has made the point plainly. Far fewer providers can actually deliver it, and the reason is structural. You can only ground AI in a customer’s reality if you genuinely hold that reality: clean, complete, well-structured telemetry, under your own control, at the fidelity the work requires.
Most managed providers do not. They run their service on top of a SIEM they license from a third party. They do not own the data layer -they rent it -which means they do not control its completeness, its structure or its quality, and grounding AI properly becomes a data project they have not yet started.
We are one of very few managed providers that runs on a SIEM we built and own. We have been operating SOCs for thirteen years. The data needed to ground AI in a customer’s environment is not something we have to go and assemble -it is already here, at the right fidelity, in a platform we control. That is a commercial advantage that is genuinely hard to copy, because copying it means building a SOC platform first and then operating it for a decade.
And the data is only half of it. Alongside it sit thirteen years of how our cleared analysts have actually triaged, investigated and concluded -the institutional judgment of a team that has done this work, at scale, in regulated and high-consequence environments, for a very long time. An architecture can be copied in an afternoon. Thirteen years of accumulated analyst judgment cannot be copied at all -and, crucially, neither can it be acquired by an adversary or found inside any frontier model.
An architecture can be copied in an afternoon. Thirteen years of accumulated analyst judgment cannot be copied at all.
The second advantage: we will tell you who owns the outcome
Hallucination in the SOC is not, in the end, a technical problem. It is an accountability one, and it cannot be left until after an incident. If a model invents a conclusion, misses a signal, suggests the wrong triage decision, or produces a weak detection change -who owns that outcome? The analyst? The SOC lead? The vendor? The organisation that deployed it?
With a tool, the honest answer is always the same: you do. The vendor ships the capability and the risk transfers to you on installation. That is a legitimate model, but leaders should be clear-eyed that it is the one they are buying.
We work differently, and we are open about how. We operate as partners rather than as a black box you are asked to trust on faith. We show our working: for any AI-assisted decision, an analyst -and the customer -can see what evidence was used, what was assumed, what was reproduced, and how the conclusion was reached. We carry a real share of the outcome rather than transferring all of it across on day one. And there is an audit trail behind every decision precisely so that accountability can be reconstructed after the fact, not merely asserted before it.
Showing our working is also what makes the human oversight real. An analyst can only overrule a machine they understand – so explain ability is not a nicety, it is what turns a human in the loop from a rubber stamp into a genuine check. The point of putting a person in the loop is lost the moment that person cannot see why the system decided what it did, or does not feel able to say no to it.
That transparency is not a courtesy. For the sectors we serve it is the whole basis of trust -and frankly it is the thing the “best model” vendors find hardest to offer, because you cannot show your working if you do not understand it yourself.
What reliability looks like in practice
The principles that follow from all of this are simple to state, even where the engineering behind them is not. The customer’s data stays in the customer’s environment and under the customer’s sovereignty -that is a precondition for how the reliability works, not a feature bolted on the side. Frontier models still have a real role for research, unfamiliar techniques and analyst exploration, but the day-to-day SOC runs on a model grounded in the customer’s own world. And improvement is something we can prove rather than something we ask you to take on trust: changes are tested before they ship and can be rolled back, so the system is promoted to better, one tested step at a time, rather than left to drift.
That last point matters more than it sounds. “It learns and gets better over time” is, structurally, the same unfalsifiable claim as “just wait for the bigger model.” Improvement only counts when you can measure it, evidence it, and reverse it.
The question that actually matters
None of this is an argument against AI in the SOC -quite the opposite. Defence is a contest, and the side with the better information wins. The uncomfortable truth the hype skips over is that the adversary has frontier models too: the same general capability now sits on both sides of the fight, so it confers no advantage on either. What the attacker cannot have is your environment -its normal, its topology, its history -and the accumulated judgment of the team that has defended it for years. Ground AI in that, and you hand the defender the one source of edge an adversary can never acquire and will never find in any frontier model. That is what tips a contest.
But augmentation only earns trust if the people and the platform behind it can stand behind it. The question was never whether AI can produce a useful answer in a demo. It often can. The question is whether you can trust it at 03:40, when a nation-state actor has just gained initial access -whether the process around that answer can show its evidence, ground itself in your environment, name who is accountable, and prove it is right rather than merely sound right. That is the moment every demo is silent about, and the only one that counts.
Reliability is not a model you download. It is an architecture you build, govern, and can be held to.
Frequently asked questions
Does AI hallucinate in cybersecurity and the SOC?
Yes. Hallucination -a confident, plausible answer that is wrong -affects every large language model, and specialised technical domains can fare worse than general ones. In a SOC the most dangerous form is “misgrounding”: a verdict that draws on real evidence but reasons from it incorrectly and looks exactly right, which is far harder to catch than an obvious error.
Can a bigger or better AI model eliminate hallucination?
No. Research by OpenAI and Georgia Tech, published in Nature in 2026, shows hallucination is structurally incentivised by the way models are trained and evaluated, and it persists even in the newest, most capable models. You cannot scale or upgrade your way out of it; reliability has to be engineered around the model.
Does retrieval-augmented generation (RAG) stop AI from hallucinating?
It reduces hallucination but does not eliminate it. In Stanford’s evaluation, purpose-built legal AI tools using retrieval over curated data still hallucinated 17–33% of the time. Grounding is necessary but not sufficient; verification, explainability and human oversight remain essential.
Will AI replace SOC analysts?
No. Gartner has been explicit that there will never be a fully autonomous SOC, and that AI should be used for augmentation rather than replacement. The decisive advantage in security is not the model -attackers have frontier models too -but the defender’s grounded knowledge of their own environment, which an adversary cannot acquire.
How should AI be deployed safely in a security operations centre?
Ground the model in the customer’s own environment, keep that data under the customer’s sovereignty, and wrap it in explainability, a clear owner for every outcome, human oversight able to overrule the system, and improvement that is tested and reversible rather than assumed. The test is not whether AI can produce a useful answer, but whether the process around that answer can be trusted and accounted for.
What makes a managed detection and response (MDR) provider’s AI trustworthy?
Owning the data layer -rather than renting a third-party SIEM -so the AI can be grounded in clean, complete telemetry; showing its working with a full audit trail; carrying a real share of accountability for outcomes; and operating with data sovereignty appropriate to regulated sectors. e2e-assure runs on its own SIEM and has operated SOCs for thirteen years, the foundation that makes grounded, accountable AI possible.
Sources
- Kalai, A. T., Nachum, O., Vempala, S. S. & Zhang, E. “Evaluating large language models for accuracy incentivizes hallucinations.” Nature, 22 April 2026. https://www.nature.com/articles/s41586-026-10549-w
- Dahl, M., Magesh, V., Suzgun, M. & Ho, D. E. “Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models.” Journal of Legal Analysis, 2024. https://dho.stanford.edu/wp-content/uploads/Hallucinations_JLA.pdf
- Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D. & Ho, D. E. “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.” Stanford RegLab & HAI, 2025. https://law.stanford.edu/wp-content/uploads/2024/05/Legal_RAG_Hallucinations.pdf
- Schwarcz, D. et al. “AI-Powered Lawyering.” SSRN, March 2025. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162111
- Palo Alto Networks. “Agentic AI vs. AI Agents” (Cyberpedia). https://www.paloaltonetworks.com/cyberpedia/agentic-ai-vs-ai-agents