Chaos agent or agent of chaos? Your infrastructure in the age of AI

15:13

A perspective from TSG Solution Architect Bill Halvorsen on the promise, risks, and realities of autonomous agents in infrastructure and operations.

When I started my career in network operations and engineering, we used to joke that the internet had two states: on and off. One or zero. That was the extent of our visibility back then: observability, monitoring, the whole telemetry picture. If the link was up, things were fine. If it was down, you usually found out because the phone started ringing. Fast-forward to today and billions of data points stream in real time across every layer of the stack. The data didn't just grow, it lapped us. Our ability to act as humans, at human speed, has fallen behind the flood. That gap is exactly where autonomous agents step in, and they are already upon us.

Which brings me to a fitting ambiguity in the phrase "chaos agent." It can mean a person, or a piece of software, that quietly creates disorder. It can also describe a deliberate tool, like the chaos-engineering agents that break things on purpose so you learn how your systems fail. Both meanings are now live in infrastructure and operations teams, and the difference between them usually comes down to one thing: whether the rollout had a plan.

AI agents, or software that can perceive a goal, reason about it, and then take actions through APIs and tools without waiting for a human at each step — have arrived in network operations, cloud platforms, and service-provider environments faster than most organizations have built the controls to govern them. This article isn't a verdict on whether that's good or bad. It's a walk through the actual conversation: what agents offer, what they put at risk, and how two organizations running the same technology can end up in completely different places.

What we're actually talking about

It helps to be precise. An "AI agent" in operations isn't a chatbot that answers questions. It's a system that closes the loop: it observes telemetry, decides on a course of action, and executes — restarting a service, rerouting traffic, adjusting a policy, modifying a config, calling a cloud API. The promise is autonomy. The risk is also autonomy. The same property that makes an agent useful at 3 a.m. is what lets it act faster than any human can intervene.

That tension is the whole story.

The case for putting agents in the loop

The upside is real and worth stating plainly.

Speed and scale

Modern environments generate more alerts than humans can triage. AIOps platforms routinely report alert-noise reductions of 80–85% through event correlation, and meaningful drops in mean-time-to-resolution when remediation is automated for known patterns. For a NOC drowning in duplicate alarms from a single root cause, that's not a luxury — it's survival.

Always-on, consistent execution

Agents don't get tired, don't skip a runbook step at the end of a long shift, and don't fat-finger a command at 4 a.m. For well-understood, repetitive tasks — clearing disk, scaling a service, restarting a known-flaky pod — consistency is a genuine benefit.

Predictive prevention

Anomaly detection and pattern matching can surface emerging problems before they become outages, shifting teams from reactive firefighting toward proactive maintenance.

Engineer leverage

When agents handle the repetitive 80%, scarce senior engineers spend their time on the hard 20% , focusing on architecture, capacity planning, the genuinely novel incident.

A moment from my time at Cisco captures what this is really about. I supported customers like Time Warner during the deployment of Cisco's Gigabit Switch Routers (GSRs), which were capable of moving extraordinary volumes of network traffic. I was proud of the raw numbers. The VP of Operations quickly reframed the discussion: "Bill, I don't care if your GSR can pass a billion packets a second. I need you to account for those packets and report them accurately to my operations tools." That was the real job. It wasn’t just about throughput, but understanding what was happening, accounting for it accurately, and turning that information into operational insight at the speed of the business. For decades that was the hard part; capacity always outran our ability to observe and account for it honestly. What makes autonomous agents different is not their ability to execute tasks. It is their ability to continuously observe, analyze, correlate, and act across environments operating at a scale no human team could manage alone.

Combined with modern observability platforms, AI agents can process and account for trillions of data points, identify patterns hidden within massive datasets, and surface meaningful actions with a level of speed and consistency that was previously unattainable.

That is not hype. It is the reason the category exists. The question is never whether agents can create value but rather how much authority organizations are willing to grant them and how fast.

The case for caution

The risks are equally real, and they cluster into a few recurring failure modes.

Irreversible actions at machine speed

An agent doing something wrong can cause the same damage as an employee with the same credentials, except it happens in seconds, before anyone can intervene. As organizations grant agents more autonomy, the stakes increase. Gartner predicts that by 2028, at least 15% of day-to-day work decisions will be made autonomously by AI agents. The challenge isn't simply whether agents make mistakes; it's that they can execute those mistakes at machine speed and across multiple systems simultaneously.

Confidently, invisibly wrong

The scariest agent failures aren't loud crashes. They're an agent that's certain, takes a reasonable-looking action on bad inputs, and is wrong in a way nobody notices until the blast radius is large.

Automation amplifying its own mistakes

Auto-remediation without memory treats every incident as the first one. It restarts the same pod for the eighth time this quarter, masks the real problem, and turns "we resolved it fast" into "we never actually fixed it."

Blast radius scales with privilege

As one security expert aptly described it, an autonomous agent with unrestricted access is like an intern with a company credit card and no spending limits. It can move quickly, execute tasks at scale, and make decisions without hesitation, but it can also amplify mistakes just as rapidly. The reality is simple: the potential damage an agent can cause is directly proportional to the systems, data, and authority you place in its hands.

Scenario one: The blind rollout

Picture an organization that's excited, under pressure to "do something with AI," and short on patience. It gives an agent broad access, points it at production, trusts a few rules written in a config file, and turns it loose. Here's what that has actually produced.

A database gone in nine seconds

In April 2026, a coding agent working on a routine staging task at a small software company hit a credential mismatch, decided on its own to "fix" it, found an over-permissioned API token sitting in an unrelated file, and used it to delete the company's production database and the backups, which lived on the same storage volume. Total elapsed time: nine seconds. There was no "type DELETE to confirm," no environment scoping, no alert to a human. The company's most recent usable backup was three months old. The post-mortem's blunt lesson: rules in a config file are not a guardrail. A permission system that physically cannot perform the action is a guardrail. You can't ask an agent to police itself with prose instructions alone.

An agent that deleted data during a code freeze — then hid it

In a separate, widely discussed incident, an AI coding agent wiped a production database holding more than 1,200 executive records during an explicit code freeze, then misreported what it had done. The platform's CEO called it "unacceptable" and shipped fixes — automatic dev/prod separation, better rollback, a planning-only mode. The fix list reads like a checklist of the safeguards that should have existed first.

The internet's bad weekend — without any AI agent at all

The most instructive example isn't even an LLM agent. On October 19–20, 2025, AWS's US-EAST-1 region suffered a roughly 15-hour outage. The root cause was a latent race condition in DynamoDB's automated DNS management system — two independent components (a "Planner" and an "Enactor") that update DNS records without human involvement. The automation worked exactly as designed, on bad inputs, and pushed out an empty DNS plan. The failure cascaded across ~140 AWS services; independent measurements suggested 20–30% of internet-facing services were disrupted, with estimated insured losses in the hundreds of millions. Recovery required engineers to manually fix what the automation couldn't, and AWS disabled the automation worldwide while it added safeguards. The point: the autonomous-control-loop failure mode predates AI agents. Agents simply give that same failure mode more reach and more initiative.

The common thread across all three isn't "AI is dangerous." It's that autonomous action plus broad privilege plus no enforced guardrail plus no rollback equals an incident waiting for a trigger. The agent didn't need malice or a hacker. It just needed permission and an obstacle.

Scenario two: The methodical rollout

Now picture the same technology in the hands of an organization that treats autonomy as something you earn, level by level. The tooling is identical. The outcomes are not.

Earned autonomy, not assumed autonomy

Mature operations models describe a deliberate progression: agents start by assisting (surfacing correlations, drafting change plans, suggesting remediations a human approves), then graduate to running playbooks for known patterns, then to auto-approving only low-risk changes with error budgets that automatically throttle deployments when reliability dips. Each step is unlocked by evidence, not enthusiasm.

A telecom example doing it the slow way

In February 2026, Nokia and AWS demonstrated agentic AI for 5G-Advanced network slicing, with early pilots at carriers including Orange and du. The system autonomously adjusts radio-access-network policies in near real time based on KPIs, which is exactly the kind of action that would terrify anyone who'd watched Scenario One. The difference is framing: it was scoped, governed, expanded gradually with explicit controls, and kept explicitly in pilot rather than rushed into production. Same ambition, opposite posture.

Guardrails that are architecture, not advice

Disciplined teams bound the blast radius before the first agent runs:

Scoped, least-privilege credentials: read-only by default, write access granted per session and per task, never a blanket root token.
Tool and action allowlists: the agent can only call a defined set of tools, with argument validation (no unbounded DELETE, no SELECT * on the customer table) and rate limits.
Human-in-the-loop gates on anything irreversible: deletes, transfers, and account changes require explicit approval. Confidence is not authorization.
A separate verifier: an independent check, not sharing the agent's incentives, that validates the action before or right after it executes.
Full audit trails and one-click rollback: every automated action is logged and reversible, so the answer to "what did it do, and can we undo it?" is always yes.

Distributed-systems discipline

The most useful mental model from teams running agents well in 2026 is to treat agents like unreliable workers calling unreliable dependencies. That means SLOs, runbooks, and post-mortems and tracking concrete metrics: task success rate, tool-call correctness, cost-to-complete, and human-escalation rate. The serious question isn't "does it hallucinate?" It's "what's our worst-case blast radius, and have we capped it?"

Honesty about readiness

Notably, adoption of fully autonomous remediation remains modest. While interest in agentic operations continues to grow, only less than a quarter of organizations have progressed to allowing AI agents to independently diagnose and resolve issues in production environments. The reason is unglamorous: autonomous remediation needs a level of data quality and system maturity most environments simply haven't reached yet. The most methodical organizations recognize this and rather than rushing to full autonomy, they focus on strengthening their data foundations, improving visibility across systems, and establishing clear governance before expanding an agent’s authority.

The conversation, not the conclusion

So which organization are you?

That’s intentionally the wrong question.

The evidence doesn’t point to a binary choice between embracing autonomous agents or rejecting them. It points to a spectrum of readiness. The same agent that deletes a production database in seconds in one environment can safely tune a 5G network slice in another. The technology is the same. What changes are the controls around it.

The difference comes down to governance, visibility, access management, operational maturity, and a clear understanding of where human oversight remains essential.

Before the first autonomous agent touches a production environment, leaders should be prepared to answer a few uncomfortable questions:

What happens when the agent is wrong — not if, but when?
Can every automated action be traced, audited, and reversed?
What is the worst possible outcome if this agent acts exactly as instructed?
Who is accountable when autonomous action creates business disruption?
Do we have the data quality, observability, and governance required to support this level of autonomy?

The answers to these questions will determine whether an AI agent becomes a force multiplier or an agent of chaos, bringing unexpected or additional risk. The most important story in AI-driven infrastructure was never about the model itself. Models will continue to improve. Capabilities will continue to expand. What will separate successful organizations from the rest is the discipline surrounding deployment.

In the end, autonomous agents are neither inherently transformative nor inherently dangerous. They simply amplify the strengths and weaknesses that already exist within an organization. The organizations that benefit most will not be the ones that move the fastest. They will be the ones that understand where autonomy creates value, where oversight remains necessary, and how to balance both with intention.

If you're evaluating where autonomous systems fit within your infrastructure, governance, and operations strategy, TSG can help you assess your readiness, identify risks, and build a practical path toward responsible adoption.