ops0 engineering article
AI SRE

The AI SRE: Always-On Monitoring With Humans in the Loop

How an AI SRE works when humans stay in control: AI watches, reasons about incidents, and runs only the safe fixes a team has explicitly opted into.

Topic
AI SRE
Category
Thought Leadership
Read time
6 min read
Published
February 17, 2026
Reviewed
May 9, 2026
Author
ops0 Engineering

Key Takeaways

  • AI SRE works as always-on monitoring with AI analysis and human-approved action, not as a hands-off operator
  • ops0 auto-remediates only an opt-in safe-tier allowlist with hourly rate limits, controlled by hiveAutoFixSettings
  • Deploys, drift fixes, configuration changes, IAM, and production writes always require explicit human approval
  • Kubernetes incidents are detected by a 60-second monitor loop and deduped by namespace, kind, name, and reason
  • AI analysis populates aiAnalysis and aiRecommendations on each incident as a triage assist, not a fix executor
  • The control model is enforced in tables and execution paths, not in soft policy

Quick Answers

What is an AI SRE?

An AI SRE continuously watches infrastructure, reasons about incidents using AI analysis, suggests fixes, and runs only the narrow set of remediation actions that a team has explicitly approved as safe.

Does ops0 take action on infrastructure on its own?

Only within an opt-in safe-tier allowlist with hourly rate limits, controlled by hiveAutoFixSettings. Deploys, drift remediation, configuration changes, IAM, and any production write require explicit human approval.

How are Kubernetes incidents detected?

A monitor loop runs on a configurable interval, default 60 seconds. It pulls events, pod status, and workload health, dedupes by namespace, kind, name, and reason, and creates or updates an incident with severity, occurrence count, and a raw event window.

What does AI analysis do?

AI analysis populates aiAnalysis and aiRecommendations on each incident with a likely cause, recently changed resources to inspect, and suggested next steps. It is a triage assist, not a fix executor.

What infrastructure signals does ops0 reason against?

Deployments, configuration changes, drift findings, Kubernetes events, Trivy and Kyverno results, OpenCost data, and dependency context from Resource Graph.

Related Reading

An AI SRE watches infrastructure on a schedule, reasons about incidents when something breaks, and runs only the narrow set of fixes the team has explicitly approved as safe. Everything else is surfaced as a recommendation with a clear next action. ops0 does this with a 60-second monitoring loop, AI analysis on demand, an opt-in safe-fix allowlist with hourly rate limits, and explicit approval gates on deploys, drift remediation, configuration changes, and anything that touches IAM or production state. The 3 AM page becomes a morning summary, but humans still own the decisions that matter.

The promise of NoOps was simple: infrastructure so well operated that routine toil disappears. The reality of AI SRE today is more honest: AI absorbs the analysis work, the team keeps the keys.

NoOps the first time around struggled because teams tried to achieve it with the wrong tools. The automation did exactly what it was told, which meant teams had to tell it everything. Every edge case. Every failure mode. Every recovery procedure.

That is not automation. That is scripting.

A useful AI SRE needs systems that understand what they are doing, plus a control model that bounds what they are allowed to do.

What AI SRE Means In Practice

There are two parts: the analysis, and the action.

Analysis is where AI carries the load. The system collects events, pod state, deployment history, drift signals, policy results, and dependency context. When an incident is created or an operator clicks "Analyze," AI reads that context, reasons about likely root cause, and writes a recommendation. Most of the value happens here.

Action is where the team stays in charge. ops0 does not push deploys, modify IAM, scale clusters, or change configuration on its own. Those flows go through dry runs, plan output, policy gates, and explicit approval before a single resource changes.

Always-On Monitoring

The Kubernetes incident monitor in ops0 runs on a configurable interval, default 60 seconds, controlled by incidentMonitoringInterval. Each tick it pulls events, pod status, and workload health from connected clusters, dedupes by incidentKey (namespace, kind, name, reason), and either creates a new incident or updates an existing one with occurrenceCount, firstSeenAt, lastSeenAt, severity, and a JSONB array of raw events.

This is intentionally not a real-time event stream. A scheduled loop is more stable, deduplicates better, and gives AI analysis a coherent window of context to reason about rather than firing on every transient event.

Around that loop, ops0 ingests deployment history, IaC drift findings, policy check results, Trivy vulnerability data, Kyverno admission decisions, and OpenCost allocation when those integrations are enabled. That is the situational awareness layer the AI reasons against.

What AI Does With That Data

AI analysis is a triage assist, not a fix executor. When an incident is created or a user clicks Analyze, ops0 sends the structured incident plus its raw events to the AI backend. The result populates aiAnalysis (a human-readable summary) and aiRecommendations (suggested next steps). aiAnalyzedAt records when the analysis was last refreshed.

The output is not a button that runs against production. It is a triage view: likely cause, recently changed resources to inspect, suggested commands or manifest changes, and a short list of next actions. The team decides what to apply.

That separation is what keeps AI useful in production without giving it write access to a live cluster.

What ops0 Will Auto-Remediate

There is a real auto-remediation surface in ops0, but it is small and explicitly opt-in.

Settings live in hiveAutoFixSettings, scoped per organization. To enable auto-fix, an admin sets enabled to true, defines allowedRiskLevels, sets maxAutoFixesPerHour as a rate cap, lists requireApprovalFor risk levels, and optionally excludes specific agents or issue categories. Default behavior is conservative: only safe-tier fixes auto-execute, with a five-per-hour rate limit, and any moderate or advanced fix requires approval regardless of the enabled flag.

Examples of what fits the safe tier: restarting a crash-looping pod, clearing a full ephemeral disk, rotating logs, retrying a failed observability sync, acknowledging a duplicate incident. These are well-understood operations with limited blast radius and easy reversal.

Examples of what does not fit the safe tier and stays gated: scaling a StatefulSet, rolling back a deployment, applying drift remediation, modifying IAM, changing security groups, deleting resources, applying any IaC plan to production.

Notifications fire on every auto-fix execution so the team has a real-time record of what the system did, not a silent action queue.

What Always Requires Approval

The approval gates are the spine of the control model. They live in dedicated tables and execution paths, not in soft policy.

IaC deployments follow plan, dry-run, approval, then apply. The dry run executes with check and diff flags so no resource changes. The execution loop yields awaitingApproval until a human signs off, then runs the apply.

Configurations have an approvalStatus field. Deployments block until the status reaches approved.

Workflow steps that touch sensitive paths use workflowApprovalRequests with expiry and required admin sign-off before the step proceeds.

IaC policy violations from Checkov and custom Rego rules block deployment at warning severity or higher. The block is a wall, not a warning.

Drift remediation is detected automatically and analyzed by AI, but the actual reconciliation runs through the same approval-gated deploy path. The system never reconciles drift on its own.

A Real Triage Path

A useful triage with this model looks like this.

3 AM. The monitor loop fires. A CrashLoopBackOff incident is created in the payments namespace with severity P1, occurrenceCount climbing, lastSeenAt two minutes ago.

The incident detail view groups the related raw events, the recent deployment that changed the image tag, a Trivy critical on the new image, and the AI analysis: likely cause is the image rollout, recommended next step is a rollback PR.

If the team has hiveAutoFixSettings configured to auto-restart the affected pod as a safe-tier action, the system restarts it once and notifies the on-call channel. The restart buys time. It is not the fix.

The on-call engineer wakes up to a Slack message and a populated incident view: AI analysis, the recent deploy, the vulnerable image, the rollback PR ready to merge. They merge the PR, the rollback runs through the normal approval-gated deploy path, drift detection confirms the rollback, and the incident is resolved with a status_change note that links back to the rollback.

The team did not get a vague page at 3 AM. They got a triage already in progress with the next action obvious.

What Humans Still Do

AI SRE does not mean no people. It means less work spent assembling context and more work spent on the decisions that matter.

Set policy. Define which risk levels are allowed for auto-fix, the rate limits, the excluded categories, the approval requirements, and the escalation paths.

Approve writes. Sign off on deploys, drift reconciliation, configuration changes, and any change that touches IAM, secrets, or production data.

Handle the novel. AI is good at incidents that match known patterns. New failure modes still need human judgment, and the system is built to surface them rather than guess.

Make strategic calls. Cloud migrations, architecture changes, scaling strategy, regional failover, vendor selection. These are not operational decisions.

Review and improve. Look at what the system did, decide whether the auto-fix policy is still right, tune the allowlist, expand or restrict scope based on what happened.

ops0's Approach

ops0 was built around always-on AI analysis with human-approved action.

The monitoring is continuous and lightweight, designed to feed observability platforms without the overhead of heavier agents. Deployments, configuration changes, resource utilization, drift, dependency health, Kubernetes events, and policy results all enter one structured incident model.

AI reads context with the right awareness. It knows that latency at 3 AM is unusual, that a deployment went out, what the service dependencies are, and which resources are unmanaged by IaC.

Action stays bounded. Safe-tier remediation runs only when explicitly enabled, only within rate limits, only on an allowlist, and always with notifications. Higher-risk actions are surfaced with explanations and recommendations. Humans stay in the loop for any decision that meaningfully changes infrastructure.

This is what an AI SRE looks like. Not no people, less toil. Not no control, sharper control. Not no observability, observability that reasons.

From Article To Workflow

The next step is not another dashboard.

See how ops0 turns discovery, generated IaC, policy gates, runtime checks, and evidence into one governed infrastructure workflow.

Why ops0
© 2026 OPS0, Inc. All rights reserved.
ops0 binary code decoration
AI SRE With Humans in the Loop | ops0