How does ops0 detect infrastructure drift?

ops0 compares declared infrastructure with discovered cloud state and uses Resource Graph context to show what changed and what depends on it.

How should teams fix drift safely?

Teams should inspect blast radius, decide whether the live change or IaC source should win, route remediation through review, and preserve audit evidence.

What Is Infrastructure Drift

Name: ops0
Author: ops0

Infrastructure drift is when your actual cloud environment no longer matches what your Terraform, CloudFormation, or other infrastructure-as-code says it should be. Someone changed a security group in the AWS console. An emergency SSH fix modified a config file on a server. An auto-scaling event created resources your state file doesn't know about. The gap between declared state and real state is drift, and every team running infrastructure at any scale deals with it.

Drift matters because it means your code is lying to you. Running terraform plan against a drifted environment will either break things or miss things. Disaster recovery from code won't recreate what you had. Compliance audits fail when reality doesn't match documentation.

Why Drift Happens

Drift isn't caused by bad engineers. It's caused by real situations.

Emergency fixes are the most common source. Production goes down at 2 AM. You SSH in, change a config, restart a service. It works. You go back to bed. The Terraform never gets updated. Three months later someone runs terraform apply and your fix gets reverted.

Console changes are the second biggest cause. The AWS console is right there. Clicking a button is faster than writing a PR. "I'll update the code later." Later doesn't happen.

Multi-team environments make it worse. When multiple teams touch the same infrastructure, one team's change can create drift for another team's code. Nobody's at fault. The tooling doesn't handle shared ownership well.

Auto-scaling and dynamic resources add another layer. Kubernetes creates pods, load balancers create target groups, auto-scaling groups adjust instance counts. These are expected changes that still show up as drift in your state files.

How to Detect Drift

The basic approach is running terraform plan regularly. If the plan shows changes you didn't make, you have drift. This works but it's manual, slow, and only catches drift in resources Terraform manages. Anything created outside Terraform is invisible.

Better tools scan your actual cloud APIs and compare the results against your state files. This catches both changes to managed resources and new resources that were created outside your IaC.

ops0's Resource Graph does this continuously. It compares your Terraform state files against live cloud state across AWS, GCP, and Azure, and shows you exactly what changed, when it changed, and what depends on the changed resource. That last part matters because a drifted security group might affect dozens of instances, and knowing the blast radius before you fix anything prevents making things worse.

How to Fix Drift

You have three options when you find drift.

Option one: update the code to match reality. The manual change was correct and should be kept. You modify your Terraform to reflect what exists. This is the right choice when the drift was an intentional improvement.

Option two: revert reality to match the code. The manual change was wrong or temporary. You run terraform apply to force the infrastructure back to the declared state. This is the right choice when someone made a mistake or a temporary fix that should be removed.

Option three: ignore it. Some drift is expected and harmless. Auto-scaling changes, dynamic pod counts, and temporary resources don't need to be reconciled.

ops0 gives you one-click remediation for options one and two. You see the drift, you see the blast radius, you pick a direction. No manual editing required.

How to Prevent Drift

Detection and remediation are reactive. Prevention is better.

Policy gates stop non-compliant changes before they happen. If someone tries to deploy infrastructure that violates your compliance rules, the deployment fails before anything changes. ops0 enforces this across 27+ compliance frameworks including CIS, SOC 2, HIPAA, and PCI DSS.

RBAC and access controls reduce console access. If fewer people can click buttons in the AWS console, fewer manual changes happen. This isn't about trust. It's about removing the temptation to take shortcuts.

Continuous scanning catches drift early. The longer drift goes unnoticed, the harder it is to fix. Daily or hourly scans keep the gap small. ops0's Discovery runs continuously so drift gets flagged in minutes, not months.

GitOps workflows require every change to go through version control. No direct applies, no console changes, no SSH fixes without a matching PR. This is the gold standard but it requires discipline and good tooling to enforce.

Drift in Kubernetes

Kubernetes has its own drift problem. Helm charts declare desired state but operators, controllers, and manual kubectl commands modify actual state constantly. Someone runs kubectl edit to fix a broken deployment. A controller scales a replica set. An admission webhook modifies a pod spec.

ops0 tracks 31 Kubernetes resource types for drift, comparing Helm chart state against live cluster state. This is harder than cloud drift because Kubernetes is designed to be dynamic. The challenge is distinguishing intentional changes (auto-scaling, rolling updates) from actual drift (manual edits, configuration mistakes).

The Real Cost of Drift

Drift isn't a technical problem. It's a trust problem. When your code doesn't match reality, you can't trust your disaster recovery plan. You can't trust your compliance reports. You can't trust that a terraform apply won't break production.

Teams that don't manage drift end up afraid to make changes. They stop running terraform plan because the output is too scary. They stop updating infrastructure because nobody knows what will happen. The infrastructure calcifies and becomes impossible to maintain.

Solving drift isn't about buying a tool. It's about closing the gap between what you declare and what exists, continuously and automatically.

What is Infrastructure Drift and How to Fix It

Key Takeaways

Quick Answers