Terraform State Management: Lessons Learned the Hard Way
Terraform state is both the most important and most dangerous file in your infrastructure. It's the source of truth for what Terraform manages. Lose it, corrupt it, or let two people modify it simultaneously, and you're in for a very bad day.
Here are the lessons I learned — some of them the hard way.
Lesson 1: Remote State from Day One
Local state files work for tutorials. For anything else, use remote state.
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "stterraformstate"
container_name = "tfstate"
key = "networking/terraform.tfstate"
}
}
Not tomorrow. Not "when we set up the proper backend." Day one.
Lesson 2: State Locking Prevents Disasters
Two people run terraform apply at the same time. Both read the same state. Both try to create the same resource. Result: corrupted state, duplicated resources, and a weekend of cleanup.
Azure Blob Storage with lease-based locking. DynamoDB for AWS. Turn it on and never think about it again.
Lesson 3: Isolate State by Blast Radius
One state file for everything = one mistake destroys everything.
What works: Separate state files by domain AND environment.
states/
├── networking/
│ ├── dev.tfstate
│ ├── staging.tfstate
│ └── prod.tfstate
├── compute/
│ ├── dev.tfstate
│ └── prod.tfstate
└── database/
├── dev.tfstate
└── prod.tfstate
A bad plan in networking can't touch your databases. A dev experiment can't affect production.
Lesson 4: Never Manually Edit State
terraform state mv, terraform state rm, terraform import — these are the only safe ways to modify state. Opening the JSON file and editing it directly is asking for corruption.
The one exception: if state is already corrupted and the CLI tools can't help. Even then, take a backup first.
Lesson 5: State Backups Are Not Optional
Enable versioning on your state storage. Azure Blob versioning, S3 versioning — whatever your backend supports. When (not if) something goes wrong with state, the ability to roll back to a previous version is the difference between a 5-minute fix and a multi-day recovery.
The Hard-Won Summary
| Rule | Why |
|---|---|
| Remote state always | Collaboration + durability |
| State locking always | Prevent concurrent corruption |
| Isolate by blast radius | Limit damage from mistakes |
| Never edit state manually | CLI tools or nothing |
| Version your state storage | Rollback capability |
What's your worst Terraform state horror story? Share it on LinkedIn — misery loves company.
