Why Every Platform Team Needs a Service Catalogue
"Who owns this service?"
If answering that question requires Slack messages, tribal knowledge, or archaeology through Git blame — you need a service catalogue.
"Who owns this service?"
If answering that question requires Slack messages, tribal knowledge, or archaeology through Git blame — you need a service catalogue.
We test application code religiously. Unit tests, integration tests, E2E tests. CI won't pass without 80% coverage.
Infrastructure code? "Just run terraform plan and eyeball it."
That inconsistency is where production incidents are born.
Your infrastructure was deployed with Terraform. It's version-controlled. It's reviewed. It's compliant.
Then someone SSH'd into a server and changed a config file. Someone used the Azure portal to add a firewall rule. Someone ran kubectl edit to patch a deployment in production.
Now your Terraform state says one thing. Reality says another. That gap is configuration drift — and it's the silent killer of reliable infrastructure.
Helm is the de facto package manager for Kubernetes. It's powerful, widely adopted, and the first tool most teams reach for.
It's also overused, over-complicated, and sometimes the wrong tool entirely.
"We're multi-cloud." In most organisations, this really means "we use AWS for some things and Azure for others, and nobody has a unified view of either."
True multi-cloud at enterprise scale is hard. Here's what I learned doing it — and what I'd do differently.
3 AM. PagerDuty fires. You open the runbook. It says:
"Check the logs and restart if necessary."
That's not a runbook. That's a suggestion. And at 3 AM, it's useless.
Here's how to write runbooks that actually help during incidents.