Killing the Saturday 11pm Deploy: A 12-Month Case Study

It was a Friday at 4pm when a Slack message from Jeff popped up: "Hey, we're doing the deploy Saturday at 11pm, can you be online?"

I had been consulting at Aurora Analytics for two days. Aurora is a B2B data platform with about 120 engineers across seven product teams. They had a P0 outage three weeks earlier, a bad deploy that took production down for two and a half hours. The CTO had brought me in to run an honest assessment of their CI/CD practice.

I asked Jeff why Saturday at 11pm. He sent back: "We just do."

That shrug is the whole problem.

What I found

Aurora's leadership had told me, before I arrived, that the team was operating at Bastion Maturity Level 3 across the board: fully automated pipeline, full observability, progressive rollouts on the flagship API. The recent P0, they thought, was bad luck.

Two days of artifact review showed a different picture.

On Define Shippable, the team had a deployment contract written down, but it was an eight-month-old Notion page. Most CI checks were warnings, not blockers. Override rate was not tracked. Self-rated Level 3. Validated Level 1.

On Automate the Path, the pipeline existed but required a designated engineer to manually approve every production deploy. Artifacts were rebuilt per environment instead of promoted. Self-rated Level 3. Validated Level 2.

On See Everything, three of 33 services had SLOs. The other 30 had threshold-based CPU alerts and no runbooks. Self-rated Level 3. Validated Level 2.

On Deploy Progressively, exactly one service used canary deploys. The other 32 were big-bang. The recent P0 was a big-bang deploy gone wrong. Self-rated Level 3. Validated Level 1.

On Detect and Recover, post-mortems existed but produced action items that stayed open for months. There were 23 open items going back over a year. Self-rated Level 2. Validated Level 1.

The honest scorecard was the most uncomfortable conversation of the engagement. The CTO sat with the heatmap for a long minute before saying, "That tracks."

The gap between what the team believed about itself and what was actually true was the most useful finding of the entire two-day assessment.

The twelve months of work

We did not try to fix everything at once. The roadmap was four quarters of focused work, sequenced by the dependency graph in the framework. You cannot legitimately reach Level 3 on Pillar 4 without first reaching Level 3 on Pillars 2 and 3. The order matters.

Q1 was foundation work. We rewrote the deployment contract from scratch and made every check blocking. Coverage warnings became hard failures. Security scans gated PRs. We added override logging. By the end of Q1, the team could see exactly how often the procedure was being bypassed. The answer turned out to be: a lot. We also worked through the 23 open post-mortem action items from the previous year. Most of them became new contract checks. Pillars 1 and 5 reached Level 2, on their way to Level 3.

Q2 was the pipeline. This was the quarter the manual approval step came out. Removing that one button was the hardest political fight of the engagement. The security team had inherited it from a 2022 incident response and had defended it for two years. The negotiation took six weeks. The eventual answer: the security team got real-time dashboards of every scan, every deploy, and every contract violation. They got more visibility than the email approvals ever gave them. The button went away. The pipeline went end-to-end. Pillar 2 reached Level 3.

Q2 also retired Aurora's long-lived develop branch. The team had been working in GitFlow: feature branches off develop, weekly merges from develop to main, a release branch when something urgent shipped. The develop branch had drifted from main by about 400 commits at one point. We migrated to trunk-based development: one mainline, feature branches off main, branch lifetime capped at three days by a CI check. The team also collapsed two of their pre-production environments (UAT and pre-prod merged into a single staging environment), because once the artifact-promotion model held cleanly, the extra environments were just places for state to drift. The same SHA now ran in CI, staging, and production. Configuration was the only difference between environments.

Q3 was observability and progressive rollout. We propagated trace IDs across all 33 services, defined SLOs for the 30 that lacked them, converted infrastructure alerts to user-impact alerts, and wrote runbooks for the 50 most common incident patterns. With observability in place, we rolled out canary deploys for every production service. The first month had three false-positive rollbacks that annoyed the product teams. We loosened the thresholds, then tightened them again gradually. Pillars 3 and 4 reached Level 3.

Q4 was closing the loop. Runbooks became executable scripts. Rollback drills became quarterly. Post-mortem outputs started feeding back into the contract automatically. By the end of Q4, the team had reached Level 3 across all five pillars.

What deploy day looks like now

It was a Tuesday at 14:32 when I last checked in with Jeff, about a month after the engagement ended.

He had merged a PR ten minutes earlier. The change was a routine billing query optimization. The pipeline had run, the contract had passed, the canary had gone out to one percent of users, the metrics had stayed flat, and the change had been promoted to one hundred percent of users about twenty minutes after merge.

The team did not have a Slack channel for that deploy. Nobody had to be on call for it. Nobody had to schedule it.

The Saturday 11pm window had been retired six months earlier.

The numbers

Metric	Before	After 12 months
Deploys per week	~3	~30
Lead time (commit to prod)	4 to 6 days	2 to 4 hours
Change failure rate	22%	7%
Mean time to restore	2.5 hours	22 minutes
Override rate	Not measured	3%
Repeat incident rate	~40%	9%

The last row is the one I find most useful to point at. When forty percent of your incidents share a root cause with one you already had, your procedure is not learning. When that number drops below ten percent, the procedure has become load-bearing. Aurora's bastion was finally defending itself.

What did not go cleanly

The slide-deck version of this story would make the engagement sound smooth. The real version was not.

The Q2 timeline slipped by six weeks because of a platform team transition. The Q3 canary thresholds were misjudged on the first attempt and produced false-positive rollbacks that frustrated the product teams. There was a real political fight in Q1 about who owned the contract. None of it was fatal, but none of it was as clean as a roadmap on a whiteboard.

If you take one thing from Aurora's story, take this: the work that made the Saturday 11pm deploy disappear was not new tooling. It was the procedure. The procedure was a year of small, defended changes that added up to something the team could trust.

That procedure is what I have been calling The Bastion. The full field guide is available, and if you have been quietly stuck in your own version of the Aurora situation, the procedure is most of what you need.

The full procedure is in the book

Five pillars, thirteen principles, ten diagnostic metrics, a thirty-five question assessment scorecard. One defended path from commit to production.

Read the framework →

← Back to blog

Killing the Saturday 11pm deploy