What the Bastion Actually Looks Like in a Repo

I will use a representative stack: GitHub Actions, Datadog, Argo Rollouts, bash for runbooks. The principle is tool-agnostic. The shape of each artifact is what matters. Translate the stack to whatever your team uses.

These are simplified versions of the real artifacts I have written with consulting clients. They are not theoretical. You can copy them into your repo today.

The flow at a glance

Before the YAML, here is how the seven artifacts connect to each other. Read it left to right for the happy path, then follow the loop on the bottom row for what happens when production breaks.

The Bastion in a repo: how the seven artifacts work together, from commit to production and back through the recovery loop — How the seven artifacts work together. Every arrow is enforced by code in the repo.

The thing to notice on this diagram is the dashed blue arrow on the bottom: every incident loops back to contract.yaml as a new check. That is what makes the procedure adapt over time. Without that arrow, the bastion is static. With it, the bastion learns.

One question I get asked about this diagram: who actually checks the contract? Nobody, in the sense of a human. The contract is a declaration; the enforcers are two pieces of automation. CI (GitHub Actions, GitLab CI, whatever you use) runs every job listed in contract.yaml: the test suite, the security scan, the SBOM verification, the linter. The exit code of each job is the check result. Branch protection rules on the main branch refuse the merge until every required CI check returns green and the required reviewers have approved. For code review specifically, the reviewer rules in contract.yaml generate the GitHub CODEOWNERS file (or equivalent in your VCS), so even the "right human approves" part is enforced by configuration, not by social convention.

The pattern that actually works in 2026 is hybrid: AI writes the rules, deterministic code runs them. An AI assistant (Claude Code, Gemini, ChatGPT) reads the existing workflow and proposes a contract.yaml. It writes the small validation script. It runs as a cron job once a week to audit the repo against the contract and post drift findings to Slack. It runs in the IDE as a pre-PR compliance helper. What it does not do is run the actual CI gate, for three reasons: non-determinism (the same PR can pass on Monday and fail on Tuesday because the model temperature differs), weaker audit trails ("the AI approved this" is worse than "the test passed and these humans approved"), and cost or latency (an LLM call per PR adds up).

The right shape: AI advises and audits; humans and scripts gate. A team that wants to set this up tomorrow can have an AI assistant draft a 30-line validation script that lives in .bastion/scripts/. The script runs in CI. The AI's work is done until the contract changes. The script is deterministic from there forward.

The repo layout

Every Bastion artifact lives under a single directory: .bastion/. Reviewing that directory is reviewing the team's procedure.

.
├── .bastion/
│   ├── contract.yaml          # Pillar 1: the deployment contract
│   ├── slos.yaml              # Pillar 3: SLO definitions
│   ├── canary.yaml            # Pillar 4: rollout strategy
│   ├── runbooks/              # Pillar 5: executable runbooks
│   │   ├── 5xx-spike.sh
│   │   ├── high-latency.sh
│   │   └── db-pool-exhausted.sh
│   ├── overrides.jsonl        # Audit log of every gate bypass
│   └── maturity.yaml          # BMI snapshot, updated quarterly
├── .github/
│   └── workflows/
│       ├── ci.yaml            # Pillar 2: the pipeline
│       └── nightly-bmi.yaml   # Computes the Bastion Maturity Index
└── ... # your service code

Pillar 1: The deployment contract

This is the file that defines what "ready to deploy" means for this service. Every check has an owner, a blocking flag, and an explicit override policy.

# .bastion/contract.yaml
version: 3
last_reviewed: 2026-04-12
owner: platform-team

checks:
  - id: unit-tests
    description: All unit tests pass
    enforced_by: github-actions/test
    blocking: true
    override: not-allowed

  - id: integration-tests
    description: Critical-path integration tests pass
    enforced_by: github-actions/integration
    blocking: true
    override:
      allowed_by: ["platform-team"]
      requires_reason: true
      logged_in: overrides.jsonl

  - id: code-review
    description: One engineer approval; security review for auth/db changes
    enforced_by: github/branch-protection
    blocking: true
    rules:
      - touches: ["src/auth/**", "migrations/**"]
        requires: ["security-team"]
      - default: { requires: ["any-engineer"] }

  - id: security-scan
    description: Snyk scan; no high or critical findings
    enforced_by: github-actions/snyk
    blocking: true
    severity_threshold: high

  - id: sbom-provenance
    description: All deps and AI-suggested code have verifiable origin
    enforced_by: github-actions/sbom
    blocking: true

  - id: changelog
    description: PR includes CHANGELOG.md entry under [Unreleased]
    enforced_by: github-actions/changelog
    blocking: true

Look at what is in this file and what is not. There is no "test coverage above X percent." There is no "build successful" (a build always succeeds if you ignore the tests). There is no "all PR checks green" (vague). Every check is specific, owned, and either blocking or auditable.

The contract is reviewed like code. Changes to contract.yaml go through a PR that requires platform-team approval. The team can see who changed what, when, and why.

Pillar 2: The pipeline

This is the GitHub Actions workflow that enforces the contract. It is intentionally boring.

# .github/workflows/ci.yaml
name: bastion-pipeline
on:
  push: { branches: [main] }
  pull_request: { branches: [main] }

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci && npm test

  integration:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - run: ./.bastion/scripts/integration.sh

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: snyk/actions/node@master
        env: { SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }} }
        with: { args: --severity-threshold=high }

  sbom:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./.bastion/scripts/verify-provenance.sh

  build-and-promote:
    runs-on: ubuntu-latest
    needs: [test, integration, security, sbom]
    if: github.ref == 'refs/heads/main'
    permissions: { id-token: write }
    steps:
      - uses: actions/checkout@v4
      - name: Build immutable artifact
        run: |
          IMAGE_TAG=$(git rev-parse --short HEAD)
          docker build -t ${{ vars.ECR_REPO }}:${IMAGE_TAG} .
          docker push ${{ vars.ECR_REPO }}:${IMAGE_TAG}
      - name: Trigger progressive rollout
        run: |
          kubectl argo rollouts set image bastion-app \
            app=${{ vars.ECR_REPO }}:${IMAGE_TAG}

Three things to notice. There is no manual approval step between merge and production. The artifact is built once and promoted by reference (the same image that ran in test runs in production). And the rollout uses Argo Rollouts rather than a direct kubectl apply, which is what makes Pillar 4 possible.

Pillar 3: SLOs and alerts

Two SLOs for this service, both expressed declaratively.

# .bastion/slos.yaml
service: bastion-app
slos:
  - name: api-availability
    description: Percent of API requests returning 2xx within 800ms
    sli: |
      sum(rate(http_requests_total{status=~"2.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    target: 99.5
    window: 30d
    error_budget_burn_alerts:
      - threshold: 2%     # 2% of monthly budget in 1 hour = page
        window: 1h
        page: oncall
        runbook: runbooks/5xx-spike.sh
      - threshold: 10%    # 10% in 24 hours = notify but don't page
        window: 24h
        notify: slack-engineering

  - name: checkout-latency
    description: P95 checkout latency under 1.5s
    sli: |
      histogram_quantile(0.95,
        rate(checkout_duration_seconds_bucket[5m]))
    target_lt: 1.5  # seconds
    window: 30d
    error_budget_burn_alerts:
      - threshold: 5%
        window: 1h
        runbook: runbooks/high-latency.sh

Every alert points to a runbook. If there is no runbook, there is no alert. That single rule eliminates 80% of the noisy infrastructure-level alerts most teams accumulate.

Pillar 4: The canary rollout

Here is the Argo Rollouts spec that defines progressive deployment for this service.

# .bastion/canary.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: bastion-app }
spec:
  strategy:
    canary:
      steps:
        - setWeight: 1          # 1% of traffic
        - pause: { duration: 5m }
        - analysis:
            templates: [{ templateName: slo-burn-check }]
        - setWeight: 10
        - pause: { duration: 10m }
        - analysis:
            templates: [{ templateName: slo-burn-check }]
        - setWeight: 50
        - pause: { duration: 15m }
        - analysis:
            templates: [{ templateName: slo-burn-check }]
        - setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata: { name: slo-burn-check }
spec:
  metrics:
    - name: api-availability-canary
      successCondition: result[0] >= 0.995
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              status=~"2..", deployment="canary"}[5m]))
            /
            sum(rate(http_requests_total{
              deployment="canary"}[5m]))

The progression is 1% → 10% → 50% → 100%, with an SLO check between each step. If the canary's availability drops below 99.5% during any analysis window, Argo Rollouts automatically aborts the deploy and rolls back. No human in the loop.

The Pillar 4 / Pillar 3 connection. This canary cannot exist without the SLOs from slos.yaml. The SLO is the gate that decides whether to promote or roll back. Without observability, this YAML is theater.

Pillar 5: The executable runbook

Here is what an actual runbook looks like. It is a bash script with deliberate constraints: every command is idempotent, every step explains itself, and the script can be run blindly at 02:14 by an exhausted on-call engineer.

#!/usr/bin/env bash
# .bastion/runbooks/5xx-spike.sh
# Responds to the SLO burn alert: high rate of 5xx responses.

set -euo pipefail

echo "[$(date)] 5xx-spike runbook started"

# Step 1: Confirm the alert is real
ERROR_RATE=$(curl -s "https://api.datadog.com/api/v1/query?query=avg:trace.servlet.request.errors{env:prod}" \
  -H "DD-API-KEY: ${DD_API_KEY}" | jq -r .series[0].pointlist[-1][1])
echo "Current error rate: $ERROR_RATE"

if (( $(echo "$ERROR_RATE < 0.01" | bc -l) )); then
  echo "Error rate is below 1%. Alert may be stale. Stopping."
  exit 0
fi

# Step 2: Identify the current deploy
LAST_DEPLOY=$(kubectl argo rollouts get rollout bastion-app -o json \
  | jq -r .spec.template.spec.containers[0].image)
echo "Currently deployed image: $LAST_DEPLOY"

# Step 3: Initiate rollback
echo "Rolling back to previous stable image..."
kubectl argo rollouts undo bastion-app

# Step 4: Wait for rollback to complete
kubectl argo rollouts status bastion-app --timeout 5m

# Step 5: Verify error rate is dropping
sleep 60
NEW_RATE=$(curl -s "https://api.datadog.com/api/v1/query?query=avg:trace.servlet.request.errors{env:prod}" \
  -H "DD-API-KEY: ${DD_API_KEY}" | jq -r .series[0].pointlist[-1][1])
echo "Error rate after rollback: $NEW_RATE"

if (( $(echo "$NEW_RATE < $ERROR_RATE" | bc -l) )); then
  echo "OK Rollback successful. Error rate dropping."
else
  echo "FAIL Error rate not dropping. Escalating to senior on-call."
  curl -X POST https://events.pagerduty.com/v2/enqueue \
    -H "Content-Type: application/json" \
    -d "{\"routing_key\": \"${PD_KEY}\", \"event_action\": \"trigger\","...
fi

echo "[$(date)] 5xx-spike runbook complete"

This is the difference between a runbook that is documentation and a runbook that is procedure. The script can be invoked from a pager. It is checked into the repo. It is tested in quarterly drills, when one of the senior engineers runs all the runbooks against a clone of production to make sure they still work.

The override audit log

Every time a contract check is bypassed, an entry lands in overrides.jsonl:

{"ts":"2026-04-02T14:32:11Z","check":"integration-tests","actor":"sarah@helmsly.com","reason":"flaky test #4421, hotfix for billing bug","pr":"#892","approved_by":"jeff@helmsly.com"}
{"ts":"2026-04-09T09:14:55Z","check":"changelog","actor":"dependabot","reason":"automated dependency update","pr":"#911","approved_by":"automated"}
{"ts":"2026-04-15T22:08:30Z","check":"security-scan","actor":"sarah@helmsly.com","reason":"false positive on transitive dep, snyk #SNYK-JS-AXIOS-1579-FP","pr":"#925","approved_by":"platform-team"}

JSONL is one record per line, append-only, grep-friendly. A nightly job aggregates the override rate per week. When that rate climbs above 5%, the procedure flags it as a leading indicator that the contract no longer matches reality, and the team has a conversation about whether to tighten the contract or stop the practice of overriding.

Bonus: the nightly BMI job

Here is the workflow that calculates the team's Bastion Maturity Index every night and posts it to Slack.

# .github/workflows/nightly-bmi.yaml
name: bastion-maturity-index
on:
  schedule:
    - cron: '0 7 * * *'    # 07:00 UTC daily

jobs:
  compute-bmi:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Score each pillar
        run: |
          ./.bastion/scripts/score-p1.sh > p1.txt  # scores 1-5
          ./.bastion/scripts/score-p2.sh > p2.txt
          ./.bastion/scripts/score-p3.sh > p3.txt
          ./.bastion/scripts/score-p4.sh > p4.txt
          ./.bastion/scripts/score-p5.sh > p5.txt
      - name: Aggregate and post to Slack
        run: ./.bastion/scripts/post-bmi.sh

The team sees their BMI in Slack every morning. When it moves, they notice. When it stalls, they have a quarterly conversation about why. The score is the sum of the five pillar scores, ranging from 5 (everything at Level 1) to 25 (everything at Level 5).

How a change actually moves through this

The artifacts are the parts. The workflow is how the parts get used. A few readers have written in asking the practical questions: what branching strategy fits with this, what about dev and QA environments, how do hotfixes work, who owns the release log. The short answers are below.

Branching strategy: trunk-based, short-lived

One mainline: main. Feature branches live one to three days, named feat/short-description or fix/short-description, and merge back to main. No long-running develop branch. No release branches unless a version freeze actually requires one.

This is opinionated and comes straight out of Principle 3 (batch size is the master variable). Long-lived branches accumulate batch size; large batches deploy worse. Branch lifetime is the upstream enforcement of the principle. A CI check rejects branch names that don't match the convention.

Teams migrating from GitFlow ask whether they can keep their develop branch. The honest answer is no, not if they want the rest of the Bastion to work. The migration is usually a quarter of work and a real conversation about what develop was giving the team in the first place. Usually it was an illusion of safety from larger batches, which is exactly the wrong direction.

Trunk-based branching versus GitFlow — Trunk-based (top) vs GitFlow (bottom). Short branches keep batches small; long-lived branches grow batch size by design.

The lifecycle of a change

End-to-end, eighteen steps, organized in four phases. The artifact is built early (on push to the feature branch) and the same artifact ships through to production. Promotion is metadata; the merge does not trigger a rebuild.

The single-artifact lifecycle, organized in four phases: Branch Work, PR Review, Merge Queue, Shipping. The artifact is built once on push and promoted unchanged through every environment. — Eighteen steps, four phases. The artifact is built on push (Phase 1), and the same artifact ships in Phase 4. The merge queue rebuilds only if the rebase produces a new SHA.

Phase 1 (Branch work, steps 1-6):

Engineer pulls latest main
Creates feat/checkout-bug off main
Commits, pushes
CI runs on every push: tests, security scan, SBOM, linter, and builds the artifact tagged with the branch tip SHA: bastion-app:B
Preview environment auto-deploys this artifact
Dev, QA, stakeholders test the preview (the exact artifact that will ship)

Phase 2 (PR review, steps 7-9):

Engineer opens PR to main
Contract checks fire. Required reviewers auto-assigned. Validation runs against the existing artifact, no rebuild required.
Reviews land, comments resolve, the engineer adds the PR to the merge queue. This is the only human button press in the entire flow.

Phase 3 (Merge queue, steps 10-13):

PR enters the merge queue
Queue rebases the branch onto current main. If main has not moved, the rebase is a no-op and B is still the shipping SHA. If main moved, the rebase produces a new SHA, D.
CI rebuilds the artifact for the rebased SHA: bastion-app:D. (Skipped if B == D.)
Preview environment redeploys with the rebased artifact. QA gets a notification with a window to re-validate if needed.

Phase 4 (Shipping, steps 14-18):

Queue fast-forwards main to the rebased branch tip. No merge commit; main now equals D.
Pipeline on main detects that bastion-app:D already exists in the registry. No rebuild.
Same artifact promoted to staging.
Progressive rollout to production: 1% → 10% → 50% → 100%, gated by SLOs at each step (Pillar 4).
Release entry appended to releases.jsonl. Slack notification posts the SHA, the PR link, and the one-line rollback command.

The artifact identity holds end to end. What QA tested in the preview is byte-identical to what ships, because both reference the same SHA in the same registry. There is no "build at merge" step that could drift from the build that was tested. This is the load-bearing claim of the Bastion's deploy model.

For this to work, four components have to be in place together: linear history (rebase or fast-forward merges only, no squash), reproducible builds (same source produces a byte-identical artifact), build on push (every branch push triggers a build, tagged with the commit SHA), and a merge queue (GitHub Merge Queue, GitLab Merge Trains, Mergify, Bors, or equivalent) to serialize and rebase before fast-forward.

Teams that cannot meet the preconditions yet can use the pragmatic fallback: build a separate preview artifact from the feature branch, build a separate shipping artifact on merge, accept the small drift. The fallback is not the recommendation; it is the bridge to the recommendation while the team invests in the preconditions.

Human review checkpoints in this lifecycle

The 18 steps above hide where each human role does its review. A team transitioning to this model needs to know "when does QA do their pass?" and "where does the product owner sign off?" The mapping is below.

Role	Phase / step	What they do
QA, initial pass	Phase 1, step 6	Exploratory testing on the preview env. Sign-off on the PR before it's queued.
QA, re-validation	Phase 3, step 13	Re-test on the post-rebase preview if the rebase touched code QA already exercised. Required, advisory, or skipped per team policy.
PO / stakeholder demo	Phase 1, step 6	See the actual artifact running. Early product feedback while the PR is open.
PO, acceptance / sign-off	Phase 2, step 9	PO is a required reviewer in the contract for product-facing changes. PR can't be queued without their approval.
Engineering peer review	Phase 2, step 9	Default + special-case reviewers (security, DB team) per the contract. Same gate as PO.
Real-user UAT	Phase 4, step 17	Feature-flag-controlled rollout to a PO-selected cohort. Observability tells the team when to widen. This IS UAT, in production.
Continuous QA	Phase 4 onward	SLOs and dashboards are continuous QA. Pillar 3 owns this.

The first row is the one engineers most often miss. QA's initial pass happens on the preview environment, before the PR enters the merge queue. The team treats this as a gate, not a courtesy. No QA sign-off, no queue.

What happens to QA sign-off if main moves and the branch rebases? Three options:

Strict. Any rebase with a new SHA invalidates the prior QA sign-off. Re-validation required. Safest, slowest. Right for regulated industries.
Configurable per PR. Engineer marks each PR "rebase-safe" or "rebase-sensitive" at queue time. Default in most Bastion engagements.
Trust-based. Rebase re-validation is advisory. Right for teams with mature, fast, comprehensive automated tests. Risky for everyone else.

The choice goes into the contract. The merge queue enforces it.

Environment topology

The Bastion recommends four environments: local, CI (ephemeral, per-PR), staging (production-equivalent), production. That is it.

Most companies have more. Five is common. Seven is not unusual. Dev, QA, UAT, pre-prod, staging, staging-2, production. Each was introduced for a reason that made sense at the time. None has been removed since.

The argument for fewer environments: each environment is state, and state drifts. The team that has Dev separate from QA separate from UAT eventually finds the three environments have diverged in subtle ways. Tests pass in Dev, fail in UAT, and nobody can explain why. Engineers stop trusting any environment except production, which is the one they cannot freely change.

For most readers, the answer is not "delete your QA environment by Friday." It is "apply the principles to whichever environments you have":

The same artifact runs in every environment. Not rebuilt at each stage.
Promotion between environments is automated. No engineer pulls down a build and uploads it to the next stage.
Each environment has its own SLOs and observability. The team can answer at any moment which SHA is in which environment.
The number of environments is not a measure of safety. It is a measure of where state can drift.

If your Pillar 3 (observability) and Pillar 4 (progressive rollouts) are working, you can probably collapse two or three of your pre-production environments without losing anything important. That is a future quarter's work. For now, make the principles hold across the environments you have.

Bastion 4-environment topology versus the more common 5-7 environment reality, with the same artifact SHA flowing through both — The Bastion topology (top) and the more common reality (bottom). The same artifact SHA flows through every environment in both cases.

Per-PR preview environments

The CI environment in the four-environment model is per-PR and ephemeral. Modern platforms make this cheap: Vercel, Netlify, Render, Railway, AWS Amplify, Google Cloud Run, and Kubernetes-native solutions like Argo CD preview applications all support spinning up an environment automatically when a pull request opens. The environment lives for the duration of the PR and is destroyed when the PR closes.

What it is for: manual exploratory testing, stakeholder demos, side-by-side comparison with the current main, visual regression checks. What it is not for: long-lived feature work (that violates Principle 3, batch size), replacing staging, or replacing automated tests.

The artifact that runs in the preview is the same artifact that would run in staging and production. The build is not repeated for the PR. This is what makes preview environments both cheap and trustworthy: same image, different config.

Where QA and UAT actually happen

The four-environment model is the part of this post that gets the most pushback. "If you delete my QA environment and my UAT environment, where do my QA team and my UAT process go?" Fair question.

The Human Review Checkpoints subsection above mapped each role to a specific lifecycle step. This subsection explains why those roles work this way: QA and UAT are functions, not environments.

The Bastion treats QA and UAT as functions, not environments. Each function still exists; the delivery mechanism changes.

QA lives in three places:

Automated tests in CI. The deterministic part. If a test cannot be automated, the test is not good enough yet; find a way to automate it or remove it.
The per-PR preview environment. Manual exploratory testing, accessibility checks, edge cases automation does not catch. The QA engineer pairs with the author or runs a session against the preview.
Production observability and SLOs. The deepest QA happens in production with real traffic. Pillar 3 (See Everything) is how a team maintains quality after release.

UAT lives inside production, controlled by Pillar 4:

Feature flags let the team ship code to production with the feature off, then turn it on for a small subset of real users. Stakeholders test the actual deployment.
Progressive rollouts (1% to 10% to 50% to 100%) are UAT. Real users, real data, controlled exposure, with observability that catches problems before the blast radius widens.
Beta channels or opt-in cohorts are a more deliberate variant for stakeholders who want explicit early access.

The team that has run UAT for ten years does not want to hear their environment is going away. The honest answer: the function is not going away. The environment is. They will get more useful signal testing the actual system in production behind a flag than testing a copy that drifts every week.

Code review

Code review is a check in the contract, not a separate norm. The contract specifies who reviews what:

Default: one engineering reviewer
Touches src/auth/** or src/billing/**: security team reviewer required
Touches migrations/**: database team reviewer required
Touches contract.yaml itself: platform team approves

Two rules that are easy to skip and worth enforcing: re-request review on substantive change (so the "approve, then push a small fix" bypass doesn't exist), and no self-approval, ever (even for the team lead). AI-assisted review tools post comments before the human reviewer arrives. Their comments are advisory; human approval is required.

Release tracking

Every deploy logs an entry. Minimum schema: artifact SHA, environment, timestamp, trigger, approver, link to PR. Stored either in .bastion/releases.jsonl (append-only file) or in a tool like Argo CD that gives you the equivalent.

The release log lets the team answer in under thirty seconds:

What SHA is in production right now?
When was the last deploy?
Show me everything that went to production today.
What changed between staging and production?

If these questions take longer to answer than thirty seconds, the release log is the missing artifact. The Slack notification per deploy turns the log from archival into behavior-shaping. The team sees deploys go by in real time. Deploys feel normal.

Database migrations

This post is about getting code from commit to production. Migrations are code too, and they follow the same procedure. The pattern that makes that possible is expand-contract, in three deploys: expand the schema (add new alongside old), migrate data and switch the code (often behind a feature flag), contract by removing the old schema. Each phase is independently reversible. Destructive changes paired with the code that requires them in a single deploy are exactly what the procedure refuses to ship.

Migrations are a substantial enough topic to warrant their own field guide as a companion to this book. That companion covers expand-contract in depth, common migration types, the migration runner, observability during migrations, and a worked example end to end. The summary above is enough for a Bastion reader to know what to do; the companion is for teams whose migration story is the next thing they have to fix.

Multi-service repos and monorepos

Everything above assumes one service in one repo. Most companies past a certain size have either many repos with one service each, or one monorepo with many services. The principles do not change; the implementation does.

Per-service contracts. Each service has its own contract.yaml, slos.yaml, releases.jsonl, canary.yaml. Two services in the same monorepo can be at different maturity levels. BMI is computed per service.
An org-level meta-contract. Some checks belong to the whole organization, not any single service: license scanning, dependency vulnerability rules, base image policies, secret detection. Every service contract inherits from the org contract.
Intelligent test selection. In a monorepo with thousands of test files, the pipeline detects which files changed and runs only the relevant subset on the PR. The full suite still runs on main.
Independent deploys. Each service deploys when its own contract is satisfied. No waiting across services. Coupled deploys are a smell that usually points at hidden coupling in the code itself.
Cross-service breaking changes use expand-contract the same way database migrations do. Add the new interface on Service A, deploy. Migrate Service B, deploy. Remove the old interface from Service A, deploy. Three small batches.
One BMI per service, one dashboard for the org. The team can see which services are L1 and which are L4. Investment decisions get made per service. The unit of intervention is the service, not the repo.

What does not work is one org-wide contract enforced identically across many services. Different services have different risk profiles, different criticality, different team ownership. Collapsing that into one contract collapses information that should drive different procedural choices.

Hotfixes

A hotfix is a feature branch with a fast pipeline. No bypass path. Same contract checks, same artifact promotion, same progressive rollout. If hotfixes feel slow, the answer is to make the pipeline faster (intelligent test selection cuts 40-70% of CI time in most cases) rather than build a parallel path that skips gates.

The "emergency hotfix" path is the classic procedure-killer. Teams build one for a real emergency, then start using it for non-emergencies, then it becomes the default, and the actual pipeline is vestigial. The Bastion specifically refuses to let this evolve.

What this all adds up to

Seven files. Maybe a thousand lines of YAML and bash, including the comments. That is the entire physical footprint of a Bastion in a service repository.

These artifacts are not the procedure. The procedure is the agreements behind them. But if you do not have the artifacts, you do not have the procedure. They are the load-bearing physical evidence that the team is actually doing what it says it does.

If you copy these files into your repo as-is and change nothing else about your culture, the artifacts will rot within a quarter. Override rate will climb. Runbooks will stop being tested. The contract will drift. If you copy them and pair them with the principles from the framework, they become the structure that holds.

The next post in this series will walk through the transitions: what the artifacts look like as a team moves from Level 1 to Level 5. The contract gets longer. The pipeline gets faster. The runbooks get smaller because more of the response is automated. Follow the blog for that one.

Want to install this in your repo?

I help teams write the actual artifacts for their specific stack. The shape is the same; the YAML is yours.

Get in touch →

← Back to blog

What the Bastion actually looks like in a repo