June 16, 2026 · 11 min read · performance.qa

Error Budget Policy Template: Thresholds + Runbook Gating

Error budget policy template: a copy-paste 3-tier threshold table, burn-rate alert config, and a runbook gate that actually freezes feature releases.

Error Budget Policy Template: Thresholds + Runbook Gating

You already have SLOs. You have a dashboard. You can see the error budget ticking down. And yet your team keeps shipping features the week before a service breaches its objective, because nothing actually stops them.

That gap is the difference between measuring reliability and governing it. The missing piece is an error budget policy: a written, pre-agreed document that says exactly what happens when the budget runs low. This is the copy-paste artifact - a three-tier threshold table, burn-rate alert config, and a runbook gate that genuinely freezes releases. If you want the conceptual grounding first, our SLO, SLI and error budget guide covers the math and the why. This post is the operational document you drop into your runbook.

What an error budget policy is (and why a dashboard isn’t one)

An error budget policy is a written, pre-agreed contract for what the team does when the budget runs low. Not a chart. Not a Slack channel where someone occasionally posts a screenshot. A document, agreed in advance, that names the thresholds and the actions tied to each one.

Here is the uncomfortable truth about teams that “have SLOs” and still ship recklessly: without a policy there is no enforcement. The dashboard is informational. Knowing you have spent 80% of your budget changes nothing if there is no rule that says spending 80% means you stop adding risk. People look at the number, shrug, and merge the feature anyway, because the number has no teeth.

A good policy aligns the three parties who otherwise pull in different directions:

  • Product wants features shipped, fast.
  • SRE and engineering want stability and a sustainable on-call load.
  • Leadership breaks ties and owns the consequences of either choice.

The policy is the pre-negotiated tiebreaker. It means nobody is arguing about whether to slow down in the middle of an incident, because the decision was already made in calm times and written down.

An SLO without a policy is a suggestion; a policy is what turns it into a decision rule.

This pattern traces back to the Google SRE workbook, which formalized error budgets as a feedback loop between reliability and release velocity. What has changed by 2026 is that the model has standardized: a clean three-tier threshold structure plus multi-window burn-rate alerting that most mature teams now share.

The copy-paste 3-tier threshold policy

This is the core artifact. Call it the 3-Tier Error Budget Policy. It maps remaining budget to actions, so the decision is mechanical. The standard 2026 thresholds: above 50% remaining you ship normally, 25-50% you exercise caution, below 25% you freeze, and an exhausted budget escalates to incident level.

Here is the exact table to drop into your runbook:

TierRemaining budgetStatusRequired actions
Tier 1>50% remainingHealthyShip normally. No extra gates. Deploy during business hours with the standard review process.
Tier 225-50% remainingCautionFreeze risky changes. High-blast-radius deploys require a second reviewer and a named owner. Prioritize at least one reliability item per sprint.
Tier 3<25% remainingFreezeFeature freeze. All engineering effort redirected to reliability until budget recovers. Only rollbacks, security patches, and reliability fixes ship.
Tier 0Exhausted / negativeIncidentIncident-level response. Exec notification. Mandatory postmortem before any feature ship. Leadership reviews whether the SLO target itself is right.

A few notes on wiring this into reality:

  • Remaining budget is measured against your rolling SLO window (typically 28 or 30 days), not a calendar month. A rolling window means the policy responds to current reliability, not to whatever happened on the first of the month.
  • The tiers are asymmetric on purpose. Tier 1 is permissive so teams ship fast when the service is healthy. The friction ramps only as risk does. This is the whole point: reliability buys you velocity.
  • Tier 0 is not just “Tier 3 but worse.” It pulls in leadership and forces a question Tier 3 does not: is the target wrong? A budget that keeps hitting zero is often a target set too tight, not a team that is failing.

Drop this table into your service runbook verbatim, fill in the service name and SLO at the top, and you have a policy. The rest of this post makes it fire automatically.

Burn-rate alerts: gating on speed, not just remaining budget

The three-tier table answers “how much budget is left?” That is necessary but not sufficient, because a fast burn can blow a month’s budget in an hour. If you only gate on remaining budget, you find out you are in trouble after the damage is done. You need to gate on speed too.

A burn rate is how fast you are consuming budget relative to the window. A burn rate of 1x spends the entire budget exactly over the SLO window. A burn rate of 14.4x spends it 14.4 times faster - the whole month’s budget gone in roughly two days, or a serious dent in an hour.

The Google SRE workbook pattern is multi-window, multi-burn-rate alerting. You run two alerts:

  • A fast-burn page for sudden, severe burns. Wakes someone up.
  • A slow-burn ticket for sustained, low-grade burns. Files a ticket, no 3am page.

Each uses two time windows - a long one to detect the burn and a short one to confirm it is still happening, so the alert clears promptly once the burn stops. Here are the standard thresholds:

AlertBurn rateLong windowShort windowSeverityBudget consumed
Fast burn14.4x1h5mPage~2% in 1h
Slow burn6x6h30mTicket~5% in 6h

Copy-paste Prometheus/Grafana-style alert rules (assuming a Sloth-style or hand-rolled error_budget:ratio_rate recording rule per window):

groups:
  - name: error-budget-burn
    rules:
      # Fast burn: page. 14.4x over 1h, confirmed by 5m short window.
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            error_budget:burnrate1h{service="payment-api"} > 14.4
            and
            error_budget:burnrate5m{service="payment-api"} > 14.4
          )
        for: 2m
        labels:
          severity: page
          tier: "3"
        annotations:
          summary: "Fast error-budget burn on {{ $labels.service }} - triggers feature freeze"

      # Slow burn: ticket. 6x over 6h, confirmed by 30m short window.
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            error_budget:burnrate6h{service="payment-api"} > 6
            and
            error_budget:burnrate30m{service="payment-api"} > 6
          )
        for: 15m
        labels:
          severity: ticket
          tier: "2"
        annotations:
          summary: "Slow error-budget burn on {{ $labels.service }} - move to caution tier"

Notice the tier label on each alert. That is the bridge: a fast-burn page maps to Tier 3 (freeze) and a slow-burn ticket maps to Tier 2 (caution). The alert does not just notify a human - it sets the gate. A fast burn can drop you into a freeze even while remaining budget still reads above 25%, because at 14.4x you will be at zero before anyone finishes reading the dashboard. If your instrumentation is not emitting clean enough signals to compute these rates, our OpenTelemetry instrumentation guide covers getting the request and error metrics in place first.

The runbook gate: how a feature freeze actually fires

A policy that depends on someone remembering to honor it will be ignored under deadline pressure. The fix is to wire the gate into CI/CD so a feature deploy checks the current tier before it is allowed to proceed.

The mechanism: a deploy step queries the current error budget tier (from Prometheus, your SLO tool’s API, or a small budget service) and decides. Tier 1 passes. Tier 2 passes with a required approval. Tier 3 and Tier 0 block, unless the change is exempt.

What is always exempt from a freeze - these ship regardless of tier:

  • Rollbacks - reverting is how you recover budget.
  • Security patches - a freeze never blocks a CVE fix.
  • Reliability fixes - the work that ends the freeze.

A freeze blocks net-new feature risk, not the work that makes the service healthy again.

Here is a sample gate in pseudocode you can adapt to GitHub Actions, GitLab CI, or Argo:

# deploy-gate.py - run as a required CI step before a feature deploy
import sys

tier = get_current_budget_tier("payment-api")   # queries Prometheus / SLO API
change_type = get_change_label()                 # "feature" | "rollback" | "security" | "reliability"

EXEMPT = {"rollback", "security", "reliability"}

if change_type in EXEMPT:
    print(f"PASS: {change_type} is exempt from freeze (tier {tier})")
    sys.exit(0)

if tier in ("1",):
    print("PASS: Tier 1 healthy - ship normally")
    sys.exit(0)

if tier == "2":
    if not has_second_approval():
        sys.exit("BLOCK: Tier 2 caution - second reviewer approval required")
    print("PASS: Tier 2 with required approval")
    sys.exit(0)

# Tier 3 or Tier 0 - feature work is frozen
if has_signed_freeze_override():
    log_override(change_type, tier, actor=current_user())
    print(f"PASS (OVERRIDE): freeze overridden by {current_user()} - logged")
    sys.exit(0)

sys.exit(f"BLOCK: Tier {tier} feature freeze. Feature deploys suspended until budget recovers.")

The exception process matters as much as the block. Someone, eventually, will need to ship a feature during a freeze. Make that possible but expensive:

  • A named role can override - typically the on-call SRE lead plus a director, not any engineer.
  • Every override is logged: who, when, what, and why.
  • Overrides are reviewed at the next policy retro. A pattern of overrides means either the thresholds are wrong or the team is not taking the policy seriously - both worth surfacing.

Freeze-gate runbook checklist:

  • Deploy gate queries live budget tier, not a cached or manual value
  • feature changes blocked at Tier 3 and Tier 0
  • rollback, security, reliability always exempt
  • Tier 2 requires a documented second approval
  • Override requires a named role and writes an audit log entry
  • Overrides surfaced in the quarterly policy review

Rollout: getting product and leadership to actually honor it

A policy nobody respects is worse than no policy, because it teaches the team that the rules are theater. Roll it out so it earns trust.

Start advisory, not enforcing. Run the policy in observe mode for one full SLO cycle. The gate computes the tier and posts what it would have done - “this deploy would have been blocked (Tier 3)” - but lets everything through. One cycle of watching the policy make correct calls builds far more buy-in than asserting it is correct on day one. It also surfaces measurement bugs before they block a real deploy.

Use the pre-commitment trick. Agree the thresholds and the freeze rule in calm times, with product and leadership in the room, and write it down. The entire value of a policy is that it is not litigated mid-incident. If the first time product hears “we are freezing features” is during an outage, you have a fight, not a policy.

Watch for the permanent-freeze failure mode. The most common way these policies die: the SLO target is set too tight, the team lives in Tier 3 constantly, and everyone learns to ignore the freeze because it is always on. If you are always frozen, the problem is usually the target, not the team. Loosen the SLO to something the service can actually hit, then tighten gradually. A policy that fires occasionally is respected; one that fires always is wallpaper.

Review quarterly. Once a quarter, tune the thresholds against real burn history. Did Tier 2 ever actually change behavior, or did you skip straight from healthy to frozen? Are the burn-rate multipliers paging on noise? Treat the policy as a living document, adjusted on demonstrated performance - never edited mid-incident to dodge a freeze.

Download the template + when to bring in help

Here is the full error budget policy template to copy into your runbook - the three-tier table, the burn-rate config, and the gate, in one place:

ERROR BUDGET POLICY - [Service Name]
SLO: 99.5% success rate (30-day rolling window)
Monthly error budget: 0.5%  (e.g. ~50,000 failed requests @ 10M req/mo)

TIER 1 - Healthy   (>50% remaining):  Ship normally. No extra gates.
TIER 2 - Caution   (25-50% remaining): Freeze risky changes. 2nd reviewer
                                        on high-blast-radius deploys.
                                        >=1 reliability item per sprint.
TIER 3 - Freeze    (<25% remaining):   Feature freeze. All effort to
                                        reliability. Only rollback/security/
                                        reliability fixes ship.
TIER 0 - Exhausted (negative budget):  Incident response. Exec notify.
                                        Postmortem before any feature ship.

BURN-RATE ALERTS
  Fast burn (page):   14.4x over 1h, confirmed by 5m  -> Tier 3
  Slow burn (ticket):  6x  over 6h, confirmed by 30m  -> Tier 2

DEPLOY GATE
  feature deploys: blocked at Tier 3 / Tier 0
  always exempt:   rollback, security patch, reliability fix
  override:        named role only, logged, reviewed quarterly

This template is the easy part. The reason teams search for it is usually that they are standing up an SRE practice for the first time - and the policy is the visible tip of a deeper gap.

Signs you need more than a template:

  • You have no agreed SLIs - nobody can say what “good” means for the service.
  • Your targets are guesses - 99.9% because it sounded right, not because you measured the 90-day baseline.
  • The policy keeps getting overridden - which usually means the targets are wrong or the practice was never properly designed.

A reliability practice is more than a policy doc. It is SLI selection measured close to the user, target setting from real baseline data, the error budget policy above, on-call structure, and the CI/CD gating wired end to end. Each piece depends on the others - a policy on top of guessed targets just automates the wrong decision. Before any of this, it is worth running a pre-launch performance checklist so you are gating on a service that can actually hold its SLO.

Stand up your SRE practice with us

A template gets you a document. A working practice gets you a reliability flywheel - velocity when you are healthy, automatic protection when you are not.

We design the whole thing in a fixed-scope reliability sprint: select your SLIs, set targets from your real performance data, write the error budget policy, and wire it into CI/CD so the freeze actually fires. Explore our performance audit to baseline where you stand today, or a performance retainer for ongoing SLO and error-budget governance once the practice is live.

Bring us the dashboard you already have. We will turn it into a policy your team honors.

Frequently Asked Questions

What is an error budget policy?

An error budget policy is a written, pre-agreed contract that says exactly what your team does when reliability runs low. It maps how much of your error budget remains to a set of actions - ship normally, slow down and review, or freeze features entirely. A dashboard tells you the budget is being burned; the policy turns that number into an enforceable decision rule about whether you ship or stop.

What are the standard error budget thresholds?

The emerging 2026 standard is a three-tier model based on remaining budget: above 50% remaining means ship normally, 25-50% remaining means caution (freeze risky changes, require extra review), and below 25% remaining means a feature freeze with all effort going to reliability. A fourth tier covers an exhausted or negative budget: incident-level response, exec notification, and a mandatory postmortem before any feature ships.

When should you freeze feature releases based on error budget?

Freeze feature releases when less than 25% of your error budget remains for the rolling window. At that point the service is close to breaching its SLO, so every new feature deploy adds risk you cannot afford. During a freeze, only rollbacks, security patches, and reliability fixes ship. A fast burn-rate alert can also trigger an immediate freeze even when remaining budget still looks healthy, because a rapid burn can drain a month's budget in an hour.

What is a burn rate alert and how do you set the thresholds?

A burn rate alert fires on how fast you are spending error budget, not just how much is left. A burn rate of 1x consumes the budget exactly over the window; 14.4x consumes it 14.4 times faster. The Google SRE workbook pattern uses multi-window, multi-burn-rate alerts: a fast-burn page at 14.4x over 1h (and a 5m short window) and a slow-burn ticket at 6x over 6h (with a 30m short window). The short second window prevents alerts from latching after the burn stops.

What's the difference between an SLO and an error budget policy?

An SLO is the target - for example, 99.5% success over 30 days. The error budget is the allowed unreliability that target implies (0.5%). The error budget policy is the rulebook for what happens as that budget is spent: which tier you are in, what gets frozen, who can override it. An SLO without a policy is a suggestion. The policy is what turns the number into a decision rule your team actually follows.

Your P99 Deserves Better

Book a free 30-minute performance scope call with our engineers. We review your latency profile, identify the most impactful optimization target, and scope a sprint to fix it.

Talk to an Expert