February 15, 2026 · 10 min read · performance.qa

SLOs, SLIs, and Error Budgets: The Complete Guide for Engineering Teams

Everything you need to implement SLOs - definitions, choosing SLIs, setting targets, calculating error budgets, and avoiding common mistakes.

SLOs, SLIs, and Error Budgets: The Complete Guide for Engineering Teams

Service Level Objectives (SLOs) are the single most impactful reliability practice an engineering team can implement. They create a shared language between product, engineering, and business. They provide an objective framework for prioritizing reliability work versus feature work. They transform vague reliability goals into measurable commitments.

Yet most engineering teams implement SLOs incorrectly, and many that implement them correctly fail to use them effectively. This guide covers everything: the definitions that matter, how to choose meaningful SLIs, how to set targets that create the right incentives, and how to use error budgets to make better engineering decisions.

Definitions That Actually Matter

The three terms are related but distinct. Understanding the relationship is essential before doing anything practical.

Service Level Indicator (SLI)

An SLI is a quantitative measurement of a service behavior that is relevant to users. It is a ratio: the number of “good” events divided by the total number of events, expressed as a percentage.

Examples:

  • HTTP requests that return non-5xx responses / total HTTP requests
  • API calls that complete in under 200ms / total API calls
  • Background jobs that succeed / total background jobs
  • Events processed within 30 seconds / total events

Notice that an SLI is always a ratio, not an absolute number. “1000ms latency” is not an SLI. “Percentage of requests that complete in under 1000ms” is an SLI.

Service Level Objective (SLO)

An SLO is a target value for an SLI over a defined time window. It is the answer to: “How good does this metric need to be for us to consider the service acceptable?”

Examples:

  • HTTP success rate SLI >= 99.5% over a 30-day rolling window
  • Latency SLI (requests under 200ms) >= 95% over a 28-day rolling window
  • Background job success rate SLI >= 99.9% over a 7-day rolling window

The time window matters enormously. “99.9% uptime per day” allows only 86 seconds of downtime per day. “99.9% uptime per month” allows 43.8 minutes of downtime per month. Same percentage, very different implications.

Service Level Agreement (SLA)

An SLA is a contract with a customer that includes remediation if the service falls below a specified level. It is an external commitment with consequences.

The relationship: SLI measures what is happening. SLO sets the internal target. SLA is the external promise (which should always be less strict than the SLO).

Most engineering teams should focus on SLOs. SLAs are a legal and commercial concern. SLIs and SLOs are engineering concerns.

Choosing SLIs

The right SLIs depend on the type of service. The four golden signals - latency, traffic, errors, and saturation - are a useful starting framework, but not every service needs all four.

SLIs by Service Type

Service TypePrimary SLISecondary SLI
Request-driven APISuccess rate (non-5xx %)Latency percentile
Read-heavy serviceRead success rateRead latency
Write-heavy serviceWrite success rateWrite latency
Background processingJob completion rateProcessing latency
Streaming/real-timeMessage processing rateEnd-to-end latency
Data pipelinePipeline success rateData freshness
Batch jobsJob success rateCompletion time

What Makes a Good SLI

It correlates with user experience. If users complain, the SLI should be in violation. If the SLI looks fine, users should not be experiencing problems. A database CPU utilization metric fails this test - high CPU may or may not affect users. HTTP success rate passes this test.

It is measurable in the right place. Measure SLIs as close to the user as possible. Measure at the load balancer, not at the application server. Measure end-to-end success, not the success of one component in the chain.

It is actionable. When the SLI degrades, it should point toward the problem. “API success rate” declining tells you something. An extremely granular metric like “database index scan success rate” is less useful as an SLI (though valuable for debugging).

It excludes expected failures. Not all failures are the service’s fault. If clients send malformed requests that return 400 errors, those 400s should not count against your success rate SLI. Pre-production traffic should not count. Bot traffic should not count.

Setting SLO Targets: The 99.9% Trap

The most common mistake in SLO setting is choosing the target before understanding the cost. Teams default to 99.9% because it sounds good.

The problem: 99.9% availability sounds high, but it is actually quite hard to achieve for a complex distributed system with multiple dependencies. And it leaves only 43.8 minutes of downtime per month, which means a single 45-minute deployment incident burns your entire monthly error budget.

The Cost of Reliability

SLO TargetAllowed Downtime (Month)Cost Implication
99.0%7.3 hoursVery achievable without heroics
99.5%3.6 hoursAchievable with good practices
99.9%43.8 minutesRequires significant investment
99.95%21.9 minutesRequires extensive investment
99.99%4.4 minutesRequires dedicated SRE team and extensive automation

Higher SLOs require more engineering investment: more redundancy, more automation, more on-call burden, and less flexibility to make changes that could cause brief disruptions. Before setting an SLO, ask: “What would it take to maintain this target? What are we not allowed to do anymore if we commit to this?”

Starting Point Recommendations

New services: Start at 99.0-99.5%. You do not know yet where your failure modes are. A loose SLO lets you discover them without constantly burning error budget.

Established services: Analyze your last 90 days of actual performance. Set your SLO at the 10th percentile of your measured actual performance. This is a target you can reliably hit, which gives you confidence the measurement approach is correct before tightening.

Customer-facing APIs: 99.5% is a reasonable target for most SaaS APIs. Very few customers need 99.99% availability.

Internal services: Internal services can typically have 1-2% looser targets than external services. A 15-minute degradation of an internal analytics API is less impactful than the same degradation of your checkout API.

Error Budgets

An error budget is the complement of your SLO: the amount of unreliability you are allowed. If your SLO is 99.5%, your error budget is 0.5%.

Error budgets express the same concept in a more actionable form. A 0.5% monthly error budget for a service receiving 10 million requests per month is 50,000 allowed failures. If you are currently consuming 5,000 failures per day and there are 20 days left in the month, you have 40 days worth of remaining budget but only 20 days to use it - you are fine. If you are consuming 3,000 failures per day, you have only 16.7 days of budget remaining with 20 days left - you are on track to violate your SLO.

Calculating Error Budget

Error budget (in requests) = Total requests x (1 - SLO target)

Example:
- 10M requests/month
- 99.5% SLO
- Error budget = 10,000,000 x 0.005 = 50,000 failed requests/month

Daily budget = 50,000 / 30 = 1,667 failures/day

Current burn rate (3,000 failures/day):
Budget exhaustion date = 50,000 / 3,000 = 16.7 days

Error Budget Policy Template

An error budget only creates value if the team commits to acting on it. Document your error budget policy:

Error Budget Policy - [Service Name]

SLO: 99.5% request success rate (30-day rolling window)
Monthly error budget: ~50,000 failed requests

When error budget is >50% remaining:
  - Normal feature development velocity
  - Deploy during business hours with standard process

When error budget is 25-50% remaining:
  - Review reliability work items in sprint planning
  - Prioritize at least 1 reliability item per sprint
  - Increase deploy caution (require 2 reviewers)

When error budget is <25% remaining:
  - Freeze non-critical feature work
  - All hands on reliability
  - Reliability work takes priority over all feature work
  - Deploy only for critical fixes with incident-response standby

When error budget is exhausted:
  - Feature deployments suspended until budget recovers
  - Post-mortem on what exhausted the budget
  - Leadership review of SLO target appropriateness

This policy creates a direct link between reliability and feature velocity. When the service is reliable (large error budget), teams can move fast. When the service is struggling (small error budget), the error budget policy enforces a slowdown and prioritization shift automatically.

Implementation with Prometheus and Sloth

Sloth is an open-source tool that generates Prometheus recording rules and alerting rules from a simple SLO specification. It handles the multi-window, multi-burn-rate alerting that is otherwise complex to configure manually.

Install Sloth and define your SLO:

# payment-api-slo.yaml
version: "prometheus/v1"
service: "payment-api"
labels:
  team: "payments"

slos:
  - name: "requests-availability"
    description: "Payment API HTTP request availability"
    objective: 99.5

    sli:
      events:
        error_query: sum(rate(http_requests_total{job="payment-api",code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="payment-api"}[{{.window}}]))

    alerting:
      name: PaymentApiAvailability
      labels:
        severity: critical
      annotations:
        summary: "Payment API availability SLO violation"
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Generate Prometheus rules:

sloth generate -i payment-api-slo.yaml -o payment-api-slo-rules.yaml
kubectl apply -f payment-api-slo-rules.yaml

Sloth generates multi-window burn rate alerts: a fast burn alert (5x burn rate over 1 hour, consuming 2% of monthly budget) and a slow burn alert (1x burn rate over 6 hours, consuming 5% of monthly budget). This gives you early warning of serious burns while avoiding false positives.

Datadog SLO Implementation

Datadog provides native SLO tracking. Create a metric-based SLO:

# Using Datadog API to create an SLO
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.service_level_objectives_api import ServiceLevelObjectivesApi
from datadog_api_client.v1.model.service_level_objective_request import ServiceLevelObjectiveRequest

slo = ServiceLevelObjectiveRequest(
    name="Payment API Availability",
    type="metric",
    description="Availability SLO for the payment API service",
    numerator="sum:trace.http.request.hits.by_http_status{service:payment-api,!http.status_class:5xx}.as_count()",
    denominator="sum:trace.http.request.hits.by_http_status{service:payment-api}.as_count()",
    thresholds=[
        {"target": 99.5, "timeframe": "30d", "warning": 99.7}
    ],
    tags=["team:payments", "env:production"]
)

Building an SLO-Driven Culture

The technical implementation of SLOs is the easy part. The organizational change is harder.

Weekly SLO review. Every engineering team should spend 15 minutes per week reviewing their SLO status. Is error budget being burned faster than expected? Did a deployment spike the burn rate? What is the trend?

SLOs in incident review. Every incident post-mortem should calculate the error budget impact. “This incident burned 23% of our monthly error budget” is a concrete, comparable way to communicate incident severity.

SLOs in planning. When prioritizing work, ask: “If we do not work on this reliability item, what happens to our error budget over the next quarter?” This frames reliability work in business terms rather than as abstract technical debt.

Seven Common SLO Mistakes

1. Starting with SLAs before SLOs. Define your internal SLOs first, understand whether you can reliably hit them, then make external SLA commitments based on demonstrated internal performance.

2. Too many SLOs. Five SLOs per service is too many to track meaningfully. Two to three SLOs per service - one for availability, one for latency, optionally one for a business metric - is manageable.

3. No error budget policy. An SLO without an error budget policy is a number on a dashboard. The policy defines what the team does when the budget is consumed.

4. Measuring at the wrong layer. Measuring success rate inside the application misses failures in the load balancer, service mesh, or network. Measure as close to the user as possible.

5. Not excluding non-user traffic. Health check traffic, internal monitoring calls, and bot traffic should all be excluded from SLI calculations. Including them creates noise.

6. Changing targets after a violation. If your SLO is 99.5% and you are at 99.3%, the correct response is to investigate and fix the reliability issue, not to lower the target to 99.0%. Adjust targets quarterly based on demonstrated performance, not in response to violations.

7. Treating SLOs as SLAs. Violating an SLO should trigger an engineering response. It should not automatically trigger customer compensation or legal consequences - that is what SLAs are for.

SLOs done well transform reliability from a vague aspiration into an engineering discipline with clear metrics, clear policies, and clear accountability. Our performance engineering team helps engineering teams implement SLOs from scratch - including measurement setup, alerting configuration, and the organizational change management that makes them stick.

Your P99 Deserves Better

Book a free 30-minute performance scope call with our engineers. We review your latency profile, identify the most impactful optimization target, and scope a sprint to fix it.

Talk to an Expert