June 16, 2026 · 10 min read · performance.qa

Reduce Your Datadog Bill 60-90% With OpenTelemetry

Your Datadog bill exploded? A split-model OpenTelemetry migration cuts observability spend 60-90% by moving logs and custom metrics off the meter.

Reduce Your Datadog Bill 60-90% With OpenTelemetry

If you are reading this, you probably just opened a Datadog invoice that made you do a double take. You are not alone. Datadog bill shock is the dominant observability conversation of 2026, and the math behind it is brutal: custom-metric premiums now eat 30-50% of many bills, the new LLM-observability surcharge auto-activates around $120/day, and log indexing quietly scales with every new service you ship.

The good news is that you do not have to rip out Datadog or accept the bill as a cost of doing business. This is a surgical playbook. The approach is a split-model migration: keep Datadog where its UX and compliance actually earn the money, move the high-cost data (logs and custom metrics) onto an OpenTelemetry-fronted pipeline, and let the dollar math speak for itself. Teams that do this consistently report 60-90% savings.

Here is the one-line answer to the questions you came with, before the detail:

  • How much can you save? Usually 60-90%, almost entirely from logs and custom metrics.
  • What do you migrate first? Logs, because they are highest volume and lowest risk.
  • Does OpenTelemetry replace Datadog? No. It is the instrumentation layer that frees you to pick a cheaper backend.
  • What causes the high bill? Custom metrics and log indexing, typically 50-70% of an over-budget invoice.
  • Can you run both? Yes, and dual-shipping through the collector is the safe way to migrate.

Why Datadog bills explode (and which line items actually cause it)

The first thing to understand is that your Datadog bill is not one number, it is four. Misdiagnosing which line item is bleeding you is the most common reason teams either overreact (rip out everything) or underreact (negotiate a discount that gets eaten in a quarter).

The four cost drivers are custom metrics, log ingestion and indexing, per-host APM, and 2026’s new LLM-observability surcharge. Of these, custom metrics and log indexing are almost always the problem. Custom metrics and log indexing typically account for 50-70% of an over-budget Datadog invoice. APM hosts feel expensive because the per-host number is visible, but they are rarely where the money actually goes.

Here is the anatomy of a representative $40K/month bill, so you can see where to aim:

Line itemShare of billWhy it growsMigration difficulty
Custom metrics30-45%Exponential pricing as tag cardinality climbsMedium (aggregate at collector)
Log ingestion + indexing25-40%Every service multiplies volume; indexing is the premiumLow (route elsewhere)
Per-host APM15-25%Linear with host count; the visible numberHigh (UX-dependent)
LLM-observability surcharge5-15%Auto-activates around $120/day of LLM spansLow (sample at collector)

The silent killer is the cardinality trap. Custom metrics are priced by the number of unique tag combinations, not by the number of metrics. Add one high-cardinality tag - a user ID, a request ID, a full URL path - to a single metric and you can 10x that metric’s cost overnight without anyone noticing until the invoice lands. Most teams have two or three of these hiding in their dashboards right now.

The takeaway: before you change anything, find out which of the four buckets is actually large. The fix for an out-of-control logs bill is completely different from the fix for a cardinality blowup.

Why OpenTelemetry is the escape hatch

For years, the reason teams stayed on Datadog despite the bill was lock-in. You instrumented your code with Datadog’s agent and libraries, so leaving meant re-instrumenting everything. OpenTelemetry (OTel) breaks that.

OTel gives you OTLP, a vendor-neutral wire protocol for traces, metrics, and logs. You instrument once and route anywhere. Switching backends becomes a config change - you point the export endpoint at a different destination - instead of a re-instrumentation project. That portability is the entire reason migrations that were historically painful are now a fixed-scope sprint.

The second piece is the OpenTelemetry Collector, and this is where the savings actually happen. The collector sits between your apps and any paid backend, and it gives you cost control before data ever hits the meter:

  • Tail-based sampling - keep the traces that matter (errors, slow requests) and drop the boring 99% that you pay to store and never look at.
  • Attribute dropping - strip the high-cardinality tags that cause the metrics blowup, at the pipeline level, without touching app code.
  • Metric aggregation - roll up raw data points into the aggregates you actually graph, before they become billable custom metrics.

Be honest about what OTel does not solve. It is not a database and not a UI. You still need a storage backend and dashboards. OTel is the plumbing that lets you choose cheaper plumbing fixtures - it does not replace them. Here is the routing map of the open ecosystem versus the Datadog products it displaces:

Telemetry typeDatadog productOpenTelemetry-fronted alternative
LogsLog ManagementGrafana Loki, ClickHouse, or S3 + query layer
MetricsCustom MetricsGrafana Mimir or Prometheus
TracesAPMGrafana Tempo
DashboardsDatadog UIGrafana
All-in-oneFull platformSigNoz (self-hosted OTel-native)

If running a full Grafana stack sounds like a lot, SigNoz is an OTel-native single-platform option that covers logs, metrics, and traces with one deployment.

The split-model decision tree: what to migrate first

The mistake teams make is treating this as all-or-nothing. The split-model principle is simple: move the data where the savings are huge and the risk is low, keep the data where Datadog earns its keep. That ordering is what makes this safe.

Work through your telemetry in this order:

  1. Logs first. Highest volume, lowest switching risk, fastest savings. Route them through the collector to Loki, ClickHouse, or cheap object storage. Nobody’s incident response depends on Datadog’s specific log UI the way it depends on trace flame graphs, so this is the cleanest first win.
  2. Custom metrics second. Aggregate and pre-filter at the collector, prune the high-cardinality tags that caused the blowup, and ship the result to Mimir or Prometheus. This is where the cardinality-trap savings land.
  3. APM traces last, or never. If your team lives in Datadog’s trace UI, the productivity cost of moving it can exceed the bill. Keep it on Datadog initially, sample aggressively at the collector to shrink the volume, and revisit later.

A quick decision tree by data type and team maturity:

  • High volume, low UX dependence (logs) -> route through OTel collector to cheap storage. Do this first.
  • High cost from cardinality (custom metrics) -> aggregate and drop tags at the collector, then move to Prometheus/Mimir.
  • High UX dependence, moderate cost (APM) -> keep on Datadog, sample to reduce spend, migrate only if you have platform capacity.
  • Compliance-bound or tiny team -> keep on Datadog. The migration will not pay off (more on this at the end).

Before/after: a real cost breakdown

Numbers make this concrete. Here is a worked example for a 100-host environment running $38K/month on Datadog, moved to a split model.

CategoryBefore (Datadog)After (split model)What changed
Logs$13K/mo$1.2K/moMoved to Loki + S3, killed Datadog indexing
Custom metrics$14K/mo$1.5K/moAggregated + pruned cardinality, moved to Mimir
APM traces$8K/mo$2K/moKept on Datadog, tail-sampled to cut volume 75%
LLM observability$3K/mo$0.3K/moSampled spans at collector
OTel collector + storage infra$0$1K/moNew: collector fleet + Grafana stack
Total$38K/mo~$6K/mo~84% reduction

That is a real shape, not a fantasy. Notice that the after-column adds a cost that calculators love to ignore: the collector fleet and storage cost roughly $1K/month here. And there is a cost no line item captures - engineering time to run the migration and the ongoing collector operations afterward. A self-hosted observability pipeline is infrastructure your team now owns.

So does it pay off? Run the payback math. At $38K -> $6K, you are saving ~$32K/month. Even if the migration sprint plus a quarter of ramp-up costs $60-80K of engineering time, the migration pays for itself in roughly 2-4 months at this scale, then keeps saving every month after.

If you want that number for your actual bill rather than a worked example, that is exactly what we do in a performance audit - we map your line items to a split-model plan with a projected savings figure before you commit to anything.

Book an Observability Cost Audit. We map your Datadog bill to a split-model migration plan with projected savings in 3 days. Start with a free bill teardown - no migration commitment required.

Migration plan: 4 weeks to first savings

The whole plan is built around one safety principle: dual-shipping. The collector sends telemetry to Datadog and the new backend simultaneously, so at every step you can compare the two and abort with zero data loss.

Week 1 - Deploy the collector, dual-ship everything. Stand up the OpenTelemetry Collector alongside the existing Datadog agent. Configure it to forward all telemetry to both Datadog and your new backend. Nothing is cut over yet, nothing is at risk, and you immediately have a parity comparison. (Our OpenTelemetry instrumentation guide covers the collector setup in depth.)

Week 2 - Cut over logs, validate parity, kill Datadog log indexing. Logs go first. Point queries at Loki or ClickHouse, confirm your alerts and searches return the same results, then turn off Datadog log indexing. This single step often delivers the largest line-item drop on its own.

Week 3 - Aggregate and migrate custom metrics, prune cardinality. Move metrics through the collector with aggregation and attribute-dropping enabled. Hunt down the high-cardinality tags from the cardinality trap and strip them. Ship the cleaned-up metrics to Mimir or Prometheus.

Week 4 - Tune retention, build Grafana dashboards, decommission. Set retention windows that match how you actually investigate incidents (you rarely need 90 days of raw debug logs). Rebuild the dashboards your team needs in Grafana. Turn off the redundant Datadog products. First full month of savings starts here.

The reason this works without a war room is the rollback safety of dual-shipping. If log parity looks wrong in week 2, you have not killed anything yet - you just keep using Datadog and fix the collector config. You can abort at any step and you are exactly where you started. For a deeper dive on the collector-and-Claude-Code tooling side of this, see our Datadog APM alternative with OpenTelemetry walkthrough.

When to keep paying Datadog

This is a cost-reduction playbook, but the honest version includes when not to do it. A migration that does not pay off is just a different way to waste money.

Keep paying Datadog when:

  • You have compliance certifications you cannot easily replicate. SOC 2, HIPAA, or PCI on a managed control plane is real value. Recreating that posture on self-hosted infrastructure can cost more than the bill you are trying to cut.
  • You run very high cardinality at genuine scale and the operational cost of running a collector fleet and storage tier exceeds the Datadog premium. Sometimes the managed plane is genuinely cheaper once you price in your own SRE time.
  • You are a small team with no platform engineering capacity. Someone has to operate the collector fleet and the Grafana stack at 3am. If that someone does not exist, Datadog’s managed plane is doing real work for you.

The honest math: below roughly $8K/month, a migration rarely pays off. The engineering time to build and operate the pipeline swamps the savings. Split-model migrations shine in the $15K-$200K/month range, where the savings are large and a dedicated platform function already exists.

If you are above that threshold and the bill is climbing, the split model is the clearest revenue move on the table. We run it as a fixed-scope optimization sprint with a guaranteed savings projection up front. For ongoing tuning after the migration, a performance retainer keeps the pipeline lean as you ship new services.

Want to see the comparison landscape first? Our breakdowns of Datadog vs New Relic vs Dynatrace vs AppDynamics and APM tools compared cover the alternatives in detail.

Book an Observability Cost Audit. Three days, a line-by-line teardown of your Datadog bill, and a split-model migration plan with projected savings. The free bill teardown is the fastest way to find out what 60-90% off looks like for your environment.

Frequently Asked Questions

How much can you save by moving from Datadog to OpenTelemetry?

Teams typically cut 60-90% of their Datadog bill with a split-model migration. The savings come almost entirely from moving log indexing and custom metrics off Datadog's per-unit pricing and onto an OpenTelemetry-fronted pipeline. A 100-host environment running ~$38K/month commonly lands near $6K/month after the migration, including new collector and storage costs.

What should you migrate off Datadog first to cut costs?

Migrate logs first. Logs are the highest-volume, lowest-switching-risk data type, so they deliver the fastest savings with the least chance of breaking workflows. Move custom metrics second by aggregating and pruning high-cardinality tags at the collector. Keep APM traces on Datadog initially if your team relies on the UX, and migrate those last or never.

Does OpenTelemetry replace Datadog?

No. OpenTelemetry is an instrumentation and routing layer, not a backend. It standardizes how you collect telemetry and lets you send it anywhere, but you still need a storage backend and dashboards (Grafana with Loki, Mimir, and Tempo, or an all-in-one like SigNoz). OpenTelemetry is what frees you to choose a cheaper backend without re-instrumenting your code.

What causes high Datadog bills?

Four line items drive most over-budget Datadog invoices: custom metrics (priced exponentially as cardinality grows), log ingestion plus indexing, per-host APM, and 2026's LLM-observability surcharge. Custom metrics and log indexing alone typically account for 50-70% of an over-budget bill. APM hosts are rarely the main culprit.

Can you use OpenTelemetry and Datadog together?

Yes, and that is the safest way to migrate. The OpenTelemetry Collector can dual-ship telemetry to Datadog and to a new backend at the same time, so you validate parity before cutting anything over. The split-model approach keeps Datadog for the data where it earns its keep and routes the expensive data through the collector to cheaper storage.

Your P99 Deserves Better

Book a free 30-minute performance scope call with our engineers. We review your latency profile, identify the most impactful optimization target, and scope a sprint to fix it.

Talk to an Expert