OpenTelemetry Instrumentation: From Zero to Distributed Tracing in 30 Minutes
Step-by-step OpenTelemetry setup guide - auto-instrument Node.js and Python apps, configure the Collector, and send traces to Grafana or Datadog.
OpenTelemetry (OTEL) is the open-source observability framework that has become the industry standard for distributed tracing, metrics, and logs. It provides a vendor-neutral instrumentation layer that lets you collect telemetry data once and send it to any backend - Grafana, Datadog, New Relic, Jaeger, or your own infrastructure.
Before OpenTelemetry, switching APM vendors meant re-instrumenting your entire application. With OTEL, you instrument once and change only the Collector configuration when you switch backends. This guide gets you from zero to working distributed traces in a Node.js or Python application, with the Collector routing data to Grafana Tempo (free) or Datadog.
Why OpenTelemetry Matters
The problem OpenTelemetry solves is vendor lock-in at the instrumentation layer. Pre-OTEL, APM vendors provided proprietary SDK clients. Installing the Datadog SDK meant your instrumentation code was tightly coupled to Datadog. Migrating to New Relic required replacing all instrumentation calls.
OpenTelemetry provides a standard API that works with any OTEL-compatible backend. Your application code calls the OTEL API. The OTEL SDK collects the data. The OTEL Collector routes it to your chosen backend.
Key benefits:
- Instrument once, send anywhere
- Auto-instrumentation for popular frameworks (Express, FastAPI, Django, gRPC, etc.)
- W3C TraceContext standard for trace propagation across service boundaries
- Supported by every major observability vendor
- CNCF Graduated project with strong governance and long-term support
OpenTelemetry Architecture
Understanding the three components prevents confusion:
OTEL SDK (in your application): The library your application code links against. Provides the API for creating spans, recording metrics, and emitting logs. Available for 12+ languages.
OTEL Collector (separate process): Receives telemetry from your applications, processes it (batching, sampling, filtering, transforming), and exports it to backends. Runs as a sidecar container or as a cluster-wide deployment.
Backend (observability tool): Receives processed telemetry from the Collector. Jaeger, Grafana Tempo, Zipkin, Datadog, New Relic, Honeycomb - anything that speaks OTLP (OpenTelemetry Protocol).
The typical data flow:
Application (OTEL SDK) --> OTEL Collector --> Backend (Grafana/Datadog/etc.)
Node.js Auto-Instrumentation
Auto-instrumentation patches popular libraries at startup without requiring code changes. Express routes, HTTP calls, database queries, Redis operations, and more are automatically traced.
Install the dependencies:
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/exporter-metrics-otlp-grpc \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
Create instrumentation.js (must be required before any other module):
// instrumentation.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME || 'my-service',
[SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
'deployment.environment': process.env.NODE_ENV || 'production',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
exportIntervalMillis: 30000, // Export metrics every 30 seconds
}),
instrumentations: [
getNodeAutoInstrumentations({
// Instrument HTTP calls (includes Express route handling)
'@opentelemetry/instrumentation-http': {
enabled: true,
// Filter out health check endpoints from traces
ignoreIncomingRequestHook: (req) => req.url === '/health',
},
// Instrument PostgreSQL queries
'@opentelemetry/instrumentation-pg': { enabled: true },
// Instrument Redis operations
'@opentelemetry/instrumentation-redis': { enabled: true },
// Instrument gRPC calls
'@opentelemetry/instrumentation-grpc': { enabled: true },
}),
],
});
sdk.start();
// Ensure clean shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('OpenTelemetry SDK shut down successfully'))
.catch((error) => console.error('Error shutting down OpenTelemetry SDK', error))
.finally(() => process.exit(0));
});
Start your application with the instrumentation:
node -r ./instrumentation.js app.js
# Or set NODE_OPTIONS for automatic loading:
NODE_OPTIONS="--require ./instrumentation.js" node app.js
With this setup, every Express route, PostgreSQL query, and Redis operation will automatically generate traces with timing, status codes, and error information.
Python Auto-Instrumentation
Python auto-instrumentation uses a similar approach but leverages the opentelemetry-instrument command-line wrapper.
Install dependencies:
pip install opentelemetry-distro opentelemetry-exporter-otlp-proto-grpc
opentelemetry-bootstrap -a install # Automatically installs instrumentation packages
Configure via environment variables (no code changes required for basic setup):
export OTEL_SERVICE_NAME="payment-service"
export OTEL_SERVICE_VERSION="2.1.0"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4317"
export OTEL_TRACES_EXPORTER="otlp"
export OTEL_METRICS_EXPORTER="otlp"
export OTEL_LOGS_EXPORTER="otlp"
export OTEL_PYTHON_LOG_CORRELATION="true" # Inject trace IDs into log records
# Start FastAPI app with auto-instrumentation
opentelemetry-instrument uvicorn main:app --host 0.0.0.0 --port 8000
Or configure via code for more control:
# otel_setup.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
import os
def configure_otel():
resource = Resource.create({
"service.name": os.getenv("SERVICE_NAME", "payment-service"),
"service.version": os.getenv("SERVICE_VERSION", "1.0.0"),
"deployment.environment": os.getenv("ENVIRONMENT", "production"),
})
# Configure tracing
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint=os.getenv("OTEL_ENDPOINT", "http://otel-collector:4317"))
)
)
trace.set_tracer_provider(tracer_provider)
# Configure metrics
reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint=os.getenv("OTEL_ENDPOINT", "http://otel-collector:4317")),
export_interval_millis=30000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(meter_provider)
configure_otel()
OTEL Collector Configuration
The Collector is the routing layer between your applications and backends. Configure it as a Kubernetes deployment or sidecar.
otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Add memory limit to prevent OOM
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# Batch telemetry before sending to reduce API calls
batch:
timeout: 5s
send_batch_size: 1024
# Add resource attributes to all telemetry
resource:
attributes:
- key: environment
value: "production"
action: upsert
# Drop noisy internal metrics
filter:
spans:
exclude:
match_type: regexp
span_names:
- "^.*health.*$"
- "^.*readiness.*$"
exporters:
# Send traces to Grafana Tempo
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Send metrics to Prometheus (scraped by Grafana)
prometheus:
endpoint: 0.0.0.0:8889
# Optionally send to Datadog
datadog:
api:
key: "${DD_API_KEY}"
site: datadoghq.com
# Debug logging (disable in production)
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource, filter]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Deploy the Collector in Kubernetes:
# otel-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.95.0
args: ["--config=/conf/config.yaml"]
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8889 # Prometheus metrics
volumeMounts:
- name: collector-config
mountPath: /conf
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
volumes:
- name: collector-config
configMap:
name: otel-collector-config
Adding Custom Spans
Auto-instrumentation traces framework operations. For business logic, add custom spans manually.
Node.js custom spans:
const { trace, context, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('payment-service', '1.0.0');
async function processPayment(orderId, amount, paymentMethod) {
// Create a span for the entire payment processing operation
return tracer.startActiveSpan('process-payment', async (span) => {
try {
// Add business context as span attributes
span.setAttributes({
'order.id': orderId,
'payment.amount': amount,
'payment.method': paymentMethod,
'payment.currency': 'USD',
});
// Child span for fraud check
const fraudResult = await tracer.startActiveSpan('fraud-check', async (fraudSpan) => {
try {
const result = await fraudCheckService.check({ orderId, amount });
fraudSpan.setAttribute('fraud.score', result.score);
fraudSpan.setAttribute('fraud.decision', result.decision);
return result;
} finally {
fraudSpan.end();
}
});
if (fraudResult.decision === 'block') {
span.setStatus({ code: SpanStatusCode.ERROR, message: 'Payment blocked by fraud check' });
span.setAttribute('payment.outcome', 'blocked');
throw new Error('Payment blocked');
}
// Child span for payment gateway call
const chargeResult = await tracer.startActiveSpan('charge-gateway', async (chargeSpan) => {
try {
chargeSpan.setAttribute('gateway.name', paymentMethod.gateway);
const result = await paymentGateway.charge({ amount, token: paymentMethod.token });
chargeSpan.setAttribute('gateway.transaction_id', result.transactionId);
return result;
} finally {
chargeSpan.end();
}
});
span.setAttribute('payment.outcome', 'success');
span.setAttribute('payment.transaction_id', chargeResult.transactionId);
return chargeResult;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
Common Pitfalls
Pitfall 1: Sampling too aggressively. Many teams set a 1% head-based sampling rate and then cannot find traces for rare events (errors, slow outliers). Use tail-based sampling in the Collector: sample 100% of traces that contain errors or exceed a latency threshold, and sample 1-10% of successful fast traces.
Pitfall 2: Not setting semantic conventions. OTEL defines standard attribute names for common concepts (http.method, db.system, rpc.service). Using standard names means your data works with pre-built dashboards and alerts in Grafana and other tools. Using custom names means you lose compatibility.
Pitfall 3: No resource attributes. Without service.name, service.version, and deployment.environment attributes, traces from different services look identical. Always set these in your SDK configuration.
Pitfall 4: Sending traces directly to the backend. Always use the Collector. It handles batching, retries on backend failures, and sampling - things that are complex and resource-intensive to implement in every application SDK.
Pitfall 5: Ignoring cardinality in metrics. OTEL metrics with high-cardinality attributes (user ID, request ID, IP address) will cause backend cardinality explosions. Use low-cardinality attributes for metrics (service name, endpoint path, status code range). Save high-cardinality data for trace attributes.
Distributed tracing implemented well gives engineering teams the ability to debug complex microservices performance problems in minutes instead of hours. Our observability setup service gets your team from zero to production-grade distributed tracing with proper sampling, dashboards, and alerting.
Your P99 Deserves Better
Book a free 30-minute performance scope call with our engineers. We review your latency profile, identify the most impactful optimization target, and scope a sprint to fix it.
Talk to an Expert