The OpenTelemetry Collector is often the point where observability strategy becomes operational reality. It is where traces, metrics, and logs are shaped, filtered, enriched, routed, and protected before they reach a backend. In production, collector configuration matters as much as instrumentation because small choices around batching, memory limits, sampling, enrichment, and deployment topology can affect reliability, cost, and incident response. This guide walks through production-ready OpenTelemetry Collector configuration patterns you can use as a repeatable workflow, with practical pipeline examples, tradeoffs, and review points teams can revisit as their observability stack evolves.
Overview
A good opentelemetry collector config is not just a YAML file that happens to work. It is a set of decisions about trust boundaries, data volume, failure modes, ownership, and backend portability.
In simple environments, teams often start with a single collector that receives OTLP data and exports it to one vendor. That is a reasonable beginning, but opentelemetry collector production setups usually need a bit more structure:
- Receivers define what telemetry the collector accepts.
- Processors shape telemetry in flight, for example by batching, filtering, enriching, or limiting memory usage.
- Exporters send data to one or more backends.
- Extensions support operational needs such as health checks or authentication helpers.
- Service pipelines bind all of the above together for traces, metrics, and logs.
The collector becomes easier to manage when you treat configuration as a product interface between application teams, platform teams, and observability owners. A production pattern should answer a few baseline questions:
- Which telemetry types are accepted?
- What data is required, optional, or prohibited?
- Where does enrichment happen?
- What should be sampled or dropped?
- Which exporters are primary and which are fallback?
- How will the collector itself be observed?
- How will changes be tested and rolled out?
If you are running in Kubernetes, the collector also intersects with platform engineering choices such as DaemonSet versus Deployment, gateway versus agent topology, and config packaging through Helm, Kustomize, or other tooling. If you are comparing approaches for managing Kubernetes configuration at scale, it may help to review Helm vs Kustomize vs Jsonnet.
Step-by-step workflow
The most sustainable way to design an otel collector pipeline is to work from requirements to topology to concrete configuration. That helps avoid the common trap of copying an example file and discovering later that it does not match your data volume, security requirements, or ownership model.
1. Define telemetry goals before writing config
Start with the purpose of the collector, not the syntax. Clarify:
- Who sends telemetry: applications, nodes, ingress, service mesh, CI jobs, or managed services?
- Which signals matter first: traces, metrics, logs, or profiling if supported in your stack?
- What decisions should telemetry support: alerting, debugging, cost analysis, SLO review, release validation?
- What attributes must exist on all records: environment, service name, cluster, namespace, team, region?
This step is important because processor choice should reflect operational intent. For example, if cross-team cost allocation matters, consistent resource attributes are more important than adding every possible exporter option.
2. Choose a deployment topology
Most production teams use one of three patterns:
- Agent pattern: a collector runs close to workloads, often as a DaemonSet or sidecar, and forwards data onward.
- Gateway pattern: a centralized collector deployment receives telemetry from agents or applications and handles heavier processing and exporting.
- Hybrid pattern: lightweight agents perform local collection and basic enrichment, while gateway collectors handle sampling, routing, retries, and backend-specific export logic.
As a general rule, hybrid is the most flexible production pattern. It separates local collection concerns from central policy concerns. Agents can gather host or node context, while gateways standardize data and shield applications from backend changes.
3. Start with a minimal but safe baseline
A practical baseline for production usually includes:
otlpreceiver for traces, metrics, and logs where relevantmemory_limiterprocessor to reduce the chance of collector instability under pressurebatchprocessor to improve network efficiency and exporter behaviorresourceorattributesprocessor for standard metadata- One primary exporter
health_checkextension for readiness and liveness integration
Here is a simplified baseline:
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 20
batch:
timeout: 5s
send_batch_size: 1024
resource:
attributes:
- key: environment
value: production
action: upsert
exporters:
otlp:
endpoint: telemetry-backend:4317
tls:
insecure: false
extensions:
health_check: {}
service:
extensions: [health_check]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlp]This is not complete for every case, but it is a sensible starting point because it addresses stability, metadata consistency, and export efficiency without overfitting early.
4. Add enrichment deliberately
Production collectors often need more context than applications provide. Common enrichment patterns include adding cluster name, namespace, cloud region, business unit, tenant identifier, or deployment ring.
Be careful here: enrichment should improve queryability and routing, not create attribute sprawl. Prefer a documented set of required attributes and reject ad hoc additions unless there is a clear use case.
If your collector runs in Kubernetes, enrichment commonly uses resource detection or Kubernetes-specific metadata processors. Standardize naming early. For example, decide whether environment values are prod or production, and keep that consistent across all pipelines.
5. Filter or sample with intent
Filtering and sampling are where observability quality and observability cost often meet. They should be based on use cases, not just data reduction targets.
Useful production patterns include:
- Drop noisy telemetry from health endpoints, low-value internal checks, or duplicate sources.
- Retain error traces at higher rates than successful requests.
- Route security-relevant logs differently from application debug logs.
- Apply tail sampling at a gateway when complete trace context is available.
A common mistake is to sample too early at the edge and later discover that incident investigation needs traces that never arrived. If you need tail-based decisions, keep that logic in a central layer that can see full trace behavior.
6. Separate pipelines by signal and purpose
Do not assume traces, metrics, and logs should all use the same processing chain. In many environments they should not.
- Traces may need sampling and span attribute cleanup.
- Metrics may need aggregation, relabeling, or cardinality controls.
- Logs may need parsing, redaction, or routing based on severity or source.
You may also want separate pipelines for different classes of telemetry, such as application traces versus infrastructure metrics, or customer-facing services versus internal systems.
7. Plan for failure behavior
A collector is part of your production path. That does not mean it should become a single point of failure. Think through:
- What happens if the backend is unavailable?
- What happens if telemetry volume spikes?
- Should low-priority telemetry be dropped before high-priority telemetry?
- How much retry buffering is acceptable?
- What is the blast radius if a bad config is deployed?
In practice, this means using conservative memory settings, sensible batching, and staged rollouts. It also means understanding whether your applications fail open or fail closed if the collector is unavailable. In most cases, you want telemetry failures to degrade observability rather than break user traffic.
8. Version and test configuration like application code
Treat collector configuration as an artifact with reviews, validation, and environment promotion. Keep config in version control. Test it in lower environments. Use CI to run linting or startup validation where possible. Roll out changes progressively.
If you are already refining your CI/CD path, Best CI/CD Tools for Small Teams and Growing Engineering Orgs can help frame where config validation belongs in your delivery workflow.
9. Use a gateway config for policy and an agent config for locality
One of the most reusable otel collector pipeline examples is a split between agent and gateway responsibilities.
Agent responsibilities often include:
- Receive local OTLP traffic
- Collect host or node metrics
- Add local metadata
- Forward upstream
Gateway responsibilities often include:
- Central authentication and authorization
- Tail sampling
- Global filtering rules
- Export to multiple backends
- Tenant or team-based routing
This pattern is especially useful when teams want consistent policies across many clusters or environments.
Tools and handoffs
Production collector configuration is rarely owned by one person. The handoffs matter.
Platform team
The platform team usually owns deployment topology, scaling, config packaging, and cluster-level integration. Their responsibilities often include:
- Choosing DaemonSet, Deployment, or hybrid rollout patterns
- Defining baseline processors and approved exporters
- Managing secrets and certificates for exporter endpoints
- Providing reusable templates for teams
If exporter credentials or endpoints require secure secret distribution, align collector deployment with your broader secret handling approach. For reference, see Secrets Management Tools Compared.
Application teams
Application teams should not have to become collector experts, but they should understand the metadata and semantic conventions expected by the platform. Their handoff typically includes:
- Instrumenting services correctly
- Setting service names and environment variables consistently
- Avoiding sensitive payloads in spans or logs
- Flagging unusual telemetry needs early, such as large event attributes or custom sampling requirements
Observability or SRE team
This group usually defines backend requirements and data retention expectations. They often decide:
- Which telemetry is critical for incident response
- What minimum attributes are required
- What sampling strategy balances cost with debugging value
- What alerts should exist for the collector itself
Many teams benefit from documenting this as a short interface contract: required resource attributes, approved exporters, redaction rules, and ownership of changes.
Security team
Collectors sit on a sensitive boundary because they may handle application metadata, request context, and potentially logs with regulated data. Security review should cover:
- TLS settings and certificate management
- Authentication between agents, gateways, and backends
- Redaction or filtering of sensitive fields
- Network policies and least-privilege access
This is one reason to keep opentelemetry exporters configuration explicit and reviewed rather than spread across ad hoc per-team files.
Configuration tooling
Whichever templating tool you use, aim for a small number of composable patterns rather than many one-off variants. Common inputs include:
- Environment name
- Cluster or region
- Exporter endpoint
- Sampling mode
- Enabled receivers
Make overrides intentional. Production drift often begins when every team gets unrestricted control over processors and exporters.
Quality checks
Before calling a collector configuration production-ready, validate it from four angles: reliability, observability, security, and cost.
Reliability checks
- Does the collector expose health endpoints and integrate with readiness and liveness probes?
- Are memory limits, batching, and retries configured conservatively?
- Can you roll back quickly if a bad config causes drops or crashes?
- Have you tested volume spikes and backend unavailability?
The collector should be observable itself. Export its own metrics and logs so you can see queue growth, dropped telemetry, exporter failures, and process instability.
Data quality checks
- Do all services emit consistent service names?
- Are environment, cluster, namespace, and team attributes present where expected?
- Are duplicate attributes or conflicting labels being introduced?
- Are spans, metrics, and logs reaching the correct backend and tenant?
A very common production issue is not total failure but subtle inconsistency. Queries break because one team uses different names, or alert routing misses a service because metadata is incomplete.
Security checks
- Is telemetry encrypted in transit?
- Are secrets injected securely rather than hardcoded?
- Are sensitive headers, tokens, or payload fields redacted?
- Is access to collector endpoints restricted appropriately?
Collectors can unintentionally become a path for data leakage. Review processors and exporters with the same care you would apply to any production data pipeline.
Cost checks
- Which telemetry streams drive the most ingestion volume?
- Are debug logs or high-cardinality metrics creating avoidable cost?
- Is sampling preserving useful error and latency signals?
- Can some telemetry be routed to lower-cost storage or shorter retention?
Cost control is not just a backend concern. Thoughtful collector policy can reduce noise earlier and more safely. If you also archive telemetry or related artifacts to object storage, your storage choice may matter operationally; see S3-Compatible Object Storage Comparison for broader storage tradeoffs.
Operational checklist
Use this short checklist during rollout:
- Baseline config validated in non-production
- Health checks enabled
- Collector self-observability enabled
- Required resource attributes enforced
- Sensitive fields filtered or redacted
- Sampling strategy documented
- Exporter failure behavior tested
- Rollback path documented
- Ownership of config changes defined
When to revisit
Collector configuration should be reviewed as a living operational policy, not a one-time setup. Revisit it when any of the following changes:
- Your backend changes, including vendor migrations, dual-write periods, or new retention models.
- Your traffic shape changes, such as major growth, new regions, or new high-volume services.
- Your architecture changes, such as moving to Kubernetes gateway patterns, adding service mesh telemetry, or splitting monoliths into many services.
- Your compliance needs change, including stricter redaction, new data handling boundaries, or tenant isolation requirements.
- Your incident response needs change, especially if postmortems show missing context, poor trace coverage, or noisy metrics.
- Collector components evolve, including new processors, deprecated exporters, or improved scaling features.
A practical review cadence is to revisit collector policy during:
- Quarterly observability reviews
- Major platform changes
- Backend migrations
- After significant incidents
- Before expanding telemetry collection to new teams or services
If you want this review process to stay lightweight, define three documents and keep them current:
- Baseline config: the approved default pipelines and processors.
- Exception register: teams or services with approved deviations.
- Review checklist: the quality checks from the previous section.
The goal is not to freeze your collector design. It is to make change safer. The most useful otel collector best practices are the ones your team can sustain: standard defaults, clear ownership, measured rollout, and regular review.
As your stack matures, keep the collector boring in the best sense of the word. Use it to centralize policy, preserve portability, and reduce observability surprises. If you do that, your collector configuration becomes more than syntax. It becomes part of your production operating model.