Kubernetes Requests and Limits Without Waste

A practical guide to setting Kubernetes requests and limits so workloads stay stable without quietly inflating cluster costs.

Setting Kubernetes resource requests and limits is one of those tasks that looks simple in YAML but has outsized impact on cost, reliability, and developer experience. Done well, requests help the scheduler place workloads efficiently and give autoscalers sensible signals. Done poorly, they inflate cluster size, cause throttling, trigger evictions, or hide waste behind a false sense of safety. This guide gives you a practical, repeatable way to right size Kubernetes resources without chasing perfect precision. You will learn how to estimate starting values, what assumptions matter, how to evaluate tradeoffs for CPU and memory, and when to revisit your numbers as workloads, pricing, and autoscaling behavior change.

Overview

The goal of kubernetes requests and limits is not to guess the exact amount of compute a container will ever need. The real goal is to express intent clearly enough that the platform can make good scheduling decisions while your team avoids paying for too much idle capacity.

In practice, requests and limits influence different parts of the system:

Requests tell the scheduler how much CPU and memory a pod should reserve on a node.
Limits define the maximum amount a container may use before the kernel or runtime enforces a boundary.
CPU requests often affect placement and autoscaling more than direct runtime behavior.
Memory limits are usually more dangerous to set carelessly because exceeding them can lead to OOM kills.

That difference is why a single rule like “always set equal requests and limits” is rarely good enough. Some workloads benefit from tight control. Others need room to burst. Some are latency sensitive. Others are batch oriented and can tolerate slower completion.

A useful mental model is this:

Set requests based on what the workload typically needs to operate well.
Set limits based on what the workload can safely burst to, if limits are appropriate at all.
Review both values through the lens of cost, performance, and failure mode.

If you are working on broader cluster efficiency, this topic pairs naturally with Cloud Cost Allocation Best Practices for Kubernetes Clusters and Best Observability Tools for Kubernetes, because right sizing without cost visibility or workload metrics usually leads to guesswork.

As a starting point, avoid three common mistakes:

Copying defaults across every service. A queue worker, API deployment, and stateful cache do not behave the same way.
Treating CPU and memory the same. CPU pressure usually degrades performance; memory pressure often causes restarts.
Optimizing only for peak load. Sizing every pod for worst case demand can make normal-day costs permanently higher than they need to be.

How to estimate

A practical resource tuning process should be easy to repeat whenever a service changes. The simplest durable method is to estimate from observed usage, add explicit headroom, and validate the result under representative traffic.

Use this five-step approach to right size Kubernetes resources:

1. Measure steady-state usage

Start with real metrics from a normal operating window, not a single deployment and not an incident period. For each container, capture:

Typical CPU usage over time
Peak CPU during expected traffic bursts
Typical memory working set
Peak memory during normal operation
Startup or warmup spikes
Any periodic behavior such as cron jobs, cache refreshes, or batch pulls

Percentiles are more useful than averages. Averages hide bursts that users feel. For many services, a high percentile of normal usage is a better starting point than the absolute maximum, provided you understand what produced the maximum.

2. Set CPU requests from typical need, not theoretical peak

CPU is compressible. A container can usually continue running with less CPU, though it may respond more slowly. That makes CPU requests a scheduling and fairness tool more than a strict safety threshold.

A simple starting heuristic:

Set the CPU request near the high end of steady-state usage under expected load.
Add enough headroom for ordinary variation, not rare spikes.
If the workload is bursty and user-facing, consider allowing more burst above the request.

This helps the cluster pack pods efficiently while still leaving room for bursts or horizontal scaling.

3. Set memory requests closer to what you really need

Memory is not compressible in the same way. If a workload genuinely needs a certain amount of memory to run, under-requesting can create noisy scheduling outcomes, and under-limiting can lead to OOM kills.

A simple starting heuristic:

Set the memory request near the upper range of normal working set.
Add headroom for language runtime behavior, caches, and startup growth.
Use extra caution with workloads that have unpredictable heap usage or in-memory queues.

Compared with CPU, memory requests usually deserve a more conservative setting.

4. Decide whether limits help or hurt

Limits are not automatically good. They should match the failure mode you want.

CPU limits can prevent a noisy neighbor from consuming too much node capacity, but they may also cause throttling and latency spikes for bursty applications.
Memory limits can stop runaway growth, but if set too close to actual need they can make the application unstable.

For many teams, the most careful question is not “what limit should we use?” but “should this workload have this limit at all?” That answer depends on multi-tenancy, platform guardrails, and the cost of degraded performance versus hard failure.

5. Validate with load and rollout checks

After choosing values, test under realistic traffic or replayed workload patterns. Then watch:

Pod restarts
CPU throttling indicators
Latency changes
Memory growth over time
Node pressure and evictions
Horizontal Pod Autoscaler behavior if used

If you are using GitOps, this is a good place to standardize the review path. Teams managing manifests with Helm, Kustomize, or Jsonnet often benefit from storing resource assumptions alongside service configuration so changes remain visible in pull requests. If deployment automation is part of your workflow, similar discipline also improves CI reliability, as discussed in Best CI/CD Tools for Small Teams and Growing Engineering Orgs.

Inputs and assumptions

Resource tuning works best when your assumptions are explicit. Otherwise a recommendation that made sense for one environment gets copied into another where it quietly wastes money.

Before changing values, define these inputs:

Workload type

Stateless API: often CPU sensitive, may tolerate CPU burst, usually needs stable memory.
Background worker: throughput matters more than latency; can often run with lower requests if queue depth is acceptable.
Batch job: may justify temporary higher usage instead of permanently high reservations.
Stateful service: memory behavior and eviction risk usually matter more than aggressive packing.

Traffic pattern

Steady all day
Business-hour peaks
Short burst traffic
Seasonal or release-driven spikes

A service with predictable daily peaks can often rely on autoscaling and moderate requests. A service with abrupt spikes may need more local headroom.

Autoscaling model

Your sizing decisions depend heavily on what is allowed to scale:

HPA reacts to utilization or custom metrics at the pod layer.
Cluster autoscaling adds or removes nodes based on unschedulable demand and spare capacity.
Vertical adjustment workflows, whether manual or tool-assisted, may revise requests over time.

If requests are inflated, autoscalers may react too early and keep extra nodes around. If requests are too low, the cluster may look efficient while the application suffers.

Cost model

To estimate cost impact, you do not need exact market pricing in the article. You only need a consistent local method:

Determine the effective cost of node CPU and memory in your environment.
Estimate total requested CPU and memory for the workload across replicas.
Compare the old and new request footprints.
Translate the difference into node pressure or reserved capacity saved.

For example, if reducing requests allows more pods to fit per node, you may delay a scale-out event or reduce baseline node count. That is often where kubernetes cost optimization appears in practice: not in tiny YAML changes by themselves, but in aggregate cluster packing efficiency.

Availability expectations

Not every service should be tuned aggressively. If a workload supports a critical user path, preserving response time and restart safety may be worth some overprovisioning. The right target is usually “efficient enough with low operational risk,” not “smallest possible number.”

Platform constraints

Namespace quotas, LimitRanges, admission policies, and multi-tenant fairness rules all shape what is reasonable. If you run a platform team, documenting these defaults inside an internal platform or golden path can reduce repeated mistakes. For broader context, see Platform Engineering Tools Landscape and Backstage Alternatives Compared for Platform Teams.

Worked examples

The easiest way to make resource tuning repeatable is to use simple scenarios. These are not universal values to copy. They show how to think.

Example 1: Stateless API with bursty traffic

Imagine an API deployment with several replicas. Observability shows:

Typical CPU use is moderate
Short bursts happen during traffic peaks
Memory stays relatively stable after startup

A reasonable approach:

Set CPU requests around the upper end of normal operating usage.
Allow room to burst above the request if latency matters.
Set memory requests close to stable working set plus runtime headroom.
Use a memory limit only if you understand safe ceiling behavior.

Why this saves money: if requests were originally sized for worst-case CPU spikes, the scheduler may reserve much more than the service needs most of the day. Lowering CPU requests while preserving burst capacity can improve node utilization without immediately hurting performance.

Example 2: Queue worker with flexible completion time

Now consider a worker that pulls tasks from a queue and is not user-facing. It can process more slowly during contention as long as backlog remains acceptable.

A reasonable approach:

Use lower CPU requests than a latency-sensitive API.
Scale workers based on queue depth or throughput metrics where possible.
Keep memory requests realistic if the worker accumulates in-memory task state.

Why this saves money: these workloads are often over-requested because teams fear slower processing. But if the queue is the true buffer, you can often reserve less per pod and let horizontal scaling handle demand more cheaply.

Example 3: Memory-heavy service with occasional spikes

Suppose a service keeps sizable caches in memory and occasionally grows during refresh cycles.

A reasonable approach:

Do not set memory requests from average usage.
Include cache growth and refresh patterns in your estimate.
Set memory limits with caution, because an aggressive ceiling may convert a recoverable slowdown into repeated OOM restarts.

Why this saves money: the optimization may be architectural rather than numeric. If the service is expensive because it truly needs memory, shrinking requests too far does not create efficiency. It creates instability. In this case, cost control may come from cache policy, sharding, or workload redesign instead of tighter YAML.

Example 4: CronJob or scheduled batch

Scheduled jobs are often left with copy-pasted requests from interactive services.

A reasonable approach:

Measure the job separately from the main application.
Size for completion goals rather than arbitrary parity with other workloads.
Consider running fewer, larger pods or more, smaller pods depending on node packing and parallelism.

Why this saves money: jobs that run briefly do not always need the same steady reservation strategy as always-on services. Their requests should reflect actual execution behavior.

Across all four examples, the same principle holds: CPU and memory requests best practices are less about memorizing fixed ratios and more about matching the resource policy to the workload’s failure mode and cost profile.

When to recalculate

Resource tuning should be revisited whenever the inputs change. That is what makes this topic evergreen: the correct settings move as your application, traffic, autoscaling, and cloud cost structure move.

Recalculate requests and limits when:

Traffic patterns change, such as a new region, customer tier, or product launch.
The application changes runtime behavior, for example after a language upgrade, new cache layer, or dependency update.
Autoscaling configuration changes, including HPA targets, custom metrics, or cluster scaling policy.
Node shapes or pricing assumptions change, because cost efficiency depends on packing against the compute you actually buy.
Incident reviews reveal resource symptoms, such as CPU throttling, OOM kills, eviction pressure, or noisy neighbors.
Platform defaults change, including quotas, guardrails, or namespace policies.

A practical review cadence looks like this:

Monthly: inspect top workloads by requested CPU and memory versus observed usage.
Quarterly: review services that drive the largest share of cluster cost.
After major releases: compare pre-release and post-release runtime profiles.
After incidents: update requests, limits, or autoscaling assumptions as part of the postmortem action list.

To make this sustainable, build a lightweight checklist for every service owner:

What is the normal usage range?
What is the expected burst pattern?
What happens if CPU is constrained?
What happens if memory is constrained?
Is HPA using the right signal?
Did the change reduce waste or just shift risk?

If you run Kubernetes through GitOps, keep resource changes versioned and reviewed. If your teams are evaluating delivery workflows, Argo CD vs Flux is worth comparing for consistency and policy enforcement. If ingress behavior and service shape are changing, architectural shifts like those covered in Kubernetes Ingress vs Gateway API can also affect traffic patterns and therefore resource assumptions.

The most useful final rule is simple: do not aim for perfect values. Aim for a repeatable tuning loop. Measure, set a conservative starting point, validate under realistic load, and revisit the numbers when workload behavior or cost inputs change. That process is what helps teams right size Kubernetes resources without turning every deployment into an expensive buffer against uncertainty.

How to Set Resource Requests and Limits in Kubernetes Without Wasting Money

Overview

How to estimate

1. Measure steady-state usage

2. Set CPU requests from typical need, not theoretical peak

3. Set memory requests closer to what you really need

4. Decide whether limits help or hurt

5. Validate with load and rollout checks

Inputs and assumptions

Workload type

Traffic pattern

Autoscaling model

Cost model

Availability expectations

Platform constraints

Worked examples

Example 1: Stateless API with bursty traffic

Example 2: Queue worker with flexible completion time

Example 3: Memory-heavy service with occasional spikes

Example 4: CronJob or scheduled batch

When to recalculate

Related Topics

Behind Cloud Editorial

Up Next

Service Mesh Comparison: Istio vs Linkerd vs Cilium Service Mesh

OpenTelemetry Collector Configuration Patterns for Production

Container Registry Comparison: ECR vs GHCR vs GCR vs Docker Hub