Kubernetes Requests and Limits Best Practices

A practical guide to Kubernetes CPU and memory requests and limits by workload type, with tuning patterns and a repeatable review cycle.

Getting Kubernetes resource requests and limits roughly right is one of the highest-leverage ways to improve cluster stability, application performance, and cloud efficiency. This guide gives platform teams and service owners a practical, update-friendly framework for sizing CPU and memory by workload type, avoiding the most common tuning mistakes, and building a review cycle that keeps resource settings aligned with real usage as services evolve.

Overview

Kubernetes resource settings are easy to treat as boilerplate, but they shape scheduling, runtime behavior, and cost. Requests influence where a Pod can land and how much capacity the scheduler reserves. Limits influence how far a container can burst and what happens under contention. CPU and memory behave differently, so a good policy starts by understanding those differences instead of applying one generic rule to every deployment.

A useful mental model is simple:

CPU requests protect baseline performance and help the scheduler place Pods.
CPU limits cap burst usage and can introduce throttling if set too low.
Memory requests reserve capacity for scheduling and reduce eviction risk.
Memory limits are hard boundaries; if a process exceeds them, it may be terminated.

That difference matters. CPU contention usually shows up as slower work. Memory pressure often shows up as restarts, OOM kills, or noisy neighbor problems. Because of that, many teams are more cautious with memory limits than with CPU limits, especially for workloads with spiky heap growth or unpredictable caches.

For most teams, the goal is not perfect precision. It is to reach settings that are:

safe enough to protect reliability,
efficient enough to avoid large waste, and
simple enough that engineers can maintain them without guesswork.

The best starting point is to tune by workload pattern rather than by language or team ownership. Common workload types include:

Stateless web APIs: Usually moderate baseline usage with occasional CPU bursts and relatively steady memory.
Background workers: Often CPU-heavy, queue-driven, and bursty, with resource use tied to concurrency.
Batch jobs and CronJobs: Short-lived and variable, often needing higher peaks than long-running services.
Streaming or event consumers: Sensitive to lag and backpressure; sizing affects throughput directly.
Memory-heavy apps: JVM services, analytics components, or cache-like processes that need extra care around headroom.
System daemons and sidecars: Easy to underestimate, especially log shippers, service mesh proxies, and metrics exporters.

As a practical baseline, standardize guidance around patterns such as:

Set requests from observed steady-state usage, not from optimistic assumptions.
Set limits only when they solve a real multi-tenant or runaway-risk problem.
Keep memory request and realistic working set close enough to avoid frequent eviction.
Review sidecar overhead separately; do not hide it inside the main application estimate.
Use the same sizing language in templates, Helm values, or platform golden paths so teams are not inventing policy from scratch.

If your organization is building shared platform standards, this is a strong place to define opinionated defaults. Teams adopt resource guidance faster when it is embedded into reusable deployment patterns, similar to the adoption approach discussed in Golden Paths for Developers: Examples, Tradeoffs, and Adoption Metrics.

Best-practice defaults by workload type

These are not universal numbers. They are decision patterns you can adapt:

Stateless HTTP services: Start with a memory request near normal peak working set and a CPU request that covers average traffic. Be careful with low CPU limits if latency matters.
Queue workers: Tie requests to concurrency. If worker count or thread count scales up, revisit resources at the same time.
CronJobs and one-off Jobs: Use separate sizing from the long-running service. Batch work often needs a larger memory ceiling and should not inherit conservative API settings.
JVM or GC-heavy apps: Give memory headroom beyond nominal heap size. Account for metaspace, threads, off-heap buffers, and native libraries.
Data processing services: Prefer testing with realistic payload sizes. These workloads often look fine in staging and fail only with production-shaped inputs.
Sidecars: Assign explicit requests and limits. Service mesh, log forwarders, and security agents can become the hidden reason a Pod is undersized.

Teams comparing deployment packaging choices should also align resource policy with their deployment method. If you are deciding where defaults belong, see Helm vs Kustomize vs Terraform for Kubernetes Deployments.

Maintenance cycle

Resource tuning works best as a repeatable operating practice, not a one-time ticket. A lightweight maintenance cycle helps prevent drift as code paths, traffic patterns, and platform behavior change.

A practical cycle has four steps:

Collect recent usage: Review CPU usage, memory working set, restart patterns, throttling, and eviction history over a meaningful window.
Compare against deployed settings: Look for large gaps between requested resources and actual baseline usage, or recurring pressure near limits.
Adjust by workload behavior: Tune differently for latency-sensitive APIs, bursty workers, and batch jobs rather than applying one formula.
Roll changes carefully: Validate impact after deployment and watch for throughput, latency, and stability regressions.

For most teams, a quarterly review is a good default for stable services, with faster review for rapidly changing workloads. New services should usually be checked sooner, often after the first few production releases, because early estimates are rarely accurate.

What should the review include?

CPU usage distribution, not just averages
Memory usage distribution and spike patterns
Container restarts and OOM events
CPU throttling if limits are in use
Node pressure and Pod eviction events
HPA behavior, if autoscaling is based on CPU or memory
Changes in dependencies, sidecars, runtime version, or concurrency settings

This is also where observability matters. Dashboards that combine requests, limits, actual usage, and incidents make tuning far easier than raw point-in-time inspection. If you are refining your stack, Prometheus vs Grafana Cloud vs Datadog: Monitoring Stack Comparison can help frame where these signals live.

A simple review checklist

Is the memory request below normal working set?
Is the CPU request so low that the Pod performs poorly during routine traffic?
Are CPU limits causing throttling during healthy load?
Are memory limits so tight that normal variation leads to OOM kills?
Did a new sidecar, SDK, or runtime version change the footprint?
Are requests far above real usage, reducing node efficiency?
Do HPA targets still make sense after resource changes?

If you want resource tuning to be maintainable at scale, define ownership clearly. Service owners should understand application behavior. Platform teams should supply dashboards, policy templates, admission controls if needed, and a standard review cadence. That division keeps standards consistent without turning every tuning change into platform bottleneck work.

Signals that require updates

Not every service needs constant attention, but some changes should trigger an immediate review of requests and limits. The key is to define signals before they become incidents.

Common update triggers include:

Repeated OOM kills: Usually a sign that memory limits are too low, memory requests are unrealistic, or the app has a leak or load-shaped spike.
CPU throttling: If latency, throughput, or job completion time degrades while CPU is capped, limits may be too restrictive.
Poor bin-packing efficiency: Large requests with consistently low usage can waste cluster capacity and inflate cost.
Frequent evictions: Requests may be too low relative to real memory needs, especially on busy nodes.
Autoscaling instability: Incorrect requests distort HPA behavior and can trigger unnecessary scale-outs or slow response to load.
Major code or runtime changes: New frameworks, language versions, caches, sidecars, agents, and feature flags can materially change footprint.
Traffic profile changes: Seasonal load, onboarding new tenants, larger payloads, or expanded background processing should prompt revalidation.

Search intent on this topic also shifts over time. For example, some teams increasingly ask whether CPU limits should be omitted for certain workloads, or how requests interact with autoscaling and cluster autoscaling. Those are useful prompts to revisit platform guidance rather than freeze old defaults.

Another strong update trigger is a change in platform policy: new node sizes, revised namespace quotas, changes in admission policies, or a Kubernetes version upgrade. Keep an eye on your broader platform lifecycle, and pair resource guidance reviews with release planning. A cluster-wide refresh often fits naturally alongside a version check using a resource such as the Kubernetes Release Calendar and End-of-Life Tracker.

What healthy signals look like

Healthy does not mean perfectly flat usage. It means the service has enough headroom for normal variation, enough consistency for autoscaling to behave sensibly, and no recurring reliability symptoms linked to resource pressure. In practice, that often looks like:

few or no OOM kills under expected load,
limited throttling that does not correlate with user-visible problems,
requests close enough to baseline to support efficient scheduling, and
resource changes that are deliberate, documented, and easy to trace.

Common issues

Most problems with Kubernetes requests and limits come from a small set of repeated mistakes. Fixing these patterns usually produces better results than chasing ultra-fine optimization.

1. Copy-pasting the same values to every service

Uniform defaults are useful as a starting point, but not as a final answer. An API server, a queue worker, and a CronJob rarely have the same runtime profile. Standardize the process, not identical numbers.

2. Setting CPU requests too low

Very low CPU requests can make a service look efficient on paper while harming latency and throughput in practice. This is especially risky for workloads with short bursts of work, request-driven traffic, or startup-sensitive behavior.

3. Setting memory limits too close to observed average

Average memory usage is often misleading. Many services need headroom for spikes, garbage collection cycles, caching effects, or workload-specific bursts. Tight memory limits can turn normal variance into repeated restarts.

4. Ignoring sidecars and agents

Teams often size the app container and forget the Pod as a whole. Service mesh proxies, security agents, and log shippers can materially increase CPU and memory demand. If a Pod is unstable, inspect every container, not just the primary process.

5. Using staging traffic as the only benchmark

Staging environments often underrepresent payload size, concurrency, cardinality, and dependency latency. Production-shaped load is a much better basis for resource tuning than synthetic happy-path tests.

6. Tuning in isolation from reliability targets

Resource settings should support service objectives. If a system has strict availability or latency goals, be conservative enough to protect those goals. If you manage services by error budgets, tie resource reviews to them; SLO Error Budget Policy Examples for SaaS Engineering Teams offers a useful framing for that operating model.

7. Treating incidents as one-off anomalies

An OOM kill during a high-traffic event may be an application issue, a configuration issue, or both. Either way, it should feed back into standard resource guidance. Incident reviews are a valuable source of tuning updates, especially when resource pressure contributes to severity. For teams refining incident process, Incident Severity Levels: How to Define Sev 1, Sev 2, Sev 3, and Sev 4 can help standardize response language.

8. Not documenting the reason for a setting

A request of 500m CPU and a 1Gi memory limit may have been a sensible choice when added, but six months later nobody remembers why. Brief annotations in values files, pull requests, or runbooks make future reviews much easier.

The operational lesson is straightforward: resource policy should be observable, reviewable, and explainable. If the only rationale is “it worked before,” the setting is overdue for a check.

When to revisit

The easiest way to keep this topic current is to define clear revisit points instead of relying on memory. If you own a platform, publish this as part of your workload standards and review it on a recurring schedule. If you own a service, attach resource checks to normal operational rhythms.

Revisit requests and limits when any of the following happens:

on a quarterly review cycle for stable workloads,
after major releases that change traffic patterns or processing logic,
after adding sidecars, language runtime upgrades, or instrumentation agents,
after repeated OOM kills, throttling alerts, or evictions,
after changing HPA targets or autoscaling strategy,
after node pool or cluster version changes, and
after incidents where resource pressure affected impact or recovery.

A practical action plan for teams looks like this:

Inventory workloads by type: Separate APIs, workers, jobs, data processors, and sidecar-heavy Pods.
Define baseline templates: Give each workload type a documented starting pattern for requests, limits, and review expectations.
Instrument the right signals: Track usage, throttling, OOM kills, evictions, and autoscaling behavior in one place.
Review on schedule: Make resource tuning part of service ownership, not emergency cleanup.
Update platform guidance: Feed lessons from incidents and production behavior back into shared defaults.

This topic is worth revisiting because resource behavior changes as applications and clusters change. The goal is not to find one permanent setting. The goal is to keep settings close enough to reality that developers can ship safely, operators can manage capacity predictably, and the cluster remains efficient under normal growth.

If you are building broader standards for workload delivery, pair this guidance with your deployment packaging, image tagging, and platform patterns. Related reading includes Docker Image Tagging Strategy: Latest vs Immutable Tags vs Semver and Platform Engineering Toolchain Checklist for Internal Developer Platforms. Together, these practices make resource tuning part of a repeatable cloud-native operating model rather than a series of one-off fixes.

Kubernetes Resource Requests and Limits Best Practices by Workload Type

Overview

Best-practice defaults by workload type

Maintenance cycle

A simple review checklist

Signals that require updates

What healthy signals look like

Common issues

1. Copy-pasting the same values to every service

2. Setting CPU requests too low

3. Setting memory limits too close to observed average

4. Ignoring sidecars and agents

5. Using staging traffic as the only benchmark

6. Tuning in isolation from reliability targets

7. Treating incidents as one-off anomalies

8. Not documenting the reason for a setting

When to revisit

Related Topics

Midways Editorial

Up Next

Kubernetes Cost Optimization Checklist for Small and Mid-Size Clusters

On-Call Handoff Checklist for Distributed Engineering Teams

Runbook Automation Tools Compared for SRE and DevOps Teams