EC2 Spot Instances in Production: Stop Fearing Interruptions

You’ve seen the numbers. Spot instances are 70–90% cheaper than On-Demand. Your compute bill is the single biggest line item in your AWS account. The math is not complicated. But every time the topic comes up in your team, someone raises the same objection: “We can’t use Spot in production. What happens when they get interrupted?”

This guide is the answer to that question: a practical five-step pattern used by engineering teams running Spot in production at scale, with a real-time control loop that makes interruptions operationally indistinguishable from a routine health-check replacement.

Section 1

The Myths Keeping Your Team on On-Demand

✗

"Spot is only for batch workloads."

✓

Spot runs production web services, ML inference, real-time APIs, and Kubernetes worker nodes. The constraint is stateful single-instance workloads, not latency-sensitive ones.

✗

"If a Spot instance gets interrupted, my service goes down."

✓

Only if your fleet has no redundancy and no graceful shutdown handler. With a minimum On-Demand floor and a lifecycle hook, a Spot interruption is operationally identical to an EC2 health-check replacement.

✗

"The savings aren't worth the operational overhead."

✓

At 70% savings on a $30k/month compute bill = $21k/month saved = $252k/year. The overhead is a one-time setup of 2–4 hours.

Section 2: The Production Spot Playbook

5 steps to Spot-first production fleets

Understand what a Spot interruption actually looks like

AWS sends a 2-minute interruption notice before reclaiming a Spot instance: a CloudWatch event and an instance metadata endpoint. That's 120 seconds to drain connections, finish in-flight requests, and hand off state. The fear is usually of an instant kill. The reality is a predictable warning with a standard API. Most application frameworks can handle a SIGTERM and finish in-flight work well within that window.

Where to configure

EC2 → Spot Requests → Interruption notices + http://169.254.169.254/latest/meta-data/spot/termination-time

⚠

Red flag: Workloads with no interruption handler and no graceful shutdown logic. These are the ones that actually break on eviction.

◆

XamOps logs every Spot interruption event with timestamp, instance ID, and ASG in Actions History, providing a full eviction audit trail across all accounts.

Set a minimum On-Demand floor and never go below it

The safest Spot pattern is not 100% Spot. It's a Spot-first fleet with a guaranteed On-Demand minimum. Set the smallest number of On-Demand instances your service needs to stay functional during a full Spot interruption wave (1–3 for non-critical services, more for tier-1). Everything above that floor is Spot. This is the single most important configuration decision in your Spot setup.

Where to configure

EC2 → Auto Scaling Groups → Edit → On-Demand Base Capacity

⚠

Red flag: On-Demand base capacity set to 0 on any customer-facing ASG. This is a ticking clock, not a cost optimization.

◆

XamOps AutoSpotting enforces the minimum On-Demand floor as a hard constraint. The staged replacement loop never drops below your configured minimum, even during simultaneous interruptions across AZs.

Use multiple instance types: capacity diversity as your interruption hedge

The biggest mistake with Spot is requesting a single instance type in a single AZ. When that pool runs out, your entire fleet gets interrupted simultaneously. Request 4–6 instance types of similar size (e.g., m5.xlarge, m5a.xlarge, m4.xlarge) across all available AZs. AWS interrupts Spot when it needs that specific capacity back. If you're not over-concentrated, a simultaneous fleet wipeout becomes nearly impossible.

Where to configure

EC2 → Launch Templates → Instance type requirements (flexible mode) → Multiple types + all AZs

⚠

Red flag: An ASG with a single instance type and fewer than 3 AZs. This is a correlated failure waiting to happen.

◆

XamOps shows Spot fleet instance type and AZ distribution on the AutoSpotting dashboard. Flags over-concentrated ASGs before a capacity crunch exposes them.

Add a lifecycle hook for graceful shutdown

Install a lightweight daemon polling the metadata termination endpoint. On detection: drain from the load balancer, finish in-flight requests, flush local state. For containers: configure the SIGTERM handler and terminationGracePeriodSeconds. For EC2 workloads: use ASG lifecycle hooks with a TERMINATING:WAIT state. This is the difference between a Spot interruption that's invisible to users and one that drops requests hard.

Where to configure

EC2 → Auto Scaling → Lifecycle Hooks (TERMINATING:WAIT) + systemd service polling the metadata endpoint

⚠

Red flag: No lifecycle hooks on Spot ASGs and no SIGTERM handler. Interruptions will terminate in-flight requests hard.

◆

XamOps AutoSpotting integrates with ASG lifecycle hooks in the staged replacement loop, health-checking before removing capacity. Same discipline that makes graceful interruption handling work.

Let the replacement loop do the heavy lifting

The hardest part of Spot in production isn't the initial config. It's keeping the fleet in the right state over time. Spot capacity comes and goes. Manually rebalancing On-Demand ratios, handling fallback events, and re-checking AZ distribution is a part-time job. The engineers who run Spot successfully have automated the control loop: detect drift → add Spot → wait for health check → remove excess On-Demand. Continuously.

Where to configure

EC2 → Auto Scaling Groups → Activity History + CloudWatch metric: OnDemandCapacity vs SpotCapacity ratio

⚠

Red flag: ASGs that started on Spot months ago and have never been reviewed. They're probably mostly On-Demand after a silent fallback event.

◆

XamOps AutoSpotting runs the staged replacement loop automatically: validate ASG candidacy → launch Spot → wait for health → terminate On-Demand. Runs continuously across all opted-in ASGs. Opt in with a single tag.

Section 3

What 70% Savings Actually Means

On-Demand

$7,000

per month

50 × m5.xlarge

Spot (blended)

$2,200

per month

same 50 instances

Saving: $4,800/month = $57,600/year from one afternoon of configuration. At 200 instances → $230k/year. At 500 instances → exceeds the cost of a full platform engineering team.

Monthly saving

$4,800

50-instance fleet

Annual saving

$57,600

one afternoon of setup

At 500 instances

$576k/yr

exceeds a platform team

The staged replacement loop, running continuously

Configure it once. XamOps keeps it running.

The 5-step setup above takes 2–4 hours to configure correctly. The hard part is keeping it in the right state as Spot capacity fluctuates, instance types get deprecated, and new ASGs are added to your account.

XamOps AutoSpotting runs the staged replacement loop on every opted-in ASG: validate candidacy → launch Spot → wait for health checks → terminate On-Demand. The minimum On-Demand floor is a hard constraint the loop never violates. Every event is logged to Actions History with timing, savings attribution, and instance IDs.

Opt in with a single ASG tag. No agents to install, no YAML to manage.

See AutoSpotting

About this article

PublishedMay 28, 2026

Read time10 minutes

CategorySpot

AuthorAditya Mehta

In this article

70% savings. One tag to opt in.

XamOps AutoSpotting runs the replacement loop continuously, across all regions, accounts, and ASGs.

EC2 Spot Instances in Production: How to Stop Fearing Interruptions and Save 70%

The Myths Keeping Your Team on On-Demand

5 steps to Spot-first production fleets

Understand what a Spot interruption actually looks like

Set a minimum On-Demand floor and never go below it

Use multiple instance types: capacity diversity as your interruption hedge

Add a lifecycle hook for graceful shutdown

Let the replacement loop do the heavy lifting

What 70% Savings Actually Means

Configure it once. XamOps keeps it running.

70% compute savings. One tag to opt in.