Blog/FinOps·May 28, 2026·9 min read

Your AWS Bill Jumped 40% Last Month: Here’s How to Actually Investigate It

A step-by-step checklist for DevOps engineers to diagnose and fix unexpected AWS cost spikes, before finance comes knocking again.

Aditya Mehta

It’s Monday morning. You open Slack and see three messages from the finance team, all in the last hour. Your AWS bill came in 40% above the previous month. No major deployment happened. No launch, no traffic spike you know of. The Cost Explorer alert that was supposed to catch this? It didn’t fire. Now you have until end of day to explain what happened and produce a remediation plan.

This checklist is what we wish we’d had the first time that happened. Seven steps, in order of most likely cause to most obscure, each with a direct path to the answer inside the AWS console, plus a note on where XamOps would have caught it before finance did.

Why AWS Cost Spikes Are Hard to Diagnose

AWS Cost Explorer has a 24-hour reporting lag. Billing dimensions are spread across dozens of service-specific consoles. A single forgotten resource can hide in the statistical noise of a large account. And unlike application errors, there is no stack trace. Just a larger number at the bottom of a table.

The core problem is dimensionality. AWS bills across 500+ distinct usage types, split across regions, availability zones, account IDs, and resource tags. Finding the spike means narrowing from the entire billing surface down to a single resource or behavior, and doing it fast enough to actually fix something before next month’s bill arrives.

This is not a reflection on your engineering competence. The tooling is genuinely fragmented. The checklist below is a proven sequence for cutting through that noise systematically.

The Investigation Checklist

7 steps, in order of blast radius

01

Check the service breakdown first

Before you dig into individual resources, start at 30,000 feet. Pull up AWS Cost Explorer, set the grouping to "Service," and look at the delta between last month and the previous month. Which service line moved? EC2, S3, Data Transfer, RDS, or something else entirely? Narrowing the blast radius from "AWS in general" to "EC2 specifically" cuts your investigation time in half. You're not chasing shadows across 30 different dashboards.

Where to look
AWS Cost Explorer → Group by: Service → Date range: last 2 months

Red flag: A single service that jumped 30%+ with no corresponding deployment is the signal. Everything else is noise.

XamOps Cost Analytics surfaces this service-level delta automatically on the Waste & Cost dashboard. No manual report building required.

02

Look at Data Transfer charges: the silent killer

Data transfer is the most misunderstood line item on any AWS bill. It hides under vague labels like "EC2 Other" and splits across three different cost dimensions: cross-AZ traffic, NAT Gateway egress, and internet-bound egress. A poorly placed microservice that talks cross-AZ instead of intra-AZ can rack up thousands of dollars in a single month, and it'll show up as a flat 2¢/GB that sounds harmless until you're moving 200 TB.

Where to look
Cost Explorer → Filter: Service = EC2-Other + DataTransfer → Group by: Usage Type

Red flag: "DataTransfer-Out-Bytes" or "DataTransfer-Regional-Bytes" climbing without a corresponding traffic increase on your CDN.

XamOps surfaces cross-AZ and NAT Gateway transfer anomalies on the Cost dashboard with the exact resource generating the traffic.

03

Audit EC2 instance types and counts

Did anything scale out and forget to scale back? Auto Scaling Groups are powerful, but they don't always scale in as aggressively as they scale out, especially if you tuned them during a traffic spike and never reverted the minimum capacity. Pull a 30-day EC2 inventory and compare instance-hour counts week over week. A fleet that was running 40 instances and quietly ballooned to 65 is your culprit. Also check On-Demand vs Reserved vs Spot ratios. An On-Demand surge while Reserved capacity sits idle is a pricing mismatch that finance will notice.

Where to look
EC2 → Instances (sort by Launch Time) + Cost Explorer → Group by: Instance Type

Red flag: Instance-hours running 20%+ above baseline with no corresponding change in deployment logs.

XamOps tracks fleet size and On-Demand vs Spot ratios over time, alerting when instance count or cost efficiency drifts beyond threshold.

04

Check for Spot → On-Demand fallbacks

This one is the most expensive surprise in the AWS toolkit. If your Auto Scaling Group exhausted Spot capacity in a given AZ (common during regional demand spikes), it silently falls back to On-Demand. You don't get a banner, you don't get an alert. You just get a bill. On-Demand EC2 runs 3–5× the cost of equivalent Spot capacity. A single weekend fallback event for a medium-sized fleet can double your compute bill for the month.

Where to look
EC2 → Auto Scaling Groups → Activity History + CloudWatch → ASG metric: OnDemandCapacity

Red flag: On-Demand capacity appearing in an ASG that is normally 100% Spot.

XamOps monitors Spot-to-On-Demand fallback events in real time and alerts the moment a fallback starts, not when the bill lands.

05

Look at RDS and managed service sprawl

The database tier accumulates waste quietly. Multi-AZ is almost always enabled "just in case" on dev and staging databases, which doubles the instance cost. Read replicas created for a load test and never deleted keep charging hourly. Automated snapshot storage silently grows month over month. Pull a list of every RDS instance, note whether it's Multi-AZ, count the read replicas, and check snapshot retention policies. Then ask: does this environment actually need production-level HA?

Where to look
RDS → Databases (filter by environment tag) + Cost Explorer → Group by: Database

Red flag: Multi-AZ enabled on non-production databases, or read replica count exceeding the number of active applications.

XamOps flags Multi-AZ on non-prod environments and orphaned read replicas automatically on the Waste dashboard.

06

Hunt for orphaned resources

When engineers spin up infrastructure and don't tear it down cleanly, the orphans accumulate: EBS volumes detached from terminated instances, Application Load Balancers with no registered targets, Elastic IPs not associated with any resource, NAT Gateways in VPCs that are no longer used. None of these are dramatically expensive individually, but combined across a large AWS account, they routinely add 5–15% to the monthly bill.

Where to look
EC2 → Volumes (state=available) + EC2 → Load Balancers (0 targets) + EC2 → Elastic IPs

Red flag: Unattached EBS volumes, load balancers with 0 healthy targets, or Elastic IPs not associated with a running instance.

XamOps scans for orphaned resources continuously and surfaces them in a prioritized Waste list. No manual console crawling needed.

07

Check data processing costs

Athena, Lambda, and CloudWatch are usage-priced and often invisible until they aren't. Athena charges per TB of data scanned; a single unpartitioned query on a large S3 bucket can cost more than a full day's worth of EC2. Lambda invocation counts can spike if a misconfigured event source starts firing at 10× the expected rate. CloudWatch log ingestion bills by the gigabyte; verbose logging from a debugging session that was never turned off keeps charging indefinitely.

Where to look
Cost Explorer → Filter: Service = Athena / Lambda / CloudWatch → Group by: Usage Type

Red flag: Athena queries scanning TB+ without partitioning, Lambda invocations doubling without a deployment, or CloudWatch log ingestion growing without a known debugging session.

XamOps tracks Athena, Lambda, and CloudWatch costs as part of the Cost dashboard, with anomaly alerts when processing costs deviate from the 30-day baseline.

The Root Cause Pattern

After running this checklist across hundreds of cost spikes, the root cause almost always traces back to one of three patterns:

A forgotten resource

Something was created, used once, and left running. NAT Gateways, EBS volumes, idle load balancers. Low per-unit cost, high aggregate cost over weeks.

A scaling event that didn't reverse

The fleet scaled out for a traffic event and the scale-in never completed correctly. Minimum capacity settings, misconfigured cooldown periods, or a Spot fallback that got locked in.

A data transfer surprise

A service was moved between AZs, a VPC peering was misconfigured, or a NAT Gateway started handling traffic it shouldn't. Data transfer is invisible until it isn't.

The checklist above is ordered to hit all three. Start at the service level, narrow to the resource type, and follow the money.

How XamOps automates this

Run this checklist once. Then never again.

The investigation above takes a skilled engineer 2–4 hours the first time. Longer if they’re new to the account. And it has to be repeated every time a spike happens, because AWS doesn’t maintain the institutional memory of what normal looks like.

XamOps monitors all seven of these dimensions continuously, across AWS, GCP, and Azure, in near-real time. When something deviates (a Spot fallback, a new orphaned volume, an unusual data transfer pattern), you get an anomaly alert before the billing period closes. The Waste & Cost dashboard shows the root cause in a single view: which resource, which account, which team, and the projected monthly impact.

No Cost Explorer pivot tables. No manual inventory audits. No Monday morning Slack messages from finance.

See Cost Analytics

Stop investigating. Start preventing.

XamOps monitors all 7 cost dimensions continuously and alerts on anomalies before finance does.

Pricing