From x86 to ARM in production: the EKS migration story

originally posted at LinkedIn at December 30, 2025

Every engineer on our team builds on ARM. Every test runs on ARM. Every container gets built on ARM.

Production ran on x86.

That mismatch created unnecessary complexity. Two architectures meant maintaining two sets of build configurations, two sets of potential edge cases, and the constant mental overhead of keeping dev and prod aligned across different instruction sets. It worked, but it felt like carrying extra weight for no good reason.

So we decided to close the gap. We migrated our entire EKS production environment from x86 to ARM—not as an experiment, but as a deliberate platform decision with clear goals: simplify our architecture end-to-end and reduce costs without sacrificing reliability.

The best part? We did it with zero downtime.

This is how we made that migration boring instead of risky.

Two things made this feel like the right move.

First, the obvious one: our build world is already ARM-heavy, so a lot of compatibility gets “tested” naturally just by how we work.

Second, the less obvious one: outside of work I run a custom K3s cluster as part of my DevOps Odyssey series, and it’s been a surprisingly practical proving ground. Over time I’d already seen core tooling behave just fine on ARM—things like Argo CD, Sealed Secrets, and the Prometheus + Grafana observability stack.

That doesn’t mean production is the same as a side cluster. But it does mean the foundations weren’t a mystery.

Even better, it created a feedback loop I’ve grown to love:

What I learn at work applies to my side project. What I validate in my side project shows up as confidence at work. Win-win, and it compounds.

The migration plan (four phases, zero drama)

We ran this as a structured rollout. Each phase was designed to reduce uncertainty before we increased the blast radius. No big-bang migrations. No weekend heroics. Just methodical progress, one phase at a time.

Phase 1: Assessment — prove what’s real

Before we touched a single node, we treated ARM compatibility like a question that deserved evidence, not assumptions.

We went through every application in our stack. We checked container images: did ARM versions exist? We scanned dependencies: were there native libraries compiled only for x86? We reviewed code paths: any architecture-specific assumptions hiding in the codebase?

The audit turned unknowns into a concrete list. Some images didn’t have ARM builds yet. A handful of code paths made architecture assumptions we’d need to fix.

But here’s what mattered: we knew exactly what we were dealing with. No surprises waiting in production. We prioritized applications based on how easy they’d be to migrate and how much cost impact they’d have. The biggest “unknowns” became a spreadsheet of very knowable things.

No heroics—just clarity.

Phase 2: Preparation — fix the supply chain first

Most ARM migrations don’t fail in Kubernetes. They fail in the build pipeline.

So we fixed the supply chain first. We rebuilt every container image to support ARM. We updated our CI/CD pipelines to build multi-architecture images—or in some cases, ARM-only images since that’s all we’d need. We tracked down those missing ARM base images, and fixed the architecture-specific code paths we’d found in Phase 1.

Then we prepared the infrastructure path. In Terraform, we created ARM node group definitions with zero instances. The node groups existed in code, ready to scale up when we needed them, but no actual nodes were running yet. The path was prepared.

This is also where being on MacBooks really helped: we weren’t introducing ARM into the org—we were aligning production with what engineers already used daily. The build pipeline that compiled ARM images on developer laptops would now deploy those same images to ARM nodes in production. The supply chain finally made sense.

Phase 3: Migration — one app at a time, one environment at a time

Before we moved anything, we locked everything down. We set node selectors to pin all our in-house applications to AMD nodes. Nothing would accidentally land on ARM until we explicitly told it to.

Then we scaled up the ARM node groups from zero. We’d prepared them in Terraform during Phase 2, and now we increased the instance count. ARM nodes joined the cluster and sat there, idle, waiting. We had both architectures running side by side, but all traffic still flowed to x86.

The migration started in dev, one application at a time. We’d flip a node selector, watch the pods reschedule onto ARM nodes, and let it run for a day. If the metrics stayed boring—latency flat, error rates normal, nothing interesting happening—we’d move to the next app.

Dev first. Then staging. Then production.

It was a canary rollout, but we made it methodical: one app, one environment, one validation cycle. Each success gave us confidence for the next. Each wave was small enough that if something went wrong, we’d know exactly what broke and could roll back in minutes.

As applications moved to ARM, we gradually scaled down x86 nodes while scaling up ARM ones. The cluster shrank and grew in parallel, keeping capacity constant while shifting architectures.

Once all our applications were running on ARM, we faced the infrastructure layer: Argo CD, Sealed Secrets, Prometheus, Grafana—all the platform tooling that keeps everything running. This is where the K3s experience paid off. I’d already seen these tools work reliably on ARM in my side cluster, so we knew they’d be fine. We scaled down the AMD nodes one by one, watching each infrastructure component handle the transition smoothly, until the last x86 node drained and the cluster was fully ARM.

Phase 4: Cleanup — finish what you started

Once everything was running on ARM, the real work began: removing the old architecture entirely.

In Terraform, we deleted the x86 node group definitions. One commit, one apply, and the infrastructure code matched reality. No more maintaining node groups we didn’t need.

In our build pipelines, we stopped building x86 Docker images. The CI/CD jobs got simpler—no more multi-architecture builds, no more pushing images for an architecture we’d never use. The build times dropped, and the pipeline logs got cleaner.

But the biggest win came from our GitHub Actions runners. We’d been using larger x86 runners, and they weren’t cheap. We migrated to native ARM runners, and the cost difference was immediate. Same performance, lower bill. The runners that built our code now matched the architecture we actually ran in production.

This is where the cost story becomes real. Migrating is good; finishing is what actually captures savings. Removing the old node groups, simplifying the build pipeline, and switching to ARM-native runners—that’s where the bill changes.

What made zero downtime achievable

Not luck. Guardrails.

We made sure readiness probes actually reflected real health—not just that a process was running, but that the application was ready to serve traffic. We kept enough replicas running for anything that mattered, so when we drained a node, traffic just shifted to another pod. We set disruption budgets so node drains couldn’t become outages. We used rollout strategies that limited blast radius: if something broke, it broke small, and we’d know exactly what it was.

We watched the metrics. Latency, error rates, saturation—the stuff that tells the truth. If anything looked interesting, we’d pause. If it looked wrong, we’d roll back. And rollbacks were boring and fast: flip a node selector, pods reschedule, done.

When those guardrails are in place, the migration stops being scary. It becomes operational work. You’re not hoping nothing breaks. You’re confident that if something breaks, you’ll catch it early and fix it fast.

The takeaway

For us, moving to ARM wasn’t just about cost—though yes, saving money matters.

It was about alignment.

When your dev world is ARM and your prod world is x86, you’re carrying friction you don’t need. Migrating production to ARM let us simplify the story, harden our build pipeline, and validate compatibility in a structured way—without interrupting users.

And personally, it reinforced the best part of running a serious side cluster:

Work makes the side project better. The side project makes work safer. That loop is worth more than any single migration.