The DevOps Odyssey, Part 9: The Day My K3s Cluster Disappeared

originally posted at LinkedIn at April 13, 2026

One morning, everything was gone.

There was no SSH access, no network response, and none of the services I depended on - Grafana, Argo CD, or any of my applications, were reachable. What remained was not a degraded system, but a completely unresponsive VM.

Oracle Cloud had scheduled maintenance on the underlying infrastructure. When it came back, my instance didn’t.

The cluster wasn’t slow. It wasn’t partially broken.

It had simply disappeared.

Rebuild, Not Repair

The first instinct in moments like this is to troubleshoot. Maybe the network is misconfigured. Maybe a service failed to restart. Maybe there’s still a way to recover the system.

But after a short while, it became clear that there was nothing left to fix. There was no system to connect to, no logs to inspect, and no meaningful way to debug what no longer existed.

At that point, the decision shifted from recovery to reconstruction.

The only viable path forward was to rebuild everything from scratch.

Starting Over

I began by deleting both the master and worker nodes and returning to my terraform-oci repository to recreate the infrastructure.

Since this was effectively a full rebuild, it was also an opportunity to improve the baseline. I upgraded the base image from Oracle Linux 9 to Oracle Linux 10 as part of the process.

That decision quickly exposed a flaw in my automation. My Ansible scripts still assumed the presence of EPEL, which Oracle Linux 10 no longer required. The playbook failed - not catastrophically, but enough to interrupt the workflow and require manual intervention.

Automation that isn’t regularly exercised tends to drift, and when it does, it fails at the worst possible time.

After patching the playbook, the provisioning flow completed successfully.

A Cluster Exists Again

With Terraform recreating the infrastructure and Ansible installing K3s, the master node came back online.

Running:

kubectl get nodes

confirmed that the control plane was up and ready.

For the first time since the outage, there was a functioning Kubernetes cluster again. However, it was effectively a blank slate - no workloads, no applications, and no established access from my local environment.

Regaining Access

Access to Kubernetes is not automatically restored just because the server exists. It must be explicitly re-established.

On the server, the kubeconfig file is located at:

/etc/rancher/k3s/k3s.yaml

I copied this file to my local machine, updated the server endpoint to use the reachable IP address, and configured my local kubectl to use it.

With that, I regained administrative access to the cluster.

The First Real Problem: Sealed Secrets

Because the cluster had been rebuilt, it generated a new encryption key for Sealed Secrets. As a result, every previously sealed secret became unusable. There was no way to decrypt them with the new controller.

This behavior is intentional. Sealed Secrets are designed to bind encrypted data to a specific cluster. When that cluster is replaced, the associated encryption key is replaced as well.

To proceed, I fetched the new certificate:

kubeseal \
  --controller-name sealed-secrets \
  --controller-namespace kube-system \
  --fetch-cert > sealed-secrets.crt

and began resealing all required secrets from their original values.

Regenerate a New GitHub Dex Token

As part of rebuilding the cluster, I regenerated the dex.github.clientSecret used by Argo CD for GitHub authentication.

This involved returning to GitHub, locating the application under developer settings, and generating a new client secret to use with the fresh environment.

This approach ensured that the authentication setup was fully aligned with the rebuilt cluster and avoided carrying over any stale configuration.

Argo CD Returns

With the required secrets resealed and applied, I rebuilt Argo CD.

Once it was operational, it began reconciling the desired state from Git. Applications were redeployed, services stabilized, and the system gradually returned to a functional state.

The Subtle Failure

At this point, the system appeared healthy.

However, the worker node was missing.

The root cause was a token mismatch. The master generated a new token, while the worker still used an old token stored in GitHub Secrets. Because they didn’t match, the worker failed to join.

The Real Problem

This wasn’t just a token issue.

The infrastructure could be created, but it could not be reliably recreated.

Terraform, Ansible, GitHub, and Argo CD each managed part of the system, but there was no unified, deterministic workflow tying them together.

Critical dependencies, like token generation and build order, were implicit rather than enforced.

What This Changed

Uptime is not the real goal.

Recoverability is.

A system that cannot be rebuilt quickly and predictably is not production-ready.

The Silver Lining

Along the way, I upgraded:

Oracle Linux 9 → 10
K3s
Argo CD
Observability stack

The rebuilt system was cleaner and more current.

Final Thoughts

The failure wasn’t defined by the outage itself.

It was defined by the fact that the system could not be restored automatically and predictably without manual intervention.

That is the real test of any platform.