GKE - Part 4: ExternalDNS, cert-manager, and Real URLs for the GitOps Platform

A heads up before getting into the details: this is a longer step.
The previous parts built the cluster shape. This part starts turning that cluster into a platform. The goal is no longer just to prove that a pod can run or that a Gateway can receive traffic. The goal is to make the edge of the cluster behave like something that can be reused: DNS records created automatically, certificates issued automatically, and platform tools available at real hostnames instead of temporary IP addresses.
By the end of this step:
- ExternalDNS manages Cloud DNS records from Gateway API resources
- cert-manager issues Let’s Encrypt certificates using Cloud DNS DNS-01
- Argo CD is reachable at a real HTTPS URL
- a sample application is exposed through Gateway API with DNS and TLS
- platform controllers run on the system node pool
- the sample app runs on the spot node pool
This post does not cover installing Argo CD from scratch. For that bootstrap flow, check Homelab Arena Part 6. Here, Argo CD already exists, is self-managed, and is being used to install the next layer of platform components.
Where this starts
In the earlier GKE work, the cluster moved away from direct service exposure. Instead of using a LoadBalancer service per application, Gateway API became the entry point for inbound traffic. That gave the cluster a cleaner edge, but DNS still had a manual step: create a hostname, copy the Gateway IP, and create an A record by hand.
That was fine for proving the Gateway worked. It was not good enough for a GitOps platform.
The previous structure also separated shared resources from environment-specific resources. DNS zones live in the shared layer, while the GKE cluster, node pools, and environment resources live under nonprod or prod. Terraform is applied through CI with environment-specific roots, so this new layer follows the same pattern.
The GitOps repository also already has an important design decision: nonprod and prod each have their own Argo CD instance. Each instance reads from its own folder in the repo, which lets nonprod act as the staging ground for platform changes before prod is touched.
That matters here. external-dns and cert-manager are core platform services. A bad config can break DNS or certificate issuance. Having a nonprod Argo CD means those changes can be tested against the nonprod cluster first, using the same GitOps workflow, without modifying production.
Why two Argo CD instances
The platform uses two Argo CD instances:
GKE nonprod Argo CD
GKE prod Argo CD
This is intentional.
At first, one Argo CD might sound simpler. It is one UI, one set of secrets, and one place to manage applications. But once Argo CD starts managing platform components, that simplicity becomes a risk. If the same controller manages both nonprod and prod, then a mistake in a platform app definition can reach both environments from the same control plane.
Keeping them separate gives a few useful properties:
- nonprod issues do not directly affect prod
- platform upgrades can be tested in nonprod first
- repo credentials and SSO settings can differ by environment
- prod can remain slower and more conservative
- nonprod can move faster while the platform is still evolving
The upgrade flow becomes much safer:
upgrade external-dns in nonprod
validate DNS behavior
promote the same change to prod
or:
upgrade cert-manager in nonprod
issue a staging certificate
issue a production certificate
then repeat in prod
This is the same reason I want Argo CD itself to be self-managed, but not shared across every environment. The controller is part of the platform, and platform changes need a place to prove themselves.
The platform node pool
The cluster has multiple node pool types prepared:
system
regular
spot
That distinction becomes useful in this step.
Components like external-dns and cert-manager are not normal application workloads. They are platform controllers. If external-dns is unavailable, DNS automation stops. If cert-manager is unavailable, certificate issuance and renewal stop. They do not need many resources, but they should not be scheduled onto the most disposable capacity by accident.
So the platform components use the system pool:
nodeSelector:
janus.io/pool: system
tolerations:
- key: "janus.io/system"
operator: "Equal"
value: "true"
effect: "NoSchedule"
The sample app, on the other hand, can use the spot pool:
nodeSelector:
janus.io/pool: spot
tolerations:
- key: "janus.io/spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
That gives the platform a useful split:
system pool: cluster services and controllers
spot pool: cheap, interruptible workloads
regular pool: workloads that should not be on spot
This is not just about saving money. It makes the scheduling intent visible in the manifests.
Shared DNS zone
The shared DNS zone is created by Terraform:
{
"project_id": "google-project-id",
"region": "us-central1",
"zone_name": "g-example-com",
"dns_name": "g.example.com.",
"description": "Delegated public zone for GKE-managed subdomains"
}
The zone is shared, but the actors that update it should not be shared.
For nonprod, external-dns gets its own Google service account:
external-dns-nonprod@google-project-id.iam.gserviceaccount.com
cert-manager gets its own identity too:
cert-manager-nonprod@google-project-id.iam.gserviceaccount.com
That separation matters. external-dns and cert-manager both need to write DNS records, but they do it for different reasons.
external-dns creates application records:
argocd.nonprod.g.example.com -> Gateway IP
hello.g.example.com -> Gateway IP
cert-manager creates temporary ACME challenge records:
_acme-challenge.<hostname> -> challenge token
They both touch DNS, but they are separate controllers and should have separate identities.
Terraform access for external-dns
The external-dns access module creates a Google service account, grants DNS access, and binds the Kubernetes service account to the Google service account through GKE Workload Identity.
The module looks like this:
locals {
gsa_email = var.create_gsa ? google_service_account.external_dns[0].email : var.existing_gsa_email
}
resource "google_service_account" "external_dns" {
count = var.create_gsa ? 1 : 0
project = var.project_id
account_id = var.gsa_account_id
display_name = var.gsa_display_name
description = "Google service account used by external-dns via GKE Workload Identity."
}
resource "google_dns_managed_zone_iam_member" "external_dns_zone_access" {
project = var.dns_project_id
managed_zone = var.dns_zone_name
role = var.dns_role
member = "serviceAccount:${local.gsa_email}"
}
resource "google_project_iam_member" "external_dns_project_access" {
project = var.dns_project_id
role = var.dns_role
member = "serviceAccount:${local.gsa_email}"
}
resource "google_service_account_iam_member" "workload_identity_user" {
service_account_id = "projects/${var.project_id}/serviceAccounts/${local.gsa_email}"
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.project_id}.svc.id.goog[${var.k8s_namespace}/${var.k8s_service_account_name}]"
}
The project-level DNS grant is important. At first, it seemed like a managed-zone-level IAM binding should be enough. In practice, the Google provider used by external-dns lists zones first, then matches records to a zone. Without the project-level permission, external-dns can authenticate but still fail with a 403 while listing zones.
The nonprod live stack passes values like this:
{
"project_id": "google-project-id",
"region": "us-central1",
"dns_project_id": "google-project-id",
"dns_zone_name": "g-example-com"
}
The important distinction is still there even if both values are currently the same project:
project_id = project where the GKE cluster and GSA live
dns_project_id = project where the Cloud DNS zone lives
If prod later lives in a different project but still writes to a shared DNS project, the same module can be reused.
Installing external-dns with Argo CD
external-dns is installed as a platform app through Argo CD.
The Argo CD Application points at the GitOps repo path:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: external-dns
namespace: argocd
spec:
project: default
source:
repoURL: THE_REPO_URL
targetRevision: HEAD
path: gke/nonprod/platform/external-dns
helm:
valueFiles:
- values.yaml
destination:
server: https://kubernetes.default.svc
namespace: kube-system
syncPolicy:
automated:
prune: true
selfHeal: true
The wrapper chart uses the upstream chart as a dependency:
apiVersion: v2
name: external-dns
version: 0.1.0
dependencies:
- name: external-dns
version: 1.20.0
repository: https://kubernetes-sigs.github.io/external-dns/
Because this is a wrapper chart, the values are nested under external-dns::
external-dns:
provider:
name: google
nodeSelector:
janus.io/pool: system
tolerations:
- key: "janus.io/system"
operator: "Equal"
value: "true"
effect: "NoSchedule"
serviceAccount:
create: true
name: external-dns
annotations:
iam.gke.io/gcp-service-account: external-dns-nonprod@google-project-id.iam.gserviceaccount.com
sources:
- gateway-httproute
domainFilters:
- g.example.com
policy: upsert-only
registry: txt
txtOwnerId: nonprod-gke
interval: 1m
logLevel: info
There are a few important details here.
First, the source is gateway-httproute. This makes external-dns read hostnames from Gateway API HTTPRoute resources instead of looking for classic Ingress resources.
Second, the domain filter is the actual managed zone:
domainFilters:
- g.example.com
It is tempting to set this to narrower names like:
qa.g.example.com
stg.g.example.com
nonprod.g.example.com
But those only work as domain filters if they are actual managed zones. In this setup, the managed zone is g.example.com. Later, stricter boundaries can be added by creating delegated child zones with Terraform. For now, the controller needs to match the parent zone.
Third, txtOwnerId should be unique per cluster or environment:
txtOwnerId: nonprod-gke
external-dns uses TXT records to track ownership. Nonprod and prod should not share the same owner ID.
Terraform access for cert-manager
cert-manager also needs DNS access, but for a different reason. Since the goal is to use Let’s Encrypt with DNS-01 validation, cert-manager needs to create and clean up ACME challenge TXT records in Cloud DNS.
A separate access module keeps the identity separate from external-dns:
locals {
gsa_email = var.create_gsa ? google_service_account.cert_manager[0].email : var.existing_gsa_email
}
resource "google_service_account" "cert_manager" {
count = var.create_gsa ? 1 : 0
project = var.project_id
account_id = var.gsa_account_id
display_name = var.gsa_display_name
description = "Google service account used by cert-manager via GKE Workload Identity."
}
resource "google_project_iam_member" "cert_manager_dns_access" {
project = var.dns_project_id
role = var.dns_role
member = "serviceAccount:${local.gsa_email}"
}
resource "google_service_account_iam_member" "workload_identity_user" {
service_account_id = "projects/${var.project_id}/serviceAccounts/${local.gsa_email}"
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.project_id}.svc.id.goog[${var.k8s_namespace}/${var.k8s_service_account_name}]"
}
The nonprod module instance uses:
module "cert_manager_access" {
source = "../../../modules/cert-manager-access"
project_id = var.project_id
dns_project_id = var.dns_project_id
gsa_account_id = "cert-manager-nonprod"
gsa_display_name = "cert-manager nonprod"
k8s_namespace = "cert-manager"
k8s_service_account_name = "cert-manager"
}
That binds this Kubernetes service account:
cert-manager/cert-manager
to this Google service account:
cert-manager-nonprod@google-project-id.iam.gserviceaccount.com
Installing cert-manager with Argo CD
cert-manager is also installed as a platform app.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cert-manager
namespace: argocd
spec:
project: default
source:
repoURL: THE_REPO_URL
targetRevision: HEAD
path: gke/nonprod/platform/cert-manager
helm:
valueFiles:
- values.yaml
destination:
server: https://kubernetes.default.svc
namespace: cert-manager
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
The wrapper chart:
apiVersion: v2
name: cert-manager
version: 0.1.0
dependencies:
- name: cert-manager
version: v1.20.2
repository: oci://quay.io/jetstack/charts
The values file keeps the controller on the system node pool and annotates the service account for Workload Identity:
cert-manager:
crds:
enabled: true
serviceAccount:
annotations:
iam.gke.io/gcp-service-account: cert-manager-nonprod@google-project-id.iam.gserviceaccount.com
global:
leaderElection:
namespace: cert-manager
nodeSelector:
janus.io/pool: system
tolerations:
- key: "janus.io/system"
operator: "Equal"
value: "true"
effect: "NoSchedule"
ClusterIssuers
cert-manager needs issuers that describe how certificates should be requested.
For nonprod, there are two ClusterIssuer resources:
letsencrypt-staging
letsencrypt-prod
The staging issuer is used first to validate the flow without burning production rate limits or debugging against real certificates. Once DNS-01 works, the certificate can be changed to the production issuer.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
email: platform@example.com
server: https://acme-staging-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-staging-private-key
solvers:
- selector:
dnsZones:
- g.example.com
dns01:
cloudDNS:
project: google-project-id
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
email: platform@example.com
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod-private-key
solvers:
- selector:
dnsZones:
- g.example.com
dns01:
cloudDNS:
project: google-project-id
These are cluster-wide. There is no need to create one issuer per Gateway or per application.
The model is:
ClusterIssuer = reusable issuing policy
Certificate = hostname-specific request
Gateway = consumes the TLS secret
Because Gateway API is being used directly, the certificate is explicit. This is a little different from older Ingress setups, where cert-manager annotations on an Ingress could create a Certificate indirectly through ingress-shim.
With Gateway API, the clearer pattern is:
Certificate -> Secret -> Gateway listener
Giving Argo CD a real URL
Once external-dns and cert-manager are working, Argo CD can move from a cluster-internal service or temporary access path to a real hostname:
https://argocd.nonprod.g.example.com
The Argo CD installation itself is not the focus of this post. The important part here is the edge configuration around it.
The Argo CD chart values need to know the external URL and that TLS is terminated before traffic reaches the Argo CD server:
argo-cd:
global:
domain: argocd.nonprod.g.example.com
configs:
params:
server.insecure: true
cm:
url: https://argocd.nonprod.g.example.com
Then the platform adds a Certificate, Gateway, and HTTPRoute in the argocd namespace.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: argocd-nonprod-g-example-com
namespace: argocd
spec:
secretName: argocd-nonprod-g-example-com-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- argocd.nonprod.g.example.com
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: argocd-nonprod-gateway
namespace: argocd
spec:
gatewayClassName: gke-l7-global-external-managed
listeners:
- name: https
protocol: HTTPS
port: 443
hostname: argocd.nonprod.g.example.com
tls:
mode: Terminate
certificateRefs:
- name: argocd-nonprod-g-example-com-tls
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: argocd-nonprod-route
namespace: argocd
spec:
parentRefs:
- name: argocd-nonprod-gateway
hostnames:
- argocd.nonprod.g.example.com
rules:
- backendRefs:
- name: argocd-server
port: 80
At this point, the public edge for Argo CD is no longer hand wired.
The HTTPRoute gives external-dns a hostname to publish. The Gateway gives GKE a load balancer and HTTPS listener. The Certificate gives cert-manager a request to issue a Let’s Encrypt certificate. The TLS secret produced by cert-manager is referenced by the Gateway.
Sample app: exposing a real URL on the spot pool
The other validation is a simple application scheduled onto the spot pool. This proves that normal workloads can still use the cheaper capacity while platform controllers stay on the system pool.
This is the all.yaml I would use for the public Gateway API version of the sample.
During early testing, set the issuer to letsencrypt-staging. After the flow works, change it to letsencrypt-prod.
apiVersion: v1
kind: Namespace
metadata:
name: janus-test
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello
namespace: janus-test
spec:
replicas: 1
selector:
matchLabels:
app: hello
template:
metadata:
labels:
app: hello
spec:
nodeSelector:
janus.io/pool: spot
tolerations:
- key: "janus.io/spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: hello
image: us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: hello
namespace: janus-test
spec:
selector:
app: hello
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: hello-g-example-com
namespace: janus-test
spec:
secretName: hello-g-example-com-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- hello.g.example.com
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: hello-gke-gateway
namespace: janus-test
spec:
gatewayClassName: gke-l7-global-external-managed
listeners:
- name: https
protocol: HTTPS
port: 443
hostname: hello.g.example.com
tls:
mode: Terminate
certificateRefs:
- name: hello-g-example-com-tls
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: hello-route
namespace: janus-test
spec:
parentRefs:
- name: hello-gke-gateway
hostnames:
- hello.g.example.com
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: hello
port: 80
The important pieces are the relationship between the resources:
HTTPRoute hostname -> external-dns creates A record
Certificate dnsNames -> cert-manager creates TLS secret
Gateway listener -> uses TLS secret and routes to HTTPRoute
Deployment -> runs on spot pool
If the TLS secret does not exist yet, the Gateway may temporarily report an invalid listener. That is expected until cert-manager finishes issuing the certificate.
Validation
The first thing to watch is external-dns:
kubectl -n kube-system logs deploy/external-dns -f
A successful run creates both the ownership TXT record and the A record:
Add records: a-hello.g.example.com. TXT ["heritage=external-dns,external-dns/owner=nonprod-gke,external-dns/resource=httproute/janus-test/hello-route"] 300
Add records: hello.g.example.com. A [33.120.22.122] 300
Then check the Gateway:
kubectl -n janus-test get gateway
Expected:
NAME CLASS ADDRESS PROGRAMMED
jc-qa-gke-gateway gke-l7-global-external-managed 33.120.22.122 True
If local DNS has not caught up yet, test directly with --resolve:
curl -vk --resolve hello.g.example.com:443:33.120.22.122 \
https://hello.g.example.com/
The useful part of the response is not just that it returns 200. It also proves TLS and routing are both working:
Server certificate:
subject: CN=hello.g.example.com
issuer: C=US; O=(STAGING) Let's Encrypt; CN=(STAGING) Riddling Rhubarb R12
HTTP/2 200
Hello, world!
Version: 1.0.0
Once this works with staging, switching to production is just changing the issuerRef:
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
What changed
The previous cluster could run workloads and receive traffic through Gateway API.
This version can manage the public edge through GitOps.
The difference is significant:
Before:
Gateway gets IP
manual DNS record
manual certificate thinking
curl to test
After:
HTTPRoute declares hostname
external-dns creates DNS record
Certificate declares desired TLS cert
cert-manager issues Let's Encrypt cert
Gateway terminates HTTPS
Argo CD now has a real URL:
https://argocd.nonprod.g.example.com
A sample app can also be exposed with a real URL and a real certificate, while running on spot capacity.
The platform services themselves run on the system pool, keeping the cluster controllers away from the more disposable workload pool.
Final shape
The platform layer now looks like this:
kube-system
external-dns
cert-manager
cert-manager controller
letsencrypt-staging ClusterIssuer
letsencrypt-prod ClusterIssuer
argocd
Argo CD
Certificate
Gateway
HTTPRoute
janus-test
sample app on spot node pool
Certificate
Gateway
HTTPRoute
And the responsibility split is clearer:
Terraform:
Cloud DNS zone
Google service accounts
IAM permissions
Workload Identity bindings
Argo CD:
external-dns
cert-manager
ClusterIssuers
Gateway resources
Certificate resources
application manifests
GKE Gateway:
load balancer
HTTPS listener
routing into Services
external-dns:
A records and TXT ownership records
cert-manager:
ACME account
DNS-01 challenge records
TLS secrets
Next
With DNS and certificates in place, the next platform decision is access.
Argo CD can now be reached through a real public hostname with TLS. That makes SSO and RBAC the next immediate concern. After that, I want to add the private access path for internal platform tools, likely with the Tailscale operator.
The important milestone here is that the cluster no longer depends on manually copying Gateway IPs into DNS or manually thinking about certificates. The edge is now declared in Kubernetes and reconciled by platform controllers.
That is the point where the cluster starts to feel less like a demo and more like a platform.