From Local Terraform to CI: Structuring Shared and Nonprod Stacks for GKE

By the end of the previous step, the cluster had a defined shape.

Nodes were private, outbound traffic was routed through NAT, and inbound traffic flowed through a Gateway instead of individual services. A hostname pointed to that Gateway, and a simple workload confirmed that the path from DNS to container was working.

That was enough to validate the cluster itself.

What was still missing was how the rest of the system should be organized so that it could be applied consistently, and not just from a single laptop session.

From local execution to CI

Up to this point, Terraform was being run locally using Application Default Credentials.

That worked, and it was the simplest way to get started. But it also meant that:

the ability to apply infrastructure depended on a local environment
permissions were tied to a human user rather than a defined role
there was no clear boundary between shared resources and environment-specific ones

The cluster was working, but the system around it was not yet structured in a way that could be reused or operated by something other than a local session.

The next step was to move Terraform execution into CI and make the structure explicit.

Structuring shared and nonprod stacks

As more components were added, especially DNS and artifact storage, it became clear that not everything belonged to the same environment.

Some resources are naturally shared:

DNS zones
artifact registries

Others are tied to an environment:

project services
networking
the GKE cluster

The separation between shared and non-production stacks was not just about organization. It defined ownership and made it clear which resources could be reused across environments and which ones could change independently.

The structure reflects that:

Manual bootstrap
  |
  +-- state buckets
  +-- Terraform runner service accounts
  +-- GitHub OIDC / Workload Identity Federation

Live stacks
  |
  +-- shared/dns-zone
  +-- shared/artifact-registry
  |
  +-- nonprod/project-services
        |
        v
      nonprod/foundation
        |
        v
      nonprod/gke

Each Terraform root represents a deployable unit with its own state.

This is an extension of the earlier separation between foundation and GKE, but applied more broadly. Instead of treating the system as a single graph, dependencies are made explicit and applied in order.

Keeping bootstrap separate

Bootstrap remains a separate concern.

Terraform needs a backend, service accounts, and identity configuration before it can run reliably. These pieces are created once and change infrequently, so they are applied manually rather than through CI.

This layer includes:

remote state buckets
Terraform runner service accounts
Workload Identity Federation setup

Keeping bootstrap out of CI reduces complexity and makes the dependency chain clearer.

Running Terraform through GitHub Actions

With the structure in place, Terraform execution was moved into GitHub Actions.

Instead of using a service account key, Workload Identity Federation is used so that GitHub can exchange its OIDC token for short-lived credentials. This allows workflows to impersonate a Terraform runner service account without storing long-lived secrets.

A minimal workflow looks like this:

permissions:
  id-token: write
  contents: read

steps:
  - uses: actions/checkout@v4

  - uses: google-github-actions/auth@v2
    with:
      workload_identity_provider: ${{ vars.GCP_WIF_PROVIDER }}
      service_account: ${{ vars.GCP_SA_NONPROD }}

  - uses: hashicorp/setup-terraform@v3

  - run: terraform init -input=false
    working-directory: gcp/live/nonprod/foundation

  - run: terraform plan -input=false
    working-directory: gcp/live/nonprod/foundation

Each stack has its own workflow. The logic is the same, but the working directory and service account differ depending on what is being applied.

This changed how Terraform was executed: from a local user session to a service-account-backed workflow.

At this stage, workflows were triggered manually. This made it possible to validate each stack independently before introducing more automation.

Validation through behavior

Validation followed the same approach used earlier with the cluster.

The goal was to observe behavior rather than assume correctness.

On the infrastructure side:

Gateway routing was verified by sending requests directly to the external IP with a host header
DNS issues exposed incorrect zone boundaries rather than problems in Kubernetes
workloads confirmed that networking and image pulls were functioning as expected

On the CI side, initial failures were useful.

Errors such as missing permissions or inability to generate access tokens pointed directly to gaps in IAM configuration. Fixing these clarified which identity was running Terraform and what permissions it required.

Outcome

The system did not gain new features through this change.

Instead, the way it is operated became more defined.

Terraform is no longer tied to a local environment. Each stack has a clear boundary and can be applied independently. Authentication is handled through short-lived credentials rather than stored keys. Shared resources and environment resources are separated in a way that reflects how they are used.

The cluster remains the same.

What changed is how it is managed, and how that management can be repeated without relying on a specific machine or user session.

From Local Terraform to CI: Structuring Shared and Nonprod Stacks for GKE

From local execution to CI

Structuring shared and nonprod stacks

Keeping bootstrap separate

Running Terraform through GitHub Actions

Validation through behavior

Outcome

Related Posts

GKE — Part 2: Private Nodes, Gateway API, and a More Realistic Cluster Shape

From Zero to GKE — A Terraform-First Build with Production in Mind

From x86 to ARM in production: the EKS migration story