Skip to main content

Tenant Cluster Lifecycle

This document describes how Butler provisions and manages tenant clusters.

Overview

Tenant clusters are Kubernetes clusters provisioned by Butler for running workloads. Butler uses Cluster API (CAPI) for infrastructure management and Steward for hosted control planes.

TenantCluster Resource

apiVersion: butler.butlerlabs.dev/v1alpha1
kind: TenantCluster
metadata:
name: my-cluster
namespace: team-a
spec:
kubernetesVersion: "v1.30.0"

controlPlane:
replicas: 1 # Hosted via Steward

workers:
replicas: 3
machineTemplate:
cpu: 4
memory: 8Gi
diskSize: 100Gi

providerConfigRef:
name: harvester-prod

networking:
podCIDR: 10.244.0.0/16
serviceCIDR: 10.96.0.0/12

addons:
cni:
provider: cilium
loadBalancer:
provider: metallb
addressPool: 10.40.1.0-10.40.1.50
storage:
provider: longhorn

Lifecycle Phases

PhaseDescription
PendingCR created, awaiting reconciliation
ProvisioningCreating CAPI resources, Steward control plane, and worker VMs
InstallingInstalling CNI, storage, and other addons
ReadyCluster operational
UpdatingProcessing spec changes
DeletingCleaning up resources
FailedError state (check conditions)

Provisioning Flow

CAPI Resources Created

For each TenantCluster, butler-controller creates:

Resource Details

Cluster (CAPI)

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: my-cluster
spec:
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1alpha1
kind: StewardControlPlane
name: my-cluster
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterCluster
name: my-cluster

StewardControlPlane

apiVersion: controlplane.cluster.x-k8s.io/v1alpha1
kind: StewardControlPlane
metadata:
name: my-cluster
spec:
replicas: 1
version: v1.30.0
dataStoreName: default

MachineDeployment

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: my-cluster-workers
spec:
clusterName: my-cluster
replicas: 3
template:
spec:
clusterName: my-cluster
version: v1.30.0
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
name: my-cluster-workers
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterMachineTemplate
name: my-cluster-workers

Worker Node Bootstrap

Worker nodes use kubeadm to join the cluster. The bootstrap process:

  1. VM Created: Provider creates VM with cloud image
  2. Cloud-init: Configures network, SSH, base packages
  3. Container Runtime: Installs containerd
  4. Kubernetes: Installs kubelet, kubeadm, kubectl
  5. Join: Runs kubeadm join to join cluster

Node OS Support

OSBootstrap MethodStatus
Talos LinuxMachine config via dataSecretNameStable
Rocky Linux 9CABPK KubeadmConfigTemplate via configRefStable
Flatcar Container LinuxIgnition JSON via dataSecretNameStable
BottlerocketTOML settings via dataSecretNameBeta
KairosCloud-config YAML via dataSecretNameBeta

Addon Installation

After workers join, platform addons are installed:

Installation Order

OrderAddonRequiredPurpose
1CiliumYesCNI networking
2MetalLBYes*LoadBalancer services
3cert-managerNoTLS certificates
4LonghornNoDistributed storage
5TraefikNoIngress controller

*MetalLB required for on-premises deployments

Addon Configuration

Addons can be configured via TenantCluster spec:

spec:
addons:
cni:
provider: cilium
version: "1.17.0" # Optional, defaults to platform default
loadBalancer:
provider: metallb
addressPool: 10.40.1.0-10.40.1.50
monitoring:
provider: prometheus

Scaling

Scale Workers

# Via CLI
butlerctl cluster scale my-cluster --workers 5

# Via kubectl
kubectl patch tenantcluster my-cluster -n team-a \
--type merge -p '{"spec":{"workers":{"replicas":5}}}'

The controller updates the MachineDeployment, and CAPI handles adding/removing nodes.

Scale Down Behavior

When scaling down:

  1. CAPI selects nodes to remove (based on machine health, age)
  2. Nodes are cordoned and drained
  3. VMs are deleted
  4. Pods are rescheduled to remaining nodes

Kubernetes Version Upgrades

spec:
kubernetesVersion: "v1.31.0" # Change from v1.30.0

Upgrade process:

  1. Control plane updated first (Steward handles this)
  2. Worker nodes updated via rolling update
  3. One node at a time to minimize disruption

Deletion

butlerctl cluster destroy my-cluster

Deletion process:

  1. Finalizer blocks immediate deletion
  2. Delete addons (Helm releases)
  3. Delete CAPI resources
  4. Delete Steward TenantControlPlane
  5. Wait for VM cleanup
  6. Remove finalizer
  7. TenantCluster deleted

Kubeconfig Access

# Get kubeconfig
butlerctl cluster kubeconfig my-cluster > my-cluster.yaml

# Use directly
butlerctl cluster kubeconfig my-cluster | kubectl --kubeconfig /dev/stdin get nodes

The kubeconfig is stored as a Secret in the management cluster and references the LoadBalancer endpoint for the hosted control plane.

Status and Conditions

status:
phase: Ready
controlPlaneEndpoint: "10.40.0.201:6443"
workerNodesReady: 3
workerNodesDesired: 3
observedState:
kubernetesVersion: "v1.30.0"
conditions:
- type: ControlPlaneReady
status: "True"
- type: WorkersReady
status: "True"
- type: AddonsReady
status: "True"

See Also