Troubleshoot Cluster Provisioning
Cluster Stuck in "Provisioning"
Diagnosis:
kubectl get tenantcontrolplane -A
kubectl logs -n steward-system deploy/steward-controller --tail=100
kubectl get datastore -A
Solutions:
-
DataStore unavailable -- Check the etcd pods backing the DataStore:
kubectl describe datastore default
kubectl get pods -n steward-system -l app=etcd -
Certificate issues -- Verify cert-manager is running and Certificate resources are in Ready state.
-
Resource constraints -- Ensure the management cluster has enough CPU and memory for control plane pods. Each TenantControlPlane requires ~12 mCPU and ~6 MiB at idle, more under load.
Workers Not Joining
Diagnosis:
kubectl get cluster,machinedeployment,machine -A
kubectl describe machine -l cluster.x-k8s.io/cluster-name=<cluster-name>
kubectl logs -n butler-system deploy/butler-provider-<provider>
Solutions:
-
VMs not starting -- Check VM status on the infrastructure:
# Harvester
kubectl --kubeconfig harvester.yaml get virtualmachine -A -
Bootstrap failing on the node -- For kubeadm-based OS types (Rocky), check cloud-init logs on the worker:
/var/log/cloud-init-output.log. For Talos, checktalosctl dmesgon the node IP. -
Node cannot reach API server -- Verify the control plane LoadBalancer Service has an IP and the worker node can reach it on port 6443.
Cluster Stuck in "Installing"
Workers are ready but addons fail to install.
Diagnosis:
kubectl get tenantaddon -n <cluster-namespace>
kubectl describe tenantaddon <name> -n <cluster-namespace>
kubectl logs -n butler-system deploy/butler-controller | grep addon
Solutions:
- Helm chart not found -- Verify the AddonDefinition chart repository URL is accessible from the management cluster.
- Invalid values -- Check TenantAddon events for Helm rendering errors. Validate addon values against the chart's values schema.
- Timeout -- Cilium and Traefik require working LoadBalancer Services. Ensure MetalLB has an address pool allocated. See Networking Troubleshooting.
Kubeconfig Not Available
Symptoms: butlerctl cluster kubeconfig fails with "secret not found."
Solutions:
- Cluster not ready -- The kubeconfig Secret is created when the cluster reaches the
Readyphase. Wait for provisioning to complete. - RBAC permissions -- Verify your user has access to the team namespace containing the cluster.