Skip to main content

Azure Provider Guide

Status: Stable. E2E validated for single-node and HA topologies.

Bootstrap a Butler management cluster on Microsoft Azure.

Table of Contents


Overview

Butler uses a thin provider controller (butler-provider-azure) to provision VMs on Azure Compute. For HA topologies, the provider creates a Standard Load Balancer with a public IP, health probe, and load balancing rule on port 6443.

After bootstrap, the Azure Cloud Controller Manager (CCM) runs on the management cluster as an embedded Deployment (not Helm). It handles type: LoadBalancer services by creating Azure load balancers.


Prerequisites

Azure Subscription

  • An Azure subscription with sufficient quota
  • Microsoft.Compute and Microsoft.Network resource providers registered

Service Principal

A service principal with Contributor role on the resource group:

az ad sp create-for-rbac \
--name butler-bootstrap \
--role Contributor \
--scopes /subscriptions/SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP

Save the appId (clientID), password (clientSecret), and tenant (tenantID) from the output.

Networking

Pre-provision the following resources:

ResourceNotes
Resource groupMust exist before bootstrap
VNet + subnetMust exist in the resource group
Network Security Group (NSG)Must exist. CCM manages NSG rules for LB services.

Talos Image

A Talos Linux image must be available as a managed image or in a Shared Image Gallery.

Step 1: Download from the Talos Image Factory

wget https://factory.talos.dev/image/SCHEMATIC_ID/v1.12.5/azure-amd64.vhd.xz
xz -d azure-amd64.vhd.xz

Step 2: Upload to Azure

Upload as a managed image or add to a Shared Image Gallery. Note the full ARM resource ID for the imageURN config field. Examples:

  • Managed image: /subscriptions/.../resourceGroups/.../providers/Microsoft.Compute/images/talos-v1-12-5
  • Gallery image: /subscriptions/.../resourceGroups/.../providers/Microsoft.Compute/galleries/.../images/.../versions/1.12.5

Public IP Quota

Standard Public IPs have a quota that varies by subscription type. Restricted and free-tier subscriptions often default to 3 Standard PIPs per region.

TopologyVMsLBsTotal Public IPs
Single-node11 (console)2
HA (3CP + 2W)51 (console)6

If your quota is below 6, HA bootstrap will fail with PublicIPCountLimitReached. Request a quota increase through the Azure portal under Subscription > Usage + quotas.


Azure Setup

1. Create Service Principal

az ad sp create-for-rbac \
--name butler-bootstrap \
--role Contributor \
--scopes /subscriptions/SUB_ID/resourceGroups/butler-rg

2. Create Resource Group and VNet

az group create --name butler-rg --location eastus

az network vnet create \
--resource-group butler-rg \
--name butler-vnet \
--address-prefix 10.0.0.0/16 \
--subnet-name default \
--subnet-prefix 10.0.0.0/24

3. Create Network Security Group

az network nsg create \
--resource-group butler-rg \
--name butler-nsg

# Kubernetes API (external + LB health probes)
az network nsg rule create \
--resource-group butler-rg --nsg-name butler-nsg \
--name allow-apiserver --priority 100 \
--access Allow --protocol Tcp \
--destination-port-ranges 6443 \
--source-address-prefixes '*'

# Talos API (inter-node)
az network nsg rule create \
--resource-group butler-rg --nsg-name butler-nsg \
--name allow-talos --priority 200 \
--access Allow --protocol Tcp \
--destination-port-ranges 50000-50001 \
--source-address-prefixes 10.0.0.0/24

# etcd (inter-node)
az network nsg rule create \
--resource-group butler-rg --nsg-name butler-nsg \
--name allow-etcd --priority 300 \
--access Allow --protocol Tcp \
--destination-port-ranges 2379-2380 \
--source-address-prefixes 10.0.0.0/24

# kubelet API (inter-node)
az network nsg rule create \
--resource-group butler-rg --nsg-name butler-nsg \
--name allow-kubelet --priority 400 \
--access Allow --protocol Tcp \
--destination-port-ranges 10250 \
--source-address-prefixes 10.0.0.0/24

# Cilium health checks (inter-node)
az network nsg rule create \
--resource-group butler-rg --nsg-name butler-nsg \
--name allow-cilium-health --priority 500 \
--access Allow --protocol Tcp \
--destination-port-ranges 4240 \
--source-address-prefixes 10.0.0.0/24

# Cilium VXLAN overlay (inter-node)
az network nsg rule create \
--resource-group butler-rg --nsg-name butler-nsg \
--name allow-cilium-vxlan --priority 600 \
--access Allow --protocol Udp \
--destination-port-ranges 8472 \
--source-address-prefixes 10.0.0.0/24

# Associate NSG with subnet
az network vnet subnet update \
--resource-group butler-rg \
--vnet-name butler-vnet \
--name default \
--network-security-group butler-nsg

NSG Rules

PriorityProtocol/PortSourcePurpose
100TCP 6443*Kubernetes API (external + LB health probes from 168.63.129.16)
200TCP 50000-50001Subnet CIDRTalos API (apid + trustd)
300TCP 2379-2380Subnet CIDRetcd client and peer
400TCP 10250Subnet CIDRkubelet API
500TCP 4240Subnet CIDRCilium health checks
600UDP 8472Subnet CIDRCilium VXLAN overlay

Bootstrap Configuration

Create a config file at ~/.butler/bootstrap-azure.yaml:

Single-Node

This config was used for E2E validation. Replace credentials, resource group, VNet, NSG, and image URN with your values.

provider: azure

cluster:
name: butler-azure-test
topology: single-node
controlPlane:
replicas: 1
cpu: 4
memoryMB: 16384
diskGB: 100

network:
podCIDR: "10.244.0.0/16"
serviceCIDR: "10.96.0.0/12"

talos:
version: v1.12.5

addons:
cni:
type: cilium
storage:
type: longhorn

providerConfig:
azure:
clientID: "YOUR_SERVICE_PRINCIPAL_APP_ID"
clientSecret: "YOUR_SERVICE_PRINCIPAL_PASSWORD"
tenantID: "YOUR_AZURE_AD_TENANT_ID"
subscriptionID: "YOUR_SUBSCRIPTION_ID"
resourceGroup: "butler-bootstrap-rg"
location: "eastus"
vnetName: "butler-bootstrap-vnet"
subnetName: "default"
securityGroupName: "butler-bootstrap-nsg"
vmSize: "Standard_DC4s_v3"
imageURN: "/subscriptions/YOUR_SUB_ID/resourceGroups/butler-bootstrap-rg/providers/Microsoft.Compute/galleries/butlerImageGallery/images/talos-linux/versions/1.12.5"

HA

provider: azure

cluster:
name: butler-azure-ha
topology: ha
controlPlane:
replicas: 3
cpu: 4
memoryMB: 16384
diskGB: 100
workers:
replicas: 2
cpu: 4
memoryMB: 16384
diskGB: 100

network:
podCIDR: "10.244.0.0/16"
serviceCIDR: "10.96.0.0/12"

talos:
version: v1.12.5

addons:
cni:
type: cilium
storage:
type: longhorn

providerConfig:
azure:
clientID: "YOUR_SERVICE_PRINCIPAL_APP_ID"
clientSecret: "YOUR_SERVICE_PRINCIPAL_PASSWORD"
tenantID: "YOUR_AZURE_AD_TENANT_ID"
subscriptionID: "YOUR_SUBSCRIPTION_ID"
resourceGroup: "butler-bootstrap-rg"
location: "eastus"
vnetName: "butler-bootstrap-vnet"
subnetName: "default"
securityGroupName: "butler-bootstrap-nsg"
vmSize: "Standard_DC4s_v3"
imageURN: "/subscriptions/YOUR_SUB_ID/resourceGroups/butler-bootstrap-rg/providers/Microsoft.Compute/galleries/butlerImageGallery/images/talos-linux/versions/1.12.5"

Required fields:

  • securityGroupName -- the Azure CCM fails with securityGroupName is not configured without it.
  • imageURN -- full ARM resource ID. Managed image or Shared Image Gallery version.
  • vmSize -- Azure VM SKU. The E2E test used Standard_DC4s_v3.

Run Bootstrap

butleradm bootstrap azure --config ~/.butler/bootstrap-azure.yaml

Validation

export KUBECONFIG=~/.butler/butler-azure-test-kubeconfig

# All nodes Ready with providerID set
kubectl get nodes -o wide
kubectl get nodes -o jsonpath='{.items[*].spec.providerID}'
# Expected format: azure:///subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm>

# Azure CCM Deployment running (embedded manifest)
kubectl get deploy -n kube-system | grep cloud

# Cilium running
kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium

# Longhorn running
kubectl get pods -n longhorn-system

# Butler Console exposed via Azure LB
kubectl get svc butler-console-frontend -n butler-system

# Console accessible (use the EXTERNAL-IP from above)
curl http://<LB-IP>

What You Have Now

A Butler management cluster running on Azure with:

  • Talos Linux VMs with Cilium CNI
  • Azure Standard Load Balancer fronting the Kubernetes API (HA topology)
  • Azure CCM handling LoadBalancer services
  • Longhorn distributed storage
  • Steward for hosted tenant control planes
  • Butler controller, CRDs, and web console exposed via Azure LB

To create your first tenant cluster, see Create Your First Tenant Cluster.


Cleanup

# Delete KIND bootstrap cluster
kind delete cluster --name butler-bootstrap

# Delete VMs
az vm list -g butler-rg \
--query "[?tags.\"butler_butlerlabs_dev_managed-by\"=='butler'].name" -o tsv \
| xargs -I{} az vm delete -g butler-rg -n {} --yes --force-deletion yes

# Wait ~180s for NICs to release, then delete NICs
az network nic list -g butler-rg \
--query "[?contains(name, 'CLUSTER_NAME')].name" -o tsv \
| xargs -I{} az network nic delete -g butler-rg -n {}

# Delete public IPs
az network public-ip list -g butler-rg \
--query "[?tags.\"butler-managed-by\"=='butler'].name" -o tsv \
| xargs -I{} az network public-ip delete -g butler-rg -n {}

# Delete load balancers
az network lb list -g butler-rg \
--query "[?tags.\"butler-managed-by\"=='butler'].name" -o tsv \
| xargs -I{} az network lb delete -g butler-rg -n {}

# Delete orphaned disks
az disk list -g butler-rg \
--query "[?contains(name, 'CLUSTER_NAME')].name" -o tsv \
| xargs -I{} az disk delete -g butler-rg -n {} --yes

# Delete availability set
az vm availability-set delete -g butler-rg -n CLUSTER_NAME-avset

Azure-Specific Details

AvailabilitySet

The provider controller auto-creates an AvailabilitySet named <clusterName>-avset with Aligned SKU (Fault Domains: 2, Update Domains: 5). This is required because the Azure CCM with vmType=standard uses AvailabilitySet membership to determine which VMs belong to the cluster for LB backend pool management.

CCM Backend Pool Bug

Azure CCM v1.31 with vmType=standard does not auto-populate LB backend pools. The initial backend sync runs before any LoadBalancer services exist, and no subsequent sync is triggered after the console service is created.

Butler works around this in bootstrap step 12.5 by directly calling the Azure REST API to:

  1. Get the LB backend pool resource ID
  2. For each cluster node: read the NIC, add the backend pool reference to ipConfigurations[0].properties.loadBalancerBackendAddressPools, and write it back

This is handled automatically during bootstrap.

Tag Key Sanitization

Azure tag keys cannot contain forward slashes (/). Butler labels like butler.butlerlabs.dev/team are sanitized by replacing / with _. Load balancer and public IP resources use simplified tag keys like butler-managed-by instead.

SSH Key Placeholder

Azure requires an SSH public key for Linux VM creation, even though Talos ignores SSH. The provider generates a placeholder RSA-4096 key at runtime.


Troubleshooting

NSG Rules Blocking Traffic

Symptom: Talos bootstrap times out. Nodes cannot communicate.

az network nsg rule list \
--resource-group butler-rg \
--nsg-name butler-nsg \
--output table

Verify all six rules exist, including Cilium ports (4240, 8472).

Public IP Quota Exceeded

Symptom: PublicIPCountLimitReached error in provider logs.

az vm list-usage --location eastus --output table | grep "Public IP"

Standard Public IPs default to 3 per region on restricted subscriptions. Single-node needs 2 (1 VM + 1 console LB). HA needs 6 (5 VMs + 1 LB). Self-service quota increase (az quota create) may fail. Request an increase through the Azure portal support ticket flow.

CCM LB Backend Pool Empty

Symptom: Console LB exists but has no healthy backends. The service external IP is assigned but curl times out.

Check CCM logs:

kubectl logs -n kube-system deploy/cloud-controller-manager

If you see managed=, ok=false, isNodeManagedByCloudProvider=true without subsequent backend sync, the CCM backend pool bug has occurred. Butler's step 12.5 handles this automatically, but if it fails, check the bootstrap controller logs for Azure REST API errors.

Service Principal Expired

Symptom: AADSTS7000215: Invalid client secret.

az ad sp credential reset --id APP_ID

Update the secret in your bootstrap config.

securityGroupName Missing

Symptom: CCM logs show securityGroupName is not configured.

The securityGroupName field is required in the Azure provider config. Without it, the CCM cannot create NSG rules for LoadBalancer services. Add it to your config:

providerConfig:
azure:
securityGroupName: "butler-nsg"

See Also