IPAM Internals

This document covers the implementation of Butler's IP Address Management subsystem: the bitmap allocator, controller interactions, demand-driven elastic allocation, and garbage collection. For a user-facing introduction to IPAM concepts, see Concepts: Networking.

The subsystem consists of three CRDs and four cooperating controllers:

Key design principles:

Single allocator: The NetworkPool controller is the sole writer of IPAllocation status. This eliminates race conditions without distributed locking.
Demand-driven allocation: Elastic growth fires when tenant LB Services are Pending without an IP. Shrink fires when allocated IPs have no matching Service for a sustained grace period. No speculative arithmetic.
Best-fit allocation: The bitmap allocator selects the smallest free block that satisfies each request, reducing fragmentation over the pool's lifetime.
Management authoritative: IPAllocation CRs on the management cluster are the desired state. MetalLB pools on tenants are projections. Drift is corrected on every sync.
Three-layer cleanup: TenantCluster deletion, IPAllocation finalizers, and orphan garbage collection ensure IP addresses are always returned to the pool.
Cloud-native bypass: Cloud providers skip the entire IPAM subsystem. When spec.network.mode is cloud, the TenantCluster controller returns early and the cloud provider's native LoadBalancer handles IP assignment.

CRD Resources

NetworkPool

A NetworkPool defines a block of IP addresses available for allocation to tenant clusters. It is a namespaced resource (typically created in butler-system) that tracks capacity, fragmentation, and allocation count.

API Group: butler.butlerlabs.dev/v1alpha1 Scope: Namespaced Short Name: np

apiVersion: butler.butlerlabs.dev/v1alpha1
kind: NetworkPool
metadata:
  name: lab-pool
  namespace: butler-system
spec:
  # The full CIDR block owned by this pool
  cidr: "10.40.0.0/21"

  # Ranges excluded from tenant allocation (e.g., management cluster, gateways)
  reserved:
    - cidr: "10.40.0.0/28"
      description: "Management cluster nodes and VIP"
    - cidr: "10.40.0.16/28"
      description: "Management cluster MetalLB pool"

  # Optional: constrain tenant allocations to a subset of the CIDR
  tenantAllocation:
    start: "10.40.1.0"
    end: "10.40.7.254"
    defaults:
      nodesPerTenant: 5    # Default node IPs per tenant (if IPAllocation.spec.count is unset)
      lbPoolPerTenant: 8   # Default LB IPs per tenant (if IPAllocation.spec.count is unset)

NetworkPool Status

The status is computed by the NetworkPool controller on every reconciliation cycle:

status:
  totalIPs: 1774          # Usable IPs (total minus reserved)
  allocatedIPs: 48        # IPs assigned to active IPAllocations
  availableIPs: 1726      # totalIPs - allocatedIPs
  allocationCount: 6      # Number of active IPAllocations
  fragmentationPercent: 12 # 0 = contiguous free space, 100 = maximally fragmented
  largestFreeBlock: 1680  # Largest contiguous block available
  observedGeneration: 2
  conditions:
    - type: Ready
      status: "True"
      reason: Ready
      message: "1726/1774 IPs available (6 allocations)"
    - type: CapacityWarning
      status: "False"
      reason: UtilizationBelowThreshold
      message: "Pool utilization is 3% (48/1774 IPs)"
    - type: CapacityCritical
      status: "False"
      reason: UtilizationBelowThreshold
      message: "Pool utilization is 3% (48/1774 IPs)"
    - type: CapacityExhausted
      status: "False"
      reason: UtilizationBelowThreshold
      message: "Pool utilization is 3% (48/1774 IPs)"

Spec Fields

Field	Type	Description
`spec.cidr`	string	CIDR notation for the pool's address space (e.g., `10.40.0.0/21`)
`spec.reserved[]`	array	Ranges excluded from allocation
`spec.reserved[].cidr`	string	Reserved range in CIDR notation
`spec.reserved[].description`	string	Human-readable reason for the reservation
`spec.tenantAllocation`	object	Optional: constrains tenant allocations to a sub-range
`spec.tenantAllocation.start`	string	First allocatable IP
`spec.tenantAllocation.end`	string	Last allocatable IP
`spec.tenantAllocation.defaults.nodesPerTenant`	int32	Default node IP count per tenant (default: 5)
`spec.tenantAllocation.defaults.lbPoolPerTenant`	int32	Default LB IP count per tenant (default: 8)

Status Fields

Field	Type	Description
`status.totalIPs`	int32	Total usable IPs (excludes reserved)
`status.allocatedIPs`	int32	IPs currently allocated
`status.availableIPs`	int32	IPs available for new allocations
`status.allocationCount`	int32	Number of active IPAllocations
`status.fragmentationPercent`	int32	Free space fragmentation (0-100)
`status.largestFreeBlock`	int32	Size of largest contiguous free block
`status.conditions[]`	[]Condition	Standard Kubernetes conditions (see Capacity Conditions)
`status.observedGeneration`	int64	Last observed generation

IPAllocation

An IPAllocation represents a request for (and eventual assignment of) a contiguous block of IP addresses from a NetworkPool. It is created by the TenantCluster controller and fulfilled by the NetworkPool controller.

API Group: butler.butlerlabs.dev/v1alpha1 Scope: Namespaced Short Name: ipa

apiVersion: butler.butlerlabs.dev/v1alpha1
kind: IPAllocation
metadata:
  name: team-platform-prod-cluster-lb
  namespace: butler-system
  labels:
    butler.butlerlabs.dev/team: team-platform
    butler.butlerlabs.dev/tenant: prod-cluster
    butler.butlerlabs.dev/network-pool: lab-pool
    butler.butlerlabs.dev/allocation-type: loadbalancer
    butler.butlerlabs.dev/allocation-role: initial
spec:
  poolRef:
    name: lab-pool
  tenantClusterRef:
    name: prod-cluster
    namespace: team-platform
  type: loadbalancer    # "nodes" or "loadbalancer"
  count: 8              # Optional; defaults to pool's tenantAllocation.defaults

IPAllocation with Pinned Range

For cases where a specific IP range is required (for example, to preserve stable addresses across recreation):

apiVersion: butler.butlerlabs.dev/v1alpha1
kind: IPAllocation
metadata:
  name: team-platform-prod-cluster-lb
  namespace: butler-system
spec:
  poolRef:
    name: lab-pool
  tenantClusterRef:
    name: prod-cluster
    namespace: team-platform
  type: loadbalancer
  pinnedRange:
    startAddress: "10.40.2.0"
    endAddress: "10.40.2.7"

IPAllocation Lifecycle

Phase	Description
`Pending`	Created by TenantCluster controller, awaiting fulfillment
`Allocated`	NetworkPool controller assigned an IP range
`Failed`	Allocation could not be fulfilled (pool exhausted, conflict)
`Released`	Deletion in progress; audit timestamp recorded

Spec Fields

Field	Type	Description
`spec.poolRef`	LocalObjectReference	Name of the NetworkPool to allocate from
`spec.tenantClusterRef`	NamespacedObjectReference	The TenantCluster this allocation serves
`spec.type`	string	`nodes` or `loadbalancer`
`spec.count`	*int32	Number of IPs requested (min: 1, optional)
`spec.pinnedRange`	object	Request a specific range instead of best-fit
`spec.pinnedRange.startAddress`	string	First IP of the pinned range
`spec.pinnedRange.endAddress`	string	Last IP of the pinned range

Status Fields

Field	Type	Description
`status.phase`	string	Current lifecycle phase
`status.cidr`	string	Allocated range in CIDR or `start-end` format
`status.startAddress`	string	First IP in the allocated range
`status.endAddress`	string	Last IP in the allocated range
`status.addresses[]`	[]string	All individual IPs in the allocated range
`status.allocatedCount`	int32	Number of IPs allocated
`status.allocatedAt`	*Time	Timestamp of allocation
`status.allocatedBy`	string	Controller that performed the allocation
`status.releasedAt`	*Time	Timestamp of release (audit trail)
`status.conditions[]`	[]Condition	Standard Kubernetes conditions

ProviderConfig Network Configuration

The ProviderConfig's spec.network section configures IPAM behavior for all tenant clusters using that provider.

apiVersion: butler.butlerlabs.dev/v1alpha1
kind: ProviderConfig
metadata:
  name: harvester-prod
  namespace: butler-system
spec:
  provider: harvester
  credentialsRef:
    name: harvester-kubeconfig

  network:
    # Networking mode: "ipam" for Butler-managed, "cloud" for provider-native
    mode: ipam

    # Ordered list of NetworkPools (lower priority = tried first)
    poolRefs:
      - name: lab-pool-primary
        priority: 0
      - name: lab-pool-secondary
        priority: 10

    # Layer 2/3 network settings for provisioned VMs
    subnet: "10.40.0.0/21"
    gateway: "10.40.0.1"
    dnsServers:
      - "10.40.0.2"
      - "10.40.0.3"

    # LoadBalancer allocation configuration
    loadBalancer:
      defaultPoolSize: 8         # Static mode: IPs per tenant (default: 8)
      allocationMode: static     # "static" or "elastic" (default: static)
      initialPoolSize: 2         # Elastic mode: starting IPs (default: 2)
      growthIncrement: 2         # Elastic mode: IPs added per growth event (default: 2)

    # Per-tenant IP limits
    quotaPerTenant:
      maxNodeIPs: 20
      maxLoadBalancerIPs: 32

Network Field Reference

Field	Type	Default	Description
`mode`	string	`cloud`	`ipam` for Butler-managed IPAM, `cloud` for provider-native
`poolRefs[]`	array	-	Ordered list of NetworkPool references
`poolRefs[].name`	string	-	NetworkPool name
`poolRefs[].priority`	int32	0	Lower value = higher priority
`subnet`	string	-	Network subnet for VM provisioning
`gateway`	string	-	Default gateway
`dnsServers[]`	[]string	-	DNS server addresses
`loadBalancer.defaultPoolSize`	int32	8	IPs allocated per tenant in static mode
`loadBalancer.allocationMode`	string	`static`	`static` (fixed) or `elastic` (auto-scaling)
`loadBalancer.initialPoolSize`	int32	2	Starting IPs per tenant in elastic mode
`loadBalancer.growthIncrement`	int32	2	IPs added per elastic growth event
`quotaPerTenant.maxNodeIPs`	*int32	-	Maximum node IPs per tenant (unset = unlimited)
`quotaPerTenant.maxLoadBalancerIPs`	*int32	-	Maximum LB IPs per tenant (unset = unlimited)

Controllers

Four controllers cooperate to manage IP allocation:

Controller	Package	Responsibility
NetworkPool	`internal/controller/networkpool/`	Sole allocator. Processes Pending IPAllocations using best-fit bitmap. Computes pool status. Runs orphan GC.
IPAllocation	`internal/controller/ipallocation/`	Thin lifecycle. Adds finalizer, sets initial Pending phase. On deletion: sets Released phase with timestamp, removes finalizer.
TenantCluster	`internal/controller/tenantcluster/`	Creates IPAllocations during provisioning. Runs demand-driven elastic IPAM on Ready clusters. Syncs MetalLB pools. Cleans up allocations on deletion.
ProviderConfig	`internal/controller/providerconfig/`	Validates pool availability for IPAM mode. Estimates tenant capacity from available IPs.

Controller Interaction

Watch Relationships

The TenantCluster controller watches IPAllocation resources in addition to its primary resources. When an IPAllocation transitions from Pending to Allocated, the watch triggers a TenantCluster reconcile within seconds, rather than waiting for the timer-based requeue.

Reconciliation Intervals

Controller	Normal Requeue	Special Cases
NetworkPool	60 seconds	5 seconds after processing pending allocations or GC
IPAllocation (Pending)	15 seconds	Backstop; primary fulfillment is event-driven
IPAllocation (Failed)	30 seconds	Backstop retry
IPAllocation (Allocated)	5 minutes	Health check
TenantCluster (non-Ready)	30 seconds	-
TenantCluster (Ready, < 1h old)	1 minute	Elastic IPAM runs on each reconcile
TenantCluster (Ready, 1-24h old)	5 minutes	-
TenantCluster (Ready, > 24h old)	15 minutes	-

Allocation Flow

Static IPAM

Static IPAM allocates a fixed number of LoadBalancer IPs when a TenantCluster is created. The allocation size does not change for the lifetime of the cluster.

Step-by-step:

A TenantCluster CR is created. The TenantCluster controller calls reconcileIPAllocation().
The controller checks ProviderConfig.spec.network.mode. If not ipam, it returns immediately.
getInitialLBPoolSize() determines the allocation size using this precedence:
- TenantCluster spec.networking.lbPoolSize override
- ProviderConfig spec.network.loadBalancer.defaultPoolSize
- Fallback: 8
The count is clamped to quotaPerTenant.maxLoadBalancerIPs if set.
The controller iterates through spec.network.poolRefs in priority order (lower value = higher priority).
For each pool, it checks pool.status.availableIPs >= lbCount. On the first pool with capacity, it creates an IPAllocation with standard labels.
The IPAllocation controller adds a finalizer and sets the phase to Pending.
The IPAllocation creation triggers the NetworkPool controller via a watch. The NetworkPool controller builds a bitmap, sorts pending allocations by creation timestamp (FIFO), and runs the best-fit allocator.
On success, the IPAllocation status is updated with the allocated range. On failure (pool exhausted), the phase is set to Failed.
On the next TenantCluster reconcile, reconcileIPAllocation() sees the Allocated phase and returns (true, nil). The controller then installs MetalLB on the tenant cluster with the allocated address range.

Elastic IPAM (Demand-Driven)

Elastic IPAM starts with a small initial allocation and grows or shrinks based on observed LoadBalancer Service demand on the tenant cluster. Growth fires when a Service is stuck Pending without an IP. Shrink fires when allocated IPs have no matching Service for a sustained grace period.

Configuration:

spec:
  network:
    mode: ipam
    loadBalancer:
      allocationMode: elastic
      initialPoolSize: 2       # Start with 2 IPs
      growthIncrement: 1       # Add 1 IP per growth event
    quotaPerTenant:
      maxLoadBalancerIPs: 8    # Hard cap

Demand-driven growth

Growth is triggered by observed demand on the tenant cluster, not by arithmetic projections.

On each reconcile of a Ready cluster with elastic IPAM enabled:

reconcileElasticIPAM() lists all LB IPAllocations for the tenant.
It connects to the tenant cluster and builds a Service inventory: all Services of type LoadBalancer, their external IPs, and their age.
If any Service has been Pending without an externalIP for longer than 30 seconds, growth fires — but only after accounting for in-flight supply. Growth allocations that are still Pending (awaiting fulfillment by the NetworkPool controller) or Allocated but not yet consumed by any Service (MetalLB propagation in progress) are subtracted from the demand count. This prevents redundant growth when a watch-triggered reconcile fires shortly after creating a growth allocation.
Batch assessment: The controller counts all Pending Services at once, subtracts in-flight supply, and creates enough growth allocations to cover the remaining demand. If three Services are Pending and one growth allocation is already in flight, growth creates allocations for the remaining two.
Each growth allocation is quota-checked (totalAllocated + growthIncrement <= maxLoadBalancerIPs) and capacity-checked against the pool.
Growth allocations are labeled allocation-role: growth and named {namespace}-{name}-lb-{N}.

The 30-second age threshold prevents the controller from racing with MetalLB. When a Service is first created, MetalLB may take a few seconds to assign an IP from the existing pool. The controller waits to confirm that MetalLB has no free IPs before triggering growth.

Measured timing: In production, a Pending Service triggers a growth allocation within one reconcile cycle. End-to-end, from Service creation to IP assignment, measured at 37 seconds: the controller detects the Pending Service, creates the growth IPAllocation, the NetworkPool controller fulfills it, MetalLB is updated with the new range, and MetalLB assigns the IP.

Demand-driven shrink

Shrink releases growth allocations whose IPs are not in use by any tenant LB Service.

For each growth allocation: check whether any IP in the allocation's range matches an externalIP on a tenant LB Service.
If no Service is using any IP from the allocation, and the allocation has been in this unused state for longer than the 10-minute grace period, the allocation is deleted.
updateMetalLBPool() then syncs the MetalLB pool on the tenant to reflect the reduced address set.

Shrink protections:

Initial allocations are never shrunk. The allocation labeled allocation-role: initial is always preserved regardless of usage.
Pinned allocations are never shrunk. Any allocation with a spec.pinnedRange is preserved regardless of role label or Service usage.
Grace period prevents thrashing. A growth allocation must have no matching Service for 10 continuous minutes before it is eligible for release. Transient Service restarts do not trigger shrink.

Measured timing: After test Services were deleted, growth allocations were released at the 9-minute mark (the allocations were 37 seconds old when the Services were deleted, so the total age crossed the 10-minute threshold at ~9m23s, caught on the next reconcile).

MetalLB multi-range support

When elastic IPAM produces multiple allocations, the tenant cluster's MetalLB IPAddressPool contains multiple entries:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
    - "10.40.1.0-10.40.1.1"    # Initial allocation
    - "10.40.1.8-10.40.1.8"    # Growth allocation

The MetalLB sync uses server-side apply with the butler-controller/ipam field manager, which overwrites any manual edits to default-pool on the tenant cluster. Operators who need custom MetalLB pools should create additional IPAddressPool resources with different names.

End-to-end demand-driven sequence

Cloud Provider Bypass

When a ProviderConfig uses mode: cloud (the default), the entire IPAM subsystem is bypassed:

reconcileIPAllocation() returns (true, nil) immediately.
isElasticIPAM() returns false.
No NetworkPool, IPAllocation, or MetalLB resources are created.
The cloud provider's native LoadBalancer implementation handles IP assignment.

This means cloud-hosted Butler deployments (AWS, Azure, GCP) use the cloud's existing LoadBalancer controllers with no additional configuration.

Design Evolution

Butler's IPAM originally used speculative arithmetic to decide when to grow and shrink elastic allocations. The controller computed availableIPs = totalAllocated - platformServiceCount - tenantServiceCount and triggered growth when availableIPs < 1 or shrink when availableIPs >= growthIncrement.

This approach had a stable oscillation bug. With growthIncrement=1 and all allocated IPs in use (e.g., 2 IPs allocated, 1 platform LB + 1 tenant LB), availableIPs = 0 triggered growth. After the growth allocation was fulfilled, availableIPs = 1, which equaled growthIncrement, so shrink triggered. After shrink, availableIPs = 0 again. This cycle repeated every reconcile interval (1-15 minutes depending on cluster age), creating continuous IPAllocation churn on 5 of 8 production tenant clusters.

The fix was to replace speculative arithmetic with observed demand. Growth now fires only when a real LB Service is stuck Pending without an IP. Shrink fires only when allocated IPs have no matching Service for a sustained period. The speculative computation was removed entirely.

The demand-driven approach eliminates the oscillation because there is no arithmetic equilibrium to destabilize. Growth requires a concrete signal (Pending Service), and shrink requires sustained absence of demand (no matching Service for 10 minutes). At rest, with no Pending Services and all allocated IPs either in use or within the grace period, the controller takes no action.

For the full design rationale, see ADR-016: Demand-Driven IPAM.

Authority Model

IPAllocation CRs on the management cluster are the desired state for IP ranges. MetalLB IPAddressPools on tenant clusters are projections of that state. If they disagree, the controller corrects the tenant to match management.

Management writes, tenant reads. The management cluster decides which IPs to allocate, from which pool, subject to which quotas. The tenant cluster tells management "I need IPs" (via a Pending Service); management decides which IPs and tells the tenant (via MetalLB pool update).

Drift correction. On every MetalLB sync, the controller computes the expected pool state from IPAllocations and applies it to the tenant via server-side apply with Force: true. If the tenant pool has been manually edited, the edit is overwritten. This happens automatically on every elastic IPAM reconcile.

Operators who need custom MetalLB pools on a tenant should create additional IPAddressPool resources with different names. Do not modify default-pool directly; the controller will revert the change.

Best-Fit Bitmap Allocator

The allocator lives in internal/ipam/allocator.go and is a pure-function library with no Kubernetes dependencies. The NetworkPool controller is the sole caller.

How It Works

BuildBitmap: Creates a boolean array representing the allocatable IP range. Each element corresponds to one IP address. true = used (reserved or allocated), false = free.
findFreeBlocks: Scans the bitmap linearly to find all contiguous runs of false values. Returns a list of FreeBlock{StartOffset, EndOffset, Size}.
AllocateRange (best-fit): Iterates through free blocks and selects the smallest block that can satisfy the requested count. Allocates from the start of the selected block.
AllocatePinnedRange: Validates that the requested start-end range falls within the allocatable range, then checks every bit in the bitmap to confirm no overlap with reserved or existing allocations.
ComputeFragmentation: Calculates 1 - (largestFreeBlock / totalFreeIPs) as a percentage. A single contiguous free block yields 0% fragmentation. Many small scattered blocks approach 100%.

Why Best-Fit

Best-fit allocation minimizes fragmentation over time compared to first-fit or next-fit strategies. By selecting the tightest-fitting free block, it preserves larger contiguous blocks for future allocations that may need them. This is important for long-lived pools where clusters are created and deleted repeatedly.

Constraints

IPv4 only: The allocator uses uint32 arithmetic for IP addresses.
Maximum pool size: 1,048,576 IPs (~1M, a /12 CIDR). Pools larger than this are rejected to prevent excessive memory usage.
Maximum enumeration: EnumerateIPs() caps at 65,536 IPs per range to avoid generating oversized status.addresses[] arrays.

Data Structures

// PoolState decouples the allocator from Kubernetes types.
type PoolState struct {
    AllocatableStart string          // First IP available for allocation
    AllocatableEnd   string          // Last IP available for allocation
    ReservedCIDRs    []string        // CIDRs excluded from allocation
    ExistingAllocs   []AllocatedRange // Currently allocated ranges
}

// AllocationResult contains the result of a successful allocation.
type AllocationResult struct {
    Start     string   // First IP in allocated range
    End       string   // Last IP in allocated range
    CIDR      string   // CIDR notation if power-of-2 aligned, otherwise "start-end"
    Addresses []string // All individual IPs
    Count     int32    // Number of IPs allocated
}

CIDR Formatting

The allocator formats the result as CIDR notation when the allocated range is power-of-2 aligned (e.g., 10.40.1.0/29 for 8 IPs starting at a /29 boundary). Otherwise, it uses start-end format (e.g., 10.40.1.3-10.40.1.10). This affects status.cidr on the IPAllocation but does not change the allocated addresses.

Cleanup and Garbage Collection

Butler uses a three-layer cleanup strategy to ensure IP addresses are always returned to the pool, even under failure conditions.

Layer 1: TenantCluster Deletion

When a TenantCluster is deleted, handleDeletion() calls cleanupIPAllocations(). This uses two strategies to find all associated allocations:

Status references: Deletes the IPAllocations pointed to by tc.Status.LBAllocationRef and tc.Status.IPAllocationRef. This catches the primary allocation.
Label-based discovery: Lists all IPAllocations in butler-system matching butler.butlerlabs.dev/team={namespace} and butler.butlerlabs.dev/tenant={name}. This catches elastic growth allocations that are not tracked in the TenantCluster status.

A deduplication map prevents double-deletion of allocations found by both methods.

Layer 2: IPAllocation Finalizer

Every IPAllocation has a finalizer (butler.butlerlabs.dev/ipallocation). When deletion is initiated:

The IPAllocation controller detects DeletionTimestamp is set.
It records the current time as status.releasedAt for audit purposes.
It sets the phase to Released.
It removes the finalizer, allowing Kubernetes to complete the deletion.

The releasedAt timestamp creates an audit trail: you can see when an IP range was released even after the allocation object is gone (if you capture the Released status update in logs or events).

Layer 3: NetworkPool Orphan GC

The NetworkPool controller runs orphan garbage collection on every reconcile cycle (every 60 seconds). For each Allocated IPAllocation referencing this pool:

It reads spec.tenantClusterRef.{name, namespace}.
It attempts to GET the referenced TenantCluster.
If the TenantCluster returns 404 (Not Found), the allocation is orphaned and is deleted.

This is a safety net for edge cases where:

The TenantCluster was force-deleted (finalizer removed manually).
The TenantCluster's namespace was deleted before cleanup could run.
A bug in the TenantCluster controller skipped cleanupIPAllocations().

tip

Orphan GC only processes Allocated IPAllocations. Pending and Failed allocations are transient states handled by the normal allocation flow.

Labels and Discovery

All IPAllocations are labeled for efficient querying and cleanup:

Label	Value	Purpose
`butler.butlerlabs.dev/team`	Team namespace (e.g., `team-platform`)	Filter allocations by team
`butler.butlerlabs.dev/tenant`	TenantCluster name (e.g., `prod-cluster`)	Filter allocations by cluster
`butler.butlerlabs.dev/network-pool`	NetworkPool name (e.g., `lab-pool`)	Track which pool an allocation came from
`butler.butlerlabs.dev/allocation-type`	`loadbalancer` or `nodes`	Distinguish allocation purpose
`butler.butlerlabs.dev/allocation-role`	`initial` or `growth`	Distinguish the initial allocation from elastic growth allocations

The allocation-role label determines shrink eligibility. Only allocations labeled growth are candidates for demand-driven shrink. The initial allocation is always preserved. During migration from older controller versions, the controller infers the role from the allocation name: the allocation matching {team}-{tenant}-lb is labeled initial, allocations matching {team}-{tenant}-lb-{N} are labeled growth, and allocations with a spec.pinnedRange or unrecognized names are labeled initial for safety.

The NetworkPool controller uses a field indexer on spec.poolRef.name for efficient listing of all IPAllocations referencing a given pool. This avoids full-list scans on every reconciliation.

Quotas and Capacity Planning

Per-Tenant Quotas

ProviderConfig enforces per-tenant IP limits:

spec:
  network:
    quotaPerTenant:
      maxNodeIPs: 20
      maxLoadBalancerIPs: 32

Quota enforcement points:

Initial allocation: reconcileIPAllocation() clamps the requested count to maxLoadBalancerIPs.
Elastic growth: reconcileElasticIPAM() checks totalAllocated + growthIncrement <= maxLoadBalancerIPs before creating a growth allocation.

If the quota is reached, the controller logs a message and skips the growth. The cluster continues to operate with its current allocation.

Pool Capacity Estimation

The ProviderConfig controller estimates how many tenant clusters a provider can support:

estimatedTenants = availableIPs / (nodesPerTenant + lbPerTenant)

This estimate is exposed in the ProviderConfig status, enabling capacity planning.

Capacity Conditions

The NetworkPool controller maintains three always-present conditions on every NetworkPool. These conditions follow the standard metav1.Condition pattern and are queryable via kubectl, ArgoCD health checks, Flux kstatus, and butler-console.

Condition	Threshold	Meaning
`CapacityWarning`	70% utilization	Pool is filling. Plan expansion.
`CapacityCritical`	85% utilization	Pool is near capacity. Expansion is urgent.
`CapacityExhausted`	95% utilization	Pool is effectively full. New allocations will likely fail.

Each condition is True when utilization is at or above the threshold, False otherwise. The lastTransitionTime records when the condition last changed state, so operators can see how long a pool has been above a threshold.

# Check capacity conditions on all pools
kubectl get networkpool -n butler-system -o custom-columns=\
'NAME:.metadata.name,WARN:.status.conditions[?(@.type=="CapacityWarning")].status,CRIT:.status.conditions[?(@.type=="CapacityCritical")].status,EXHAUSTED:.status.conditions[?(@.type=="CapacityExhausted")].status'

Capacity Events

The NetworkPool controller emits Kubernetes events at utilization thresholds:

Utilization	Event Reason	Description
>= 70%	`PoolCapacityWarning`	Pool filling up
>= 85%	`PoolCapacityCritical`	Pool near capacity
>= 95%	`PoolCapacityExhausted`	Pool effectively full
Drops below threshold	`PoolCapacityRecovered`	Utilization returned below a threshold

Events are rate-limited to one per 10 minutes per tier per pool. This prevents event spam on pools that sit above a threshold for extended periods.

kubectl get events -n butler-system --field-selector reason=PoolCapacityWarning

Observability

CRD Status Fields and Conditions

The primary observability mechanism for IPAM is CRD status. Every pool's utilization, fragmentation, and capacity tier is available via kubectl:

# Pool capacity overview
kubectl get networkpool -n butler-system

# Detailed pool status
kubectl describe networkpool -n butler-system <pool-name>

# Allocations for a specific tenant
kubectl get ipallocation -n butler-system \
  -l butler.butlerlabs.dev/tenant=my-cluster

# All growth allocations across all tenants
kubectl get ipallocation -n butler-system \
  -l butler.butlerlabs.dev/allocation-role=growth

# Allocation details
kubectl get ipallocation -n butler-system <name> -o yaml

# Which pools a provider uses
kubectl get providerconfig <name> -n butler-system \
  -o jsonpath='{.spec.network.poolRefs[*].name}'

# Pool events (capacity transitions, allocations, GC)
kubectl get events -n butler-system \
  --field-selector involvedObject.kind=NetworkPool

# Check MetalLB pool on a tenant cluster (for drift verification)
kubectl --kubeconfig <tenant-kubeconfig> \
  get ipaddresspool -n metallb-system default-pool -o yaml

Integration with External Monitoring

Butler's IPAM signals through standard Kubernetes mechanisms: CRD status conditions, events, and status fields. Operators connect these to whatever monitoring stack they run:

ArgoCD / Flux: Capacity conditions (CapacityWarning, CapacityCritical, CapacityExhausted) are standard Kubernetes conditions. ArgoCD health checks and Flux kstatus interpret them natively.
Event exporters: Tools like kube-eventer or fluentd can capture capacity events for long-term storage beyond the API server's event TTL (default 1 hour).
Custom scripts: CRD status fields (allocatedIPs, availableIPs, fragmentationPercent) are machine-readable via kubectl -o json or client-go.

note

Butler core does not ship Prometheus metrics endpoints, PrometheusRules, ServiceMonitors, or Grafana dashboards for IPAM. An optional butler-ipam-metrics addon for operators using prometheus-operator is planned as future work.

Examples

Single Pool, Static IPAM

A simple setup with one pool and static allocation for an on-premises Harvester environment.

---
apiVersion: butler.butlerlabs.dev/v1alpha1
kind: NetworkPool
metadata:
  name: lab-pool
  namespace: butler-system
spec:
  cidr: "10.40.0.0/22"
  reserved:
    - cidr: "10.40.0.0/28"
      description: "Management cluster control plane and VIP"
    - cidr: "10.40.0.16/28"
      description: "Management cluster MetalLB pool"
  tenantAllocation:
    start: "10.40.1.0"
    end: "10.40.3.254"
    defaults:
      nodesPerTenant: 5
      lbPoolPerTenant: 8
---
apiVersion: butler.butlerlabs.dev/v1alpha1
kind: ProviderConfig
metadata:
  name: harvester-lab
  namespace: butler-system
spec:
  provider: harvester
  credentialsRef:
    name: harvester-kubeconfig
  network:
    mode: ipam
    poolRefs:
      - name: lab-pool
        priority: 0
    subnet: "10.40.0.0/22"
    gateway: "10.40.0.1"
    dnsServers:
      - "10.40.0.2"
    loadBalancer:
      defaultPoolSize: 8
      allocationMode: static

With this configuration, each new TenantCluster receives 8 LoadBalancer IPs from the lab-pool. The pool has 766 usable IPs in the tenant allocation range (10.40.1.0 - 10.40.3.254), supporting approximately 58 tenants at 13 IPs each (5 nodes + 8 LB).

Multi-Pool with Priority Failover

Two pools with priority-based failover. When the primary pool is exhausted, allocations automatically fall through to the secondary pool.

---
apiVersion: butler.butlerlabs.dev/v1alpha1
kind: NetworkPool
metadata:
  name: prod-pool-primary
  namespace: butler-system
spec:
  cidr: "10.40.0.0/22"
  reserved:
    - cidr: "10.40.0.0/26"
      description: "Infrastructure services"
  tenantAllocation:
    start: "10.40.0.64"
    end: "10.40.3.254"
---
apiVersion: butler.butlerlabs.dev/v1alpha1
kind: NetworkPool
metadata:
  name: prod-pool-secondary
  namespace: butler-system
spec:
  cidr: "10.40.4.0/22"
  tenantAllocation:
    start: "10.40.4.0"
    end: "10.40.7.254"
---
apiVersion: butler.butlerlabs.dev/v1alpha1
kind: ProviderConfig
metadata:
  name: harvester-prod
  namespace: butler-system
spec:
  provider: harvester
  credentialsRef:
    name: harvester-kubeconfig
  network:
    mode: ipam
    poolRefs:
      - name: prod-pool-primary
        priority: 0      # Tried first
      - name: prod-pool-secondary
        priority: 10     # Fallback
    loadBalancer:
      defaultPoolSize: 8
      allocationMode: static
    quotaPerTenant:
      maxNodeIPs: 20
      maxLoadBalancerIPs: 32

Elastic IPAM with Demand-Driven Scaling

Elastic mode for environments where most tenants need few LoadBalancer IPs but some may need many.

---
apiVersion: butler.butlerlabs.dev/v1alpha1
kind: ProviderConfig
metadata:
  name: nutanix-elastic
  namespace: butler-system
spec:
  provider: nutanix
  credentialsRef:
    name: nutanix-creds
  network:
    mode: ipam
    poolRefs:
      - name: underlay-pool
        priority: 0
    loadBalancer:
      allocationMode: elastic
      initialPoolSize: 2       # Start with 2 IPs
      growthIncrement: 1       # Grow by 1 when needed
    quotaPerTenant:
      maxLoadBalancerIPs: 8    # Hard cap prevents runaway growth

With this configuration, a new TenantCluster starts with 2 LB IPs. In practice, Traefik (the platform ingress) uses 1 IP on bootstrap, leaving 1 IP as headroom. When a workload creates a LB Service that consumes the headroom IP and another Service goes Pending, the controller detects the Pending Service and allocates 1 more IP. If that Service is later deleted and its IP goes unused for 10 minutes, the growth allocation is released.

A production deployment with 8 tenants on this configuration shows that 5 of 8 tenants use both initial IPs at 100% (running both Traefik and a workload LB Service), while 3 tenants run at 50% utilization (Traefik only, second IP idle).

Pinned Range for Stable Addresses

When a tenant cluster requires specific IP addresses (for example, DNS records that cannot be changed):

apiVersion: butler.butlerlabs.dev/v1alpha1
kind: IPAllocation
metadata:
  name: team-platform-api-lb
  namespace: butler-system
  labels:
    butler.butlerlabs.dev/team: team-platform
    butler.butlerlabs.dev/tenant: api-prod
    butler.butlerlabs.dev/network-pool: lab-pool
    butler.butlerlabs.dev/allocation-type: loadbalancer
spec:
  poolRef:
    name: lab-pool
  tenantClusterRef:
    name: api-prod
    namespace: team-platform
  type: loadbalancer
  pinnedRange:
    startAddress: "10.40.2.0"
    endAddress: "10.40.2.7"

The NetworkPool controller validates that the pinned range is within the pool, does not overlap reserved ranges, and does not conflict with existing allocations. If validation passes, the range is allocated exactly as requested.

Pinned allocations are protected from demand-driven shrink regardless of their role label. Even if no tenant LB Service uses the IPs in a pinned range, the controller preserves the allocation.

warning

Pinned ranges bypass best-fit allocation. If the requested range is in the middle of a large free block, it splits the block into two smaller ones, increasing fragmentation.

Troubleshooting

IPAllocation Stuck in Pending

Symptoms: IPAllocation shows phase: Pending for more than 60 seconds.

Diagnosis:

# Check the IPAllocation status
kubectl get ipallocation -n butler-system <name> -o yaml

# Check if the referenced pool exists and has capacity
kubectl get networkpool -n butler-system <pool-name>

# Check controller logs
kubectl logs -n butler-system -l app.kubernetes.io/name=butler-controller --tail=50

Common causes:

Pool exhausted: availableIPs on the pool is less than the requested count. Expand the pool or add a secondary pool to the ProviderConfig.
Fragmentation: Available IPs exist but no contiguous block is large enough. Check fragmentationPercent and largestFreeBlock in pool status.
Controller not running: Verify the butler-controller pod is healthy.

IPAllocation in Failed State

Symptoms: IPAllocation shows phase: Failed with a condition message.

Diagnosis:

kubectl get ipallocation -n butler-system <name> \
  -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'

Common causes:

Pool exhausted: The condition message will contain "no contiguous block available". Add capacity.
Pinned range conflict: A pinned range overlaps with a reserved CIDR or existing allocation. Check kubectl get ipallocation -n butler-system -o wide for overlapping ranges.
Invalid CIDR: The pool's CIDR is malformed. Check pool validation conditions.

Failed allocations are retried by both the NetworkPool controller (event-driven, treats Failed as Pending) and the IPAllocation controller (backstop, every 30 seconds).

Growth Allocation Not Firing

Symptoms: A tenant LB Service is stuck Pending but no growth IPAllocation appears.

Diagnosis:

# Confirm the Service is type LoadBalancer and has no externalIP
kubectl --kubeconfig <tenant-kubeconfig> get service -n <namespace> <name> -o yaml

# Check the TenantCluster's elastic IPAM configuration
kubectl get providerconfig <name> -n butler-system \
  -o jsonpath='{.spec.network.loadBalancer}'

# Check allocation count against quota
kubectl get ipallocation -n butler-system \
  -l butler.butlerlabs.dev/tenant=<cluster-name> \
  -o custom-columns='NAME:.metadata.name,COUNT:.spec.count,PHASE:.status.phase'

kubectl get providerconfig <name> -n butler-system \
  -o jsonpath='{.spec.network.quotaPerTenant.maxLoadBalancerIPs}'

Common causes:

Static allocation mode: Elastic growth only runs when loadBalancer.allocationMode is elastic. Static mode allocates once at creation and does not grow.
Quota reached: Total allocated IPs equal maxLoadBalancerIPs. The controller logs "quota would be exceeded" and skips growth.
Service too new: The controller waits 30 seconds after Service creation before treating it as a growth signal, to avoid racing with MetalLB assignment from existing free IPs.
Tenant API unreachable: If the controller cannot reach the tenant cluster's API server, it skips elastic IPAM for that tenant. Check controller logs for connection errors.
Reconcile interval: For a mature cluster (>24h old), the reconcile interval is 15 minutes. The worst case for detecting a Pending Service is the full interval. The IPAllocation watch accelerates follow-up reconciles after a growth allocation is fulfilled.

Unexpected Shrink

Symptoms: A growth IPAllocation was deleted and the MetalLB pool shrank.

Diagnosis:

# Check controller logs for shrink events
kubectl logs -n butler-system -l app.kubernetes.io/name=butler-controller --tail=100 \
  | grep -i shrink

# Check remaining allocations
kubectl get ipallocation -n butler-system \
  -l butler.butlerlabs.dev/tenant=<cluster-name> \
  -o custom-columns='NAME:.metadata.name,ROLE:.metadata.labels.butler\.butlerlabs\.dev/allocation-role,PHASE:.status.phase'

Common causes:

Service deleted: The LB Service using the growth allocation's IP was deleted. After 10 minutes with no matching Service, the allocation is released. This is normal operation.
MetalLB assigned a different IP: If MetalLB assigned an IP from a different allocation's range (e.g., from the initial allocation's headroom), the growth allocation's IP may appear unused even though a Service exists. The controller checks the specific IP range of each growth allocation against actual Service IPs.

Shrink never touches the initial allocation or allocations with a spec.pinnedRange. If an unexpected shrink occurred, check whether the deleted allocation was labeled growth and lacked a pinned range.

Orphaned Allocations

Symptoms: IPAllocations exist for TenantClusters that no longer exist.

Resolution: The NetworkPool controller's orphan GC runs every 60 seconds and automatically detects and deletes orphaned allocations. If you need to force cleanup:

# List allocations for a deleted cluster
kubectl get ipallocation -n butler-system \
  -l butler.butlerlabs.dev/tenant=deleted-cluster

# Manual deletion (if GC is not running)
kubectl delete ipallocation -n butler-system \
  -l butler.butlerlabs.dev/tenant=deleted-cluster

NetworkPool Cannot Be Deleted

Symptoms: NetworkPool stuck in terminating state.

Cause: The pool has active IPAllocations. The finalizer blocks deletion until all allocations are Released.

# Check active allocations
kubectl get ipallocation -n butler-system \
  -l butler.butlerlabs.dev/network-pool=<pool-name>

# Delete the TenantClusters using this pool, or wait for their cleanup

MetalLB Pool Drift

Symptoms: The MetalLB default-pool on a tenant cluster does not match the management-side IPAllocations.

The controller detects and corrects drift automatically. On every elastic IPAM reconcile, the controller computes the expected pool state from IPAllocations and applies it via server-side apply. Manual edits to default-pool are overwritten within one reconcile cycle.

If drift persists, check:

# Compare management-side allocations with tenant-side pool
kubectl get ipallocation -n butler-system \
  -l butler.butlerlabs.dev/tenant=<cluster-name>,butler.butlerlabs.dev/allocation-type=loadbalancer

kubectl --kubeconfig <tenant-kubeconfig> \
  get ipaddresspool -n metallb-system default-pool -o jsonpath='{.spec.addresses}'

If the ranges do not match, the controller will correct it on the next reconcile. If the tenant API server is unreachable, the sync is retried with exponential backoff on subsequent reconciles.

CRD Resources​

NetworkPool​

NetworkPool Status​

Spec Fields​

Status Fields​

IPAllocation​

IPAllocation with Pinned Range​

IPAllocation Lifecycle​

Spec Fields​

Status Fields​

ProviderConfig Network Configuration​

Network Field Reference​

Controllers​

Controller Interaction​

Watch Relationships​

Reconciliation Intervals​

Allocation Flow​

Static IPAM​

Elastic IPAM (Demand-Driven)​

Demand-driven growth​

Demand-driven shrink​

MetalLB multi-range support​

End-to-end demand-driven sequence​

Cloud Provider Bypass​

Design Evolution​

Authority Model​

Best-Fit Bitmap Allocator​

How It Works​

Why Best-Fit​

Constraints​

Data Structures​

CIDR Formatting​

Cleanup and Garbage Collection​

Layer 1: TenantCluster Deletion​

Layer 2: IPAllocation Finalizer​

Layer 3: NetworkPool Orphan GC​

Labels and Discovery​

Quotas and Capacity Planning​

Per-Tenant Quotas​

Pool Capacity Estimation​

Capacity Conditions​

Capacity Events​

Observability​

CRD Status Fields and Conditions​

Integration with External Monitoring​

Examples​

Single Pool, Static IPAM​

Multi-Pool with Priority Failover​

Elastic IPAM with Demand-Driven Scaling​

Pinned Range for Stable Addresses​

Troubleshooting​

IPAllocation Stuck in Pending​

IPAllocation in Failed State​

Growth Allocation Not Firing​

Unexpected Shrink​

Orphaned Allocations​

NetworkPool Cannot Be Deleted​

MetalLB Pool Drift​

See Also​

CRD Resources

NetworkPool

NetworkPool Status

Spec Fields

Status Fields

IPAllocation

IPAllocation with Pinned Range

IPAllocation Lifecycle

Spec Fields

Status Fields

ProviderConfig Network Configuration

Network Field Reference

Controllers

Controller Interaction

Watch Relationships

Reconciliation Intervals

Allocation Flow

Static IPAM

Elastic IPAM (Demand-Driven)

Demand-driven growth

Demand-driven shrink

MetalLB multi-range support

End-to-end demand-driven sequence

Cloud Provider Bypass

Design Evolution

Authority Model

Best-Fit Bitmap Allocator

How It Works

Why Best-Fit

Constraints

Data Structures

CIDR Formatting

Cleanup and Garbage Collection

Layer 1: TenantCluster Deletion

Layer 2: IPAllocation Finalizer

Layer 3: NetworkPool Orphan GC

Labels and Discovery

Quotas and Capacity Planning

Per-Tenant Quotas

Pool Capacity Estimation

Capacity Conditions

Capacity Events

Observability

CRD Status Fields and Conditions

Integration with External Monitoring

Examples

Single Pool, Static IPAM

Multi-Pool with Priority Failover

Elastic IPAM with Demand-Driven Scaling

Pinned Range for Stable Addresses

Troubleshooting

IPAllocation Stuck in Pending

IPAllocation in Failed State

Growth Allocation Not Firing

Unexpected Shrink

Orphaned Allocations

NetworkPool Cannot Be Deleted

MetalLB Pool Drift

See Also