When One Pod Must Win: Distributed Locking Lessons from Building Kubernetes Automation Systems
When One Pod Must Win: Distributed Locking Lessons from Building Two Production Kubernetes Automation Systems
In Kubernetes automation, we often deploy multiple pods for high availability. That is usually a good thing. But there are certain tasks where only one pod should act at a time.
Examples include IP allocation, node reboot orchestration, firewall updates, route management, certificate rotation, and control plane access-list synchronization. If two pods perform the same operation at the same time, the result can be duplicate updates, race conditions, failed automation or even service disruption.
The Problem
While building production automation for Kubernetes networking and control plane access management, I ran into a common challenge:
How do we allow multiple pods to run for availability, but ensure that only one pod performs the critical operation at any given time?
This is where distributed locking becomes important.
Approach 1: ConfigMap-Based Locking
The simplest approach is to use a Kubernetes ConfigMap as a lock. A pod tries to create or update a lock entry. If the lock is already held by another pod, it waits and retries.
This approach is easy to understand and works well for simple coordination tasks.
lock_holder: worker-node-1
lock_timestamp: 2026-06-06T12:00:00Z
Where ConfigMap Locking Works Well
- Simple leader election
- Serialized maintenance tasks
- Reboot coordination
- Small automation workflows
- Low-frequency operations
Limitations
- It requires careful stale-lock handling
- It is not ideal for high-frequency allocation workloads
- Concurrent updates need to be handled carefully
- Failure scenarios must be explicitly designed
Approach 2: etcd-Based Locking
For more critical workflows, etcd provides stronger primitives for coordination. Since Kubernetes itself uses etcd as its backing store, etcd-style compare-and-swap logic is a natural fit for distributed locking.
In my IP allocation workflow, etcd was a better choice because IP assignment needs atomic behavior. Two pods should never allocate the same IP address.
Using etcd, the lock or allocation state can be written atomically. If the key already exists, the operation fails and the system moves to the next available candidate.
Where etcd Works Better
- IP allocation
- High-concurrency workflows
- Stateful automation
- Atomic read/write operations
- Production-grade coordination
Key Design Lessons
1. Lock Ownership Must Be Clear
Every lock should clearly identify who owns it. This can be a pod name, node name or unique instance ID. Without ownership metadata, debugging lock issues becomes painful.
2. Always Handle Stale Locks
Pods can crash. Nodes can reboot. Network calls can fail. If a lock is created but never released, the automation can stop permanently. Every lock needs a timestamp or lease mechanism.
3. Keep the Locked Section Small
The lock should protect only the critical section. Do not hold a lock while performing long-running operations unless absolutely necessary.
4. Make Operations Idempotent
Even with locking, retries will happen. Scripts and APIs should safely handle repeated execution. If a route already exists, do not fail. If a firewall rule already exists, skip it. If an IP is already marked allocated, do not allocate it again.
5. Logs Matter
Distributed locking issues are difficult to debug without good logs. Each lock attempt should log:
- Who requested the lock
- Who currently owns the lock
- When the lock was created
- Whether the lock was acquired, skipped or considered stale
ConfigMap vs etcd: Which One Should You Use?
My general rule is simple:
- Use ConfigMap locking for simple orchestration and low-frequency coordination.
- Use etcd for atomic state management, IP allocation and high-reliability workflows.
ConfigMap locking is easier to implement. etcd is more suitable when correctness matters more than simplicity.
Final Thoughts
Distributed locking is not only a backend engineering concept. It appears frequently in Kubernetes automation, especially when infrastructure changes must be serialized.
If you are building operators, automation controllers, networking tools or platform services, assume that multiple pods may try to act at the same time. Design for that from the beginning.
The best distributed lock is not the most complex one. It is the one that is simple enough to operate, safe enough for failure scenarios and reliable enough for the workflow it protects.