When One Pod Must Win: Distributed Locking Lessons from Building Kubernetes Automation Systems

By Sandip Gangdhar • Kubernetes • Platform Engineering • Distributed Systems

Also available on Medium
When One Pod Must Win: Distributed Locking Lessons from Building Two Production Kubernetes Automation Systems

In Kubernetes automation, we often deploy multiple pods for high availability. That is usually a good thing. But there are certain tasks where only one pod should act at a time.

Examples include IP allocation, node reboot orchestration, firewall updates, route management, certificate rotation, and control plane access-list synchronization. If two pods perform the same operation at the same time, the result can be duplicate updates, race conditions, failed automation or even service disruption.

The Problem

While building production automation for Kubernetes networking and control plane access management, I ran into a common challenge:

How do we allow multiple pods to run for availability, but ensure that only one pod performs the critical operation at any given time?

This is where distributed locking becomes important.

Approach 1: ConfigMap-Based Locking

The simplest approach is to use a Kubernetes ConfigMap as a lock. A pod tries to create or update a lock entry. If the lock is already held by another pod, it waits and retries.

This approach is easy to understand and works well for simple coordination tasks.

lock_holder: worker-node-1
lock_timestamp: 2026-06-06T12:00:00Z

Where ConfigMap Locking Works Well

Simple leader election
Serialized maintenance tasks
Reboot coordination
Small automation workflows
Low-frequency operations

Limitations

It requires careful stale-lock handling
It is not ideal for high-frequency allocation workloads
Concurrent updates need to be handled carefully
Failure scenarios must be explicitly designed

Approach 2: etcd-Based Locking

For more critical workflows, etcd provides stronger primitives for coordination. Since Kubernetes itself uses etcd as its backing store, etcd-style compare-and-swap logic is a natural fit for distributed locking.

In my IP allocation workflow, etcd was a better choice because IP assignment needs atomic behavior. Two pods should never allocate the same IP address.

Using etcd, the lock or allocation state can be written atomically. If the key already exists, the operation fails and the system moves to the next available candidate.

Where etcd Works Better

IP allocation
High-concurrency workflows
Stateful automation
Atomic read/write operations
Production-grade coordination

Key Design Lessons

1. Lock Ownership Must Be Clear

Every lock should clearly identify who owns it. This can be a pod name, node name or unique instance ID. Without ownership metadata, debugging lock issues becomes painful.

2. Always Handle Stale Locks

Pods can crash. Nodes can reboot. Network calls can fail. If a lock is created but never released, the automation can stop permanently. Every lock needs a timestamp or lease mechanism.

3. Keep the Locked Section Small

The lock should protect only the critical section. Do not hold a lock while performing long-running operations unless absolutely necessary.

4. Make Operations Idempotent

Even with locking, retries will happen. Scripts and APIs should safely handle repeated execution. If a route already exists, do not fail. If a firewall rule already exists, skip it. If an IP is already marked allocated, do not allocate it again.

5. Logs Matter

Distributed locking issues are difficult to debug without good logs. Each lock attempt should log:

Who requested the lock
Who currently owns the lock
When the lock was created
Whether the lock was acquired, skipped or considered stale

ConfigMap vs etcd: Which One Should You Use?

My general rule is simple:

Use ConfigMap locking for simple orchestration and low-frequency coordination.
Use etcd for atomic state management, IP allocation and high-reliability workflows.

ConfigMap locking is easier to implement. etcd is more suitable when correctness matters more than simplicity.

Final Thoughts

Distributed locking is not only a backend engineering concept. It appears frequently in Kubernetes automation, especially when infrastructure changes must be serialized.

If you are building operators, automation controllers, networking tools or platform services, assume that multiple pods may try to act at the same time. Design for that from the beginning.

The best distributed lock is not the most complex one. It is the one that is simple enough to operate, safe enough for failure scenarios and reliable enough for the workflow it protects.

← Back to Blog