The Redlock algorithm, intended for distributed locks, actually doesn’t guarantee safety in the way most people assume; it’s a probabilistic algorithm that can fail under specific network conditions.
Let’s see Redlock in action. Imagine you have a critical shared resource, like updating a user’s balance. Without a lock, two processes could read the same balance, both add a credit, and then write back, resulting in a lost credit.
Here’s a simplified scenario using Python and redis-py:
import redis
import time
import uuid
# Connect to Redis instances (ideally 5 or more for Redlock)
# In a real scenario, these would be different machines/clusters
redis_clients = [
redis.StrictRedis(host='redis1', port=6379, db=0),
redis.StrictRedis(host='redis2', port=6379, db=0),
redis.StrictRedis(host='redis3', port=6379, db=0),
redis.StrictRedis(host='redis4', port=6379, db=0),
redis.StrictRedis(host='redis5', port=6379, db=0),
]
lock_name = "resource:user:123:balance"
lock_value = str(uuid.uuid4()) # Unique identifier for this lock attempt
lock_timeout = 10 # Seconds the lock is held if not released
def acquire_lock(clients, key, value, ttl):
"""Attempts to acquire a Redlock."""
n = len(clients)
acquired_at = time.time()
acquired_count = 0
for client in clients:
try:
# SET key value NX PX ttl (NX: only set if not exists, PX: milliseconds)
if client.set(key, value, nx=True, px=ttl * 1000):
acquired_count += 1
except redis.exceptions.ConnectionError:
# Ignore nodes that are down
pass
# Check if we acquired a majority of locks
if acquired_count >= (n // 2) + 1:
print(f"Lock acquired with value: {value}")
return True, acquired_at
else:
# If we failed, release any locks we might have acquired
for client in clients:
try:
# Use a Lua script for atomic check-and-delete
# This prevents deleting a lock acquired by another client
release_script = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
client.eval(release_script, 1, key, value)
except redis.exceptions.ConnectionError:
pass
print(f"Failed to acquire lock. Acquired {acquired_count}/{n} nodes.")
return False, None
def release_lock(clients, key, value):
"""Releases a Redlock."""
released_count = 0
for client in clients:
try:
release_script = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
if client.eval(release_script, 1, key, value):
released_count += 1
except redis.exceptions.ConnectionError:
pass
print(f"Released lock. Released on {released_count}/{len(clients)} nodes.")
# --- Example Usage ---
print("Attempting to acquire lock...")
locked, acquired_time = acquire_lock(redis_clients, lock_name, lock_value, lock_timeout)
if locked:
print("Lock acquired! Performing critical operation...")
# Simulate critical operation
time.sleep(5)
print("Operation complete. Releasing lock.")
release_lock(redis_clients, lock_name, lock_value)
else:
print("Could not acquire lock. Another process is likely holding it.")
The core problem Redlock tries to solve is achieving consensus for a lock across multiple independent Redis nodes. The algorithm works by trying to acquire a lock (with a unique value) on a majority of N Redis instances. If a client successfully acquires the lock on (N/2) + 1 instances within a certain time frame, it’s considered to have the lock. Releasing the lock involves atomically deleting the key from each instance only if the value matches the unique identifier.
The key levers you control are:
redis_clients: The list of independent Redis instances. More instances increase the probability of success but also complexity. A minimum of 5 is generally recommended.lock_name: The identifier for the resource you are protecting.lock_value: A unique, random identifier generated by the client trying to acquire the lock. This is crucial for safe release.lock_timeout(ttl): The time-to-live for the lock in seconds. If the client holding the lock crashes or gets stuck, the lock will eventually expire. This value must be longer than your critical operation’s expected duration.
The most surprising thing about Redlock is that its safety guarantee is probabilistic and relies heavily on precise timing and network behavior. Even if you acquire a majority of locks, a delayed SET command from a previous lock acquisition attempt on one node could still succeed after you think you’ve acquired the lock on a majority, leading to a double-acquisition. This is particularly problematic if a Redis node experiences a clock drift or network partition.
The critical part of Redlock’s safety, and where it often fails, is in the time window between acquiring the lock on a majority of nodes and the lock expiring on any single node. If a client acquires locks on nodes A, B, and C, and then a node D (where it failed to acquire the lock) suddenly becomes available and thinks it already holds the lock due to a delayed SET command, it might allow another client to acquire a lock. The algorithm attempts to mitigate this by requiring a majority and using unique lock_values for release, but race conditions can still occur.
The next concept you’ll run into is how to reliably measure the time elapsed for your critical operation to ensure your lock_timeout is sufficiently generous, and how to handle the case where a lock is acquired, but the client still crashes before releasing it.