The most surprising thing about latency is that it’s rarely about your code being "slow" in the traditional sense; it’s almost always about waiting.
Let’s see this in action. Imagine a simple web service that needs to fetch user data and then their recent orders.
# Flask app
from flask import Flask, jsonify
import requests
import time
app = Flask(__name__)
USER_SERVICE_URL = "http://user-service:5001/users/{}"
ORDER_SERVICE_URL = "http://order-service:5002/users/{}/orders"
@app.route("/user-profile/<user_id>")
def get_user_profile(user_id):
start_time = time.time()
# Step 1: Fetch user data
user_response = requests.get(USER_SERVICE_URL.format(user_id))
user_data = user_response.json()
user_fetch_time = time.time()
# Step 2: Fetch user orders
orders_response = requests.get(ORDER_SERVICE_URL.format(user_id))
orders_data = orders_response.json()
orders_fetch_time = time.time()
total_time = time.time() - start_time
return jsonify({
"user_data": user_data,
"orders_data": orders_data,
"timing": {
"user_fetch_duration": user_fetch_time - start_time,
"orders_fetch_duration": orders_fetch_time - user_fetch_time,
"total_duration": total_time
}
})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
This looks straightforward. We make two network calls. If user-service takes 100ms and order-service takes 200ms, the total latency for this request will be roughly 300ms (plus some overhead). The "code path" isn’t the Python requests.get calls themselves; it’s the sequence of waiting for responses.
The problem this solves is presenting a unified view of data that lives in different places. The mental model is simple: a request comes in, it orchestrates calls to other services, and then it returns the combined result. The levers you control are primarily how you orchestrate those calls and what data you request.
If we want to optimize, we can make those network calls concurrently instead of sequentially.
# Flask app with concurrency
from flask import Flask, jsonify
import requests
import time
import concurrent.futures
app = Flask(__name__)
USER_SERVICE_URL = "http://user-service:5001/users/{}"
ORDER_SERVICE_URL = "http://order-service:5002/users/{}/orders"
def fetch_user_data(user_id):
response = requests.get(USER_SERVICE_URL.format(user_id))
return response.json()
def fetch_orders_data(user_id):
response = requests.get(ORDER_SERVICE_URL.format(user_id))
return response.json()
@app.route("/user-profile/<user_id>")
def get_user_profile(user_id):
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
user_future = executor.submit(fetch_user_data, user_id)
orders_future = executor.submit(fetch_orders_data, user_id)
user_data = user_future.result()
orders_data = orders_future.result()
total_time = time.time() - start_time
return jsonify({
"user_data": user_data,
"orders_data": orders_data,
"total_duration": total_time
})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
Now, if both services take 100ms and 200ms respectively, the total latency will be closer to 200ms (the duration of the longest call), not 300ms. The concurrent.futures.ThreadPoolExecutor allows these I/O-bound operations (waiting for network responses) to happen in parallel.
The most common place latency hides is in database queries or external API calls. Your Python code might execute in microseconds, but if it’s waiting for a slow database transaction or a poorly optimized external API, the overall request will be slow. Profiling tools (like cProfile in Python, or distributed tracing systems like Jaeger/Zipkin) are crucial. They don’t just tell you which lines of your code are slow; they tell you where the program is spending its time, which is almost always waiting for something else.
The one thing most people don’t realize is that the overhead of the concurrency mechanism itself is often negligible compared to the I/O it’s parallelizing. Spinning up a few threads or asynchronous tasks to wait for network responses is far cheaper than the time spent waiting for those responses sequentially. The key is that these tasks are waiting, not computing, so they don’t consume significant CPU resources while blocked.
The next step is understanding how to trace requests across multiple services.