OpenAI’s async client unlocks massive performance gains by allowing you to fire off multiple requests concurrently, not just sequentially.

Here’s a quick peek at what that looks like in practice. Instead of this:

import openai
import time

openai.api_key = "YOUR_API_KEY"

def call_openai(prompt):
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        max_tokens=50
    )
    return response.choices[0].text.strip()

start_time = time.time()
prompts = ["Tell me a joke about cats.", "What's the capital of France?", "Write a short poem about the sea."]
results = [call_openai(p) for p in prompts]
end_time = time.time()

print(f"Sequential results: {results}")
print(f"Sequential time: {end_time - start_time:.2f} seconds")

You get this:

import openai
import time
import asyncio

openai.api_key = "YOUR_API_KEY"

async def call_openai_async(prompt):
    response = await openai.Completion.acreate(  # Note the 'a' for async
        model="text-davinci-003",
        prompt=prompt,
        max_tokens=50
    )
    return response.choices[0].text.strip()

async def main():
    start_time = time.time()
    prompts = ["Tell me a joke about cats.", "What's the capital of France?", "Write a short poem about the sea."]
    tasks = [call_openai_async(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    end_time = time.time()

    print(f"Async results: {results}")
    print(f"Async time: {end_time - start_time:.2f} seconds")

if __name__ == "__main__":
    asyncio.run(main())

The core idea is asyncio.gather. It takes a list of awaitable objects (your async API calls) and runs them concurrently. When one call is waiting for OpenAI’s servers to respond, Python doesn’t just sit idle; it switches to another call that’s ready to run. This is cooperative multitasking: your program voluntarily yields control when it’s waiting for I/O.

This isn’t just about making your code look fancy; it directly addresses the latency inherent in network requests. Each call to the OpenAI API involves sending data over the internet, waiting for OpenAI’s servers to process it, and then receiving the response. If you have 100 requests, doing them one after another means you’re waiting for the total time of all 100 requests. With asyncio.gather, you’re primarily waiting for the longest single request to complete, plus a tiny overhead for task switching. This can be a dramatic speedup, often reducing execution time by an order of magnitude or more, depending on the number of requests and their individual latencies.

The openai.Completion.acreate (and similar acreate methods for other endpoints like ChatCompletion.acreate) is the asynchronous version of the standard synchronous create method. It returns an awaitable object. asyncio.gather then orchestrates these awaitables. It doesn’t magically make the API calls themselves faster; it makes the waiting time productive by allowing other tasks to run. Think of it like a chef who can chop vegetables while a pot of water is boiling, instead of just staring at the pot.

The fundamental problem this solves is the inefficiency of sequential I/O-bound operations. If your program spends most of its time waiting for external services (like APIs, databases, or network sockets), a synchronous approach is like having one person do all the waiting. An asynchronous approach, using asyncio, allows one process to manage many concurrent waiting operations efficiently. You’re not truly running multiple requests at the exact same instant on different CPU cores (that’s multiprocessing), but you’re making the most of the time when the CPU would otherwise be idle, waiting for network responses.

When you pass multiple tasks to asyncio.gather, it schedules them to run. If task1 makes a request and then hits await, the asyncio event loop can immediately switch to task2 and let it make its request. If task2 also hits await, the loop can switch back to task1 if its response has arrived, or to task3, and so on. This interleaving of execution is what makes the overall process so much faster when dealing with many I/O operations.

The trick to understanding how this achieves speed is realizing that the bottleneck for many API interactions isn’t CPU processing power, but network latency and server response time. A single API call might take 500ms to 2 seconds. If you have 100 such calls, a synchronous program would take 50 to 200 seconds. An asynchronous program, however, can initiate all 100 calls almost simultaneously. While all 100 are "in flight" and waiting for responses, the event loop is constantly checking if any have completed. The total time becomes roughly the latency of the slowest single call plus a small overhead for managing the concurrent tasks. This can bring the total time down to just a few seconds.

The primary levers you control here are:

  1. The number of concurrent tasks: asyncio.gather can handle hundreds or even thousands of tasks. You’ll want to balance this against potential API rate limits and the memory overhead of managing many open connections.
  2. The structure of your async functions: Keeping await calls strategically placed ensures that the event loop has opportunities to switch between tasks.
  3. Error handling: asyncio.gather can be configured to handle exceptions. By default, if one task raises an exception, gather will raise that exception immediately. You can use return_exceptions=True to have gather return exceptions as results instead of raising them, allowing you to process successful results alongside failed ones.

One subtle point is that asyncio doesn’t bypass the underlying network stack or the OpenAI API’s own processing time. It’s about your program’s efficiency in managing multiple I/O operations. If the OpenAI API itself is slow to respond to all requests, asyncio can’t make those individual responses faster, but it can ensure your program isn’t wasting time idly waiting for one response while another could have already been received. You’re essentially overlapping the waiting periods of many requests.

The next step after mastering parallel requests is handling rate limiting and implementing robust retry strategies for transient network issues.

Want structured learning?

Take the full Openai-api course →