Whisper doesn’t just transcribe audio; it understands it well enough to infer missing punctuation and capitalization, making its output surprisingly human-readable without any post-processing.

Let’s see Whisper in action. Imagine you have a short audio file, audio.mp3, containing a spoken sentence. You can send this directly to the Whisper API.

import openai

# Assuming you have your API key set as an environment variable
# openai.api_key = "YOUR_API_KEY"

audio_file = open("audio.mp3", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)

print(transcript["text"])

If audio.mp3 contains "hello world how are you", the output will be: Hello world, how are you?

Notice how Whisper added the capitalization and the question mark. This isn’t just a simple character-by-character mapping; it’s a complex neural network processing the acoustic features and linguistic context.

The Problem Whisper Solves

Traditional Automatic Speech Recognition (ASR) systems often produce raw, unpunctuated text that requires significant manual cleanup. This is a bottleneck for applications like:

  • Meeting transcription: Turning hours of discussion into searchable, actionable notes.
  • Content creation: Generating captions for videos or podcasts automatically.
  • Accessibility: Providing text alternatives for audio content.
  • Voice assistants: Understanding user commands and queries.

Whisper’s ability to deliver polished, ready-to-use text significantly reduces the effort and cost associated with these tasks.

How Whisper Works Internally

Whisper is a large, transformer-based neural network trained on a massive and diverse dataset of 680,000 hours of multilingual and multitask supervised data collected from the web. This scale allows it to learn robust representations of spoken language.

The core of Whisper’s architecture is an encoder-decoder transformer.

  • Encoder: This part processes the audio input. It converts the raw audio waveform into mel-frequency cepstral coefficients (MFCCs) or similar acoustic features. These features are then fed into the transformer encoder, which learns to represent the temporal and spectral characteristics of the speech.
  • Decoder: This part generates the text output. It takes the encoded audio representation and, conditioned on previously generated tokens (words or sub-word units), predicts the next token. This autoregressive process, common in sequence-to-sequence models, allows Whisper to build up the transcription word by word.

Crucially, Whisper was trained with a multitask objective. This means it wasn’t just trained to transcribe, but also to translate speech to English and to perform language identification. This multitask training contributes to its robustness and its ability to handle various accents, background noise, and even code-switching (mixing languages).

The API endpoint openai.Audio.transcribe is a simple interface to this powerful model. You provide the model name (whisper-1 is the current version) and the audio file. Whisper handles the rest, returning a JSON object containing the transcribed text.

Controlling Whisper’s Output

While the transcribe endpoint is straightforward, you can influence its behavior with a few parameters:

  • prompt: You can provide an optional text prompt. This is particularly useful for guiding the model to transcribe specific jargon, names, or to ensure consistent formatting. For example, if you expect a lot of technical terms, you could provide a prompt with those terms to help Whisper recognize them.
  • response_format: You can request the output in different formats, such as text, srt, verbose_json, or vtt. srt and vtt are common subtitle formats that include timestamps, which are invaluable for aligning text with audio.
  • temperature: This parameter controls the randomness of the output. A temperature of 0 makes the output deterministic (always the same for the same input), while higher values increase randomness, potentially leading to more creative but less accurate transcriptions. For standard transcription, a low temperature (e.g., 0 or 0.2) is usually preferred.
  • language: You can specify the language of the audio. While Whisper is good at auto-detecting languages, explicitly setting it can improve accuracy, especially for shorter audio clips or when dealing with dialects.

Let’s see an example using response_format='srt' to get timestamps:

import openai

audio_file = open("audio.mp3", "rb")
transcript_srt = openai.Audio.transcribe("whisper-1", audio_file, response_format="srt")

print(transcript_srt)

This would output something like:

1
00:00:00,500 --> 00:00:02,100
Hello world,

2
00:00:02,100 --> 00:00:03,800
how are you?

The model’s ability to infer sentence boundaries and even pauses is remarkable. It’s not just predicting the next phoneme or word; it’s understanding the flow of conversation. This is achieved through its extensive training data, which includes diverse audio with corresponding text, allowing the transformer to learn alignments between acoustic signals and linguistic units, including prosodic cues like intonation and rhythm that signal sentence endings or questions.

The next hurdle you’ll likely encounter is handling very long audio files efficiently and cost-effectively.

Want structured learning?

Take the full Openai-api course →