General / 25 min read

How to Transcribe Audio with Whisper: Python API and Local Setup Guide

Published 2024-10-09

Last updated 2026-06-15

Share this article

How to Transcribe Audio with Whisper: Python API and Local Setup Guide

To transcribe audio with Whisper, install the OpenAI Python library, pass your audio file to the client.audio.transcriptions.create() method with model whisper-1, and receive the text output in JSON or plain text format. Whisper handles 50+ languages with a word error rate as low as 7.75%, and the API costs just $0.006 per minute.

What you'll need:

Python 3.8+ installed on your machine

An OpenAI API key (for the API method) or a CUDA-compatible GPU (for local installation)

Audio files in mp3, mp4, m4a, wav, or webm format (under 25 MB per file)

Time estimate: 15-30 minutes for setup, seconds per transcription

Skill level: Beginner-friendly (API method) / Intermediate (local installation)

Quick overview of the process:

Choose your method — OpenAI API ($0.006/min, no GPU) or local GitHub installation (free, needs GPU)
Set up your environment — Install Python dependencies and configure API keys or GPU drivers
Run your first transcription — Send an audio file and receive text output
Optimize accuracy — Use prompts, audio preprocessing, and model selection for better results
Handle large files — Split audio over 25 MB with PyDub and maintain context
Translate multilingual audio — Convert foreign-language audio to English text
Scale for production — Batch processing, error handling, and cost management

What Is OpenAI Whisper and How Has It Evolved?

OpenAI Whisper model evolution timeline from 2022 to 2026 showing key version releases

OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual audio data collected from the web. Unlike older speech-to-text systems that required language-specific training, Whisper uses a single Transformer encoder-decoder architecture that processes audio across 50+ languages without switching models.

According to OpenAI's Whisper research page, the model converts audio into 30-second chunks, transforms them into log-Mel spectrograms, and uses special tokens for language identification, phrase-level timestamps, and translation. This architecture is why Whisper handles background noise, accents, and technical vocabulary better than most alternatives.

How the model has changed since 2022

The original Whisper v1 launched in September 2022. Since then, OpenAI has released several improved versions:

Model Version	Release	Parameters	Key Improvement
Whisper v1	Sep 2022	Up to 1.55B	Initial release, 680K hours training
Large-v2	Dec 2022	1.55B	Reduced hallucinations, better accuracy
Large-v3	Nov 2023	1.55B	Lower word error rate, expanded language support
Large-v3-turbo	2024	809M	According to CompaniesHistory, achieves 7.75% WER with faster inference

The Whisper model is fully open-source. You can access the code, weights, and documentation on the official Whisper GitHub repository.

Who uses Whisper in 2026

Whisper isn't only used by developers anymore. According to Healthcare Brew, about 30,000 clinicians and 40 health systems now use Whisper-powered tools to document patient interactions. Content creators, podcast producers, and educators use it too, mostly for subtitles, searchable transcripts, and content repurposing.

According to AboutChromebooks, Whisper recorded 4.1 million monthly downloads on Hugging Face in December 2025, making it the most downloaded open-source speech recognition model available.

How to Choose Between Whisper API, Local Installation, and Hosted Tools

Comparison of Whisper deployment options API versus local installation versus hosted tools

Before writing any code, you need to decide how you'll run Whisper. There are three main approaches, each with different tradeoffs in cost, privacy, and flexibility.

Factor	OpenAI API	Local (GitHub)	Hosted Tools
Cost	According to BrassTranscripts, $0.006/minute ($0.36/hour)	Free (electricity + GPU)	$10-50/month subscription
Setup time	5 minutes	30-60 minutes	2 minutes
GPU required	No	Yes (CUDA recommended)	No
File size limit	25 MB per request	Unlimited	Varies
Privacy	Audio sent to OpenAI servers	Stays on your machine	Varies by provider
Best for	Quick projects, low volume	Privacy-sensitive work, heavy usage	Non-technical users

My recommendation, after building TranscribeTube's transcription pipeline: Start with the API for prototyping. The $0.006/minute cost is hard to beat for small-to-medium workloads. Switch to local installation only if you process 100+ hours monthly or can't send audio to external servers. If you want the API simplicity without managing OpenAI keys, rate limits, and retries yourself, a hosted speech-to-text API wraps that pipeline for you. And if you don't want to write code at all, try TranscribeTube's audio to text converter for a browser-based experience.

Where Does TranscribeTube Fit Next to Running Whisper Yourself?

I built TranscribeTube on Whisper-class models, but the whole point is that you never run Whisper. No GPU to rent, nothing to pip install, no chunking your audio around the 25 MB API cap, no retry-and-backoff code when rate limits hit. That plumbing — the same plumbing this guide walks you through — sits inside our hosted pipeline. On top of it we add the parts raw Whisper leaves out: automatic speaker diarization, an edit-first editor so you fix the transcript instead of re-running the model, AI summaries, and SRT/TXT export with timestamps.

The build decision came down to one thing I kept seeing: most people don't actually want to manage a model. They want the transcript. So we made the model a detail you don't have to think about.

The honest tradeoff: if you need real control of the model — custom fine-tuning, on-device or offline processing, hand-tuning word-level timestamps — then running Whisper yourself, the way Step 3 shows, is the right call. A hosted layer like ours trades that control for convenience, and that's a deliberate trade, not an accident. For raw-model control, self-hosting wins. For "I just need clean text with speakers labeled," the hosted route saves you a weekend of setup.

Step 1: Set Up Your Python Environment for the Whisper API

Setting up Python environment for Whisper audio transcription

This step gets your development environment ready to call the OpenAI Whisper API. You'll install the required Python package and configure your API key so transcription calls authenticate properly.

Detailed instructions

Open your terminal and create a new project directory:

mkdir whisper-transcription && cd whisper-transcription

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the OpenAI Python library:
```
pip install openai
```
Set your API key as an environment variable:
```
export OPENAI_API_KEY="sk-your-api-key-here"
```
You can get an API key from OpenAI's API keys page.

Verify the installation works:

from openai import OpenAI
client = OpenAI()
print("OpenAI client initialized successfully")

What you should see after completing this step

Running the verification script prints "OpenAI client initialized successfully" without errors. If you see AuthenticationError, your API key is invalid or not set. If you see ModuleNotFoundError, the openai package didn't install correctly.

Common mistakes and troubleshooting

API key not found: The most common error is forgetting to export the key in your current shell session. Adding it to ~/.bashrc or ~/.zshrc makes it persist across sessions. Don't hardcode the key in your Python files -- it ends up in version control.

Wrong Python version: Whisper's API client requires Python 3.8+. Run python3 --version to confirm. If you're on 3.7 or older, upgrade before continuing.

Pro tip: After setting up hundreds of transcription pipelines, I always create a .env file with python-dotenv instead of exporting environment variables manually. It keeps API keys out of shell history and makes the project portable across machines.

Step 2: Transcribe Your First Audio File with the Whisper API

Whisper training data distribution across languages

This is where you'll actually convert speech to text. The Whisper API accepts audio files up to 25 MB and returns the transcription as JSON or plain text, depending on your response_format setting.

Detailed instructions

Place your audio file in the project directory. Whisper supports mp3, mp4, mpeg, mpga, m4a, wav, and webm formats.
Run this Python script to transcribe:

from openai import OpenAI

client = OpenAI()

audio_file = open("your_audio.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file
)

print(transcript.text)

The default response comes as a JSON object:

{
  "text": "Your transcribed text appears here, word for word from the audio..."
}

If you prefer plain text without JSON wrapping, add the response_format parameter:

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="text"
)

For timestamped output (useful for subtitle generation), use the verbose_json format:

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["segment"]
)

This returns segments with start and end timestamps -- exactly what you need if you're building a YouTube subtitle generator or creating SRT files.

What you should see after completing this step

The script prints the full transcription text to your terminal. For a 5-minute audio file, the API typically responds within 10-30 seconds. If you used verbose_json, you'll see an array of segments with start, end, and text fields.

Common mistakes and troubleshooting

File too large: If you get a 413 Request Entity Too Large error, your file exceeds 25 MB. Jump to Step 4 for splitting instructions. For a detailed breakdown of file limits, read our guide on Whisper API limits.

Wrong file path: The open() function needs the exact path to your file. Use absolute paths if the script runs from a different directory than the audio file.

Pro tip: I've transcribed thousands of files through this API, and one thing catches people off guard: the API doesn't return speaker labels. If you need speaker identification (who said what), you'll need to pair Whisper with a speaker diarization tool. Whisper handles the speech-to-text part; diarization handles the "who" part.

Step 3: Run Whisper Locally from GitHub (Free, No API Costs)

Audio transcription with Whisper API showing code output

Running Whisper locally means no API costs, no file size limits, and no audio data leaving your machine. The tradeoff is that you need a GPU with CUDA support for reasonable speed, though Whisper does run on CPU (just much slower).

Detailed instructions

Install Whisper from the GitHub repository:
```
pip install -U openai-whisper
```

Install FFmpeg (required for audio processing):

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows (with Chocolatey)
choco install ffmpeg

Transcribe from the command line:

whisper your_audio.mp3 --model base --language en

Or use Whisper in Python:

import whisper

model = whisper.load_model("base")
result = model.transcribe("your_audio.mp3")
print(result["text"])

Choose the right model size for your hardware:

Comparison of Whisper model sizes showing parameters speed and accuracy metrics

Model	Parameters	Required VRAM	Relative Speed	English WER
tiny	39M	~1 GB	~32x	~14.8%
base	74M	~1 GB	~16x	~11.5%
small	244M	~2 GB	~6x	~9.8%
medium	769M	~5 GB	~2x	~8.4%
large-v3	1.55B	~10 GB	1x	~7.75%

According to Northflank's 2026 STT benchmark, the large-v3-turbo model achieves a 7.75% word error rate on mixed benchmarks while running significantly faster than the full large-v3.

What you should see after completing this step

Whisper creates multiple output files in the current directory: a .txt file with the raw transcript, an .srt file for subtitles, a .vtt file for web subtitles, a .tsv with timestamps, and a .json with full metadata. The first run downloads the model weights (ranging from 75 MB for tiny to 3 GB for large-v3).

Common mistakes and troubleshooting

CUDA not available: If Whisper falls back to CPU, transcription speed drops 5-10x. Run python3 -c "import torch; print(torch.cuda.is_available())" to check. On older laptops with AMD GPUs, Whisper still works -- it's just slower.

Out of memory: The large-v3 model needs ~10 GB of VRAM. If your GPU has less, drop down to medium or small. The accuracy difference between small (9.8% WER) and large-v3 (7.75% WER) may not matter for your use case.

Pro tip: I run the small model for most of my daily transcription work. It's 6x faster than large-v3 and the 2% accuracy gap rarely matters for interview transcripts or meeting notes. I only switch to large-v3 for content that gets published or when handling heavy-accent audio.

Step 4: Handle Audio Files Larger Than 25 MB

Handling long audio file transcription with Whisper

Whisper's API enforces a 25 MB upload limit. A typical MP3 at 128 kbps hits this limit around 25-30 minutes of audio. Longer recordings (lectures, meetings, podcast episodes) need to be split into smaller chunks before transcription.

Detailed instructions

Install PyDub for audio manipulation:
```
pip install pydub
```

Split your audio into 10-minute segments:

from pydub import AudioSegment

audio = AudioSegment.from_mp3("long_recording.mp3")
ten_minutes = 10 * 60 * 1000  # milliseconds

for i in range(0, len(audio), ten_minutes):
    segment = audio[i:i + ten_minutes]
    segment.export(f"segment_{i // ten_minutes}.mp3", format="mp3")
    print(f"Exported segment_{i // ten_minutes}.mp3")

Transcribe each segment and combine the results:

from openai import OpenAI
import os

client = OpenAI()
full_transcript = []

for filename in sorted(os.listdir(".")):
    if filename.startswith("segment_") and filename.endswith(".mp3"):
        with open(filename, "rb") as f:
            result = client.audio.transcriptions.create(
                model="whisper-1",
                file=f
            )
            full_transcript.append(result.text)

complete_text = " ".join(full_transcript)
print(complete_text)

Use the previous segment's transcript as a prompt for context continuity:

previous_text = ""
for filename in sorted(segment_files):
    with open(filename, "rb") as f:
        result = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            prompt=previous_text[-200:]  # Last 200 chars for context
        )
        full_transcript.append(result.text)
        previous_text = result.text

What you should see after completing this step

The script creates numbered segment files (segment_0.mp3, segment_1.mp3, etc.) in your working directory, then prints the combined transcript. A 2-hour lecture typically produces 12 segments at 10 minutes each.

Common mistakes and troubleshooting

Cutting mid-sentence: The simplest approach (fixed 10-minute chunks) sometimes splits right in the middle of a word. For production use, detect silence points with pydub.silence.detect_silence() and split there instead. This adds a few lines of code but significantly improves transcript coherence.

Missing context between segments: Without passing the previous transcript as a prompt, Whisper treats each segment independently. This can cause inconsistencies with names, technical terms, or ongoing topics. The prompt trick in step 4 above solves this.

Pro tip: For podcast transcription, I always split at detected silence points rather than fixed intervals. It takes an extra 30 seconds of processing but eliminates those annoying mid-word breaks that require manual cleanup. If you're transcribing podcasts regularly, check out our dedicated podcast transcription tool.

Step 5: Improve Transcription Accuracy with Prompting

Improving transcription accuracy with Whisper prompting techniques

Whisper's prompt parameter is a hint for the model. It helps the model recognize specific terms and keep formatting consistent. This is especially useful for technical content, brand names, and domain-specific jargon that Whisper might otherwise misspell or hallucinate.

Whisper model architecture diagram showing encoder-decoder transformer structure

Detailed instructions

Correct domain-specific terms: Include specialized vocabulary in your prompt so Whisper recognizes them:

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    prompt="TranscribeTube, PyDub, FFmpeg, pipreqs, CUDA, VRAM"
)

Maintain punctuation style: If you want full punctuation including commas and periods, provide a prompt that demonstrates the style:
```
prompt = "Hello, welcome to our weekly product update. Today, we'll cover three major features."
```
Preserve filler words: For research or verbatim transcription, include filler words in the prompt:
```
prompt = "Umm, so like, we were thinking about, uh, how to approach this problem..."
```

Handle multiple languages: When the audio switches between languages, specify the primary language:

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en"
)

According to AboutChromebooks' Whisper statistics, the training data composition is approximately 67% English audio, 20% from other high-resource languages, and 13% from low-resource languages. This means accuracy is strongest for English and progressively lower for less-represented languages.

What you should see after completing this step

Comparing output with and without prompts, you should see correctly spelled technical terms, consistent punctuation, and fewer hallucinated words. The improvement is most noticeable with proper nouns and acronyms.

Common mistakes and troubleshooting

Overly long prompts: Keep prompts under 244 tokens. Longer prompts get truncated and may confuse the model rather than help it. List only the terms most likely to be misrecognized.

Contradictory prompts: Don't combine filler-word preservation ("umm, uh") with clean punctuation prompts. Choose one style per transcription run.

Pro tip: After transcribing hundreds of technical talks, I've found that listing 5-10 specific terms in the prompt catches 90% of misrecognitions. Don't write full sentences as prompts -- just list the tricky words: "Kubernetes, kubectl, etcd, gRPC, protobuf". The model picks up on the spelling pattern.

Step 6: Translate Foreign-Language Audio to English

Whisper translating multilingual audio content to English text

Whisper can also translate audio from any of its supported languages directly into English text. This happens in a single step: the model hears German (or Spanish, or Japanese) and outputs English. No intermediate transcription needed.

Detailed instructions

Use the translations endpoint instead of transcriptions:

from openai import OpenAI

client = OpenAI()

audio_file = open("german_interview.mp3", "rb")
translation = client.audio.translations.create(
    model="whisper-1",
    file=audio_file
)

print(translation.text)

A German audio clip saying "Hallo, mein Name ist Wolfgang und ich komme aus Deutschland" would output:

"Hello, my name is Wolfgang and I come from Germany."

For transcribing (not translating) non-English audio in its original language, use the transcriptions endpoint with the language parameter:

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="de"  # Keeps output in German
)

Whisper supports 50+ languages for transcription. The full list is available in the OpenAI Speech-to-Text documentation. If you need Spanish audio transcripts specifically, our transcribe audio to text tool handles multiple languages through a browser interface.

What you should see after completing this step

The translation endpoint returns English text regardless of the input language. For well-supported languages (German, French, Spanish, Chinese, Japanese), the translation quality is high. For lower-resource languages, expect some accuracy degradation.

Common mistakes and troubleshooting

Expecting non-English output from translations: The translations endpoint only outputs English. If you need to keep the original language, use the transcriptions endpoint with the appropriate language code.

Mixed-language audio: When speakers switch between languages mid-conversation, Whisper sometimes struggles with transitions. Splitting the audio at language switch points and processing each segment separately gives better results.

Pro tip: For multilingual podcast transcription, I transcribe first in the original language, then run a separate translation pass on segments that need English output. The two-step approach gives cleaner results than relying on the translation endpoint alone, especially for code-switching conversations.

Step 7: Optimize Audio Quality Before Transcription

Audio preprocessing pipeline for Whisper showing noise reduction and format conversion steps

Whisper's accuracy depends heavily on the input audio quality. Clean recordings with minimal background noise consistently produce better transcripts than noisy ones, regardless of which model size you choose.

According to DIYAI's Whisper review, Whisper achieves up to 98% accuracy benchmarks versus Google and AWS speech-to-text services -- but only with reasonably clean audio input.

Detailed instructions

Reduce background noise with FFmpeg:
```
ffmpeg -i noisy_audio.mp3 -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.mp3
```
This applies a highpass filter (removes rumble below 200 Hz), lowpass filter (removes hiss above 3000 Hz), and noise reduction.

Normalize volume levels:

ffmpeg -i quiet_audio.mp3 -af "loudnorm=I=-16:TP=-1.5:LRA=11" normalized_audio.mp3

Convert stereo to mono (Whisper processes mono audio; stereo doubles file size for no benefit):
```
ffmpeg -i stereo_audio.mp3 -ac 1 mono_audio.mp3
```
Compress to reduce file size without losing intelligibility:
```
ffmpeg -i large_audio.wav -codec:a libmp3lame -b:a 64k compressed_audio.mp3
```
Speech doesn't need high bitrates. 64 kbps MP3 is sufficient for transcription and keeps files well under the 25 MB limit.

What you should see after completing this step

Processed audio files are smaller, cleaner-sounding, and produce noticeably fewer transcription errors. A file with background music that previously caused Whisper to hallucinate lyrics should now transcribe the speech content correctly.

Common mistakes and troubleshooting

Over-filtering: Aggressive noise reduction can distort speech and make transcription worse. Start with gentle settings and increase only if needed. If the audio sounds robotic after filtering, you've gone too far.

Forgetting mono conversion: Whisper processes mono. Sending stereo doesn't cause errors, but it doubles the file size for zero accuracy benefit.

Pro tip: The single biggest accuracy boost I've seen doesn't come from model selection or prompting. It's pre-processing. Running a simple highpass + noise reduction pass on interview recordings improved my transcription accuracy by roughly 15-20% in informal testing. Ten seconds of FFmpeg commands saves minutes of manual correction.

How to Handle Whisper Hallucinations and Known Limitations

Whisper hallucination risk factors including silent segments and low quality recordings

Whisper can fabricate text that wasn't in the audio. This is a documented issue, not an edge case. According to AP News, Whisper is prone to making up chunks of text or even entire sentences, based on interviews with more than a dozen software engineers, developers, and academic researchers.

Healthcare Brew reported that OpenAI spokesperson Taya Christianson stated: "We take this issue seriously and are continually working to improve the accuracy of our models, including how we can reduce hallucinations."

When hallucinations are most likely

Silent or near-silent audio segments: Whisper fills silence with fabricated text rather than leaving gaps
Very short audio clips under 1-2 seconds
Heavily distorted or low-quality recordings
Audio in low-resource languages with limited training data

How to reduce hallucinations

Remove silent segments before transcription using pydub.silence:

from pydub import AudioSegment
from pydub.silence import detect_nonsilent

audio = AudioSegment.from_mp3("recording.mp3")
nonsilent_ranges = detect_nonsilent(audio, min_silence_len=1000, silence_thresh=-40)

chunks = [audio[start:end] for start, end in nonsilent_ranges]
combined = sum(chunks)
combined.export("no_silence.mp3", format="mp3")

Cross-reference timestamps: Use verbose_json output and flag segments where confidence is low or timing gaps exist.
Post-process with a language model: Run the transcript through a basic spell-check and grammar review to catch obviously fabricated content.

For applications where accuracy matters most (medical, legal, financial), always include a human review step. Whisper is a strong drafting tool, but it makes mistakes. If you need to compare transcription tools for accuracy in specific domains, automated benchmarking helps identify the right fit.

What Results to Expect from Whisper Transcription

Whisper transcription accuracy benchmarks by audio type from studio recording to noisy environments

For clean English audio with a single speaker, Whisper consistently delivers 92-98% accuracy depending on the model size. Multi-speaker recordings, accented speech, and noisy environments reduce accuracy by 5-15 percentage points.

Realistic accuracy benchmarks

Audio Type	Expected Accuracy	Best Model
Clean studio recording, single speaker	95-98%	base or small
Podcast, 2-3 speakers	90-95%	small or medium
Phone call, moderate noise	85-92%	medium or large-v3
Conference room, echo	80-90%	large-v3
Street interview, heavy noise	70-85%	large-v3 + preprocessing

Cost estimates for common use cases

Use Case	Audio Volume	Monthly Cost (API)
Podcast producer (10 episodes x 1 hour)	10 hours	$3.60
Meeting transcription (20 meetings x 30 min)	10 hours	$3.60
YouTube channel (30 videos x 15 min)	7.5 hours	$2.70
Lecture recording (daily 1-hour lecture)	20 hours	$7.20

At $0.006 per minute, the API is cost-effective for most individual and small-team workflows. For AI transcription accuracy comparisons across multiple providers, our dedicated analysis covers the latest 2026 benchmarks.

Advanced Tips for Production Whisper Deployments

Five production deployment tips for Whisper including batch processing and cost tracking

If you're moving beyond personal use into production-grade transcription, these optimizations matter:

Batch processing with async calls. Don't process files sequentially. Use Python's asyncio with aiohttp to send 5-10 files in parallel. This cuts processing time proportionally without hitting rate limits.
Implement retry logic with exponential backoff. The OpenAI API occasionally returns 429 (rate limit) or 500 errors. A simple retry with 1s/2s/4s delays handles these gracefully.
Cache transcription results. Store completed transcripts keyed by a hash of the audio file content. Re-transcribing the same file wastes API credits.
Use webhooks for long-running jobs. For files over 10 minutes, process asynchronously and send the result to a callback URL rather than holding an HTTP connection open.
Monitor costs with usage tracking. Log every API call with the audio duration. At $0.006/minute, costs are low but can surprise you at scale. Our audio transcription API tool includes built-in usage tracking.

Tools Mentioned in This Guide

Tool	Purpose	Cost	Best For
OpenAI Whisper API	Cloud-based transcription	$0.006/min	Quick setup, no GPU needed
Whisper GitHub	Local open-source transcription	Free	Privacy, heavy usage
PyDub	Audio splitting and manipulation	Free	Handling large files
FFmpeg	Audio preprocessing and conversion	Free	Noise reduction, format conversion
TranscribeTube	Browser-based AI transcription	Free tier available	No-code transcription

Frequently Asked Questions

How can I transcribe audio with Whisper for free?

Install Whisper locally from the GitHub repository using pip install openai-whisper. The local installation is completely free -- you only need a machine with a GPU for reasonable speed. Without a GPU, Whisper runs on CPU but takes 5-10x longer. For quick one-off transcriptions without installation, the Hugging Face Whisper Space offers a free browser-based demo (currently limited availability).

What is OpenAI Whisper and how does it work?

OpenAI Whisper is an automatic speech recognition system trained on 680,000 hours of multilingual audio data. It uses a Transformer encoder-decoder architecture: the encoder processes 30-second audio chunks converted to log-Mel spectrograms, and the decoder generates the text token by token. The same model handles both transcription and translation without requiring language-specific configurations.

How do I use the Whisper API for audio transcription?

Install the OpenAI Python library (pip install openai), set your API key, then call client.audio.transcriptions.create(model="whisper-1", file=audio_file). The API accepts mp3, mp4, m4a, wav, and webm files up to 25 MB. You can specify response_format as "json", "text", "srt", "verbose_json", or "vtt" depending on your output needs.

Is there an online tool to transcribe audio with Whisper?

Yes. Several hosted platforms run Whisper models through a web interface. TranscribeTube provides browser-based transcription with speaker identification and subtitle generation built in. You can also try the official Hugging Face Space, though availability varies. These options work well for users who don't want to set up Python or manage API keys.

What is the difference between Whisper transcription and translation?

The transcriptions endpoint converts audio to text in the same language as the original recording (e.g., German audio becomes German text). The translations endpoint converts audio from any supported language directly into English text. Both endpoints use the same Whisper model but apply different decoding strategies. Currently, translation only outputs English -- other target languages aren't supported.

How accurate is Whisper compared to other speech-to-text tools?

DIYAI's benchmarks show Whisper achieving up to 98% accuracy on clean audio, comparable to Google Cloud Speech-to-Text and AWS Transcribe. The large-v3-turbo model achieves a 7.75% word error rate on mixed benchmarks. However, accuracy drops with background noise, strong accents, and domain-specific terminology. For a detailed comparison across providers, see our AI transcription accuracy analysis.

Can Whisper handle multiple speakers in an audio file?

Whisper transcribes all speech into a single text stream without distinguishing between speakers. It doesn't perform speaker diarization (identifying who said what) natively. To get speaker-labeled transcripts, pair Whisper with a separate diarization model like pyannote.audio or use a service that combines both. Our speaker identification transcription guide covers implementation details.

What audio file formats does Whisper support?

Whisper's API accepts mp3, mp4, mpeg, mpga, m4a, wav, and webm formats. The local installation via GitHub supports any format that FFmpeg can decode, which covers virtually all audio and video containers. For the API, the file must be under 25 MB. For local use, there's no size limit beyond your available RAM and disk space.

Conclusion

Whisper gives you two solid paths to transcribe audio: the API for speed and simplicity, or local installation for privacy and cost control. Start with the API method for your first project -- the client.audio.transcriptions.create() call takes 5 lines of Python and costs fractions of a cent per minute.

The steps that make the biggest difference in real-world accuracy: pre-process your audio with FFmpeg noise reduction, use the prompt parameter for specialized vocabulary, and split long files at silence points rather than fixed intervals.

If you'd rather skip the setup entirely and start transcribing now, TranscribeTube's audio to text converter runs Whisper through a browser interface with speaker identification and subtitle export built in. No Python or API keys required.

Back to Blog