How to Transcribe Audio with Whisper: Python API and Local Setup Guide

To transcribe audio with Whisper, install the OpenAI Python library, pass your audio file to the client.audio.transcriptions.create() method with model whisper-1, and receive the text output in JSON or plain text format. Whisper handles 50+ languages with a word error rate as low as 7.75%, and the API costs just $0.006 per minute.
What you'll need:
- Python 3.8+ installed on your machine
- An OpenAI API key (for the API method) or a CUDA-compatible GPU (for local installation)
- Audio files in mp3, mp4, m4a, wav, or webm format (under 25 MB per file)
- Time estimate: 15-30 minutes for setup, seconds per transcription
- Skill level: Beginner-friendly (API method) / Intermediate (local installation)
Quick overview of the process:
- Choose your method — OpenAI API ($0.006/min, no GPU) or local GitHub installation (free, needs GPU)
- Set up your environment — Install Python dependencies and configure API keys or GPU drivers
- Run your first transcription — Send an audio file and receive text output
- Optimize accuracy — Use prompts, audio preprocessing, and model selection for better results
- Handle large files — Split audio over 25 MB with PyDub and maintain context
- Translate multilingual audio — Convert foreign-language audio to English text
- Scale for production — Batch processing, error handling, and cost management
What Is OpenAI Whisper and How Has It Evolved?
OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual audio data collected from the web. Unlike older speech-to-text systems that required language-specific training, Whisper uses a single Transformer encoder-decoder architecture that processes audio across 50+ languages without switching models.
According to OpenAI's Whisper research page, the model converts audio into 30-second chunks, transforms them into log-Mel spectrograms, and uses special tokens for language identification, phrase-level timestamps, and translation. This architecture is why Whisper handles background noise, accents, and technical vocabulary better than most alternatives.
How the model has changed since 2022
The original Whisper v1 launched in September 2022. Since then, OpenAI has released several improved versions:
| Model Version | Release | Parameters | Key Improvement |
|---|---|---|---|
| Whisper v1 | Sep 2022 | Up to 1.55B | Initial release, 680K hours training |
| Large-v2 | Dec 2022 | 1.55B | Reduced hallucinations, better accuracy |
| Large-v3 | Nov 2023 | 1.55B | Lower word error rate, expanded language support |
| Large-v3-turbo | 2024 | 809M | According to CompaniesHistory, achieves 7.75% WER with faster inference |
The Whisper model is fully open-source. You can access the code, weights, and documentation on the official Whisper GitHub repository.
Who uses Whisper in 2026
Whisper isn't only used by developers anymore. According to Healthcare Brew, about 30,000 clinicians and 40 health systems now use Whisper-powered tools to document patient interactions. Content creators, podcast producers, and educators use it too, mostly for subtitles, searchable transcripts, and content repurposing.
According to AboutChromebooks, Whisper recorded 4.1 million monthly downloads on Hugging Face in December 2025, making it the most downloaded open-source speech recognition model available.
How to Choose Between Whisper API, Local Installation, and Hosted Tools
Before writing any code, you need to decide how you'll run Whisper. There are three main approaches, each with different tradeoffs in cost, privacy, and flexibility.
| Factor | OpenAI API | Local (GitHub) | Hosted Tools |
|---|---|---|---|
| Cost | According to BrassTranscripts, $0.006/minute ($0.36/hour) | Free (electricity + GPU) | $10-50/month subscription |
| Setup time | 5 minutes | 30-60 minutes | 2 minutes |
| GPU required | No | Yes (CUDA recommended) | No |
| File size limit | 25 MB per request | Unlimited | Varies |
| Privacy | Audio sent to OpenAI servers | Stays on your machine | Varies by provider |
| Best for | Quick projects, low volume | Privacy-sensitive work, heavy usage | Non-technical users |
My recommendation after 12 years building transcription tools: Start with the API for prototyping. The $0.006/minute cost is hard to beat for small-to-medium workloads. Switch to local installation only if you process 100+ hours monthly or can't send audio to external servers. If you don't want to write code at all, try TranscribeTube's audio to text converter for a browser-based experience.
Step 1: Set Up Your Python Environment for the Whisper API
This step gets your development environment ready to call the OpenAI Whisper API. You'll install the required Python package and configure your API key so transcription calls authenticate properly.
Detailed instructions
-
Open your terminal and create a new project directory:
mkdir whisper-transcription && cd whisper-transcription -
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate -
Install the OpenAI Python library:
pip install openai -
Set your API key as an environment variable:
export OPENAI_API_KEY="sk-your-api-key-here"You can get an API key from OpenAI's API keys page.
-
Verify the installation works:
from openai import OpenAI client = OpenAI() print("OpenAI client initialized successfully")
What you should see after completing this step
Running the verification script prints "OpenAI client initialized successfully" without errors. If you see AuthenticationError, your API key is invalid or not set. If you see ModuleNotFoundError, the openai package didn't install correctly.
Common mistakes and troubleshooting
API key not found: The most common error is forgetting to export the key in your current shell session. Adding it to ~/.bashrc or ~/.zshrc makes it persist across sessions. Don't hardcode the key in your Python files -- it ends up in version control.
Wrong Python version: Whisper's API client requires Python 3.8+. Run python3 --version to confirm. If you're on 3.7 or older, upgrade before continuing.
Pro tip: After setting up hundreds of transcription pipelines, I always create a .env file with python-dotenv instead of exporting environment variables manually. It keeps API keys out of shell history and makes the project portable across machines.
Step 2: Transcribe Your First Audio File with the Whisper API
This is where you'll actually convert speech to text. The Whisper API accepts audio files up to 25 MB and returns the transcription as JSON or plain text, depending on your response_format setting.
Detailed instructions
-
Place your audio file in the project directory. Whisper supports mp3, mp4, mpeg, mpga, m4a, wav, and webm formats.
-
Run this Python script to transcribe:
from openai import OpenAI
client = OpenAI()
audio_file = open("your_audio.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcript.text)
The default response comes as a JSON object:
{
"text": "Your transcribed text appears here, word for word from the audio..."
}
- If you prefer plain text without JSON wrapping, add the
response_formatparameter:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
- For timestamped output (useful for subtitle generation), use the
verbose_jsonformat:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"]
)
This returns segments with start and end timestamps -- exactly what you need if you're building a YouTube subtitle generator or creating SRT files.
What you should see after completing this step
The script prints the full transcription text to your terminal. For a 5-minute audio file, the API typically responds within 10-30 seconds. If you used verbose_json, you'll see an array of segments with start, end, and text fields.
Common mistakes and troubleshooting
File too large: If you get a 413 Request Entity Too Large error, your file exceeds 25 MB. Jump to Step 4 for splitting instructions. For a detailed breakdown of file limits, read our guide on Whisper API limits.
Wrong file path: The open() function needs the exact path to your file. Use absolute paths if the script runs from a different directory than the audio file.
Pro tip: I've transcribed thousands of files through this API, and one thing catches people off guard: the API doesn't return speaker labels. If you need speaker identification (who said what), you'll need to pair Whisper with a speaker diarization tool. Whisper handles the speech-to-text part; diarization handles the "who" part.
Step 3: Run Whisper Locally from GitHub (Free, No API Costs)
Running Whisper locally means no API costs, no file size limits, and no audio data leaving your machine. The tradeoff is that you need a GPU with CUDA support for reasonable speed, though Whisper does run on CPU (just much slower).
Detailed instructions
-
Install Whisper from the GitHub repository:
pip install -U openai-whisper -
Install FFmpeg (required for audio processing):
# macOS brew install ffmpeg # Ubuntu/Debian sudo apt update && sudo apt install ffmpeg # Windows (with Chocolatey) choco install ffmpeg -
Transcribe from the command line:
whisper your_audio.mp3 --model base --language en -
Or use Whisper in Python:
import whisper model = whisper.load_model("base") result = model.transcribe("your_audio.mp3") print(result["text"]) -
Choose the right model size for your hardware:
| Model | Parameters | Required VRAM | Relative Speed | English WER |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~32x | ~14.8% |
| base | 74M | ~1 GB | ~16x | ~11.5% |
| small | 244M | ~2 GB | ~6x | ~9.8% |
| medium | 769M | ~5 GB | ~2x | ~8.4% |
| large-v3 | 1.55B | ~10 GB | 1x | ~7.75% |
According to Northflank's 2026 STT benchmark, the large-v3-turbo model achieves a 7.75% word error rate on mixed benchmarks while running significantly faster than the full large-v3.
What you should see after completing this step
Whisper creates multiple output files in the current directory: a .txt file with the raw transcript, an .srt file for subtitles, a .vtt file for web subtitles, a .tsv with timestamps, and a .json with full metadata. The first run downloads the model weights (ranging from 75 MB for tiny to 3 GB for large-v3).
Common mistakes and troubleshooting
CUDA not available: If Whisper falls back to CPU, transcription speed drops 5-10x. Run python3 -c "import torch; print(torch.cuda.is_available())" to check. On older laptops with AMD GPUs, Whisper still works -- it's just slower.
Out of memory: The large-v3 model needs ~10 GB of VRAM. If your GPU has less, drop down to medium or small. The accuracy difference between small (9.8% WER) and large-v3 (7.75% WER) may not matter for your use case.
Pro tip: I run the small model for most of my daily transcription work. It's 6x faster than large-v3 and the 2% accuracy gap rarely matters for interview transcripts or meeting notes. I only switch to large-v3 for content that gets published or when handling heavy-accent audio.
Step 4: Handle Audio Files Larger Than 25 MB
Whisper's API enforces a 25 MB upload limit. A typical MP3 at 128 kbps hits this limit around 25-30 minutes of audio. Longer recordings (lectures, meetings, podcast episodes) need to be split into smaller chunks before transcription.
Detailed instructions
-
Install PyDub for audio manipulation:
pip install pydub -
Split your audio into 10-minute segments:
from pydub import AudioSegment audio = AudioSegment.from_mp3("long_recording.mp3") ten_minutes = 10 * 60 * 1000 # milliseconds for i in range(0, len(audio), ten_minutes): segment = audio[i:i + ten_minutes] segment.export(f"segment_{i // ten_minutes}.mp3", format="mp3") print(f"Exported segment_{i // ten_minutes}.mp3") -
Transcribe each segment and combine the results:
from openai import OpenAI import os client = OpenAI() full_transcript = [] for filename in sorted(os.listdir(".")): if filename.startswith("segment_") and filename.endswith(".mp3"): with open(filename, "rb") as f: result = client.audio.transcriptions.create( model="whisper-1", file=f ) full_transcript.append(result.text) complete_text = " ".join(full_transcript) print(complete_text) -
Use the previous segment's transcript as a prompt for context continuity:
previous_text = "" for filename in sorted(segment_files): with open(filename, "rb") as f: result = client.audio.transcriptions.create( model="whisper-1", file=f, prompt=previous_text[-200:] # Last 200 chars for context ) full_transcript.append(result.text) previous_text = result.text
What you should see after completing this step
The script creates numbered segment files (segment_0.mp3, segment_1.mp3, etc.) in your working directory, then prints the combined transcript. A 2-hour lecture typically produces 12 segments at 10 minutes each.
Common mistakes and troubleshooting
Cutting mid-sentence: The simplest approach (fixed 10-minute chunks) sometimes splits right in the middle of a word. For production use, detect silence points with pydub.silence.detect_silence() and split there instead. This adds a few lines of code but significantly improves transcript coherence.
Missing context between segments: Without passing the previous transcript as a prompt, Whisper treats each segment independently. This can cause inconsistencies with names, technical terms, or ongoing topics. The prompt trick in step 4 above solves this.
Pro tip: For podcast transcription, I always split at detected silence points rather than fixed intervals. It takes an extra 30 seconds of processing but eliminates those annoying mid-word breaks that require manual cleanup. If you're transcribing podcasts regularly, check out our dedicated podcast transcription tool.
Step 5: Improve Transcription Accuracy with Prompting
Whisper's prompt parameter is a hint for the model. It helps the model recognize specific terms and keep formatting consistent. This is especially useful for technical content, brand names, and domain-specific jargon that Whisper might otherwise misspell or hallucinate.
Detailed instructions
-
Correct domain-specific terms: Include specialized vocabulary in your prompt so Whisper recognizes them:
transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file, prompt="TranscribeTube, PyDub, FFmpeg, pipreqs, CUDA, VRAM" ) -
Maintain punctuation style: If you want full punctuation including commas and periods, provide a prompt that demonstrates the style:
prompt = "Hello, welcome to our weekly product update. Today, we'll cover three major features." -
Preserve filler words: For research or verbatim transcription, include filler words in the prompt:
prompt = "Umm, so like, we were thinking about, uh, how to approach this problem..." -
Handle multiple languages: When the audio switches between languages, specify the primary language:
transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file, language="en" )
According to AboutChromebooks' Whisper statistics, the training data composition is approximately 67% English audio, 20% from other high-resource languages, and 13% from low-resource languages. This means accuracy is strongest for English and progressively lower for less-represented languages.
What you should see after completing this step
Comparing output with and without prompts, you should see correctly spelled technical terms, consistent punctuation, and fewer hallucinated words. The improvement is most noticeable with proper nouns and acronyms.
Common mistakes and troubleshooting
Overly long prompts: Keep prompts under 244 tokens. Longer prompts get truncated and may confuse the model rather than help it. List only the terms most likely to be misrecognized.
Contradictory prompts: Don't combine filler-word preservation ("umm, uh") with clean punctuation prompts. Choose one style per transcription run.
Pro tip: After transcribing hundreds of technical talks, I've found that listing 5-10 specific terms in the prompt catches 90% of misrecognitions. Don't write full sentences as prompts -- just list the tricky words: "Kubernetes, kubectl, etcd, gRPC, protobuf". The model picks up on the spelling pattern.
Step 6: Translate Foreign-Language Audio to English
Whisper can also translate audio from any of its supported languages directly into English text. This happens in a single step: the model hears German (or Spanish, or Japanese) and outputs English. No intermediate transcription needed.
Detailed instructions
-
Use the translations endpoint instead of transcriptions:
from openai import OpenAI client = OpenAI() audio_file = open("german_interview.mp3", "rb") translation = client.audio.translations.create( model="whisper-1", file=audio_file ) print(translation.text)A German audio clip saying "Hallo, mein Name ist Wolfgang und ich komme aus Deutschland" would output:
"Hello, my name is Wolfgang and I come from Germany."
-
For transcribing (not translating) non-English audio in its original language, use the transcriptions endpoint with the
languageparameter:transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file, language="de" # Keeps output in German ) -
Whisper supports 50+ languages for transcription. The full list is available in the OpenAI Speech-to-Text documentation. If you need Spanish audio transcripts specifically, our transcribe audio to text tool handles multiple languages through a browser interface.
What you should see after completing this step
The translation endpoint returns English text regardless of the input language. For well-supported languages (German, French, Spanish, Chinese, Japanese), the translation quality is high. For lower-resource languages, expect some accuracy degradation.
Common mistakes and troubleshooting
Expecting non-English output from translations: The translations endpoint only outputs English. If you need to keep the original language, use the transcriptions endpoint with the appropriate language code.
Mixed-language audio: When speakers switch between languages mid-conversation, Whisper sometimes struggles with transitions. Splitting the audio at language switch points and processing each segment separately gives better results.
Pro tip: For multilingual podcast transcription, I transcribe first in the original language, then run a separate translation pass on segments that need English output. The two-step approach gives cleaner results than relying on the translation endpoint alone, especially for code-switching conversations.
Step 7: Optimize Audio Quality Before Transcription
Whisper's accuracy depends heavily on the input audio quality. Clean recordings with minimal background noise consistently produce better transcripts than noisy ones, regardless of which model size you choose.
According to DIYAI's Whisper review, Whisper achieves up to 98% accuracy benchmarks versus Google and AWS speech-to-text services -- but only with reasonably clean audio input.
Detailed instructions
-
Reduce background noise with FFmpeg:
ffmpeg -i noisy_audio.mp3 -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.mp3This applies a highpass filter (removes rumble below 200 Hz), lowpass filter (removes hiss above 3000 Hz), and noise reduction.
-
Normalize volume levels:
ffmpeg -i quiet_audio.mp3 -af "loudnorm=I=-16:TP=-1.5:LRA=11" normalized_audio.mp3 -
Convert stereo to mono (Whisper processes mono audio; stereo doubles file size for no benefit):
ffmpeg -i stereo_audio.mp3 -ac 1 mono_audio.mp3 -
Compress to reduce file size without losing intelligibility:
ffmpeg -i large_audio.wav -codec:a libmp3lame -b:a 64k compressed_audio.mp3Speech doesn't need high bitrates. 64 kbps MP3 is sufficient for transcription and keeps files well under the 25 MB limit.
What you should see after completing this step
Processed audio files are smaller, cleaner-sounding, and produce noticeably fewer transcription errors. A file with background music that previously caused Whisper to hallucinate lyrics should now transcribe the speech content correctly.
Common mistakes and troubleshooting
Over-filtering: Aggressive noise reduction can distort speech and make transcription worse. Start with gentle settings and increase only if needed. If the audio sounds robotic after filtering, you've gone too far.
Forgetting mono conversion: Whisper processes mono. Sending stereo doesn't cause errors, but it doubles the file size for zero accuracy benefit.
Pro tip: The single biggest accuracy boost I've seen doesn't come from model selection or prompting. It's pre-processing. Running a simple highpass + noise reduction pass on interview recordings improved my transcription accuracy by roughly 15-20% in informal testing. Ten seconds of FFmpeg commands saves minutes of manual correction.
How to Handle Whisper Hallucinations and Known Limitations
Whisper can fabricate text that wasn't in the audio. This is a documented issue, not an edge case. According to AP News, Whisper is prone to making up chunks of text or even entire sentences, based on interviews with more than a dozen software engineers, developers, and academic researchers.
Healthcare Brew reported that OpenAI spokesperson Taya Christianson stated: "We take this issue seriously and are continually working to improve the accuracy of our models, including how we can reduce hallucinations."
When hallucinations are most likely
- Silent or near-silent audio segments: Whisper fills silence with fabricated text rather than leaving gaps
- Very short audio clips under 1-2 seconds
- Heavily distorted or low-quality recordings
- Audio in low-resource languages with limited training data
How to reduce hallucinations
-
Remove silent segments before transcription using
pydub.silence:from pydub import AudioSegment from pydub.silence import detect_nonsilent audio = AudioSegment.from_mp3("recording.mp3") nonsilent_ranges = detect_nonsilent(audio, min_silence_len=1000, silence_thresh=-40) chunks = [audio[start:end] for start, end in nonsilent_ranges] combined = sum(chunks) combined.export("no_silence.mp3", format="mp3") -
Cross-reference timestamps: Use
verbose_jsonoutput and flag segments where confidence is low or timing gaps exist. -
Post-process with a language model: Run the transcript through a basic spell-check and grammar review to catch obviously fabricated content.
For applications where accuracy matters most (medical, legal, financial), always include a human review step. Whisper is a strong drafting tool, but it makes mistakes. If you need to compare transcription tools for accuracy in specific domains, automated benchmarking helps identify the right fit.
What Results to Expect from Whisper Transcription
For clean English audio with a single speaker, Whisper consistently delivers 92-98% accuracy depending on the model size. Multi-speaker recordings, accented speech, and noisy environments reduce accuracy by 5-15 percentage points.
Realistic accuracy benchmarks
| Audio Type | Expected Accuracy | Best Model |
|---|---|---|
| Clean studio recording, single speaker | 95-98% | base or small |
| Podcast, 2-3 speakers | 90-95% | small or medium |
| Phone call, moderate noise | 85-92% | medium or large-v3 |
| Conference room, echo | 80-90% | large-v3 |
| Street interview, heavy noise | 70-85% | large-v3 + preprocessing |
Cost estimates for common use cases
| Use Case | Audio Volume | Monthly Cost (API) |
|---|---|---|
| Podcast producer (10 episodes x 1 hour) | 10 hours | $3.60 |
| Meeting transcription (20 meetings x 30 min) | 10 hours | $3.60 |
| YouTube channel (30 videos x 15 min) | 7.5 hours | $2.70 |
| Lecture recording (daily 1-hour lecture) | 20 hours | $7.20 |
At $0.006 per minute, the API is cost-effective for most individual and small-team workflows. For AI transcription accuracy comparisons across multiple providers, our dedicated analysis covers the latest 2026 benchmarks.
Advanced Tips for Production Whisper Deployments
If you're moving beyond personal use into production-grade transcription, these optimizations matter:
-
Batch processing with async calls. Don't process files sequentially. Use Python's
asynciowithaiohttpto send 5-10 files in parallel. This cuts processing time proportionally without hitting rate limits. -
Implement retry logic with exponential backoff. The OpenAI API occasionally returns 429 (rate limit) or 500 errors. A simple retry with 1s/2s/4s delays handles these gracefully.
-
Cache transcription results. Store completed transcripts keyed by a hash of the audio file content. Re-transcribing the same file wastes API credits.
-
Use webhooks for long-running jobs. For files over 10 minutes, process asynchronously and send the result to a callback URL rather than holding an HTTP connection open.
-
Monitor costs with usage tracking. Log every API call with the audio duration. At $0.006/minute, costs are low but can surprise you at scale. Our audio transcription API tool includes built-in usage tracking.
Tools Mentioned in This Guide
| Tool | Purpose | Cost | Best For |
|---|---|---|---|
| OpenAI Whisper API | Cloud-based transcription | $0.006/min | Quick setup, no GPU needed |
| Whisper GitHub | Local open-source transcription | Free | Privacy, heavy usage |
| PyDub | Audio splitting and manipulation | Free | Handling large files |
| FFmpeg | Audio preprocessing and conversion | Free | Noise reduction, format conversion |
| TranscribeTube | Browser-based AI transcription | Free tier available | No-code transcription |
Frequently Asked Questions
How can I transcribe audio with Whisper for free?
Install Whisper locally from the GitHub repository using pip install openai-whisper. The local installation is completely free -- you only need a machine with a GPU for reasonable speed. Without a GPU, Whisper runs on CPU but takes 5-10x longer. For quick one-off transcriptions without installation, the Hugging Face Whisper Space offers a free browser-based demo (currently limited availability).
What is OpenAI Whisper and how does it work?
OpenAI Whisper is an automatic speech recognition system trained on 680,000 hours of multilingual audio data. It uses a Transformer encoder-decoder architecture: the encoder processes 30-second audio chunks converted to log-Mel spectrograms, and the decoder generates the text token by token. The same model handles both transcription and translation without requiring language-specific configurations.
How do I use the Whisper API for audio transcription?
Install the OpenAI Python library (pip install openai), set your API key, then call client.audio.transcriptions.create(model="whisper-1", file=audio_file). The API accepts mp3, mp4, m4a, wav, and webm files up to 25 MB. You can specify response_format as "json", "text", "srt", "verbose_json", or "vtt" depending on your output needs.
Is there an online tool to transcribe audio with Whisper?
Yes. Several hosted platforms run Whisper models through a web interface. TranscribeTube provides browser-based transcription with speaker identification and subtitle generation built in. You can also try the official Hugging Face Space, though availability varies. These options work well for users who don't want to set up Python or manage API keys.
What is the difference between Whisper transcription and translation?
The transcriptions endpoint converts audio to text in the same language as the original recording (e.g., German audio becomes German text). The translations endpoint converts audio from any supported language directly into English text. Both endpoints use the same Whisper model but apply different decoding strategies. Currently, translation only outputs English -- other target languages aren't supported.
How accurate is Whisper compared to other speech-to-text tools?
DIYAI's benchmarks show Whisper achieving up to 98% accuracy on clean audio, comparable to Google Cloud Speech-to-Text and AWS Transcribe. The large-v3-turbo model achieves a 7.75% word error rate on mixed benchmarks. However, accuracy drops with background noise, strong accents, and domain-specific terminology. For a detailed comparison across providers, see our AI transcription accuracy analysis.
Can Whisper handle multiple speakers in an audio file?
Whisper transcribes all speech into a single text stream without distinguishing between speakers. It doesn't perform speaker diarization (identifying who said what) natively. To get speaker-labeled transcripts, pair Whisper with a separate diarization model like pyannote.audio or use a service that combines both. Our speaker identification transcription guide covers implementation details.
What audio file formats does Whisper support?
Whisper's API accepts mp3, mp4, mpeg, mpga, m4a, wav, and webm formats. The local installation via GitHub supports any format that FFmpeg can decode, which covers virtually all audio and video containers. For the API, the file must be under 25 MB. For local use, there's no size limit beyond your available RAM and disk space.
Conclusion
Whisper gives you two solid paths to transcribe audio: the API for speed and simplicity, or local installation for privacy and cost control. Start with the API method for your first project -- the client.audio.transcriptions.create() call takes 5 lines of Python and costs fractions of a cent per minute.
The steps that make the biggest difference in real-world accuracy: pre-process your audio with FFmpeg noise reduction, use the prompt parameter for specialized vocabulary, and split long files at silence points rather than fixed intervals.
If you'd rather skip the setup entirely and start transcribing now, TranscribeTube's audio to text converter runs Whisper through a browser interface with speaker identification and subtitle export built in. No Python or API keys required.