Transcribing a short audio is trivial. Transcribing a 2-hour audio is where most tools break. OpenAI's Whisper API caps files at 25 MB. Free apps freeze at the 30-minute mark. Online tools ask you to manually slice the audio in Audacity and re-upload it segment by segment. And then you have to paste the chunks together by hand and review the joins.
With VOCAP you upload the whole file — a 1-hour conference, a 2-hour interview, a 3-hour seminar — and the system handles the entire pipeline automatically: compression, silence-aware splitting, parallel transcription and clean stitching. This guide explains why long audio is a problem, how it gets solved, and how much it costs you.
Why Long Audio Breaks Most Tools
Whisper's 25 MB limit
OpenAI Whisper is the most accurate AI transcription engine on the market, but its API has a hard limit: 25 MB per file. In practice that's:
- About 20-25 minutes of MP3 at standard quality (128 kbps).
- Just 4-5 minutes of uncompressed WAV.
- Roughly 50 minutes at 64 kbps mono — but you lose some audio quality.
That means if you record a 1-hour class, a 2-hour meeting or a 3-hour interview and upload them straight into a Whisper-based tool, you'll either get a max-size error or only the first few minutes will be transcribed.
Why splitting manually is a pain
The DIY workaround is to open Audacity, cut the audio into 20-minute chunks, export each one, upload them one by one, wait for the transcripts, and paste the texts back together by hand. In practice that means:
- Errors at the joins: if you cut mid-word, you lose context and the AI introduces errors at the start and end of each chunk.
- Lost speakers: speaker diarization breaks across segments — "Speaker 1" in chunk 2 may not be the same as "Speaker 1" in chunk 1.
- Wasted time: 30-45 minutes of manual work to transcribe a 2-hour audio.
- No unified summary: AI analysis (summary, tasks, decisions) gets lost when you fragment the audio.
Key data point: 78% of professional recordings (university lectures, business meetings, conferences, seminars, long-form podcasts) run between 45 minutes and 3 hours. In other words, most of the world's valuable audio content is out of reach for a Whisper without a pipeline.
Real-World Use Cases
Who needs to transcribe multi-hour audio
Conferences and keynotes (1-2h)
Professional events and recorded talks you need to turn into an article, LinkedIn post, SEO transcript or subtitles. Upload the whole thing, get text + executive summary in 10 minutes.
University lectures (1-2h)
Recorded lessons to review, take notes from or study. Combine it with converting audio to notes for a structured summary by topic.
Work meetings and committees (1-3h)
Steering committees, project meetings, long kick-offs. Full transcription plus automatic minutes with tasks and decisions — useful alongside automatic meeting minutes.
Research interviews (1-3h)
In-depth interviews for qualitative research, journalism or PhD work. No duration limit, even for life-history interviews running several hours.
Long-form podcasts (1-3h)
Long interview-style episodes (Joe Rogan, Lex Fridman, Tim Ferriss). Generate a full transcript for SEO, show notes and repurposing into 10 pieces of content.
Hearings and legal depositions (1-4h)
Court hearings and statements that require precise verbatim transcription. See transcribing court hearings with AI for legal-specific details.
Try It with a Real Long Audio
Upload your next class, conference or full meeting. 30 free minutes when you sign up.
Try VOCAP FreeHow VOCAP Solves the Problem Technically
The three-phase pipeline
VOCAP isn't a Whisper wrapper. It's a pipeline designed specifically for long audio, with three automatic phases:
- Adaptive compression: if the file exceeds 24 MB, it gets re-encoded to 64 kbps mono MP3. For human voice, that bitrate preserves intelligibility almost 100% while cutting file size by 4-6x. A 90-minute conference goes from 130 MB to around 40 MB.
- Silence-aware splitting: if after compression the file still exceeds Whisper's limit, it gets split into 10-minute segments at natural silence points (when the speaker pauses). This avoids cutting mid-word and preserves context at the joins.
- Parallel transcription and stitching: segments are sent to Whisper in parallel (not sequentially), so a 2-hour audio doesn't take 2 hours to transcribe — it takes as long as the slowest segment, typically 8-12 minutes total. Texts are stitched cleanly.
Post-analysis with Claude
Once you have the full text, Claude (Anthropic) processes it to generate:
- Executive summary: 3-5 paragraphs with the essentials.
- Key points: actionable bullets from the content.
- Tasks and decisions: identifies explicit actions and agreements.
- Tone and topics: useful for content classification.
gpt-4o-mini-transcribe, the successor to Whisper-1 with better handling of technical jargon and proper nouns. If you need it for legal or medical cases where you want compatibility with older benchmarks, you can request a rollback to Whisper-1.
Step by Step: Your First Long Audio in 5 Minutes
Sign up for VOCAP: create a free account at vocap.io. You get 30 minutes of transcription to start, no credit card required.
Upload the long audio: drag your file (up to 150 MB) onto the interface. MP3, WAV, M4A, OGG, OPUS, FLAC, AAC, MP4, WebM accepted.
Enable async mode: for audio longer than 30 minutes we recommend async mode. You can close the tab; you'll get an email when it finishes.
VOCAP runs the full pipeline: compression → splitting → parallel transcription → analysis with Claude. You don't do anything.
Get transcription + analysis: full text, executive summary, tasks, decisions and key points. Copy, export to Word/PDF or paste wherever you need it.
ffmpeg -i input.wav -b:a 64k -ac 1 output.mp3 you'll bring a 4-hour recording down to about 115 MB.
Comparison: Manual Splitting vs Automatic VOCAP
2-hour audio: two real workflows
SPLIT MANUALLY + WHISPER ONLINE: 1. Open Audacity and load the WAV (3 min) 2. Cut into 6 segments of 20 min (10 min) 3. Export each one to MP3 (5 min) 4. Upload all 6 segments one by one (15 min) 5. Wait for 6 sequential transcriptions (30 min) 6. Paste texts by hand and review joins (15 min) 7. NO unified summary or analysis TOTAL TIME: ~78 min of active work JOIN ACCURACY: variable, often loses context
VOCAP AUTOMATIC: 1. Upload the 2h file to VOCAP (1 min) 2. Enable async mode and close the tab 3. Receive email with transcript + analysis (10-12 min) 4. Unified text + summary + tasks + decisions TOTAL TIME: ~1 min of active work JOIN ACCURACY: silence-aware splitting, no loss
Tips for Multi-Hour Audio
- Record at 44.1 kHz mono when possible: for voice, mono is enough. Stereo doubles the file size with no benefit. If you're recording with multiple mics (in-person interview), mix down to mono before uploading if speakers are well separated, or keep stereo to improve diarization.
- Avoid continuous background noise: noise across several hours degrades accuracy cumulatively. If you're recording a conference, place the mic near the speaker or use a lavalier.
- Note unusual proper nouns and acronyms in advance: long audio usually has 5-10 domain-specific terms (product names, people, acronyms). Having a list handy to review the transcript at the end saves time.
- Use async mode: for audio over 30 minutes, don't wait with the tab open. Enable async and get an email.
- Buy the Ultimate plan if you transcribe >10h/month: at €1/hour with the Ultimate plan (30h for €29.99), a 3h audio costs you €3. One-time purchase, no subscription.
Upload your next long audio to VOCAP
Conferences, classes, interviews, podcasts. Up to 150 MB and several hours without splitting anything manually. Executive summary and analysis included.
30 free minutes · No credit card · Automatic compression and splitting
Start FreeFrequently Asked Questions
What is the real limit for transcribing long audio with AI?
OpenAI's Whisper API has a hard limit of 25 MB per file. In practice that's about 20-25 minutes of standard-quality MP3, or barely 4-5 minutes of uncompressed WAV. VOCAP removes that limit: it compresses audio to 64 kbps automatically and, if the file is still too large, splits it into 10-minute segments that are transcribed in parallel and stitched back together. You can upload files up to 150 MB and transcribe audio of 3, 5 or more hours without doing anything.
How long does it take to transcribe 2 or 3 hours of audio?
VOCAP processes segments in parallel, so a 2-hour audio is usually ready in 8-12 minutes and a 3-hour audio in 15-20 minutes. Exact times depend on audio quality, but async mode lets you close the tab and get the result by email when it finishes.
Does splitting the audio into segments hurt accuracy?
Not significantly. Splitting happens in 10-minute blocks that respect natural silences and the segments are stitched cleanly. Final accuracy stays around 95%+ even for multi-hour audio. For talks with very specific jargon (medical, legal, technical), the gpt-4o-mini-transcribe model improves proper nouns notably compared to Whisper-1.
How much does it cost to transcribe 1, 2 or 3 hours of audio?
With VOCAP's Ultimate credit plan (30h for €29.99), the cost is €1 per hour of audio. That means: €1 for a 1-hour conference, €2 for a 2-hour course, €3 for a 3-hour seminar. One-time purchase, no subscriptions. Full table at AI transcription pricing: cost comparison.
What long audio formats does VOCAP accept?
VOCAP accepts MP3, WAV, M4A, OGG, OPUS, FLAC, AAC, MP4 and WebM up to 150 MB. If your file exceeds that size, the easiest workaround is to export it to MP3 at 64-128 kbps before uploading: a 4-hour recording at 64 kbps mono comes in around 110 MB and uploads with no issues. For video (MP4 / WebM), VOCAP automatically extracts the audio.
Can I transcribe long audio in any language?
Yes. OpenAI's Whisper recognises more than 90 languages and keeps accuracy high on long audio. It detects the language automatically and handles language switches within the same file (common in international conferences or multilingual interviews). More details at multilingual transcription with AI.