Quick answer: a timestamp is the time code (HH:MM:SS) that marks the exact moment in the audio when something is said. In 2026, engines like Whisper or gpt-4o-mini-transcribe generate them automatically with ±0.5-2 second accuracy at segment level and ±100-300 ms at word level. The most common formats are SRT and VTT for subtitles, JSON for automation and plain text with markers like [00:01:23] for quoting and human review. VOCAP returns all four from the same audio.
If you've ever had to find a specific sentence in a two-hour recording, you know the problem: text without time codes is awkward text. You can't jump to the exact minute, you can't quote precisely, you can't generate subtitles. Timestamps fix all of that at once.
This guide explains what they are, when you need each format, how they're generated in 2026 with AI and which common pitfalls to avoid.
What a timestamp is in a transcription
A timestamp (also called time code) is a value that marks the moment in the audio when a word or sentence is spoken. It is usually expressed in one of these formats:
HH:MM:SS— hours, minutes, seconds. The most readable for humans.HH:MM:SS,mmmorHH:MM:SS.mmm— with milliseconds. Standard in SRT and VTT.secondsas a decimal value (83.42) — common in JSON and APIs.
Each timestamp can be start, end or both. Professional formats always carry both: the subtitle appears at start and disappears at end.
What timestamps are for (real cases)
1. Synced subtitles
The most obvious case: subtitling YouTube videos, online courses, webinars, social content, accessibility. No timestamps, no subtitles. Formats: SRT (universal) or VTT (HTML5 web).
2. Video and audio editing
Professional editors (Premiere, DaVinci Resolve, Final Cut) import timestamped transcripts to do text-based editing: delete a word from the transcript and the video clip is cut for you. Descript popularised this workflow and it's now standard.
3. Precise quoting in research, journalism and law
When a journalist quotes "as the minister stated at 14:23 of the press conference…" or a lawyer references "see deposition, witness audio, 00:42:18", that precision is only possible with timestamps. Qualitative researchers use them to anchor verbatims in interview and focus group recordings.
4. Searching and navigating inside audio
A timestamped transcription turns a three-hour recording into a navigable track: search a keyword, see at which minute it was said, jump there. Essential for long podcasts, training sessions, meeting archives.
5. Automatic chapters for podcasts and YouTube
YouTube allows chapters defined as markers 00:05:30 Topic X in the description. Spotify and Apple Podcasts support chapters in some formats. Generating them by hand is slow; with timestamps + AI content analysis you get them in seconds.
6. Speaker analysis and participation
If you combine timestamps with diarization (speaker separation) you can compute how much each person spoke in a meeting, an HR interview or a focus group. Useful for sales coaching, meeting balance, research.
Segment-level vs word-level timestamps
Not every timestamp has the same granularity. There are two levels, and choosing the right one matters.
| Type | Granularity | When to use | Example |
|---|---|---|---|
| Segment-level | 5-15 seconds per block (sentence or short paragraph) | Subtitles, navigable text, human quotes, chapters | [00:01:23] Hi, welcome to the podcast. |
| Word-level | Each word with start/end in milliseconds | Text-based video editing, karaoke, animated captions, quantitative speech analysis | {"word":"Hi","start":1.23,"end":1.45} |
Rule of thumb: if you only need to read the transcript or generate classic subtitles, segment-level timestamps are enough. If you're doing text-based video editing or animating word-by-word captions (TikTok-style), you need word-level.
Output formats with timestamps
SRT (SubRip Subtitle)
The universal subtitle standard. Understood by YouTube, Premiere, Final Cut, VLC, Handbrake, Netflix and basically any player.
1
00:00:01,200 --> 00:00:04,800
Hi, welcome to the podcast.
2
00:00:05,000 --> 00:00:09,500
Today we're talking about timestamps in transcriptions.
VTT (WebVTT)
HTML5 variant (used in the <track> tag). Supports positioning, styles and extra metadata. If your video is embedded on a webpage, VTT is the natural fit.
WEBVTT
00:00:01.200 --> 00:00:04.800
Hi, welcome to the podcast.
00:00:05.000 --> 00:00:09.500
Today we're talking about timestamps in transcriptions.
JSON (structured)
Used by APIs and automation. Whisper returns something like:
{
"text": "Hi, welcome to the podcast.",
"segments": [
{
"id": 0,
"start": 1.20,
"end": 4.80,
"text": "Hi, welcome to the podcast."
}
]
}
Plain text with [HH:MM:SS] markers
The most comfortable for reading, quoting and sharing. Preferred by journalists, researchers and meeting-minutes teams.
[00:00:01] Hi, welcome to the podcast.
[00:00:05] Today we're talking about timestamps in transcriptions.
[00:00:14] First point: difference between segment and word level.
TSV / CSV
Useful when you need to push the transcript to Excel, BigQuery or any tabular analysis. Each row is a segment with start, end, text columns.
How timestamps are generated in 2026
There are three paths:
- Whisper directly (OpenAI or local). Both the OpenAI API and the open-source variants (whisper.cpp, faster-whisper) return segment-level timestamps by default and word-level when you enable
word_timestamps=True. It's the technical foundation behind most modern tools. - SaaS tools built on Whisper or similar. VOCAP, Otter, Descript, Riverside, etc. They process the audio with Whisper or proprietary engines and expose timestamps in their UI, with SRT/VTT/JSON export and no need to touch code.
- Manual with subtitling software. Aegisub, Subtitle Edit, Kapwing. They let you mark timestamps by hand on top of an existing transcript. Useful for fine corrections, not for volume.
2026 fact: Whisper is still the reference engine for multilingual transcription with timestamps. gpt-4o-mini-transcribe delivers comparable or better results in many languages and is becoming the default in modern tools like VOCAP.
Step by step: transcribe with timestamps in VOCAP
- Upload the file. MP3, WAV, M4A, MP4, OGG or FLAC, up to 150 MB. If it's heavier, compress to 64 kbps mono (that's what the engine processes internally; you don't lose transcription quality).
- Wait for processing. One hour of audio takes 2-8 minutes depending on language and queue. Long audios (1-3 h) go through async processing and you get notified when finished.
- Review the transcription. The web view shows text with
[HH:MM:SS]markers at the start of each block, plus an executive summary, key points, tasks and decisions generated by Claude. - Export in your preferred format. Text with timestamps for quoting, SRT/VTT for subtitles, JSON for automation (Zapier, Make, n8n).
- Fix proper nouns and numbers. That's where models miss most. A 2-3 minute pass per hour of audio gets you to 99%.
Try VOCAP with 30 free minutes
Upload an audio and download the timestamped transcript as SRT, VTT or text with [HH:MM:SS]. No card required.
Try VOCAP FreeTypical accuracy and limits
With clean audio (single speaker, decent mic, no noise) typical Whisper accuracy in 2026 is:
- Text: 95-98% in most major languages (English, Spanish, French, German, Italian, Portuguese).
- Segment-level timestamps: ±0.5 to ±2 seconds.
- Word-level timestamps: ±100 to ±300 ms with good articulation.
Where accuracy drops:
- Audio with echo, background noise or multiple overlapping voices.
- Strong accents or minority dialects.
- Music or sound effects the model tries to interpret as speech.
- Long silences: sometimes the model "hallucinates" text where there is none.
- Sudden speaker switches mid-word.
Common errors to avoid
- Asking for word-level when you only need segments. Triples file size and rarely adds value for classic subtitles.
- Mixing decimal separators. SRT uses comma (
00:00:01,200), VTT uses dot (00:00:01.200). Mixing them breaks the parser. - Not verifying sync. Automatic timestamps are good, not perfect. Check 3-4 spots in the audio before publishing subtitles.
- Subtitles that are too long. More than 42 characters per line or more than 7 seconds per block hurts readability. Split.
- Forgetting the language. Specifying the language (instead of auto-detect) speeds things up and slightly improves accuracy, especially on short audios.
- Subtitling without reviewing proper nouns. "VOCAP" can come out as "vocap", "Bocap" or "Vokap". Same with brands, cities and acronyms.
- Trusting silences 100%. If the model doesn't detect silences well, start timestamps may shift 200-500 ms early. Eyeball the first 30 seconds manually.
Frequently asked questions
What is a timestamp in a transcription?
The marker that indicates the exact moment in the audio (HH:MM:SS) when a word or sentence is spoken. It lets you locate fragments without listening to everything, generate synced subtitles and quote with precision.
Word-level vs segment-level timestamps?
Segment-level marks start and end of each sentence (5-15 s). Word-level marks each word with millisecond precision. Classic subtitles: segments. Text-based editing, karaoke or quantitative analysis: word.
Which timestamped formats exist?
SRT (universal standard), VTT (HTML5 web), JSON (APIs and automation), TSV/CSV (tabular) and plain text with [HH:MM:SS] markers for human reading. VOCAP exports the main ones.
How accurate are automatic timestamps?
With Whisper and clean audio, ±0.5 to ±2 s at segment level and ±100-300 ms at word level. Accuracy drops with noise, overlapping voices or strong accents.
Can I add timestamps to a transcription I already have?
Yes, with software like Aegisub or Subtitle Edit, but it takes 4-6 hours per hour of audio. It's faster to re-process the original with an engine that returns automatic timestamps.
How do I get timestamps in VOCAP?
Upload the audio and VOCAP returns the transcription with [HH:MM:SS] markers at the start of each segment, downloadable as SRT/VTT for subtitles or as text with timestamps. It uses Whisper under the hood.