Home Pricing Blog Contact

Speaker Diarization with AI: How to Know Who Said What in Your Transcriptions

What it is, how it works and how to apply automatic diarization to meetings, interviews and podcasts. Practical 2026 guide.

Quick answer: Speaker diarization is the process by which an AI segments an audio file with multiple voices and labels each fragment with the corresponding speaker, answering "who said what". It is combined with a transcription engine like Whisper to produce text structured by conversational turns. In 2026, the best models (pyannote 3.1, NeMo, WhisperX) reach a 7-12% error rate on clean audio with 2-4 speakers. It is the key piece for useful meeting minutes, readable interviews and publishable podcasts.

A transcription without speaker labels is practically unreadable. A 45-minute wall of text where you can't tell who made the important decision, who raised objections and who took on the task is nearly useless. Speaker diarization is the technique that turns that wall into a structured conversation, with turns labelled per person.

In the past two years this technology has made a huge leap thanks to voice embedding models and their integration with large transcription models like Whisper. This guide explains what it is, how it works, how accurate it is, what it's useful for, and how to apply it without any technical hassle.

What is speaker diarization

Speaker diarization is the process by which an AI system takes an audio file with multiple voices and segments it into fragments, labelling each one with the corresponding speaker. The typical output looks like this:

[00:00:02 - 00:00:18] Speaker 1: Thanks for joining the quarterly review...
[00:00:19 - 00:00:34] Speaker 2: Great. Before we start, I wanted to confirm...
[00:00:35 - 00:01:12] Speaker 1: Yes, we'll cover that point at the end.
[00:01:13 - 00:01:40] Speaker 3: I have a question about the budget...

It's important to understand that diarization does not know who the speakers are. It doesn't identify Mary or Charles. It only knows that "voice A is different from voice B" and groups segments accordingly. Assigning real names is a later step, done manually or via voice biometric recognition (speaker recognition), which requires explicit consent.

How it works technically (without unnecessary jargon)

A modern diarization system combines several steps. All happen in seconds and the user doesn't see them, but it's worth understanding them to know where the limits are.

  1. Voice Activity Detection (VAD). The system removes silence and non-voice noise to keep only the stretches where someone is speaking.
  2. Segmentation. It splits the voice stretches into short fragments (typically 1-3 seconds) to analyse them separately.
  3. Voice embeddings. Each fragment is converted into a numeric vector (a "voice fingerprint") representing the unique characteristics of timbre, pitch and prosody of the speaker at that moment.
  4. Clustering. The algorithm groups similar vectors. Each cluster represents a distinct speaker. This is where it decides that fragments X, Y and Z belong to the same person.
  5. Alignment with transcription. Finally the result is combined with the transcribed text (from Whisper or another engine) to produce the turn-labelled text.

2026 technical note: the most widely used open models are pyannote 3.1 (Hugging Face), NeMo Speaker Diarization (NVIDIA) and WhisperX (integrator). They all run on cloud GPUs and process 1 hour of audio in under 2 minutes.

Diarization vs transcription: the key difference

People often confuse the two concepts. They are distinct tasks that complement each other.

Dimension Transcription Diarization
Question answered What is being said? Who is speaking at each moment?
Output Plain text Time intervals + speaker label
Typical model Whisper, Google STT, Azure Speech pyannote, NeMo, UIS-RNN
Quality metric WER (Word Error Rate) DER (Diarization Error Rate)
Useful alone? Yes, but hard to read for meetings No, needs the transcription to make sense

Combining both tasks is what truly delivers value: a transcription structured by speakers is readable, analysable and publishable. Transcription only = wall of text. Diarization only = timestamps with no content.

Got a 2-hour meeting with 5 people to transcribe?

VOCAP combines Whisper + automatic diarization. Upload the audio and receive text structured by turns, ready to share. 15 minutes free, no card required.

Try VOCAP Free

Real diarization accuracy in 2026

The standard metric is the Diarization Error Rate (DER), which measures what percentage of audio time is misattributed. A 10% DER means that out of every 60 minutes of conversation, 6 minutes are mislabelled. Current benchmarks show:

In contexts where accuracy is critical (legal, medical, journalistic), the recommendation is to use diarization as a first pass and manually review the key turns. The tool saves you 90% of the work but doesn't eliminate human review when content is sensitive.

Use cases where diarization is essential

Not every audio needs diarization. A personal voice note or individual dictation doesn't require it. But there are scenarios where without diarization the transcription loses almost all its value:

Work meetings and minutes

Without diarization you can't tell who took each task or who vetoed each decision. A useful set of minutes needs turn attribution. Tools like VOCAP generate structured minutes using diarization as the base.

Journalistic interviews

A journalist needs to distinguish their questions from the interviewee's answers to quote accurately. A long interview without diarization is nearly impossible to edit.

Multi-host podcasts

Publishing the transcription of a 2-4 voice podcast without identifying hosts and guests leaves the content unreadable. With diarization, each turn is labelled for readers and search engines.

Focus groups and market research

Qualitative analysis requires knowing what each participant said. Without diarization, aggregating responses is impossible without re-listening to the entire audio.

Legal depositions and hearings

In legal contexts, attribution is critical: who made each statement, judge, prosecutor, defender, witness. Automatic diarization speeds up minute production, though it requires human validation.

Therapy, coaching and clinical interviews

Separating the professional's turn from the patient's allows pattern analysis, session review and structured notes. Always with prior consent.

How to apply diarization in 4 steps without coding

Most users don't want to assemble a pyannote + Whisper pipeline manually. A tool that does it internally is enough. Here's the typical VOCAP flow:

  1. Record with the best possible quality. For in-person meetings, use a directional mic at the centre of the table or, better, one mic per person. On calls, enable multi-channel recording if the platform allows (Zoom and Google Meet can record each participant on a separate track).
  2. Upload the file. Supported formats: MP3, WAV, M4A, MP4, WebM, OGG, FLAC. Up to 150 MB per file; if bigger, compress first or split.
  3. Let the AI do the work. Whisper transcribes the content and pyannote (or equivalent) segments by speakers. The process takes between 1 and 3 minutes per hour of audio.
  4. Review and rename speakers. The system returns "Speaker 1, 2, 3…". Edit the labels to assign real names (Mary, Charles, Anna). This step dramatically improves final readability.

Transcriptions with identified speakers in 2 minutes

Upload your audio to VOCAP and receive the transcription already separated by turns, with summary and tasks extracted by Claude. From €1/hour or less with subscription.

Start Free with VOCAP

Common mistakes that ruin diarization

Frequently asked questions about speaker diarization

What is speaker diarization?

It is the process by which an AI takes an audio file with multiple voices and labels each fragment with the corresponding speaker. It answers "who said what and when". It doesn't identify by name: it only distinguishes different voices and groups them.

How is it different from transcription?

Transcription converts speech to text; diarization identifies who is speaking at each moment. Combined, they produce a transcription structured by conversational turns, which is what truly adds value in meetings and interviews.

How accurate is AI diarization in 2026?

On clean audio with 2-4 speakers, the best models reach a DER of 7-12%. On noisy calls with multiple speakers and overlaps, error can exceed 20%. Microphone quality and channel separation are determining factors.

Does Whisper do diarization by itself?

No. Whisper transcribes but doesn't identify speakers. To get "who said what" you must combine it with a diarization model such as pyannote, NeMo or WhisperX. VOCAP does it automatically and delivers the text already segmented.

Can AI assign the real names?

By default, no. Diarization distinguishes anonymous voices (Speaker 1, 2, 3…). Names are assigned by you or by a separate voice biometric recognition system, which in Europe requires explicit consent under GDPR.

How many speakers can AI separate without losing accuracy?

In practice, 2 to 6 speakers. Beyond 8 simultaneous people, accuracy drops noticeably because embeddings overlap. For large panels, record in multi-channel mode (one mic per person).

Try VOCAP free 15 min transcription
Start Free →