Awesome Speaker Diarization

Overview

This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.

The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.

To add items to this page, simply send a pull request. (contributing guide)

Publications

Special topics

Review & survey papers

A Review of Speaker Diarization: Recent Advances with Deep Learning, 2021
A review on speaker diarization systems and approaches, 2012
Speaker diarization: A review of recent research, 2010

Large language model (LLM)

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models, 2024
Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach, 2023
Lexical speaker error correction: Leveraging language models for speaker diarization error correction, 2023

Supervised diarization

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors, 2023
TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization, 2023
Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis, 2022
End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings, 2021
Supervised online diarization with sample mean loss for multi-domain data, 2019
Discriminative Neural Clustering for Speaker Diarisation, 2019
End-to-End Neural Speaker Diarization with Permutation-Free Objectives, 2019
End-to-End Neural Speaker Diarization with Self-attention, 2019
Fully Supervised Speaker Diarization, 2018

Joint diarization and ASR

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings, 2022
Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection, 2021
Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR, 2021
Joint Speech Recognition and Speaker Diarization via Sequence Transduction, 2019
Says who? Deep learning models for joint speech recognition, segmentation and diarization, 2018

Online speaker diarization

Speaker Diarization as a Fully Online Bandit Learning Problem in MiniVox, 2021
Online Speaker Diarization with Relation Network, 2020
VoiceID on the Fly: A Speaker Recognition System that Learns from Scratch, 2020

Challenges

M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge, 2022
The Hitachi-JHU DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap
Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge, 2018
ODESSA at Albayzin Speaker Diarization Challenge 2018, 2018
Joint Discriminative Embedding Learning, Speech Activity and Overlap Detection for the DIHARD Challenge, 2018

Audio-Visual Speaker Diarization

AVA-AVD: Audio-Visual Speaker Diarization in the Wild, 2022
DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization, 2022
End-to-End Audio-Visual Neural Speaker Diarization, 2022
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild, 2022

Other

2021

Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation
End-to-end speaker segmentation for overlap-aware resegmentation
DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding
DOVER-Lap: A method for combining overlap-aware diarization outputs
Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks
AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario, 2021

2020

An End-to-End Speaker Diarization Service for improving Multimedia Content Access
Spot the conversation: speaker diarisation in the wild
Speaker Diarization with Region Proposal Network
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

2019

Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection
Speaker diarization using latent space clustering in generative adversarial network
A study of semi-supervised speaker diarization system using gan mixture model
Learning deep representations by multilayer bootstrap networks for speaker diarization
Enhancements for Audio-only Diarization Systems
LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
Meeting Transcription Using Virtual Microphone Arrays
Speaker diarisation using 2D self-attentive combination of embeddings
Speaker Diarization with Lexical Information

2018

Neural speech turn segmentation and affinity propagation for speaker diarization
Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks
Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks

2017

Speaker Diarization with LSTM
Speaker diarization using deep neural network embeddings
Speaker diarization using convolutional neural network for statistics accumulation refinement
pyannote. metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks
Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings

2016

A Speaker Diarization System for Studying Peer-Led Team Learning Groups

2015

Diarization resegmentation in the factor analysis subspace

2014

A study of the cosine distance-based mean shift for telephone speech diarization
Speaker diarization with PLDA i-vector scoring and unsupervised calibration
Artificial neural network features for speaker diarization

2013

Unsupervised methods for speaker diarization: An integrated and iterative approach

2011

PLDA-based Clustering for Speaker Diarization of Broadcast Streams
Speaker diarization of meetings based on speaker role n-gram models

2009

Speaker Diarization for Meeting Room Audio

2008

Stream-based speaker segmentation using speaker factors and eigenvoices

2006

An overview of automatic speaker diarization systems
A spectral clustering approach to speaker diarization

Software

Framework

Link	Language	Description
FunASR	Python & PyTorch	FunASR is an open-source speech toolkit based on PyTorch, which aims at bridging the gap between academic researchs and industrial applications.
MiniVox	MATLAB	MiniVox is an open-source evaluation system for the online speaker diarization task.
SpeechBrain	Python & PyTorch	SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.
SIDEKIT for diarization (s4d)	Python	An open source package extension of SIDEKIT for Speaker diarization.
pyAudioAnalysis	Python	Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
AaltoASR	Python & Perl	Speaker diarization scripts, based on AaltoASR.
LIUM SpkDiarization	Java	LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013).
kaldi-asr	Bash	Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation.
kaldi-speaker-diarization	Bash	Icelandic speaker diarization scripts using kaldi.
Alize LIA_SpkSeg	C++	ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization.
pyannote-audio	Python	Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding.
pyBK	Python	Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data.
Speaker-Diarization	Python	Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers.
EEND	Python & Bash & Perl	End-to-End Neural Diarization.
VBx	Python	Variational Bayes HMM over x-vectors diarization. x-vector extractor recipe
RE-VERB	Python & JavaScript	RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when.
StreamingSpeakerDiarization	Python	Streaming speaker diarization, extends pyannote.audio to online processing
simple_diarizer	Python	Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diarized segments.
Picovoice Falcon	C & Python	A lightweight, accurate, and fast speaker diarization engine written in C and available in Python, running on CPU with minimal overhead.
DiaPer	Python	Pytorch implementation for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors including models pre-trained on free and public data.

Evaluation

Link	Language	Description
pyannote-metrics	Python	A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems.
SimpleDER	Python	A lightweight library to compute Diarization Error Rate (DER).
DiarizationLM	Python	Implements Word Error Rate (WER), Word Diarization Error Rate (WDER), and concatenated minimum-permutation Word Error Rate (cpWER).
NIST md-eval	Perl	(1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant
dscore	Python & Perl	Diarization scoring tools.
Sequence Match Accuracy	Python	Match the accuracy of two sequences with Hungarian algorithm.
spyder	Python & C++	Simple Python package for fast DER computation.
CDER	Python	Conversational DER from The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

Clustering

Link	Language	Description
uis-rnn	Python & PyTorch	Google's Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised.
uis-rnn-sml	Python & PyTorch	A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data.
DNC	Python & ESPnet	Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised.
SpectralCluster	Python	Spectral clustering with affinity matrix refinement operations, auto-tune, and speaker turn constraints.
sklearn.cluster	Python	scikit-learn clustering algorithms.
PLDA	Python	Probabilistic Linear Discriminant Analysis & classification, written in Python.
PLDA	C++	Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis).
Auto-Tuning Spectral Clustering	Python	Auto-tuning Spectral Clustering method that does not need development set or supervised tuning.

Speaker embedding

Link	Method	Language	Description
resemble-ai/Resemblyzer	d-vector	Python & PyTorch	PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization.
Speaker_Verification	d-vector	Python & TensorFlow	Tensorflow implementation of generalized end-to-end loss for speaker verification.
PyTorch_Speaker_Verification	d-vector	Python & PyTorch	PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al. With UIS-RNN integration.
Real-Time Voice Cloning	d-vector	Python & PyTorch	Implementation of "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" (SV2TTS) with a vocoder that works in real-time.
deep-speaker	d-vector	Python & Keras	Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System.
x-vector-kaldi-tf	x-vector	Python & TensorFlow & Perl	Tensorflow implementation of x-vector topology on top of Kaldi recipe.
kaldi-ivector	i-vector	C++ & Perl	Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure.
voxceleb-ivector	i-vector	Perl	Voxceleb1 i-vector based speaker recognition system.
pytorch_xvectors	x-vector	Python & PyTorch	PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification.
ASVtorch	i-vector	Python & PyTorch	ASVtorch is a toolkit for automatic speaker recognition.
asv-subtools	i-vector & x-vector	Kaldi & PyTorch	ASV-Subtools is developed based on Pytorch and Kaldi for the task of speaker recognition, language identification, etc. The 'sub' of 'subtools' means that there are many modular tools and the parts constitute the whole.
WeSpeaker	x-vector & r-vector	Python & C++ & PyTorch	WeSpeaker is a research and production oriented speaker verification, recognition and diarization toolkit, which supports very strong recipes with on-the-fly data preparation, model training and evaluation, as well as runtime C++ codes.
ReDimNet	improved resnet	Pytorch	Neural network architecture presented in the paper Reshape Dimensions Network for Speaker Recognition

Speaker change detection

Link	Language	Description
change_detection	Python & Keras	Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks.
tidydiarize	Python	Diarization inside OpenAI Whisper decoder

Audio feature extraction

Link	Language	Description
LibROSA	Python	Python library for audio and music analysis. https://librosa.github.io/
python_speech_features	Python	This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/
pyAudioAnalysis	Python	Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.

Audio data augmentation

Link	Language	Description
pyroomacoustics	Python	Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io
gpuRIR	Python	Python library for Room Impulse Response (RIR) simulation with GPU acceleration
rir_simulator_python	Python	Room impulse response simulator using python
WavAugment	Python & PyTorch	WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors
EEND_dataprep	Bash & Python	Recipes for generating simulated conversations used to train end-to-end diarization models.

Other software

Link	Language	Description
VB Diarization	Python	VB Diarization with Eigenvoice and HMM Priors.
DOVER-Lap	Python	Python package for combining diarization system outputs

Datasets

Diarization datasets

Audio	Diarization ground truth	Language	Pricing	Additional information
2000 NIST Speaker Recognition Evaluation	Disk-6 (Switchboard), Disk-8 (CALLHOME)	Multiple	$2400.00	Evaluation Plan
2003 NIST Rich Transcription Evaluation Data	Together with audios	en, ar, zh	$2000.00	telephone speech, broadcast news
CALLHOME American English Speech	CALLHOME American English Transcripts	en	$1500.00 + $1000.00	CH109 whitelist
The ICSI Meeting Corpus	Together with audios	en	Free	License
The AMI Meeting Corpus	Together with audios (need to be processed)	Multiple	Free	License
Fisher English Training Speech Part 1 Speech	Fisher English Training Speech Part 1 Transcripts	en	$7000.00 + $1000.00
Fisher English Training Part 2, Speech	Fisher English Training Part 2, Transcripts	en	$7000.00 + $1000.00
VoxConverse	TBD	TBD	Free	VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos
MiniVox Benchmark	MiniVox Benchmark	en	Free	MiniVox is an automatic framework to transform any speaker-labelled dataset into continuous speech datastream with episodically revealed label feedbacks.
The AliMeeting Corpus	Together with audios	zh	Free

Speaker embedding training sets

Name	Utterances	Speakers	Language	Pricing	Additional information
TIMIT	6K+	630	en	$250.00	Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets.
VCTK	43K+	109	en	Free	Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
LibriSpeech	292K	2K+	en	Free	Large-scale (1000 hours) corpus of read English speech.
Multilingual LibriSpeech (MLS)	?	?	en, de, nl, es, fr, it, pt, po	Free	Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
LibriVox	180K	9K+	Multiple	Free	Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long.
VoxCeleb 1&2	1M+	7K	Multiple	Free	VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
The Spoken Wikipedia Corpora	5K	879	en, de, nl	Free	Volunteer readers reading Wikipedia articles.
CN-Celeb	130K+	1K	zh	Free	A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University.
BookTubeSpeech	8K	8K	en	Free	Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download.
DeepMine	540K	1850	fa, en	Unknown	A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems.
NISP-Dataset	?	345	hi, kn, ml, ta, te (all Indian languages)	Free	This dataset contains speech recordings along with speaker physical parameters (height, weight, ... ) as well as regional information and linguistic information.
VoxBlink2	10M	100k+	18 lanugages (en, pt, es, ru, ar, ...)	CC BY-NC-SA 4.0	Multilingual dataset from VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark

Augmentation noise sources

Name	Utterances	Pricing	Additional information
AudioSet	2M	Free	A large-scale dataset of manually annotated audio events.
MUSAN	N/A	Free	MUSAN is a corpus of music, speech, and noise recordings.

Conferences

Conference/Workshop	Frequency	Page Limit	Organization	Blind Review
ICASSP	Annual	4 + 1 (ref)	IEEE	No
InterSpeech	Annual	4 + 1 (ref)	ISCA	No
Speaker Odyssey	Biennial	8 + 2 (ref)	ISCA	No
SLT	Biennial	6 + 2 (ref)	IEEE	Yes
ASRU	Biennial	6 + 2 (ref)	IEEE	Yes
WASPAA	Biennial	4 + 1 (ref)	IEEE	No
IJCB	Annual	8	IEEE & IAPR TC-4	Yes

Other learning materials

Online courses

Course on Udemy: A Tutorial on Speaker Diarization

Books

Voice Identity Techniques: From core algorithms to engineering practice (Chinese) by Quan Wang, 2020

Tech blogs

Literature Review For Speaker Change Detection by Halil Erdoğan
Speaker Diarization: Separation of Multiple Speakers in an Audio File by Jaspreet Singh
Speaker Diarization with Kaldi by Yoav Ramon
Who spoke when! How to Build your own Speaker Diarization Module by Rahul Saxena

Video tutorials

pyannote audio: neural building blocks for speaker diarization by Hervé Bredin
Google's Diarization System: Speaker Diarization with LSTM by Google
Fully Supervised Speaker Diarization: Say Goodbye to clustering by Google
Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection by Google
Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings by Microsoft Research
Robust Speaker Diarization for Meetings: the ICSI system by Microsoft Research
【机器之心&博文视点】入门声纹技术｜第二讲：声纹分割聚类与其他应用 by Quan Wang

Products

Company	Product
Google	Recorder app
Google	Google Cloud Speech-to-Text API
Amazon	Amazon Transcribe
IBM	Watson Speech To Text API
DeepAffects	Speaker Diarization API
Alibaba	Tingwu (听悟)
Microsoft	Azure Conversation Transcription API

Files

README.md

Latest commit

History