What Is Voice Biometrics?

Voice Biometrics - Definition
The Role of Voice Biometrics in Video and Image Anonymization
Technologies and Architectures Used in Voice Biometrics
Key Parameters and Metrics in Voice Biometrics
Challenges and Limitations
Practical Applications in Anonymization Workflows
Standards and References

Voice Biometrics - Definition

Voice biometrics (also known as speaker recognition or voice recognition for identity verification) refers to a set of methods used to identify or verify a person’s identity based on speech characteristics and voice acoustics. These systems create a speaker profile (e.g., a feature vector - an embedding) and compare it against reference templates.

From a legal perspective, voice data qualifies as biometric data if it is processed for the purpose of uniquely identifying a natural person. Under the GDPR, such data is considered a special category of personal data and requires compliance with Article 9, along with enhanced security and protection measures.

In the context of video and image anonymization, voice biometrics concerns the audio track in video files. Even if faces and license plates are blurred, an individual may still be identifiable by their voice. Therefore, risk assessments and anonymization strategies for video materials should account for potential speaker identification and, where necessary, include audio modification, masking, or muting.

The Role of Voice Biometrics in Video and Image Anonymization

In multimedia anonymization workflows, voice biometrics serves as a risk assessment framework. It helps estimate the likelihood of re-identifying individuals based on speech. The goal is not to recognize individuals during anonymization, but to understand which vocal features enable identification and which transformations effectively reduce that risk.

Risk assessment and DPIA - A voice can enable identification even when faces are blurred, especially in long recordings or when the speaker has a distinctive tone. A Data Protection Impact Assessment (DPIA) should address this risk and define mitigating measures.
Speech segment detection - Identifying where speech occurs in the audio track to selectively apply muting, modulation, or voice transformation.
Speaker diarization - Separating speakers allows different levels of modification to be applied to specific individuals, depending on legal grounds or consent.
Effectiveness validation - After voice transformation, embedding similarity can be tested against known samples to verify that it falls below a defined threshold, supporting claims of reduced identifiability.

Gallio PRO automates face and license plate blurring in offline and on-premise environments. The software does not perform speech recognition or audio anonymization. When voice masking is required, separate tools and processes should be used, and the results documented in the DPIA.

Technologies and Architectures Used in Voice Biometrics

Modern voice biometric systems rely primarily on deep learning techniques that generate compact voice representations robust to noise and channel variation. Below is an overview of key components and their role in risk assessment and audio sanitization.

Feature extraction - Traditional MFCCs and deep embeddings, including x-vectors and ECAPA-TDNN, trained on large and diverse speech datasets.
Verification and identification - Embedding comparison using cosine similarity measures or PLDA classifiers. In anonymization, these methods help assess speaker linkability before and after voice modification.
Speaker diarization - Segmentation into speakers using VAD, embeddings, and clustering (e.g., spectral clustering), enabling selective audio processing.
Presentation Attack Detection (PAD) - Mechanisms that detect replay attacks and synthetic speech, critical for assessing misuse risks.

If a video still contains audio after face blurring, best practice involves detecting speech and modifying it (e.g., via voice conversion or pitch shifting) or fully muting the audio track when required by data minimization principles.

Key Parameters and Metrics in Voice Biometrics

The performance and security of voice processing systems are evaluated using standardized metrics. In anonymization contexts, these metrics are primarily used to assess the residual risk of speaker linkability after audio transformation.

Metric	Definition	Unit	Relevance for Anonymization
EER	Equal Error Rate - the point where the false acceptance rate equals the false rejection rate	%	A higher EER after modification indicates lower speaker distinguishability
FMR / FNMR	False Match Rate and False Non-Match Rate as defined in ISO/IEC 19795-1	%	Controls embedding similarity thresholds before and after transformation
minDCF	Minimum Detection Cost Function according to NIST SRE protocols	Unitless	Aggregated error cost metric useful for comparing modification methods
DER	Diarization Error Rate - sum of missed speech, false alarms, and speaker misattributions divided by total speech time	%	Evaluates speaker separation quality for selective processing
Latency	Processing time per minute of audio under a defined configuration	ms or real-time factor (xRT)	Supports planning of batch video anonymization workflows

In practice, telephone channels typically use 8 kHz sampling, while microphone recordings use 16 kHz or higher. This choice affects feature extraction and model selection and should align with the adopted evaluation protocol.

Challenges and Limitations

Voice-related implementations involve technical and legal risks. In anonymization processes, these risks must be identified and documented to justify the chosen safeguards.

Domain mismatch - Channel variation, acoustic conditions, and noise degrade embedding comparability and must be considered in risk assessments.
Presentation attacks - Replay and synthetic speech attacks require PAD mechanisms as described in the ISO/IEC 30107 standards family.
Template protection - ISO/IEC 24745 addresses biometric information protection, including linkability prevention and reconstruction risks.
Legal basis - Processing voice data for the purpose of uniquely identifying a person may constitute special category data processing under Article 9 GDPR and requires a valid legal basis and, depending on risk level, a DPIA.
Documentation and logging - Video processing systems should minimize logs. Gallio PRO does not store logs from face or license plate detection and does not collect sensitive data.

Practical Applications in Anonymization Workflows

Organizations that publish video materials featuring private individuals should include voice identification risk management in their privacy policies. The following steps may be considered:

Extract speech-containing audio tracks and classify scenes by identification risk.
Select an appropriate measure - full muting, partial masking, or voice transformation - justified by proportionality and data minimization principles.
Evaluate effectiveness by comparing embeddings before and after transformation to demonstrate reduced similarity below a defined threshold.
Integrate into the workflow - Gallio PRO performs face and license plate blurring in offline and on-premise environments, while audio processing is handled in a separate tool.

Standards and References

The following documents define terminology, metrics, and requirements related to biometric data and speaker recognition system evaluation:

Regulation (EU) 2016/679 (GDPR) - Article 4(14), Article 9, and Recital 51. Official text: EUR-Lex.
European Data Protection Board, Guidelines 3/2019 on processing personal data through video devices, Version 2.0, 29 January 2020 - references to audio recording in surveillance contexts. EDPB.
ISO/IEC 19795-1:2021 - Information technology - Biometric performance testing and reporting - Part 1: Principles and framework.
ISO/IEC 24745:2022 - Information security - Biometric information protection.
ISO/IEC 30107-3:2017 - Biometric presentation attack detection - Part 3: Testing and reporting.
NIST Speaker Recognition Evaluations (SRE) - scope, protocols, minDCF and EER metrics. nist.gov.
D. Snyder et al., “X-vectors: Robust DNN embeddings for speaker recognition,” ICASSP 2018.
B. Desplanques et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation,” Interspeech 2020.