CompSpoof V2 Dataset

1. Introduction

CompSpoof V2 is a dataset designed for component-level anti-spoofing detection research, where either the speech or the environmental sound component (or both) may be spoofed.

📊 CompSpoof V2 contains over 250k audio samples, with a total duration of approximately 283 hours. ⏱️ Each audio sample has a fixed length of 4 seconds and is provided at multiple sampling rates, enabling a more faithful simulation of real-world acoustic and system-level variations.

Building upon CompSpoof dataset, CompSpoof V2 significantly expands the diversity of attack sources, environmental sounds, and mixing strategies. ✨ In addition, newly generated audio samples are distributed across the test set and are specifically designed to serve as detection data under unseen conditions.

🤗 CompSpoof V2 Download Link: https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2/

💻 Baseline code: https://github.com/XuepingZhang/ESDD2-Baseline

2. Download and Setup

Step 1: Vist https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2

Step 2: Read and acknowledge license (you need to click the ‘acknowledge license’ button)

Step 3: Install huggingface_hub and login

pip install huggingface_hub[hf_transfer]

huggingface-cli login     # input your huggingface login token

Step 4: download and unzip

hf download XuepingZhang/ESDD2-CompSpoof-V2 --repo-type dataset --local-dir ./CompSpoofV2

cd CompSpoofV2

tar -zxvf eval.tar.gz

cat development.tar.gz.part_* > development.tar.gz

tar -zxvf development.tar.gz

3. Audio Class Description and Samples

Below are audio samples from the CompSpoof V2 dataset. For each class, we provide the mixed/original audio, along with the speech and environment sources.

Class 0 — Original

Label: original

Description: Original bona fide speech and corresponding environment audio without mixing

Original

Class 1 — Bona fide + Bona fide

Label: bonafide_bonafide

Description: Bona fide speech mixed with another bona fide environmental audio

Mixed	Speech	Environment

Class 2 — Spoofed Speech + Bona fide Environment

Label: spoof_bonafide

Description: Spoof speech mixed with bona fide environmental audio

Mixed	Speech	Environment

Class 3 — Bona fide Speech + Spoofed Environment

Label: bonafide_spoof

Description: Bona fide speech mixed with spoof environmental audio

Mixed	Speech	Environment

Class 4 — Spoofed Speech + Spoofed Environment

Label: spoof_spoof

Description: Spoof speech mixed with spoof environmental audio

Mixed	Speech	Environment

4. CompSpoof V2 VS CompSpoof

CompSpoof dataset is our previously released dataset designed for component-level spoofing detection. Building upon this foundation, we introduce CompSpoof V2, a substantially upgraded version with expanded task formulation. The key differences between CompSpoof and CompSpoof V2 are summarized below.

Aspect	CompSpoof	CompSpoof V2
Data volume	2.5k audio clips, about 7 hours	📊 more than 250k audio clips, about 283 hours
Data sources	SSTC, ASV5, VggSound, VcapAV, Common Voice	AudioCaps, VggSound, CommonVoice, LibriTTS, english-conversation-corpus,ASV5, MLAAD,TUTASC, TUTSED, UrbanSound, VGGSound, EnvSDD, VcapAV
Duration	range from 5 to 21 seconds	⏱️ 4 seconds/audio clip
Newly generated audio	❌	✅

5. Dataset Structure

The dataset follows a hierarchical directory structure organized by data split:

CompSpoof
├── development                     # training and val data，including audio source
│   ├── env_source                  # environmental sound audio used as the environmental sound component in the mixture
│   ├── metadata                    # metadata of development set
│       ├── train.csv
│       └── val.csv
│   ├── mixed_audio                 # mixed audio files, which **don't** belong to the `original` class
│   ├── original_audio              # audios belong to `original` class
│   └── speech_sources              # speech audio used as the speech component in the mixture
│
├── eval                            # eval set data, without audio source
│   ├── audio                       # audio files
│   └── metadata                    # metadata of eval set, which only has file name
│       └── eval.csv
│
├── eval_source (Released later)    # eval set data，including audio source
│   ├── env_sources                 # environmental sound audio used as the environmental sound component in the mixture
│   ├── metadata                    # metadata of eval set, with full annotation
│   │   └── eval.csv
│   ├── mixed_audio                 # mixed audio files, which **don't** belong to the `original` class
│   ├── original_audio              # audios belong to `original` class
│   └── speech_sources              # speech audio used as the speech component in the mixture
|
├── test                            # test set data, without audio source
│   ├── audio                       # audio files
│   └── metadata                    # metadata of test set, which only has file name
│       └── test.csv
│
└── test_source (Released later)    # test set audio data，including audio source
    ├── env_sources                 # environmental sound audio used as the environmental sound component in the mixture
    ├── metadata                    # metadata of test set, with full annotation
    │   └── test.csv
    ├── mixed_audio                 # mixed audio files, which **don't** belong to the `original` class
    ├── original_audio              # audios belong to `original` class
    └── speech_sources              # speech audio used as the speech component in the mixture

6. Audio Source

The audio sources for each category are as follows:

🏋️ train & val set

Label	Original Source	Speech Source	Environmental Sound Source
original	AudioCaps, VggSound	-	-
bonafide_bonafide	-	CommonVoice, LibriTTS, english-conversation-corpus	AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound
bonafide_spoof	-	CommonVoice, LibriTTS	EnvSDD, VcapAV
spoof_bonafide	-	ASV5, MLAAD	AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound
spoof_spoof	-	ASV5, MLAAD	EnvSDD, VcapAV

🏁 eval & test set

Label	original source	speech source	environmental sound source
original	AudioCaps, VggSound	-	-
bonafide_bonafide	-	CommonVoice, LibriTTS, english-conversation-corpus	AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound
bonafide_spoof	-	CommonVoice, LibriTTS	EnvSDD, VcapAV, New Generated
spoof_bonafide	-	ASV5, MLAAD, New Generated	AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound
spoof_spoof	-	ASV5, MLAAD, New Generated	EnvSDD, VcapAV, New Generated

7. Data Splits

The dataset is divided into three standard splits:

Training set: used for model training
Validation set: used for validation and hyper-parameter tuning
Eval & Test set: used for final performance reporting

Training set and validation set have the same date source and class distribution.

Eval set and Test set share the same date source and class distribution. Eval set and Test set share some new generated audios which are unseen in training and validation set.

The quantity and proportion of audios for each category in each set are as follows:

📊 train set (Total: 175361)

Label	Count	Ratio
bonafide_spoof	50361	28.72%
original	48639	27.74%
spoof_spoof	29413	16.77%
bonafide_bonafide	25189	14.36%
spoof_bonafide	21759	12.41%

📊 val set (Total: 24864)

Label	Count	Ratio
bonafide_spoof	8071	32.46%
original	6939	27.91%
spoof_spoof	4657	18.73%
bonafide_bonafide	2784	11.20%
spoof_bonafide	2413	9.70%

📊 eval set (Total: 27605)

Label	Count	Ratio
bonafide_spoof	7655	27.73%
original	7455	27.01%
spoof_spoof	5945	21.54%
bonafide_bonafide	3570	12.93%
spoof_bonafide	2980	10.80%

📊 test set (Total: 27603)

Label	Count	Ratio
bonafide_spoof	7672	27.79%
original	7415	26.86%
spoof_spoof	5894	21.35%
bonafide_bonafide	3635	13.17%
spoof_bonafide	2987	10.82%

8. Metadata

🗂️ Metadata is provided in CSV format, with one row per audio file. Each field describes the source, generation process, and mixing configuration of the corresponding composite spoofing sample.

The meaning of each field in Metadata is as follows:

audio_path: Relative path to the final mixed audio file used for training or evaluation.
label: Class label of the audio sample. Typical values include: original, bonafide_bonafide, spoof_bonafide, bonafide_spoof, spoof,spoof
split: Dataset split indicator: train, val, eval, test
original_audio_source: Source dataset of the original audio, e.g., AudioCaps.

speech_path: Path to the speech signal used as the speech component in the mixture.
speech_source: Source dataset of the speech signal, e.g., ASV5, CommonVoice.
speech_generation_mothed: Generation method used to produce the speech signal, e.g., TTS (text-to-speech), VC (voice-conversion).
speech_generation_source: Dataset to generate the spoofed speech, e.g., a spoofed speech is generated by TTS, the text for generation is the source.
speech_generation_model: Model used to generate the spoofed speech.

env_path: Path to the environmental sound used as the environmental sound component in the mixture.
env_source: Source dataset of the environmental sound, e.g., EnvSDD, VcapAV.
env_generation_mothed: Method used to generate the spoofed environmental sound, e.g.,TTA (text-to-audio).
env_generation_source: Dataset to generate the spoofed environmental sound, e.g., a spoofed speech is generated by TTS, the text for generation is the source.
env_generation_model:Model used to generate the spoofed environmental sound.

mix_target_snr: Target signal-to-noise ratio (SNR, in dB) used when mixing the speech and environmental sound.

9. Citation

📚 If you use CompSpoof V2 in your research, please cite the corresponding paper:

@dataset{zhang2025esdd2compspoofv2,
  title     = {ESDD2-CompSpoof-V2: A Composite Spoofing Dataset for Speech Anti-Spoofing},
  author    = {Zhang, Xueping and Li, Ming},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2}
}

10. License 🔏

The part of this dataset is a derived dataset constructed by combining and mixing audio samples from multiple publicly available datasets.

The MLAAD and VCapAV datasets are released under the CC BY-NC 4.0 license.
The LibriTTS, EnvSDD and VGGSound datasets is released under the CC BY 4.0 license.
The Common Voice dataset is released under the Creative Commons CC0 1.0 Universal license.
The ASVspoof 5 dataset is released under the ODC-By License.
The english-conversation-corpus dataset is released under the GPLv3License.
The AudioCaps dataset is released under the mit License.
The TUTASC, TUTSED and UrbanSound datasets are released under the Non-Commercial License.

Users must comply with the license terms of each original dataset. The authors do not claim ownership of the original audio content. Due to the inclusion of datasets licensed under CC BY-NC 4.0 license, this dataset is released under the CC BY-NC 4.0 license.

11. Contact Information

For questions, issues, or collaboration inquiries, please contact:

✉️ Email: xueping.zhang@dukekunshan.edu.cn