CompSpoof V2 is a dataset designed for component-level anti-spoofing detection research, where either the speech or the environmental sound component (or both) may be spoofed.
π CompSpoof V2 contains over 250k audio samples, with a total duration of approximately 283 hours. β±οΈ Each audio sample has a fixed length of 4 seconds and is provided at multiple sampling rates, enabling a more faithful simulation of real-world acoustic and system-level variations.
Building upon CompSpoof dataset, CompSpoof V2 significantly expands the diversity of attack sources, environmental sounds, and mixing strategies. β¨ In addition, newly generated audio samples are distributed across the test set and are specifically designed to serve as detection data under unseen conditions.
π€ CompSpoof V2 Download Link: https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2/
π» Baseline code: https://github.com/XuepingZhang/ESDD2-Baseline
Step 1: Vist https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2
Step 2: Read and acknowledge license (you need to click the βacknowledge licenseβ button)
Step 3: Install huggingface_hub and login
pip install huggingface_hub[hf_transfer]
huggingface-cli login # input your huggingface login token
Step 4: download and unzip
hf download XuepingZhang/ESDD2-CompSpoof-V2 --repo-type dataset --local-dir ./CompSpoofV2
cd CompSpoofV2
tar -zxvf eval.tar.gz
cat development.tar.gz.part_* > development.tar.gz
tar -zxvf development.tar.gz
Below are audio samples from the CompSpoof V2 dataset. For each class, we provide the mixed/original audio, along with the speech and environment sources.
Label: original
Description: Original bona fide speech and corresponding environment audio without mixing
| Original |
|---|
Label: bonafide_bonafide
Description: Bona fide speech mixed with another bona fide environmental audio
| Mixed | Speech | Environment |
|---|---|---|
Label: spoof_bonafide
Description: Spoof speech mixed with bona fide environmental audio
| Mixed | Speech | Environment |
|---|---|---|
Label: bonafide_spoof
Description: Bona fide speech mixed with spoof environmental audio
| Mixed | Speech | Environment |
|---|---|---|
Label: spoof_spoof
Description: Spoof speech mixed with spoof environmental audio
| Mixed | Speech | Environment |
|---|---|---|
CompSpoof dataset is our previously released dataset designed for component-level spoofing detection. Building upon this foundation, we introduce CompSpoof V2, a substantially upgraded version with expanded task formulation. The key differences between CompSpoof and CompSpoof V2 are summarized below.
| Aspect | CompSpoof | CompSpoof V2 |
|---|---|---|
| Data volume | 2.5k audio clips, about 7 hours | π more than 250k audio clips, about 283 hours |
| Data sources | SSTC, ASV5, VggSound, VcapAV, Common Voice | AudioCaps, VggSound, CommonVoice, LibriTTS, english-conversation-corpus,ASV5, MLAAD,TUTASC, TUTSED, UrbanSound, VGGSound, EnvSDD, VcapAV |
| Duration | range from 5 to 21 seconds | β±οΈ 4 seconds/audio clip |
| Newly generated audio | β | β |
The dataset follows a hierarchical directory structure organized by data split:
CompSpoof
βββ development # training and val dataοΌincluding audio source
β βββ env_source # environmental sound audio used as the environmental sound component in the mixture
β βββ metadata # metadata of development set
β βββ train.csv
β βββ val.csv
β βββ mixed_audio # mixed audio files, which **don't** belong to the `original` class
β βββ original_audio # audios belong to `original` class
β βββ speech_sources # speech audio used as the speech component in the mixture
β
βββ eval # eval set data, without audio source
β βββ audio # audio files
β βββ metadata # metadata of eval set, which only has file name
β βββ eval.csv
β
βββ eval_source (Released later) # eval set dataοΌincluding audio source
β βββ env_sources # environmental sound audio used as the environmental sound component in the mixture
β βββ metadata # metadata of eval set, with full annotation
β β βββ eval.csv
β βββ mixed_audio # mixed audio files, which **don't** belong to the `original` class
β βββ original_audio # audios belong to `original` class
β βββ speech_sources # speech audio used as the speech component in the mixture
|
βββ test # test set data, without audio source
β βββ audio # audio files
β βββ metadata # metadata of test set, which only has file name
β βββ test.csv
β
βββ test_source (Released later) # test set audio dataοΌincluding audio source
βββ env_sources # environmental sound audio used as the environmental sound component in the mixture
βββ metadata # metadata of test set, with full annotation
β βββ test.csv
βββ mixed_audio # mixed audio files, which **don't** belong to the `original` class
βββ original_audio # audios belong to `original` class
βββ speech_sources # speech audio used as the speech component in the mixture
The audio sources for each category are as follows:
| Label | Original Source | Speech Source | Environmental Sound Source |
|---|---|---|---|
| original | AudioCaps, VggSound | - | - |
| bonafide_bonafide | - | CommonVoice, LibriTTS, english-conversation-corpus | AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound |
| bonafide_spoof | - | CommonVoice, LibriTTS | EnvSDD, VcapAV |
| spoof_bonafide | - | ASV5, MLAAD | AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound |
| spoof_spoof | - | ASV5, MLAAD | EnvSDD, VcapAV |
| Label | original source | speech source | environmental sound source |
|---|---|---|---|
| original | AudioCaps, VggSound | - | - |
| bonafide_bonafide | - | CommonVoice, LibriTTS, english-conversation-corpus | AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound |
| bonafide_spoof | - | CommonVoice, LibriTTS | EnvSDD, VcapAV, New Generated |
| spoof_bonafide | - | ASV5, MLAAD, New Generated | AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound |
| spoof_spoof | - | ASV5, MLAAD, New Generated | EnvSDD, VcapAV, New Generated |
The dataset is divided into three standard splits:
Training set and validation set have the same date source and class distribution.
Eval set and Test set share the same date source and class distribution. Eval set and Test set share some new generated audios which are unseen in training and validation set.
The quantity and proportion of audios for each category in each set are as follows:
| Label | Count | Ratio |
|---|---|---|
| bonafide_spoof | 50361 | 28.72% |
| original | 48639 | 27.74% |
| spoof_spoof | 29413 | 16.77% |
| bonafide_bonafide | 25189 | 14.36% |
| spoof_bonafide | 21759 | 12.41% |
| Label | Count | Ratio |
|---|---|---|
| bonafide_spoof | 8071 | 32.46% |
| original | 6939 | 27.91% |
| spoof_spoof | 4657 | 18.73% |
| bonafide_bonafide | 2784 | 11.20% |
| spoof_bonafide | 2413 | 9.70% |
| Label | Count | Ratio |
|---|---|---|
| bonafide_spoof | 7655 | 27.73% |
| original | 7455 | 27.01% |
| spoof_spoof | 5945 | 21.54% |
| bonafide_bonafide | 3570 | 12.93% |
| spoof_bonafide | 2980 | 10.80% |
| Label | Count | Ratio |
|---|---|---|
| bonafide_spoof | 7672 | 27.79% |
| original | 7415 | 26.86% |
| spoof_spoof | 5894 | 21.35% |
| bonafide_bonafide | 3635 | 13.17% |
| spoof_bonafide | 2987 | 10.82% |
ποΈ Metadata is provided in CSV format, with one row per audio file. Each field describes the source, generation process, and mixing configuration of the corresponding composite spoofing sample.
The meaning of each field in Metadata is as follows:
audio_path: Relative path to the final mixed audio file used for training or evaluation.
label: Class label of the audio sample. Typical values include: original, bonafide_bonafide, spoof_bonafide, bonafide_spoof, spoof,spoof
split: Dataset split indicator: train, val, eval, test
original_audio_source: Source dataset of the original audio, e.g., AudioCaps.
speech_path: Path to the speech signal used as the speech component in the mixture.
speech_source: Source dataset of the speech signal, e.g., ASV5, CommonVoice.
speech_generation_mothed: Generation method used to produce the speech signal, e.g., TTS (text-to-speech), VC (voice-conversion).
speech_generation_source: Dataset to generate the spoofed speech, e.g., a spoofed speech is generated by TTS, the text for generation is the source.
speech_generation_model: Model used to generate the spoofed speech.
env_path: Path to the environmental sound used as the environmental sound component in the mixture.
env_source: Source dataset of the environmental sound, e.g., EnvSDD, VcapAV.
env_generation_mothed: Method used to generate the spoofed environmental sound, e.g.,TTA (text-to-audio).
env_generation_source: Dataset to generate the spoofed environmental sound, e.g., a spoofed speech is generated by TTS, the text for generation is the source.
env_generation_model:Model used to generate the spoofed environmental sound.
mix_target_snr: Target signal-to-noise ratio (SNR, in dB) used when mixing the speech and environmental sound.π If you use CompSpoof V2 in your research, please cite the corresponding paper:
@dataset{zhang2025esdd2compspoofv2,
title = {ESDD2-CompSpoof-V2: A Composite Spoofing Dataset for Speech Anti-Spoofing},
author = {Zhang, Xueping and Li, Ming},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2}
}
This dataset is a derived dataset constructed by combining and mixing audio samples from multiple publicly available datasets.
Users must comply with the license terms of each original dataset. The authors do not claim ownership of the original audio content. Due to the inclusion of datasets licensed under CC BY-NC 4.0 license, this dataset is released under the CC BY-NC 4.0 license.
For questions, issues, or collaboration inquiries, please contact: