CompSpoof V2 Dataset

1. Introduction

CompSpoof V2 is a dataset designed for component-level anti-spoofing detection research, where either the speech or the environmental sound component (or both) may be spoofed.

πŸ“Š CompSpoof V2 contains over 250k audio samples, with a total duration of approximately 283 hours. ⏱️ Each audio sample has a fixed length of 4 seconds and is provided at multiple sampling rates, enabling a more faithful simulation of real-world acoustic and system-level variations.

Building upon CompSpoof dataset, CompSpoof V2 significantly expands the diversity of attack sources, environmental sounds, and mixing strategies. ✨ In addition, newly generated audio samples are distributed across the test set and are specifically designed to serve as detection data under unseen conditions.

πŸ€— CompSpoof V2 Download Link: https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2/

πŸ’» Baseline code: https://github.com/XuepingZhang/ESDD2-Baseline


2. Download and Setup

Step 1: Vist https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2

Step 2: Read and acknowledge license (you need to click the β€˜acknowledge license’ button)

Step 3: Install huggingface_hub and login

pip install huggingface_hub[hf_transfer]

huggingface-cli login     # input your huggingface login token

Step 4: download and unzip

hf download XuepingZhang/ESDD2-CompSpoof-V2 --repo-type dataset --local-dir ./CompSpoofV2

cd CompSpoofV2

tar -zxvf eval.tar.gz

cat development.tar.gz.part_* > development.tar.gz

tar -zxvf development.tar.gz

3. Audio Class Description and Samples

Below are audio samples from the CompSpoof V2 dataset. For each class, we provide the mixed/original audio, along with the speech and environment sources.

Class 0 β€” Original

Label: original

Description: Original bona fide speech and corresponding environment audio without mixing

Original

Class 1 β€” Bona fide + Bona fide

Label: bonafide_bonafide

Description: Bona fide speech mixed with another bona fide environmental audio

Mixed Speech Environment

Class 2 β€” Spoofed Speech + Bona fide Environment

Label: spoof_bonafide

Description: Spoof speech mixed with bona fide environmental audio

Mixed Speech Environment

Class 3 β€” Bona fide Speech + Spoofed Environment

Label: bonafide_spoof

Description: Bona fide speech mixed with spoof environmental audio

Mixed Speech Environment

Class 4 β€” Spoofed Speech + Spoofed Environment

Label: spoof_spoof

Description: Spoof speech mixed with spoof environmental audio

Mixed Speech Environment

4. CompSpoof V2 VS CompSpoof

CompSpoof dataset is our previously released dataset designed for component-level spoofing detection. Building upon this foundation, we introduce CompSpoof V2, a substantially upgraded version with expanded task formulation. The key differences between CompSpoof and CompSpoof V2 are summarized below.

Aspect CompSpoof CompSpoof V2
Data volume 2.5k audio clips, about 7 hours πŸ“Š more than 250k audio clips, about 283 hours
Data sources SSTC, ASV5, VggSound, VcapAV, Common Voice AudioCaps, VggSound, CommonVoice, LibriTTS, english-conversation-corpus,ASV5, MLAAD,TUTASC, TUTSED, UrbanSound, VGGSound, EnvSDD, VcapAV
Duration range from 5 to 21 seconds ⏱️ 4 seconds/audio clip
Newly generated audio ❌ βœ…

5. Dataset Structure

The dataset follows a hierarchical directory structure organized by data split:

CompSpoof
β”œβ”€β”€ development                     # training and val data,including audio source
β”‚   β”œβ”€β”€ env_source                  # environmental sound audio used as the environmental sound component in the mixture
β”‚   β”œβ”€β”€ metadata                    # metadata of development set
β”‚       β”œβ”€β”€ train.csv
β”‚       └── val.csv
β”‚   β”œβ”€β”€ mixed_audio                 # mixed audio files, which **don't** belong to the `original` class
β”‚   β”œβ”€β”€ original_audio              # audios belong to `original` class
β”‚   └── speech_sources              # speech audio used as the speech component in the mixture
β”‚
β”œβ”€β”€ eval                            # eval set data, without audio source
β”‚   β”œβ”€β”€ audio                       # audio files
β”‚   └── metadata                    # metadata of eval set, which only has file name
β”‚       └── eval.csv
β”‚
β”œβ”€β”€ eval_source (Released later)    # eval set data,including audio source
β”‚   β”œβ”€β”€ env_sources                 # environmental sound audio used as the environmental sound component in the mixture
β”‚   β”œβ”€β”€ metadata                    # metadata of eval set, with full annotation
β”‚   β”‚   └── eval.csv
β”‚   β”œβ”€β”€ mixed_audio                 # mixed audio files, which **don't** belong to the `original` class
β”‚   β”œβ”€β”€ original_audio              # audios belong to `original` class
β”‚   └── speech_sources              # speech audio used as the speech component in the mixture
|
β”œβ”€β”€ test                            # test set data, without audio source
β”‚   β”œβ”€β”€ audio                       # audio files
β”‚   └── metadata                    # metadata of test set, which only has file name
β”‚       └── test.csv
β”‚
└── test_source (Released later)    # test set audio data,including audio source
    β”œβ”€β”€ env_sources                 # environmental sound audio used as the environmental sound component in the mixture
    β”œβ”€β”€ metadata                    # metadata of test set, with full annotation
    β”‚   └── test.csv
    β”œβ”€β”€ mixed_audio                 # mixed audio files, which **don't** belong to the `original` class
    β”œβ”€β”€ original_audio              # audios belong to `original` class
    └── speech_sources              # speech audio used as the speech component in the mixture

6. Audio Source

The audio sources for each category are as follows:

πŸ‹οΈ train & val set

Label Original Source Speech Source Environmental Sound Source
original AudioCaps, VggSound - -
bonafide_bonafide - CommonVoice, LibriTTS, english-conversation-corpus AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound
bonafide_spoof - CommonVoice, LibriTTS EnvSDD, VcapAV
spoof_bonafide - ASV5, MLAAD AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound
spoof_spoof - ASV5, MLAAD EnvSDD, VcapAV

🏁 eval & test set

Label original source speech source environmental sound source
original AudioCaps, VggSound - -
bonafide_bonafide - CommonVoice, LibriTTS, english-conversation-corpus AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound
bonafide_spoof - CommonVoice, LibriTTS EnvSDD, VcapAV, New Generated
spoof_bonafide - ASV5, MLAAD, New Generated AudioCaps, TUTASC, TUTSED, UrbanSound, VGGSound
spoof_spoof - ASV5, MLAAD, New Generated EnvSDD, VcapAV, New Generated

7. Data Splits

The dataset is divided into three standard splits:

Training set and validation set have the same date source and class distribution.

Eval set and Test set share the same date source and class distribution. Eval set and Test set share some new generated audios which are unseen in training and validation set.

The quantity and proportion of audios for each category in each set are as follows:

πŸ“Š train set (Total: 175361)

Label Count Ratio
bonafide_spoof 50361 28.72%
original 48639 27.74%
spoof_spoof 29413 16.77%
bonafide_bonafide 25189 14.36%
spoof_bonafide 21759 12.41%

πŸ“Š val set (Total: 24864)

Label Count Ratio
bonafide_spoof 8071 32.46%
original 6939 27.91%
spoof_spoof 4657 18.73%
bonafide_bonafide 2784 11.20%
spoof_bonafide 2413 9.70%

πŸ“Š eval set (Total: 27605)

Label Count Ratio
bonafide_spoof 7655 27.73%
original 7455 27.01%
spoof_spoof 5945 21.54%
bonafide_bonafide 3570 12.93%
spoof_bonafide 2980 10.80%

πŸ“Š test set (Total: 27603)

Label Count Ratio
bonafide_spoof 7672 27.79%
original 7415 26.86%
spoof_spoof 5894 21.35%
bonafide_bonafide 3635 13.17%
spoof_bonafide 2987 10.82%

8. Metadata

πŸ—‚οΈ Metadata is provided in CSV format, with one row per audio file. Each field describes the source, generation process, and mixing configuration of the corresponding composite spoofing sample.

The meaning of each field in Metadata is as follows:





9. Citation

πŸ“š If you use CompSpoof V2 in your research, please cite the corresponding paper:

@dataset{zhang2025esdd2compspoofv2,
  title     = {ESDD2-CompSpoof-V2: A Composite Spoofing Dataset for Speech Anti-Spoofing},
  author    = {Zhang, Xueping and Li, Ming},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/XuepingZhang/ESDD2-CompSpoof-V2}
}


10. License πŸ”

This dataset is a derived dataset constructed by combining and mixing audio samples from multiple publicly available datasets.

Users must comply with the license terms of each original dataset. The authors do not claim ownership of the original audio content. Due to the inclusion of datasets licensed under CC BY-NC 4.0 license, this dataset is released under the CC BY-NC 4.0 license.


11. Contact Information

For questions, issues, or collaboration inquiries, please contact: