CompSpoof Dataset

Introduction

The CompSpoof dataset is designed for studying component-level anti-spoofing, where either the speech or the environmental sound component (or both) may be spoofed.

📄 Paper on arXiv

🖥️ Code on github

📢 NEWS!!

We expanded CompSpoof dataset to CompSpoofV2, which significantly expands the diversity of attack sources, environmental sounds, and mixing strategies. ✨ In addition, newly generated audio samples are distributed across the test set and are specifically designed to serve as detection data under unseen conditions.

🤗 CompSpoofV2 Details & Download Link: https://xuepingzhang.github.io/CompSpoof-V2-Dataset/

Building upon CompSpoofV2 dataset and separation-enhanced joint learning framework, we lunched the ICME 2026 Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2). We warmly invite researchers from both academia and industry to participate in this challenge, exploring robust and effective solutions for these critical deepfake detection tasks.

🖥️ ESDD2 Challenge website: https://sites.google.com/view/esdd-challenge/esdd-challenges/esdd-2/description

📥 Download

You can download the dataset on hugging face:
🤗 CompSpoof Download Link

🎧 Audio Examples

Below are audio samples from the CompSpoof dataset. For each class, we provide the mixed/original audio, along with the speech and environment sources.

Class 0 — Original

Label: original

Description: Original bona fide speech and corresponding environment audio without mixing

Original

Class 1 — Bona fide + Bona fide

Label: bonafide_bonafide

Description: Bona fide speech mixed with another bona fide environmental audio

Mixed	Speech	Environment

Class 2 — Spoofed Speech + Bona fide Environment

Label: spoof_bonafide

Description: Spoof speech mixed with bona fide environmental audio

Mixed	Speech	Environment

Class 3 — Bona fide Speech + Spoofed Environment

Label: bonafide_spoof

Description: Bona fide speech mixed with spoof environmental audio

Mixed	Speech	Environment

Class 4 — Spoofed Speech + Spoofed Environment

Label: spoof_spoof

Description: Spoof speech mixed with spoof environmental audio

Mixed	Speech	Environment

📂 Dataset Overview

Total samples: 2,500
Classes: 5 (500 samples per class)
Duration: 5–21 seconds
Sampling rate: 16 kHz
Partitioning: 70% train, 10% dev, 20% eval (stratified to preserve class balance)

ID	Mixed	Speech	Environment	Class Label	Description
0	❌	Bona fide	Bona fide	original	Original bona fide speech and corresponding environment audio without mixing
1	✅	Bona fide	Bona fide	bonafide_bonafide	Bona fide speech mixed with another bona fide environmental audio
2	✅	Spoofed	Bona fide	spoof_bonafide	Spoof speech mixed with bona fide environmental audio
3	✅	Bona fide	Spoofed	bonafide_spoof	Bona fide speech mixed with spoof environmental audio
4	✅	Spoofed	Spoofed	spoof_spoof	Spoof speech mixed with spoof environmental audio

🗂️ Metadata

The dataset includes three metadata files: CompSpoof_train.txt, CompSpoof_dev.txt, and CompSpoof_eval.txt.

Each line has four fields:

mixed_audio   speech_source   env_source   class_label

🎧 Data Sources

Bona fide speech: ASVspoof5, CommonVoice
Spoofed speech: ASVspoof5, SSTC
Bona fide environmental sounds: VGGSound
Spoofed environmental sounds: VCapAV
Original mixed audio: VGGSound (speech + environment simultaneously captured)

Environmental sounds cover indoor, street, and natural settings, ensuring acoustic diversity.

During processing:

All files are resampled to 16 kHz.
The shorter signal determines the final duration, with longer ones truncated.
Environmental sound is scaled to a predefined SNR relative to the speech.

🔖 Citation

If you use this dataset in your research, please cite:

@misc{zhang2025compspoofdatasetjointlearning,
      title={CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-spoofing Countermeasures}, 
      author={Xueping Zhang and Liwei Jin and Yechen Wang and Linxi Li and Ming Li},
      year={2025},
      eprint={2509.15804},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2509.15804}, 
}

License

The part of this dataset is a derived dataset constructed by combining and mixing audio samples from multiple publicly available datasets.

The SSTC dataset and the VCapAV dataset are released under the CC BY-NC 4.0 license.
The VGGSound dataset is released under the CC BY 4.0 license.
The Common Voice dataset is released under the Creative Commons CC0 1.0 Universal license.
The ASVspoof 5 dataset is released under the ODC-By License.

Users must comply with the license terms of each original dataset. The authors do not claim ownership of the original audio content. Due to the inclusion of datasets licensed under CC BY-NC 4.0 license, this dataset is released under the CC BY-NC 4.0 license.