Skip to content

Documenting/releasing the scripts and files used for automatic evaluation of the Santa Barbara Corpus of Spoken American English, as described in the paper "Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language", published in ISCA Interspeech 2024.

License

Notifications You must be signed in to change notification settings

mmaciej2/sbcsae_preparation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Preparation for the Santa Barbara Corpus of Spoken American English

This respository compiles updated annotations and technical resources to accompany the paper "Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language". For more information about the dataset itself, including audio samples and extensive statistics, please visit our website about the dataset.

If you use our work, please cite our paper as well as the corpus itself:

@inproceedings{maciejewski24_interspeech,
  author={Matthew Maciejewski and Dominik Klement and Ruizhe Huang and Matthew Wiesner and Sanjeev Khudanpur},
  title={Evaluating the {Santa Barbara} Corpus: Challenges of the Breadth of Conversational Spoken Language},
  year=2024,
  booktitle={Proc. Interspeech 2024}
}
@misc{dubois_2005,
  author={Du Bois, John W. and Chafe, Wallace L. and Meyer, Charles and Thompson, Sandra A. and Englebretson, Robert and Martey, Nii},
  year={2000--2005},
  title={{S}anta {B}arbara corpus of spoken {A}merican {E}nglish, {P}arts 1--4},
  address={Philadelphia},
  organization={Linguistic Data Consortium},
}

Annotations for Evaluation

The most comprehensive set of annotations can be generated with the Lhotse recipe, which also includes code to download the corpus from OpenSLR. However, if you would like to merely download the basic annotation files for use with your own systems without using Lhotse, these files can be found in the "ASR" and "Diarization" subdirectories of this respository. The dataset is also available via Hugging Face.

Preparation Scripts

The code that was used to prepare the data is contained in three repositories:

  • Lhotse Recipe - The Lhotse recipe contains the bulk of the processing, normalization, and correction required to prepare the annotations and metadata for automatic processing. It also contains code to download the corpus.
  • Anonymization Detection - This repository contains the code used to detect the regions of the audio that had been low-pass filtered to remove personally identifiable information from the speech.
  • Alignment for VAD - This repository contains the code used to align the transcripts to re-segment single-speaker regions for better voice activity detection (VAD) boundaries. (Note: code currently undergoing cleanup for release)

About

Documenting/releasing the scripts and files used for automatic evaluation of the Santa Barbara Corpus of Spoken American English, as described in the paper "Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language", published in ISCA Interspeech 2024.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published