THCHS30 is an open Chinese speech database published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
The origional recording was conducted in 2002 by Dong Wang, supervised by Prof. Xiaoyan Zhu, at the Key State Lab of Intelligence and System,
Department of Computer Science, Tsinghua Universeity, and the original name was 'TCMSD', standing for 'Tsinghua Continuous Mandarin
Speech Database'. The publication after 13 years has been initiated by Dr. Dong Wang
and was supported by Prof. Xiaoyan Zhu. We hope to provide a toy database for new researchers in the field of speech recognition. Therefore,
the database is totally free to academic users.
The entire package involves the full set of speech and language resoruces required to establish a Chinese speech recognition system.
If you are willing to use Kaldi, better to download the openslr version.
wav : signals including the training/cv/test sets.
doc : transcripts, reference results, etc.
lm : language models including word based and morpheme based.
We call for competition on this database. Two challenges are set up,
and researchers are welcome to challenge the current state-of-the-art!
Check the challenge page for details.
LOCAL DOWNLOAD (not recommended)
ATTENTION: We received some feedback and noticed that there are some errors
in the transcription. These problems have been fixed. If you downloaded the data
before 2015/12/27, please re-download. The baseline results have also been
updated, check [here].
The data can be download from our local server at CSLT@Tsinghua.
wav.tgz.tgz : speech signals[5.5GB]
doc.tgz : transcription and reference reuslts[1MB]
lm.tgz : lange models[24MB]
standalone.html : this file
PUBLIC DOWNLOAD (recommended)
The above links are from our own web server at Tsinghua University, which may be not stable
and slow for some connections. The mirrors in the public cloud disks can be used as a backup:
THUYG-20 RECIPE IS AVAILABLE:
All the resources contained in the database are free for research institutes and individuals. No commerical usage is permitted.
We are very happy if you cite the following paper in your publications:
Dong Wang, Xuewei Zhang, CSLT TRP 20150016: THCHS-30 : A Free Chinese Speech Corpus. [pdf][arXiv 1512.01882]
A paper (if it can be called a paper) 13 years ago regarding the database:
Dong Wang, Dalei Wu, Xiaoyan Zhu, "TCMSD: A new Chinese Continuous Speech Database",
International Conference on Chinese Computing (ICCC'01), 2001, Singapore. [pdf]
Dong Wang, Xuewei Zhang, Zhiyong Zhang @CSLT, Tsinghua Univ.
Dong Wang: email@example.com
XueWei Zhang: firstname.lastname@example.org
Zhiyong Zhang: email@example.com
CSLT, Tsinghua University
ROOM1-303, BLDG FIT
2015/10/14: First release
2015/12/10: Errors in transcriptions were corrected. New baseline released.
2015/12/27: More errors in transcriptions were corrected. New baseline released.
2016/01/22: add training lexicon to doc