Kokoro Speech Dataset is a public domain Japanese speech dataset. It contains 43,253 short audio clips of a single speaker reading 14 novel books. The format of the metadata is similar to that of LJ Speech so that the dataset is compatible with modern speech synthesis systems.
The texts are from Aozora Bunko, which is in the public domain. The audio clips are from LibriVox project, which is also in the public domain. Readings are estimated by MeCab and UniDic Lite from kanji-kana mixture text. Readings are romanized which are similar to the format used by Julius.
The audio clips were split and transcripts were aligned automatically by Kokoro-Align.
Listen from your browser or download randomly sampled 100 clips.
Metadata is provided in metadata.csv
. This file consists of one record per line,
delimited by the pipe character (0x7c). The fields are:
- ID: this is the name of the corresponding .wav file
- Transcription: Kanji-kana mixture text spoken by the reader (UTF-8)
- Reading: Romanized text spoken by the reader (UTF-8)
Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.
The dataset is provided in different sizes, xlarge
, large
, small
, tiny
.
large
, small
and tiny
don't share same clips.
xlarge
contains all available clips, including large
, small
and tiny
.
X Large:
Total clips: 44788
Min duration: 3.007 secs
Max duration: 14.861 secs
Mean duration: 4.718 secs
Total duration: 58:41:39
Large:
Total clips: 23461
Min duration: 3.007 secs
Max duration: 14.861 secs
Mean duration: 4.742 secs
Total duration: 30:54:16
Small:
Total clips: 9199
Min duration: 3.007 secs
Max duration: 9.961 secs
Mean duration: 4.687 secs
Total duration: 11:58:31
Tiny:
Total clips: 308
Min duration: 3.030 secs
Max duration: 8.092 secs
Mean duration: 4.695 secs
Total duration: 00:24:05
Because of its large data size of the dataset, audio files are not included in this repository, but the metadata is included.
To make .wav files of the dataset, run
$ bash download.sh
to download the metadata from the project page. Then run
$ pip3 install torchaudio
$ python3 extract.py --size tiny
This prints a shell script example to download MP3 audio files from archive.org and extract them if you haven't done it already.
After doing so, run the command again
$ python3 extract.py --size tiny
to get files for tiny
under ./output
directory.
You can give another size name to the --size
option to get
dataset of the size.
You can specify the audio clip format to the --format
option.
Pretrained Tacotron
model trained with Kokoro Speech Dataset
and audio samples are available.
The model was trained for 21K steps with small
.
According to the above repo,
"Speech started to become intelligible around 20K steps" with
LJ Speech Dataset.
Audio samples read the first few sentences from Gon Gitsune
which is not included in small
.
The dataset contains recordings from these books read by ekzemplaro
- 明暗 (Meian) 16:39:29 Online text
- こころ (Kokoro) 08:46:41 Online text
- 田舎教師 (Inaka Kyoshi) 08:13:26 Online text
- 野分 (Nowaki) 4:40:49 Online text
- 草枕 (Kusamakura) 04:27:35 Online text
- 坊っちゃん (Botchan) 04:26:27 Online text
- 雁 (Gan) 03:41:31 Online text
- 生まれいずる悩み (Umareizuru Nayami) 2:43:12 Online text
- 硝子戸の中 (Garasudono uchi) 2:39:53 Online text
- 永日小品 (Eijitsu Syohin) 2:33:54 Online text
- 蒲団 (Futon) 2:28:58 Online text
- 高野聖 (Kouyahijiri) 2:06:23 Online text
- ごん狐 (Gon gitsune) 0:15:42 Online text
- コーカサスの禿鷹 (Caucasus no Hagetaka) 0:13:04 Online text
This project was also inspired by CSS10, which contains audio clips of various languages from LibriVox.
- v1.3 Keep word separators in transcripts with '_'
- v1.2 New metadata generated with a new align model
- v1.1.1 Added FLAC, MP3, OGG support
- v1.1 Added more books
- v1.0 Initial release
All texts are from Aozora Bunko. Recordings by ekzemplaro from LibriVox. Alignment and annotation by Katsuya Iida.
This dataset is in the public domain in the USA (and most likely other countries as well). There are no restrictions on its use. For more information, please see: librivox.org/pages/public-domain.