week 13.03 19.03.2017

CNN + RNN for language identification (not phonemes :()

from this example Implementation, and also (more details) here. They do give some tips on training:

" Both RNNs and CNNs were trained using adadelta for a few epochs, then by SGD with momentum (0.003 or 0.0003) until overfitting. If SGD with momentum is applied from the very beginning, the convergence is very slow. Adadelta converges faster but usually doesn’t reach high validation accuracy."
Cropping out frequencies above 5.5KHz helped fight overfitting.
The general architecture of these combinations is a convolutional feature extractor applied on the input, then some recurrent network on top of the CNN’s output, then an optional fully connected layer on RNN’s output and finally a softmax layer.

TIMIT download

preprocessing: first convert all to lowercase. Run this command folder depth number of times (eg 4 for TIMIT). Run it inside the TIMIT folder( where you find train/, test/ etc...):
find . -depth -print0 | xargs -0 rename '$_ = lc $_'
alternative: find . -depth -print -execdir rename -f 'y/A-Z/a-z/' '{}' \; source
lower to upper: find . -depth -print -execdir rename -f 'y/a-z/A-Z/' '{}' \;
convert phonemes from 61 to 39 in the .phones files, using dictionaries
create MFCC's: either using HTK and my custom Preprocessing/prepareWAV_HTK.py script, or with Python itself:

from python_speech_features import mfcc
# read lines with paths of wav files into list 'wavnames'  
for wav in wavnames: 
 inputs = np.asarray([np.random.randn(t, num_features).astype(np.float32) for t in timesteps])

    # Generating random label, the size must be less or equal than timestep in order to achieve the end of the lattice in max timestep
    labels = np.asarray([np.random.randint(0, num_labels, np.random.randint(1, inputs[i].shape[0], (1,))).astype(np.int64) for i, _ in enumerate(timesteps)])

fs, audio = wav.read(audio_filename)
inputs = mfcc(audio, samplerate=fs)

maybe I can use this NLTK library for TIMIT functions?

Recognizing phonemes

the example from last week actually recognizes a language, not phonemes :/. It's useful as source for Lasagne-based LSTM's though.
people did do phoneme recognition on TIMIT using Theano, but couldn't find Lasagne implementations. There are some using other libraries:
- here, using Theano and a library called recNet.
- Another one: here. This uses TensorFlow.
- more TensorFlor: timit_tf (TiemovNiedek)
There's also: this: uses CTC cost function in Lasagne. Also preprocess script to generate pkl files from TIMIT dataset.
CTC explanation: blogpost. CTC doesn't require label boundaries; only gives sequence of recognized phonemes. How to use this for my purposes?
Both Spoken2Phoneme and Phoneme2Word , implementation in Keras: here. Very interesting, though only Dense layers, not LSTM -> needs some modification.
CTC in Lasagne: https://github.com/justiceamoh/ascii_ctc
alignment MFCC- labels: see this reddit thread
PHONEME Recognition in Lasagne on TIMIT

Home
general idea
software used
links
work-overview
thesis-conversations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week 13.03 19.03.2017

CNN + RNN for language identification (not phonemes :()

TIMIT download

Recognizing phonemes

Clone this wiki locally