Skip to content

week 13.03 19.03.2017

Matthijs Van keirsbilck edited this page Mar 29, 2017 · 1 revision
CNN + RNN for language identification (not phonemes :()

from this example Implementation, and also (more details) here. They do give some tips on training:

  • " Both RNNs and CNNs were trained using adadelta for a few epochs, then by SGD with momentum (0.003 or 0.0003) until overfitting. If SGD with momentum is applied from the very beginning, the convergence is very slow. Adadelta converges faster but usually doesn’t reach high validation accuracy."
  • Cropping out frequencies above 5.5KHz helped fight overfitting.
  • The general architecture of these combinations is a convolutional feature extractor applied on the input, then some recurrent network on top of the CNN’s output, then an optional fully connected layer on RNN’s output and finally a softmax layer.
  • preprocessing: first convert all to lowercase. Run this command folder depth number of times (eg 4 for TIMIT). Run it inside the TIMIT folder( where you find train/, test/ etc...):
    find . -depth -print0 | xargs -0 rename '$_ = lc $_'
    alternative: find . -depth -print -execdir rename -f 'y/A-Z/a-z/' '{}' \; source
    lower to upper: find . -depth -print -execdir rename -f 'y/a-z/A-Z/' '{}' \;
  • convert phonemes from 61 to 39 in the .phones files, using dictionaries
  • create MFCC's: either using HTK and my custom Preprocessing/prepareWAV_HTK.py script, or with Python itself:
from python_speech_features import mfcc
# read lines with paths of wav files into list 'wavnames'  
for wav in wavnames: 
 inputs = np.asarray([np.random.randn(t, num_features).astype(np.float32) for t in timesteps])

    # Generating random label, the size must be less or equal than timestep in order to achieve the end of the lattice in max timestep
    labels = np.asarray([np.random.randint(0, num_labels, np.random.randint(1, inputs[i].shape[0], (1,))).astype(np.int64) for i, _ in enumerate(timesteps)])

fs, audio = wav.read(audio_filename)
inputs = mfcc(audio, samplerate=fs)
Recognizing phonemes
  • the example from last week actually recognizes a language, not phonemes :/. It's useful as source for Lasagne-based LSTM's though.

  • people did do phoneme recognition on TIMIT using Theano, but couldn't find Lasagne implementations. There are some using other libraries:

  • There's also: this: uses CTC cost function in Lasagne. Also preprocess script to generate pkl files from TIMIT dataset.

  • CTC explanation: blogpost. CTC doesn't require label boundaries; only gives sequence of recognized phonemes. How to use this for my purposes?

  • Both Spoken2Phoneme and Phoneme2Word , implementation in Keras: here. Very interesting, though only Dense layers, not LSTM -> needs some modification.

  • CTC in Lasagne: https://github.com/justiceamoh/ascii_ctc

  • alignment MFCC- labels: see this reddit thread

  • PHONEME Recognition in Lasagne on TIMIT