Skip to content

week 21.11 27.11

Matthijs Van keirsbilck edited this page Mar 29, 2017 · 1 revision

modifying the converted Caffe model so the network topology fits my 39 classes -> change the nb of output neurons, retrain with learning rate of 0 for everything except the FC layer

Code usage

there are three files that are important:

  1. resnet50CaffeToLasagne.py:
  • contains build_network(), which builds the bare network structure (the layer configuration) so it matches that from the Caffe model.
  • contains build_network_fill_from_caffe(), which takes the network structure, builds the caffe model from provided model files, and then converts the caffe weights so they can be copied to the Lasagne model. Outputs both the caffe model and the converted Lasagne model for comparison.
  1. resnet50CaffeToLasagne_ImageNet.py:
  • this file get the Lasagne model from build_network_fill_from_caffe(), gets the output classes from a .txt file, gets the mean value of the images to train on from a .binaryproto file, and stores those parameters in a full model of the network. It stores that model in a pkl file that can be used for evaluation of the network.
  • If you want to, you can verify that the lasagne network works by downloading some images and testing on them.
  1. resnet50_evaluateNetwork.py:
  • this file takes an input .pkl file, and input image, and runs the network on the image, returning the top 5 classifications and some timing information. Input images are transformed to the right size.

So to create, populate and then evaluate a Lasagne model, do as follows:

python resnet50CaffeToLasagne_ImageNet.py # builds and populates the model, stores it in './resnet50imageNet.pkl'
python resnet50_evaluateNetwork.py -i indianElephant.jpeg -m resnet50imageNet.pkl #evaluate the network on an image

Fixing audio/video offsets

There is an issue with synchronization of audio and video. When cutting the whole video of a speaker in smaller clips, ffmpeg cuts video at I-frames (which may be quite far apart, like 15frames = 0.5s). The audio is cut at the exact time.
To create proper videos, ffmpeg pads the video with I-frames at the front, and adds offset information so each video frame is displayed when its corresponding audio packet is played. These offsets are called 'Presentation Time Stamps (PTS)'.
When extracting images from a clip, these PTS are lost, and the padded frames are removed by Matlab. This means that you'll have fewer frames (sometimes more) than expected based on the video length, and they also won't be synchronized to the phoneme labels anymore.
To fix these issues, you can run ffprobe to extract the PTS offsets between audio and images, and add these offsets to the audio labels (instead of the video player automatically doing this), so that they correspond to the extracted images. These offsetted phoneme files are converted to viseme labels (several phonemes are visually indistinguishable, this is a many-to-one mapping), and stored in a viseme label file.

My problem: there are no viseme label files in the DataBase.

Starting to look at Audio speech recognition:

https://yerevann.github.io/2016/06/26/combining-cnn-and-rnn-for-spoken-language-identification/

Both RNNs and CNNs were trained using adadelta for a few epochs, then by SGD with momentum (0.003 or 0.0003) until overfitting. If SGD with momentum is applied from the very beginning, the convergence is very slow. Adadelta converges faster but usually doesn’t reach high validation accuracy.

TCD TIMIT: Counting amount of data in dataset:

only GRID and VidTIMIT are currently available GRID is larger than VidTIMIT (1000 sentences vs 430) and filmed at a higher resolution, but its vocabulary (51 words) is much smaller.

  • A large number of speakers.
  • Continuous speech with good coverage of phonemes and visemes.
  • Available to other researchers.
  • High-quality recordings.
  • Gender balanced speaker set.

Data in TCD-TIMIT:

  • 2255 sentences from TIMIT
  • 59 volunteers, 3 professional lipspeakers
  • Volunteers say 98 sentences each, while the lipspeakers say 377 sentences each
For the whole database:  
{'iy': 9322, 'aa': 8046, 'ch': 1433, 'ae': 5494, 'eh': 5852, 'ah': 25070, 'ih': 10692, 'ey': 4020, 'aw': 1344, 'ay': 3609, 'uh': 1246, 'er': 6567, 'ng': 2216, 'r': 11153, 'th': 627, 'sil': 19324, 'oy': 777, 'dh': 2889, 'y':   
 2391, 'hh': 3362, 'jh': 1710, 'b': 4654, 'g': 2881, 'f': 4230, 'k': 8285, 'm': 6831, 'l': 10698, 'n': 14180, 'p': 5392, 's': 11257, 'sh': 2609, 't': 14820, 'w': 4469, 'v': 3831, 'ow': 3224, 'z': 7650, 'uw': 3543}
Ten least often occuring:  [('th', 627), ('oy', 777), ('uh', 1246), ('aw', 1344), ('ch', 1433), ('jh', 1710), ('ng', 2216), ('y', 2391), ('sh', 2609), ('g', 2881)]
Ten most often occuring :  [('k', 8285), ('iy', 9322), ('ih', 10692), ('l', 10698), ('r', 11153), ('s', 11257), ('n', 14180), ('t', 14820), ('sil', 19324), ('ah', 25070)]
Total number of phonemes in database: 	 235698
Avg number of images per phonemes: 	 	 6043

trying to extract frames in MATLAB myself

Use ffmpeg directly
  1. extracting all frames: mkdir videoName; ffmpeg -i VideoName.mp4 frames/%d.jpg
  2. calculate needed frames from video labels
  3. throw away all non-needed frames
  4. compress
  5. extract face
  6. extract mouth