Skip to content

week 07.11 13.11

Matthijs Van keirsbilck edited this page Mar 29, 2017 · 1 revision

The .mat files containing the ROI for each frame of the video don't actually contain all the frames. They leave out some frames at the beginning and end. I fixed this issue by adding some if/else cases, so the script extracts the very first frame of the .mat file; then all phonemes are extracted at startTime * (1-t) + endTime * t, except for the last one. The last phoneme is extracted at the last available frame in the .mat file.

There appears to be a few bad phoneme time labels (eg in the first video 'sha1.mp4', the 22-24 phonemes (iy, s, iy) The last iy is actually a 'w', even at the start time of the label.

  • idea: for converting labels to classes, don't do 1 class per label, but a hierarchy (vowels, fricatives,...)

TODO: convert preprocessed MAT files to input matrix for ResNet152, take the labels and convert them to numbers according to phonemeLabelConversion.txt in Imagespeech folder

  • idea: phoneme labels not 1-39, but hierarchical (so first classify on 'vowel', 'fricative',..., and only then on the actual phoneme. Probably ConvNet does this internally already...
Clone this wiki locally