week 07.11 13.11

The .mat files containing the ROI for each frame of the video don't actually contain all the frames. They leave out some frames at the beginning and end. I fixed this issue by adding some if/else cases, so the script extracts the very first frame of the .mat file; then all phonemes are extracted at startTime * (1-t) + endTime * t, except for the last one. The last phoneme is extracted at the last available frame in the .mat file.

There appears to be a few bad phoneme time labels (eg in the first video 'sha1.mp4', the 22-24 phonemes (iy, s, iy) The last iy is actually a 'w', even at the start time of the label.

idea: for converting labels to classes, don't do 1 class per label, but a hierarchy (vowels, fricatives,...)

TODO: convert preprocessed MAT files to input matrix for ResNet152, take the labels and convert them to numbers according to phonemeLabelConversion.txt in Imagespeech folder

idea: phoneme labels not 1-39, but hierarchical (so first classify on 'vowel', 'fricative',..., and only then on the actual phoneme. Probably ConvNet does this internally already...

Home
general idea
software used
links
work-overview
thesis-conversations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week 07.11 13.11

Clone this wiki locally