-
Notifications
You must be signed in to change notification settings - Fork 19
week 07.11 13.11
The .mat files containing the ROI for each frame of the video don't actually contain all the frames. They leave out some frames at the beginning and end. I fixed this issue by adding some if/else cases, so the script extracts the very first frame of the .mat file; then all phonemes are extracted at startTime * (1-t) + endTime * t
, except for the last one. The last phoneme is extracted at the last available frame in the .mat file.
There appears to be a few bad phoneme time labels (eg in the first video 'sha1.mp4', the 22-24 phonemes (iy, s, iy) The last iy is actually a 'w', even at the start time of the label.
- idea: for converting labels to classes, don't do 1 class per label, but a hierarchy (vowels, fricatives,...)
TODO: convert preprocessed MAT files to input matrix for ResNet152, take the labels and convert them to numbers according to phonemeLabelConversion.txt
in Imagespeech folder
- idea: phoneme labels not 1-39, but hierarchical (so first classify on 'vowel', 'fricative',..., and only then on the actual phoneme. Probably ConvNet does this internally already...