April: combinedSR

Issues with TCDTIMIT database:

some of the labels were wrong (duplicated) -> 2x same label after each other.
sometimes face/mouth would not be recognized = missing input image -> use hierarchy of detectors; also default variables
much improved speed because not constantly loading data when not needed + better multithreading + only do what needs to be done
mouth images a bit more cropped around the mouth.

More dataset preprocessing:

FileDirOps copies files needed from processed/ to some other dir. Also rearranges them so they're in the right folder structure. Copies label files, and replaces image names by proper name that contains their label. If 2 phonemes were mapped on the same frame, there was an error. Also fixed.
for combinedSR, you need to manually copy the wav and .phn audio files to the extraction dir of FileDirOps where the lip images were stored.
then run datasetToPkl. if not, the files are automatically generated while training. It's possible to store those files to they don't have to be regenerated each epoch.

fix evaluateImage. Now can get top-k accuracy score etc. Can also be evaluated on many images.
fix networks. Now possible to use any network.
function to print CNN network structure.
restructure train_lipreading so it's more like audioSR.
add more customization parameters
loadPerSpeaker: reduce RAM usage by only loading data from 1 speaker at a time.
this works for volunteers or combined, but not lispeakers only (because there there's no seperate test set; test data is a bit from each speaker).
performance not so good on Volunteers because unknown speakers. Add regularization -> ~~20-25% accuracy top1, 40~~45% top3.
performance on the lipspeakers is about 35% top1, 55% top3.

audio network -> take valid features before denseLayers. shape: (nbImages, lstmUnits)
lipreading network -> take features before denseLayers . shape: (nbImages, 512x7x7)
concatenate them
add some FC layers
Softmax to 39 phonemes

(for audio: only valid features -> lasagne SliceLayer)
many issues with shape and reshaping...

features:
- load lipreading, audioSR networks with pretrained nets.
- train only lipraeding, train only audio, train everything
- output from lip only, audio only, everything

Provide feedback