-
Notifications
You must be signed in to change notification settings - Fork 19
April: combinedSR
matthijs van keirsbilck edited this page Apr 30, 2017
·
1 revision
Issues with TCDTIMIT database:
- some of the labels were wrong (duplicated) -> 2x same label after each other.
- sometimes face/mouth would not be recognized = missing input image -> use hierarchy of detectors; also default variables
- much improved speed because not constantly loading data when not needed + better multithreading + only do what needs to be done
- mouth images a bit more cropped around the mouth.
More dataset preprocessing:
- FileDirOps copies files needed from processed/ to some other dir. Also rearranges them so they're in the right folder structure. Copies label files, and replaces image names by proper name that contains their label. If 2 phonemes were mapped on the same frame, there was an error. Also fixed.
- for combinedSR, you need to manually copy the wav and .phn audio files to the extraction dir of FileDirOps where the lip images were stored.
- then run datasetToPkl. if not, the files are automatically generated while training. It's possible to store those files to they don't have to be regenerated each epoch.
- fix evaluateImage. Now can get top-k accuracy score etc. Can also be evaluated on many images.
- fix networks. Now possible to use any network.
- function to print CNN network structure.
- restructure train_lipreading so it's more like audioSR.
- add more customization parameters
- loadPerSpeaker: reduce RAM usage by only loading data from 1 speaker at a time.
this works for volunteers or combined, but not lispeakers only (because there there's no seperate test set; test data is a bit from each speaker). - performance not so good on Volunteers because unknown speakers. Add regularization ->
20-25% accuracy top1, 4045% top3. - performance on the lipspeakers is about 35% top1, 55% top3.
- audio network -> take valid features before denseLayers. shape: (nbImages, lstmUnits)
- lipreading network -> take features before denseLayers . shape: (nbImages, 512x7x7)
- concatenate them
- add some FC layers
- Softmax to 39 phonemes
(for audio: only valid features -> lasagne SliceLayer)
many issues with shape and reshaping...
- features:
- load lipreading, audioSR networks with pretrained nets.
- train only lipraeding, train only audio, train everything
- output from lip only, audio only, everything