Skip to content

April: combinedSR

matthijs van keirsbilck edited this page Apr 30, 2017 · 1 revision

Fixing dataset

Issues with TCDTIMIT database:

  • some of the labels were wrong (duplicated) -> 2x same label after each other.
  • sometimes face/mouth would not be recognized = missing input image -> use hierarchy of detectors; also default variables
  • much improved speed because not constantly loading data when not needed + better multithreading + only do what needs to be done
  • mouth images a bit more cropped around the mouth.

More dataset preprocessing:

  • FileDirOps copies files needed from processed/ to some other dir. Also rearranges them so they're in the right folder structure. Copies label files, and replaces image names by proper name that contains their label. If 2 phonemes were mapped on the same frame, there was an error. Also fixed.
  • for combinedSR, you need to manually copy the wav and .phn audio files to the extraction dir of FileDirOps where the lip images were stored.
  • then run datasetToPkl. if not, the files are automatically generated while training. It's possible to store those files to they don't have to be regenerated each epoch.

Improving Lipreading

  • fix evaluateImage. Now can get top-k accuracy score etc. Can also be evaluated on many images.
  • fix networks. Now possible to use any network.
  • function to print CNN network structure.
  • restructure train_lipreading so it's more like audioSR.
  • add more customization parameters
  • loadPerSpeaker: reduce RAM usage by only loading data from 1 speaker at a time.
    this works for volunteers or combined, but not lispeakers only (because there there's no seperate test set; test data is a bit from each speaker).
  • performance not so good on Volunteers because unknown speakers. Add regularization -> 20-25% accuracy top1, 4045% top3.
  • performance on the lipspeakers is about 35% top1, 55% top3.

CombinedSR

  1. audio network -> take valid features before denseLayers. shape: (nbImages, lstmUnits)
  2. lipreading network -> take features before denseLayers . shape: (nbImages, 512x7x7)
  3. concatenate them
  4. add some FC layers
  5. Softmax to 39 phonemes

(for audio: only valid features -> lasagne SliceLayer)
many issues with shape and reshaping...

  • features:
    • load lipreading, audioSR networks with pretrained nets.
    • train only lipraeding, train only audio, train everything
    • output from lip only, audio only, everything