You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the paper they trained on audio files with sample rate of 8K.
The model is a classification model with final layer of N (N - different speakers)
The input size was up to 3 seconds length with frame length of 25ms
What is the value of N ? (I didn't found it in the paper)
According to the frame-length (i.e 25ms) there is no need to be an assumption on the speech length ? (am I right ?) (we can get embedding vector for different speech lengths) ?
If the model trained on speeches with SR of 8K, Do I need to resample any speech to 8K before getting it's embedding vector ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I read that pyannote use this embedding model:
X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION
Beta Was this translation helpful? Give feedback.
All reactions