-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathindex.html
294 lines (284 loc) · 16.5 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
<!DOCTYPE html>
<!-- saved from url=(0033)https://syang1993.github.io/glow_wavegan/ -->
<html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<!-- Begin Jekyll SEO tag v2.7.1 -->
<title>Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion</title>
<meta name="generator" content="Jekyll v3.9.0">
<meta property="og:title" content="Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion">
<meta property="og:locale" content="en_US">
<link rel="canonical" href="https://syang1993.github.io/glow_wavegan/">
<meta property="og:url" content="https://syang1993.github.io/glow_wavegan/">
<meta name="twitter:card" content="summary">
<!-- End Jekyll SEO tag -->
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#157878">
<link rel="stylesheet" href="style.css">
</head>
<body data-new-gr-c-s-check-loaded="14.1001.0" data-gr-ext-installed="">
<section class="page-header">
<!-- <h1 class="project-name">Demo PAGE</h1> -->
<!-- <h2 class="project-tagline"></h2> -->
</section>
<section class="main-content">
<h1 id=""><center> Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion </center></h1>
<center> Jiachen Lian<sup>1</sup><sup>2</sup>, Chunlei Zhang<sup>2</sup>, Dong Yu<sup>2</sup> </center>
<center> <sup>1</sup> Berkeley EECS, CA </center>
<center> <sup>2</sup> Tencent AI Lab, Bellevue, WA</center>
<h2 id="abstract">Abstract</h2>
<p>This work proposed a novel self-supervised zero-shot voice conversion that does not rely on speaker labels and pre-trained speaker embeddings. Such method is robust to noisy speech and even outperforms all previous voice conversion methods that rely on supervisory speaker signals.</p>
<h2><p class="toc_title">Contents</p></h2>
<div id="toc_container">
<ul>
<li><a href="#1">Voice Identity Transfer(Zero-Shot)</a></li>
<li><a href="#2">Noise Invariant Voice Style Transfer(Zero-Shot)</a></li>
<li><a href="#3">Unconditional Speech Generation</a></li>
</ul>
</div>
<br>
<h2><p class="toc_title">Comments</p></h2>
<div id="toc_container"
<ul>
<li>This is self-supervised Voice Conversion: No speaker labels used in the training and no pretrained speaker embeddings are used. </a></li>
<li>Unconditional speech generation was also performed. Due to limit of space, we did not list them in the paper, but it can be checked here in Section 3. </a></li>
<li>Though not stated in the paper, we later on observed improvement in speech quality when applying HifiGAN, of which results are also attached </a></li>
</ul>
</div>
<br>
<br>
<a name="1"><h2>1. Voice Identity Transfer(Zero-Shot)</h2></a>
<a name="1.1"><h3>1.1 Male and Female</h3></a>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source</strong></th>
<th style="text-align: center"><strong>Target</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p293/p293_016.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p271/p271_015.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2M/p293_p271.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2M/p293_p271_hifi.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p317/p317_134.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p274/p274_134.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2M/p317_p274.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2M/p317_p274_hifi.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p241/p241_050.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p317/p317_134.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/M2F/p241_p317.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/M2F/p241_p317_hifi.wav" controls="" preload=""></audio></td>
</tr>
<!-- <tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p274/p274_134.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p225/p225_127.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/M2F/p274_p225.wav" controls="" preload=""></audio></td>
</tr> -->
</tbody>
</table>
<br>
<a name="1.2"><h3>1.2 Male and Male</h3></a>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source</strong></th>
<th style="text-align: center"><strong>Target</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbodyi>
<tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p259/p259_021.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p271/p271_015.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/M2M/p259_p271.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/M2M/p259_p271_hifi.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p271/p271_015.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p241/p241_050.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/M2M/p271_p241.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/M2M/p271_p241_hifi.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br>
<a name="1.3"><h3>1.3 Female and Female</h3></a>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source</strong></th>
<th style="text-align: center"><strong>Target</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p317/p317_134.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p293/p293_212.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2F/p317_p293_1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2F/p317_p293_1_hifi.wav" controls="" preload=""></audio></td>
</tr>
<!-- <tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p293/p293_016.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p225/p225_127.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2F/p293_p225.wav" controls="" preload=""></audio></td>
</tr> -->
<tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p293/p293_212.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p317/p317_134.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2F/p293_p317_2.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2F/p293_p317_2_hifi.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/identity/wavs/p293/p293_016.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/wavs/p317/p317_134.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2F/p293_p317.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/identity/demo_baseline/F2F/p293_p317_hifi.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br>
<a name="2"><h2>2. Noise Invariant Voice Style Transfer(Zero-Shot)</h2></a>
<a name="2.1"><h3>2.1 Male to Female</h3></a>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source(Clean)</strong></th>
<th style="text-align: center"><strong>Target(Clean)</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_mic1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p293_034_mic1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_to_p293_0340_clean.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_to_p293_0340_clean_hifi.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source(5db)</strong></th>
<th style="text-align: center"><strong>Target(Clean)</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_mic1.5.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p293_034_mic1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_to_p293_0340_5.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_to_p293_0340_5_hifi.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source(15db)</strong></th>
<th style="text-align: center"><strong>Target(clean)</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_mic1.15.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p293_034_mic1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_to_p293_0340_15.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p274_134_to_p293_0340_15_hifi.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br>
<a name="2.2"><h3>2.2 Female to Male</h3></a>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source(Clean)</strong></th>
<th style="text-align: center"><strong>Target(Clean)</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_mic1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p334_172_mic1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_to_p334_1720_clean.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_to_p334_1720_clean_hifi.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source(5db)</strong></th>
<th style="text-align: center"><strong>Target(Clean)</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_mic1_6.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p334_172_mic1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_to_p334_1720_5.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_to_p334_1720_5_hifi.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Source(15db)</strong></th>
<th style="text-align: center"><strong>Target(Clean)</strong></th>
<th style="text-align: center"><strong>Conversion(Wavenet)</strong></th>
<th style="text-align: center"><strong>Conversion(HifiGAN)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_mic1_15.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p334_172_mic1.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_to_p334_1720_15.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/noise_invariant/p317_134_to_p334_1720_15_hifi.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br>
<br>
<a name="3"><h2>3. Unconditional Speech Generation</h2></a>
VAE allows to generate infinite number of new speakers during testing. We just sample speaker embeddings from a given prior distribution with a specified mean vector and standard deviation vector.
We observed that the standard deviation does not influence the speaker identity too much while the mean is closely related to the speaker identity. Especially, we found that positive "mean" corresponds to male voice,
and negative "mean" corresponds to female voice. The experiment takes two utterance as input. Then we set 5 different mean vectors(-1.5, -0.5, 0.0, 0.5, 1.5, up to down in the belowing figure) of length 64 and standard deviation(0.5) vector of the same size.
Then we just concatenate sampled spaker embedding and the extracted content embedding to synthesize the mel-spectrogram. For each mean-vector, we sample two speakers using the same random seed. The result is as follows:
<br>
<br>
<!-- <img src="samples/uncondition.png"/width="300px"height="180px" /> -->
<img src="samples/uncondition.png"/>
<br>
As shown in the figure, the first row are two re-synthesized mel-spectrograms. For each one we just extract the content embedding while using the sampled speaker embeddings of 5 different means. For two utterances, as long as the "means" are the same, the identities of the synthesized speech
are also the same, which can be verified by looking at the synthesized mel-spectrograms at the corresponding position. When mean is zero, the sampled speaker embedding can randomly be positive(male) or negative(female).
<br>
<footer class="site-footer">
<span class="site-footer-credits">This page was generated by <a href="https://pages.github.com/">GitHub Pages</a>.</span>
</footer>
</section>
</body></html>