-
Notifications
You must be signed in to change notification settings - Fork 2
Formants and their extraction
A formant is a concentration of acoustic energy around a particular frequency in the speech wave. There are several formants, each at a different frequency, roughly one in each 1000Hz band for average men. The corresponding range for average women is one formant every 1100Hz. The true range depends on the actual length of the vocal tract. Each formant corresponds to a resonance mode of the vocal tract.
Seen this way, the sound spectra look like mountain landscapes and the formants appear as peaks, a metaphor that is often used for formants. [12]
The frequency of the first formant is mostly determined by the height of the tongue body:
high F1 = low vowel (i.e., high frequency F1 = low tongue body)
low F1 = high vowel (i.e., low frequency F1 = high tongue body)
The frequency of the second formant is mostly determined by the frontness/backness of the tongue body:
high F2 = front vowel
low F2 = back vowel
F3: The lower of the formant frequency, the rounder shape of the lip e.g. /U/, /uù/, but F3 is not as frequently used as F1 and F2.
Since the 1950s, and even before, attempts have been made to visualize the vowel space by plotting at least F1 and F2 after collecting large corpuses of data. (Keywords to search for: vowel chart, vowel diagram) One such corpus with 1520 samples of American English[1] was collected by Peterson & Barney in 1952. This particular dataset can be found in various open source libraries such as Praat (http://web.archive.org/web/20190831081419/https://raw.githubusercontent.com/praat/praat/master/dwtools/Table_extensions.cpp) or packages in CRAN.
..., men, women, and children have vocal tracts of markedly different sizes, so that naturally their formants are different. Yet we identify a child's vowels correctly in spite of this. An /i/ said by a man and an /i/ said by a woman are felt to be "the same sound" and are equated, as far as phonetic quality goes, by the phonetician. On the usual F1/F2 plot they have quite different positions, but on an "articulatory" vowel-diagram they have the same position. Peterson has suggested that a more realistic acoustic diagram is achieved by plotting the ratio of F1 to F3 along the vertical axis and the ratio of F2 to F3 along the horizontal axis, all values being expressed in mels. Then men's, women's and children's "same" vowels are claimed to come out with approximately the same positions.[2]
An interesting study by Cartei et. al[6] tries to gauge the behavioral aspect of formant production by having 32 men and women (undergraduate students from a university) try to masculinize and feminize their voices.
Because formant frequencies are negatively correlated with the length of the vocal tract, male speakers produce lower Fi values and therefore a formant spacing (ΔF) that is about 15%–20% narrower than in female speakers, which results in male voices having a more "baritone" timbre.
However, despite the participants' attempts, one telltale sign was the dynamic range of both the fundamental frequency and the formants, especially in Figure 2; note how little both the male and female /æ/ varies from neutral especially in the feminization attempts.
The effect of sex on F0CV was not significant for vowels, but was significant in the other two tasks, indicating that, overall, men spoke with a narrower dynamic range than women.
This appears to be the case for other languages; the female vowel space appears to be a "scaled-up" and shifted version of the male vowel space, albeit the scaling factor is certainly not uniform and not a simple function of the vocal tract length.[7]
Another study by Cartei and Reby[8] suggests that preadolescent children of both sexes tend to speak with the same F0, however,
... boys speak with lower formants and consequently narrower ΔF than girls despite the absence of overall differences in vocal tract length between the two sexes before puberty. This dimorphism has led to the suggestion that pre-pubertal sex differences in ΔF have a behavioural basis (for example boys may round their lips or lower their larynx when they speak to lengthen their vocal tracts).
Certainly, the vocal tract length already varies from vowel to vowel (because as you move your tongue or lips you constrict the volume in your vocal tract in different ways); estimating a speaker's VTL from an arbitrary recording is still an open problem and a good solution to this would find applications in areas as wide as forensics (because VTL can be used to extrapolate a person's height and weight.) http://web.archive.org/web/20190901152202/https://www.researchgate.net/post/How_can_I_estimate_a_persons_vocal_tract_length_using_a_recorded_audio_file
Schiel and Zitzelsberger identified the following three main strategies that existing formant estimators use[9]:
Most algorithms to estimate formant parameters from the recorded speech signal therefore either apply homo-morphic analysis of the spectral envelope (e.g. cepstral analysis followed by a peak-picking strategy), or LPC analysis (Markel & Grey, 1982) to estimate the z-transform followed by a pole- picking strategy, or trained pattern recognition techniques (e.g. support vector machine, random forest or deep learning). The latter requires a training set of labelled formant tracks and is in most cases language dependent, while the former two approaches are inherently language independent and do not require any training material.
One such deep learning strategy implemented by Dissen and Keshet[11] feeds both LPC and cepstral coefficients into a neural network with 3 hidden layers as as regression problem, the output layer of which contains 3 output neurons, continuous in nature which predict F1..3.
Below is a plot of the frequency response of the filter (as in the source-filter model of vocal production) as modeled by LPC coefficients compared with the original spectrum as obtained from a simple FFT transform:
The first three formants are 700, 1220, 2600.
Example: Obtain F1, F2, and F3 of this audio sample of me saying the English word "pooling", in particular the excerpt of the vowel /uː/ (oo) from 00:00:95 to 00:01:00 titled pooling_transfusion_oo_0_95_1_0.wav
, using the LPC root-finding method.
The code as a single .m
file is provided here
-
Open up Praat and load the sound sample in.
-
A general guideline in the literature is "one formant in each 1000 Hz band", thus we must capture frequencies up to 5000 Hz to obtain the first 5 formants. This figure is increased by 10% for female voices; hence one formant every 1100 Hz and up to 5500 Hz should be captured. According to the Nyquist theorem, the sample rate should be doubled to become 10000 Hz and 11000 Hz respectively. The literature also recommends downsampling to these values (and perhaps even higher such as 16000 Hz in the case of children); empirically, with the instructions below, certain LPC coefficients are skipped if the algorithm is executed on the full 44100 Hz sampling rate.
-
Try the inbuilt formant extractor in Praat first after selecting
View & Edit
, thenFormant > Formant Listing
.Time_s F1_Hz F2_Hz F3_Hz F4_Hz 0.025000 352.892934 796.173876 2903.324760 3507.097425
The code behind this is located here.
-
Let's try to plot the LPC-modelled spectrogram. Note that extracting formants from the LPC coefficients manually (seems) to skip a few steps compared to the
Formant Listing
method, notablyLPC_Sound_to_LPC_robust
, which applies a gaussian window and applies preemphasis.-
Select
pooling_transfusion_oo_0_95_1_0
in thePraat Objects
window and clickConvert > Resample
. Set theNew sampling frequency
as 10000 Hz, then click OK. -
In the
Objects
window, select the newly downsampled object and click onAnalyse spectrum > To LPC (autocorrelation)
, then click OK. -
The newly created LPC object should be selected. Click on
To Formant
, then with the Formant object selected, click onTabulate > List
, then click OK. You should see:time(s) nformants F1(Hz) B1(Hz) F2(Hz) B2(Hz) F3(Hz) B3(Hz) F4(Hz) B4(Hz) F5(Hz) B5(Hz) 0.025000 5 345.716 85.135 791.642 111.987 2814.110 349.209 3487.712 92.460 4335.240 1658.090
B refers to the bandwidth, thus the first 3 formants are 345.716, 791.642, and 2814.110, close enough to our benchmark values.
-
Head back to the
Objects
window and select the LPC object. Click onTo Spectrum (slice)
, then click OK. Select the downsampled sound sample object, click onAnalyse spectrum > To Spectrum.
, then click OK. There should now be two Spectrum objects in theObjects
window. Click on one of them, then click onDraw > Draw
; repeat for the other one. The resulting image should be as follows:
- To interact with the LPC spectrum alone and find the peaks visually, click on the Spectrum object in the
Objects
window, then clickView and Edit
.
-
[snd, originalFs] = audioread('pooling_transfusion_oo_0_95_1_0.wav');
Fs = 10000;
% downsample to 10 KHz
snd = resample(snd, Fs, originalFs);
segmentlen = 100;
noverlap = 90;
NFFT = 4096;
Perform preemphasis on the signal because lower frequencies often have disproprotionately higher energy:
% warning: preEmphasisFrequency must be below the nyquist frequency!
% http://www.fon.hum.uva.nl/praat/manual/Sound__Filter__pre-emphasis____.html
preEmphasisFrequency = 50; % the default in praat, hz
preEmphasized = zeros(size(x));
alpha = exp(-2.0 * pi * preEmphasisFrequency * (1/Fs) );
for i = length(preEmphasized):-1:2
preEmphasized(i) = snd(i) - alpha * snd(i-1);
end
snd = preEmphasized;
Apply a Gaussian window to avoid spurious frequencies where the signal begins and terminates:
snd = snd.*gausswin(length(snd));
Linear Prediction is the process of coming up with a linear model which determines the value of the current sample in a discrete time series based on past/observed values.
Sound is produced by a wave travelling through a medium. To see the actual waveform of me saying oo, try plot(snd)
; the result should be as follows:
Note the various periodic waves; for instance, every other wave has two "humps" in it and the second hump appears to be of higher amplitude. We can predict the value at any arbitrary X position (time) given the previous samples.
A transfer function when dealing with time series in signal processing (like right now) or control engineering is simply the ratio of the output signal to the input signal.
In the literature, it is common to model the vocal tract's transfer function as an all-pole filter, reproduced below. A pole is when the response function goes to infinite; i.e. when the denominator of the transfer function is 0. A zero occurs when the numerator is 0. Thus, a response function with a constant in the numerator is all-pole.
Transfer function of the vocal tract.
The vector a
, of length p
, is the coefficients associated with every term of the polynomial, also of order p
.
Let Output(z)
be X(z)
, and Input(z)
be E(z)
.
Let X(z)
be the current sample in the z-domain and E(z)
be the source in the source-filter model. X(z) = H(z) * E(z)
.
Cross multiplying the fractions in the transfer function, we get:
E(z) = X(z) - X(z) * [term with summation]
X(z) = X(z) * [term with summation] + E(z)
Bringing this expression back into the time domain by doing the inverse z-transform so we have a concrete expression that we can use to express the current sample in terms of our past samples:
This is also known as the time-shifting property of the z-transform. The summation on line 3 begins at k because we must at least have k samples in our array if we want to be able to look backwards by k samples; i.e. it is not logical for n-k
to be negative.
Substituting back into the expression we get after cross multiplying,
i.e. each sample is simply a weighted sum of the immediate previous p
samples in this model + what used to be our input impulse.
However, we have our actual audio sample and we want to estimate a_k
, our coefficients now (which would be the coefficients of the polynomial in the original z-domain), thus e(n)
takes on the role of the prediction error when we apply linear regression to find a
. We will skip working out the actual linear regression in this guide since libraries exist to do so after all.
[A, err] = lpc(snd,10);
[H, F] = freqz(err^0.5,A,NFFT,Fs);
figure;
plot(F,log(abs(H)));
According to MATLAB documentation, the lpc
function finds a
in such a way that
they are negative; thus when plugged into the polynomial above with z^-k
they become positive.
Convolution Theorem: Multiplication in the discrete-time domain becomes convolution in the z-domain. Multiplication in the z-domain becomes convolution in the discrete-time domain.
Finding the roots of the polynomial described by A
, (you may try this website) which is x^10-0.990957239467850*x^9-0.125655790125284*x^8-0.838547440244313*x^7 + 1.20775228094351*x^6 + 0.333452550451733*x^5 + 0.0448200046527322*x^4 -0.493766106144003*x^3 -0.409998016009997*x^2+ 0.219683972440823*x + 0.230480465418990 = 0
(The first item in A
is 1); we should obtain
-0.570431810344407-i*0.787004105407124
-0.570431810344407+i*0.787004105407124
-0.563770157146396-i*0.196261544293191
-0.563770157146396+i*0.196261544293191
-0.164548102245518-i*0.870100929756742
-0.164548102245518+i*0.870100929756742
0.844894632920703-i*0.457672165731046
0.844894632920703+i*0.457672165731046
0.949334056549543-i*0.21047565099229
0.949334056549543+i*0.21047565099229
Plotted on the complex plane, these are:
rts = roots(A);
rts = rts(imag(rts)>=0); % retain only the roots with positive imaginary component (thus above the horizontal axis in the picture above)
angz = atan2(imag(rts),real(rts));
The angles of said points in degrees are:
>> rad2deg(angz)
ans =
12.5007684107254
28.4440247430048
100.708953320254
125.935136252332
160.805800678623
These angles we see here, informally, are how far each of the individual frequency components have "travelled around the unit circle" in a single sample. Our sampling rate is now 10000 Hz. The units above are in radians/sample, thus we need to multiply them by 10000 before applying the formula for angular frequency.
[frqs,indices] = sort(angz.*(Fs/(2*pi)));
At this point, the frqs
variable should contain our formants and be quite similar to both our attempts using Praat! They should be:
frqs =
347.243566964593
790.111798416799
2797.4709255626
3498.19822923145
4466.82779662843
TODO: Explain the distance from the unit circle and how it relates to the formant bandwidth in simple terms
Additional filtering/thresholding may be done on the obtained formants.
The bandwidth is basically how platykurtic the formant region is; i.e. a 400 Hz wide swath of sound all around the same amplitude level is more likely to be noise than a resonance of the vocal tract.
bw = -1/2*(Fs/(2*pi))*log(abs(rts(indices)));
nn = 1;
for kk = 1:length(frqs)
if (frqs(kk) > 90 && bw(kk) <400)
formants(nn) = frqs(kk);
nn = nn+1;
end
end
Previously, before performing lpc
, we came up with an expression relating the current sample with the immediate previous p samples. Let's eyeball this on a couple arbitrary samples:
>> snd(399-10:399)
ans =
-0.00692307756475205
-0.00698452989019883
-0.00393614172085627
-0.00627761144814771
-0.00867164003550028
-0.00575188812920859
-0.00699546688431311
-0.0068232860115245
-0.00352345532564564
-0.00223809113403957
0.00107157362766187
Let us try to predict the 399th (starting from the 1st) sound sample; the prediction should at least be positive:
>> - A(2:end) * flipud(snd(398-9:398))
ans =
0.000789732959574569
Another one:
>> snd(199-10:199)
ans =
-0.0148137512854369
-0.00909400429482982
-0.0142895693927139
-0.0209940327026022
-0.0126893465686367
-0.0131553582512722
-0.00643197461229436
0.00402941604871497
0.00987810570587273
0.0207191216380541
0.028318966315502
>> - A(2:end) * flipud(snd(198-9:198))
ans =
0.0270627680148722
Note that we do flipud
because the nearest sound sample, the one before the one we're trying to predict, corresponds to the 1st coefficient in A
after 1.
-
For consonants, there are also antiresonances in the vocal tract at one or more frequencies due to oral constrictions. An antiresonce is the opposite of a resonance, such that the impedence is relatively high rather than low. Consequently, they attenuate or eliminate formants at or near these frequencies, so that they appear weakened or are missing altogether when you look at spectrograms. That is why, for example, it is difficult to see formants below 3000-4000Hz for the two instances of [s] in the spectrogram above. [12]
- The literature describes some success detecting the nasal consonants n and m because the nasal cavity has its own resonances.
- Fricatives tend to resemble noise across the spectrum and be much shorter in duration.
-
F1 and F2 alone may not be sufficient to represent all the cardinal vowels in certain languages.
-
In languages that contrast front vs. mid, round vs. unrounded vowels, F3 plays a critical role. In French, for example, F1 and F2 for /i/ and /y/ can be similar for some speakers.[3]
- Vaissière proposes various ways to aggregate F2 and above into a single dimension.
-
This study[13] involving the perception of masculinized and feminized versions of male and female voices leverages the
Pitch-Synchronous Overlap Add (PSOLA) algorithm in Praat version 5.2.15. Pitch was raised (feminized) or lowered (masculinized) by 10% from baseline while holding formants constant. Likewise formants were raised (feminized) or lowered (masculinized) by 10% from baseline while holding pitch constant. This process created 12 pairs of voices (6 male and 6 female) that differed in pitch and 12 pairs of voices that differed in formants (6 male and 6 female). Following these manipulations, we amplitude normalized the sound pressure level of all voices to 70 decibels using the root mean squared method.
It would be interesting to see how the mapping is done from the masculine to female vowel space.
Informally, various commercial music VSTs such as Little AlterBoy by SoundToys and MAutoPitch by MeldaProduction and Melodyne by Celemony claim to manipulate formants.
We can feed the formants back into vocal synthesizers based on the source-filter model.
One such vocal synthesizer by Clo Yun-Hee Dufour can be found here; try it with the frequencies and bandwidths found above!
Assuming a neutral vocal tract,
Fn = (2n – 1)c/4L; Where n is an integer, L is the length of the tube and c is the speed of sound (about 35,000 cm/sec).
A worked example can be found here on the 2nd page: http://web.archive.org/web/20190908152547/https://web.stanford.edu/class/linguist205/index_files/Handout%206%20-%20Vowel%20acoustics.pdf
A Praat script implementing this formula can be found here: http://www.praatvocaltoolkit.com/calculate-vocal-tract-length.html
A problematic assumption is that the vocal tract is completely neutral and has resonances in a 1:3:5:... ratio, e.g. 500 Hz, 1500 Hz, 2500 Hz, ...
n is the frequency of the nth formant.
...a database for manually labeled vocal-tract-resonance (or formant) trajectories... The database contains a representative subset of the TIMIT corpus with respect to speaker, gender, dialect and phonetic context, with a total of 538 sentences.
- Research
- Environment Setup