wai.annotations module for audio processing.
Makes use of the librosa and soundfile libraries.
The manual is available here:
https://ufdl.cms.waikato.ac.nz/wai-annotations-manual/
Collates and outputs information on the audio files.
- Audio classification domain
usage: audio-info-ac [-o OUTPUT_FILE] [-f OUTPUT_FORMAT]
optional arguments:
-o OUTPUT_FILE, --output OUTPUT_FILE
the file to write the information to; uses stdout if omitted (default: )
-f OUTPUT_FORMAT, --format OUTPUT_FORMAT
the format to use for the output, available modes: csv, json (default: text)
Collates and outputs information on the audio files.
- Speech Domain
usage: audio-info-sp [-o OUTPUT_FILE] [-f OUTPUT_FORMAT]
optional arguments:
-o OUTPUT_FILE, --output OUTPUT_FILE
the file to write the information to; uses stdout if omitted (default: )
-f OUTPUT_FORMAT, --format OUTPUT_FORMAT
the format to use for the output, available modes: csv, json (default: text)
Converts audio files to monophonic.
- Speech Domain
- Audio classification domain
usage: convert-to-mono
Converts mp3/flac/ogg to wav.
- Speech Domain
- Audio classification domain
usage: convert-to-wav [-s SAMPLE_RATE]
optional arguments:
-s SAMPLE_RATE, --sample-rate SAMPLE_RATE
the sample rate to use for the audio data, for overriding the native rate.
(default: None)
Generates a plot from a Mel spectrogram.
- Audio classification domain
usage: mel-spectrogram [--center] [--cmap CMAP] [--dpi DPI] [--hop-length HOP_LENGTH]
[--num-fft NUM_FFT] [--pad-mode PAD_MODE] [--power POWER]
[--win-length WIN_LENGTH] [--window WINDOW]
optional arguments:
--center for centering the signal. (default: False)
--cmap CMAP the Matplotlib colormap to use (append _r for reverse), automatically infers
map if not provided; use 'gray_r' for grayscale; for available maps see:
https://matplotlib.org/stable/gallery/color/colormap_reference.html
(default: None)
--dpi DPI the dots per inch (default: 100)
--hop-length HOP_LENGTH
number of audio samples between adjacent STFT columns. (default: 512)
--num-fft NUM_FFT the length of the windowed signal after padding with zeros. should be power
of two. (default: 2048)
--pad-mode PAD_MODE used when 'centering' (default: constant)
--power POWER exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power,
etc. (default: 2.0)
--win-length WIN_LENGTH
each frame of audio is windowed by window of length win_length and then
padded with zeros to match num_fft. defaults to win_length = num_fft
(default: None)
--window WINDOW a window function, such as scipy.signal.windows.hann (default: hann)
Generates a plot from Mel-frequency cepstral coefficients.
- Audio classification domain
usage: mfcc-spectrogram [--center] [--cmap CMAP] [--dct-type DCT_TYPE] [--dpi DPI]
[--hop-length HOP_LENGTH] [--lifter LIFTER] [--norm NORM]
[--num-fft NUM_FFT] [--num-mfcc NUM_MFCC] [--pad-mode PAD_MODE]
[--power POWER] [--win-length WIN_LENGTH] [--window WINDOW]
optional arguments:
--center for centering the signal. (default: False)
--cmap CMAP the Matplotlib colormap to use (append _r for reverse), automatically infers
map if not provided; use 'gray_r' for grayscale; for available maps see:
https://matplotlib.org/stable/gallery/color/colormap_reference.html
(default: None)
--dct-type DCT_TYPE the Discrete cosine transform (DCT) type (1|2|3). By default, DCT type-2 is
used. (default: 2)
--dpi DPI the dots per inch (default: 100)
--hop-length HOP_LENGTH
number of audio samples between adjacent STFT columns. (default: 512)
--lifter LIFTER If lifter>0, apply liftering (cepstral filtering) to the MFCC: M[n, :] <-
M[n, :] * (1 + sin(pi * (n + 1) / lifter) * lifter / 2) (default: 0)
--norm NORM If dct_type is 2 or 3, setting norm='ortho' uses an ortho-normal DCT basis.
Normalization is not supported for dct_type=1. (options: none|ortho)
(default: ortho)
--num-fft NUM_FFT the length of the windowed signal after padding with zeros. should be power
of two. (default: 2048)
--num-mfcc NUM_MFCC the number of MFCCs to return. (default: 20)
--pad-mode PAD_MODE used when 'centering' (default: constant)
--power POWER exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power,
etc. (default: 2.0)
--win-length WIN_LENGTH
each frame of audio is windowed by window of length win_length and then
padded with zeros to match num_fft. defaults to win_length = num_fft
(default: None)
--window WINDOW a window function, such as scipy.signal.windows.hann (default: hann)
Augmentation method for shifting the pitch of audio files.
- Audio classification domain
- Speech Domain
usage: pitch-shift [-m AUG_MODE] [--suffix AUG_SUFFIX] [--bins-per-octave BINS_PER_OCTAVE]
[--resample-type RESAMPLE_TYPE] [-s SEED] [-a] [-f STEPS_FROM] [-t STEPS_TO]
[-T THRESHOLD] [-v]
optional arguments:
-m AUG_MODE, --mode AUG_MODE
the audio augmentation mode to use, available modes: replace, add (default:
replace)
--suffix AUG_SUFFIX the suffix to use for the file names in case of augmentation mode add
(default: None)
--bins-per-octave BINS_PER_OCTAVE
how many steps per octave (default: 12)
--resample-type RESAMPLE_TYPE
the resampling type to apply (kaiser_best|kaiser_fast|fft|polyphase|linear|z
ero_order_hold|sinc_best|sinc_medium|sinc_fastest|soxr_vhq|soxr_hq|soxr_mq|s
oxr_lq|soxr_qq) (default: kaiser_best)
-s SEED, --seed SEED the seed value to use for the random number generator; randomly seeded if
not provided (default: None)
-a, --seed-augmentation
whether to seed the augmentation; if specified, uses the seeded random
generator to produce a seed value from 0 to 1000 for the augmentation.
(default: False)
-f STEPS_FROM, --from-steps STEPS_FROM
the minimum (fractional) steps to shift (default: None)
-t STEPS_TO, --to-steps STEPS_TO
the maximum (fractional) steps to shift (default: None)
-T THRESHOLD, --threshold THRESHOLD
the threshold to use for Random.rand(): if equal or above, augmentation gets
applied; range: 0-1; default: 0 (= always) (default: None)
-v, --verbose whether to output debugging information (default: False)
Resamples audio files.
For resample types, see: https://librosa.org/doc/latest/generated/librosa.resample.html#librosa.resample
- Audio classification domain
- Speech Domain
usage: resample-audio [-t RESAMPLE_TYPE] [-s SAMPLE_RATE] [-v]
optional arguments:
-t RESAMPLE_TYPE, --resample-type RESAMPLE_TYPE
the resampling type to apply (kaiser_best|kaiser_fast|fft|polyphase|linear|z
ero_order_hold|sinc_best|sinc_medium|sinc_fastest|soxr_vhq|soxr_hq|soxr_mq|s
oxr_lq|soxr_qq) (default: kaiser_best)
-s SAMPLE_RATE, --sample-rate SAMPLE_RATE
the sample rate to use for the audio data. (default: 22050)
-v, --verbose whether to output some debugging output (default: False)
Generates a plot from a short time fourier transform (STFT) spectrogram.
- Audio classification domain
usage: stft-spectrogram [--center] [--cmap CMAP] [--dpi DPI] [--hop-length HOP_LENGTH]
[--num-fft NUM_FFT] [--pad-mode PAD_MODE] [--win-length WIN_LENGTH]
[--window WINDOW]
optional arguments:
--center for centering the signal. (default: False)
--cmap CMAP the Matplotlib colormap to use (append _r for reverse), automatically infers
map if not provided; use 'gray_r' for grayscale; for available maps see:
https://matplotlib.org/stable/gallery/color/colormap_reference.html
(default: None)
--dpi DPI the dots per inch (default: 100)
--hop-length HOP_LENGTH
number of audio samples between adjacent STFT columns. defaults to
win_length // 4 (default: None)
--num-fft NUM_FFT the length of the windowed signal after padding with zeros. should be power
of two. (default: 2048)
--pad-mode PAD_MODE used when 'centering' (default: constant)
--win-length WIN_LENGTH
each frame of audio is windowed by window of length win_length and then
padded with zeros to match num_fft. defaults to win_length = num_fft
(default: None)
--window WINDOW a window function, such as scipy.signal.windows.hann (default: hann)
Augmentation method for stretching the time of audio files (speed up/slow down).
- Speech Domain
- Audio classification domain
usage: time-stretch [-m AUG_MODE] [--suffix AUG_SUFFIX] [-f RATE_FROM] [-t RATE_TO] [-s SEED] [-a]
[-T THRESHOLD] [-v]
optional arguments:
-m AUG_MODE, --mode AUG_MODE
the audio augmentation mode to use, available modes: replace, add (default:
replace)
--suffix AUG_SUFFIX the suffix to use for the file names in case of augmentation mode add
(default: None)
-f RATE_FROM, --from-rate RATE_FROM
the minimum stretch factor (<1: slow down, 1: same, >1: speed up) (default:
None)
-t RATE_TO, --to-rate RATE_TO
the maximum stretch factor (<1: slow down, 1: same, >1: speed up) (default:
None)
-s SEED, --seed SEED the seed value to use for the random number generator; randomly seeded if
not provided (default: None)
-a, --seed-augmentation
whether to seed the augmentation; if specified, uses the seeded random
generator to produce a seed value from 0 to 1000 for the augmentation.
(default: False)
-T THRESHOLD, --threshold THRESHOLD
the threshold to use for Random.rand(): if equal or above, augmentation gets
applied; range: 0-1; default: 0 (= always) (default: None)
-v, --verbose whether to output debugging information (default: False)
Trims silence from audio files.
- Audio classification domain
- Speech Domain
usage: trim-audio [--frame-length FRAME_LENGTH] [--hop-length HOP_LENGTH] [--top-db TOP_DB] [-v]
optional arguments:
--frame-length FRAME_LENGTH
the number of samples per analysis frame. (default: 2048)
--hop-length HOP_LENGTH
the number of samples between analysis frames (default: 512)
--top-db TOP_DB the threshold (in decibels) below reference to consider as silence.
(default: 60)
-v, --verbose whether to output some debugging output (default: False)
The Urban8k class can be used in conjunction
with the generic-source-ac
source from the wai.annotations.generic
module to load the data from the Urban8k dataset.
With the to-subdir-ac
sink from the wai.annotations.subdir
module, you can split the audio files per class.