ReplayGain ffmpeg with true peaks is SLOW #4935

RollingStar · 2023-10-06T06:47:29Z

RollingStar
Oct 6, 2023
Collaborator

Here's what I could find about replaygain implementations.

Parallel: multi core reading (presumably a multicore implementation is faster than a non multicore, although it's theoretically possible this is false if a single core implementation is significantly more efficient)
R128: Newer normalization algorithm believed to be better/more accurate than replaygain 1.0. In practice it seems the difference is negligible; they only have different baseline loudnesses. But as far as I can tell, you want to use one standard for your entire library without mixing them.

	ffmpeg	mp3gain	aacgain	GStreamer	Python Audio Tools
Parallel	Yes	Yes	Yes	No	No
R128	Yes	No	No	?	?
Formats (non exhaustive)	anything?	mp3 and aac?	mp3 and aac?	?	?
R128 Formats	Opus only?	No	No	No	No

Replaygain works in beets by having different backends extend the base class.

https://github.com/beetbox/beets/blob/e10b955a931e4c205b0cadf0860797c0aeee736c/beetsplug/replaygain.py

R128 support appears to be limited to a whitelist, which only Opus is included in. Does anyone know why this is? Hints here. Is it about writing different tags? Can you analyze in R128 and write to standard RG tags? Should you?

There are other attempted implementations of Replaygain that never made it into beets - there are PRs and discussions about it. #3368 #1203 #3381

FFmpeg looks like the clear winner right now, but maybe other backends will be superior. Anecdotally it's a bit slow on my ARM device, maybe 10-30 seconds per track. It does max out my CPU cores. R128 seems like the way of the future and I'd hate to start with RG1 and have to re-analyze all my tracks.

Since that time, a new industry standard ITU-R BS.1770 [aka R128] has been published. It details a new loudness measurement algorithm that implements RMS averaging and frequency weighting, similar in nature to ReplayGain 1.0, but far less computationally intensive and shown in listening tests to be more accurate. This new algorithm provides the basis for the ReplayGain 2.0 specification upon which rsgain is based.

RollingStar
Nov 1, 2023
Collaborator Author

I tested performance of beets vs other replaygain tools and ~~beets is SLOW~~ the true peak mode used by beets is SLOW.

Results

Multiplier is the column to look at. Each of those rows are comparable with all others. A 1 in this column means real time (60 minutes of audio takes 60 minutes to analyze). A 6 means 6x (60 minutes of audio = 10 minutes to analyze), and so on. Higher is better.

The runtime results are only comparable across the same test suite (below).

RG version is about testing whether I'm doing "replaygain" or I'm doing EBRU with true peak analysis.

device	RG version	album suite	setup	speed (Real time, seconds)	multiplier (higher is better)
gaming pc	RGv1?	full	fb2k windows	7.516	7000
gaming pc	RGv1?	full	ffmpeg linux WSL	79	666
ARM HC4	RGv1?	full	ffmpeg ARM linux	835.5	63
gaming pc	EBRU truepeak	full	ffmpeg linux WSL	157	335
ARM HC4	EBRU truepeak	quick	ffmpeg ARM linux	1949 (NC)	9.6
ARM HC4	EBRU truepeak	micro	beets ffmpeg album2 deerhoof	277 (NC)	8.5
ARM HC4 (no container)	EBRU truepeak	quick	ffmpeg ARM linux		13.6
				NC = not comparable to other times

In RGv1: Gaming PC is 10.5x faster than my ARM box. (Edit 2024-01-23: Foobar is RGv2 (EBU R128) but does not use truepeaks unless specified otherwise in the options)
In EBRU true peak: Gaming PC is 34.9x faster than my ARM box.

For completeness (testing failed hypotheses), here are other irrelevant results:

setup	speed (Real time, seconds)	speed vs. album time (approx)
fb2k windows (baseline) (run2)	7.641
fb2k windows (baseline) (run3)	7.594
fb2k hdd run2	7.625
fb2k hdd	7.640
beets album1 jazz	435 (not comparable to above)	8.5

Test suite

My sampling frame was my personal digital music collection - biased in favor of genres I like (new age, jazz) and formats I like (16 bit CD FLAC). I told foobar to randomize the tracks, then picked the top 14 tracks and copied their discs to my test folder. I only copied discs and not the larger compilations that they came from.

6 16 bit FLAC albums
1 24 bit FLAC album
7 MP3 (V0 or 320k) albums

Runtime: 14:37:40.203
Size: 3.13 GB

I also made a quick test which is a subset of the above. I use this on ARM truepeaks because my system is so slow it would take hours to do the full test.

2 16 bit FLAC albums
2 MP3 albums

fb2k windows (baseline)

For a baseline (high water mark for performance) I ran replaygain on my high end gaming PC (Ryzen 5 7600X; 4.7 GHz; 6 cores, 12 threads, no overclock).

The files were stored on an upper-mid SSD - Western Digital WD BLACK SN750 NVMe M.2 2280 1TB.

NB: no observed difference in running on USB3.0 HDD (>100MBps read speeds) and an SSD. No observed difference in having firefox with ~20 tabs opened while the test was running.

FB2k might be using ebur128 (but no true peak?). https://hydrogenaud.io/index.php/topic,120918.0.html

ffmpeg ARM linux (baseline)

This is the same system I run beets on.

Odroid-HC4 (arm)
ambian (debian-esque)
beets docker image from lsio
alpine linux for the image

The basic test I ginned up in 20 minutes doesn't do per-album tags. So this is again an upper bound on reasonable performance on this system. But it still shows the beets replaygain plugin is slow. Even not running in parallel*, performance is on the order of 60x.

time find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \) -exec ffmpeg -i "{}" -af replaygain -f null - \;

* The command is basically a for-loop run in sequence (not parallel). Within the ffmpeg command it should (and does, based on cpu load) use all CPU cores.

NB: Not all tracks were scanned without errors. "Incorrect BOM value", "error reading comment frame, skipped". These were on MP3s and I believe from the logs that the tracks were still scanned for RG.

NB: LSIO beets docker uses alpine which uses busybox. your find command must be tailored to it.

real 13m55.477s
user 13m28.481s
sys 2m29.431s

Speed=62.7x

beets ffmpeg linux

Perf was atrocious.

The command executed is something like

executing
ffmpeg -nostats -hide_banner -i /imp/Deerhoof - Mountain Moves (2017) [WEB FLAC]/02 - Con Sordino.flac -map a:0 -filter ebur128=peak=true -f null -

Album2 - Deerhoof - Mountain Moves

39:42 completed in 4:37 (with beet -vvv so more precise as to when RG stopped and started).

Speed=8.6x

ARM linux ffmpeg EBUR128 (same as beets)

time find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \) -exec ffmpeg -nostats -hide_banner -i "{}" -map a:0 -filter ebur128=peak=true -f null - \;

Again no per-album tags. 9.6x

Looks like it's caused by EBUR128 vs. Replaygain

It's true peak mode

Running the raw FFMPEG command produced by beets showed the same 8x speed. So the problem appears to be that command , and likely, its EBUR128-ness. Which feels a little odd to me because EBUR128 is supposed to be faster than RG 1.0?

https://www.wolframalpha.com/input?i=%283min7s%29%2F23.3+seconds

http://underpop.online.fr/f/ffmpeg/help/ebur128.htm.gz

Where is true peak mode coming from?

I looked through the code, my config, config_default, etc. but I can't find where peak=true is being enabled. I wonder if some other value is being coerced into True with python magic? And then a quirk of ffmpeg where "true" peak has nothing to do with a Truthy or 1 value but ends up that way.

replaygain:
    backend: ffmpeg
    per_disc: yes

Future work

The difference between ffmpeg linux and the beets ffmpeg replaygain backend is very sharp. I wonder why?

EBUR128 true peak mode is very slow http://underpop.online.fr/f/ffmpeg/help/ebur128.htm.gz
EBUR128 is slow
EBUR128 is slow in ffmpeg
all the console writing by ffmpeg?
Incorrect premise - beetsRG is fast and I'm observing slowness elsewhere
many SQLite db operations that are unoptimized
writing to the filesystem is the bottleneck, not reading music and computing RG info
docker bottlenecks
differences in ARM-ffmpeg implementation and ARM-beetsRG-ffmpeg implementation (but beetsRG should just be calling ffmpeg that the system provides?)
null output (PCM16) is not realistic vs. the FLAC and mp3 i commonly write (PCM is effectively real time; FLAC is more like, idk, 10x speed to compress but that shouldn't affect tag writing speed)

TLDR

need more logging on beets replaygain to see where/if there is a serious performance issue. Need more logging around ebur128 to find how true is enabled in ffmpeg backend.

1 reply

RollingStar Nov 2, 2023
Collaborator Author

Update: Using raw debian (armbian) with no container bumped the quick test to 13.6x from 9.6x.

wisp3rwind · 2023-11-01T10:55:50Z

wisp3rwind
Nov 1, 2023
Maintainer

Some random thoughts:

The basic test I ginned up in 20 minutes doesn't do per-album tags.

At least for the ffmpeg backend, per-album replaygain shouldn't add any noticeable processing time, since the album gain is computed from the per-task data using a simple calculation.

Where is true peak mode coming from?

FfmpegBackend._construct_cmd, which takes peak_method from the peak config key (which as default value true). The ffmpeg backend seems to be the only one that supports different peak methods, however.

Actually, I'm not sure there's really a problem here. If I understand correctly what you're benchmarking, you're really comparing results on different machines, so it's really hard to conclude anything from the results.

Some results on my laptop (analyzing one album with 15 tracks:

❯ time find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \)  -exec ffmpeg -filter_threads 1 -nostats -hide_banner -i "{}" -map a:0 -filter ebur128=peak=true -f null - \; 2> /dev/null
find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \) -exec ffmpeg  1 -nostat  17,85s user 0,98s system 139% cpu 13,492 total

❯ time find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \)  -exec ffmpeg -filter_threads 1 -nostats -hide_banner -i "{}" -map a:0 -filter ebur128=peak=sample -f null - \; 2> /dev/null
find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \) -exec ffmpeg  1 -nostat  7,83s user 0,53s system 179% cpu 4,643 total

❯ time find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \)  -exec ffmpeg -filter_threads 1 -nostats -hide_banner -i "{}" -map a:0 -af replaygain -f null - \; 2> /dev/null                    
find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \) -exec ffmpeg  1 -nostat  7,91s user 0,45s system 171% cpu 4,866 total

replaygain and ebur128 are basically the same for sample peak, and true peak is ~2-3 times slower, which seems reasonable given that it is oversampling.

I think the first thing to verify here is whether your observations are reproducible using Linux on another computer (such as your gaming PC). If not (which I suspect given the numbers from my own testing), this seems like an issue with ffmpeg on ARM (inefficient implementation, specifics of the machine, or maybe just the way it's built).

EBUR128 true peak mode is very slow http://underpop.online.fr/f/ffmpeg/help/ebur128.htm.gz

EBUR128 is slow

Doesn't seem to be the case in comparison to replaygain in general, see above.

EBUR128 is slow in ffmpeg

Not on x86, but possibly on ARM? Or maybe just on this specific ARM chip (for example, this could depend on cache sizes and available instruction set extensions. For example, I could imagine that oversampling for true peak analysis requires larger FFT window sizes, which might exceed some cache size on one machine but not the other).

all the console writing by ffmpeg?

No (it's not writing that much in fact, and also, disabling logging doesn't change timings).

Incorrect premise - beetsRG is fast and I'm observing slowness elsewhere

Easy to verify --- only test using bare ffmpeg and see whether the issue persists (as I think you've already done with the "linux ffmpeg EBUR128 (same as beets)" case).

many SQLite db operations that are unoptimized

Again, benchmark without beets as for the previous point. Also, even though our database usage is inefficient, it's not that bad for inserting or updating a few items.

writing to the filesystem is the bottleneck, not reading music and computing RG info

Maybe: Just sequential reads after all, but I could imagine that this affects FLAC on HDD. Unlikely for MP3 or on SSD. Should however affect replaygain and ebur in the same way. Maybe get a baseline here using time find ./ -type f \( -iname \*.flac -o -iname \*.mp3 \) -exec dd if="{}" of=/dev/null bs=128K status=progress \;.

docker bottlenecks

Maybe?

differences in ARM-ffmpeg implementation and ARM-beetsRG-ffmpeg implementation (but beetsRG should just be calling ffmpeg that the system provides?)

But you don't observe such differences, do you?

null output (PCM16) is not realistic vs. the FLAC and mp3 i commonly write (PCM is effectively real time; FLAC is more like, idk, 10x speed to compress but that shouldn't affect tag writing speed)

Not sure I understand the point here. If you aleady observe slowness using the null output, it can only be worse for compressed output? Also, in beets, RG analysis is decoupled from anything the convert plugin does, so timings for both should just add up?

1 reply

RollingStar Nov 1, 2023
Collaborator Author

Updated the reply with new tests at the top.

In RGv1: Gaming PC is 10.5x faster than my ARM box.
In EBRU true peak: Gaming PC is 34.9x faster than my ARM box.

With the dd benchmark you gave, I get 1163x on the ARM box. For the reader, this appears to read every FLAC/MP3 file and copy it to a fake drive (/dev/null).

To clarify, that "future work" section is every idea I had - including ones that don't make much sense given the observations.

My takeaway for my setup is probably:

disable true peak RG on import
set up a cron script to RG non-RG albums overnight (so time doesn't matter)
determine if true peak is all that important, and figure out what the default should be for beets (edit: tldr true peak is important) https://hydrogenaud.io/index.php/topic,124977.0.html

diizzyy · 2023-11-27T18:01:20Z

diizzyy
Nov 27, 2023

If you want some more data points you also have https://github.com/sdroege/ebur128 that can be utilized via complexlogic/rsgain#61 or likely any other utility that supports libebur128

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReplayGain ffmpeg with true peaks is SLOW #4935

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

ReplayGain ffmpeg with true peaks is SLOW #4935

RollingStar Oct 6, 2023 Collaborator

Further reading

Replies: 3 comments · 2 replies

RollingStar Nov 1, 2023 Collaborator Author

Results

Test suite

fb2k windows (baseline)

ffmpeg ARM linux (baseline)

beets ffmpeg linux

ARM linux ffmpeg EBUR128 (same as beets)

Looks like it's caused by EBUR128 vs. Replaygain

It's true peak mode

Where is true peak mode coming from?

Future work

Further reading

TLDR

RollingStar Nov 2, 2023 Collaborator Author

wisp3rwind Nov 1, 2023 Maintainer

RollingStar Nov 1, 2023 Collaborator Author

diizzyy Nov 27, 2023

RollingStar
Oct 6, 2023
Collaborator

Replies: 3 comments 2 replies

RollingStar
Nov 1, 2023
Collaborator Author

RollingStar Nov 2, 2023
Collaborator Author

wisp3rwind
Nov 1, 2023
Maintainer

RollingStar Nov 1, 2023
Collaborator Author

diizzyy
Nov 27, 2023