Evaluation of brozzler's scalability? #250

goelayu · 2022-08-11T19:23:11Z

I am curious if there is any data reporting how well does brozzler scale with increasing the number of parallel browsers?
In my current (very limited) test bed, brozzler takes extremely long to crawl web pages and store the corresponding resources.

Attaching some results when I attempt to crawl 20 random web pages with brozzler while enabling headless Chrome browser.
Scalability results

I also track all the system resource usage (CPU, NW, disk). I am currently running this experiment on a 32 core linux server with 1Gbps NIC and storing data on an underlying hdd with r/w throughput of 150-200MBps

As you can see, neither of resources are being saturated, and yet brozzler is taking on avg ~40-50s to crawl and store a single page. Furthermore the low CPU usage is extremely concerning, since in my experience increasing the number of parallel browsers linearly increases the overall CPU usage of the system. This could be due to the proxy server used by brozzler?

Also, when I crawl the same corpus of pages using an extremely lightweight, custom, nodejs based crawler (written on top of puppeteer), it can do so about 10x faster than the above observed timings.

vbanos · 2022-08-12T12:29:04Z

You are you using warcprox for archiving, right? Have you checked its /status endpoint to see the state of the queues?
This is probably the bottleneck.
You need to show us the exact python code you are running for your experiment. Maybe you doing something in an sub-optimal way.
Is your "custom nodejs based crawler" using also warcprox for archiving? If not, the comparison is not fair.

justquick · 2023-07-05T20:59:58Z

Hi, Im new to Brozzler/warcprox and am notice something that could possibly cause a slowdown at scale. This networked pipelined system contains a lot of reads/writes http, warc file io and rethinkdb tcp. This setup will eventually become IO limited at scale and hit GIL thrashing pretty quickly. When digging into the code, im seeing some thread pool executor parallelization usage but it probably wont help much when scaling and in the worst case could cause some race conditions/unexpected behavior. There is limited usage of asyncio and modern concurrent features in later versions of python (in fact only warcprox benchmark script did it fully).

IMAO this framework needs to step up w/ modern concurrent Python patterns and replace thse these IO blocking touchpoints:

urllib requests -> httpx or requests-asyncio
builtin file io open -> aiofiles/anyio
sqlite3 -> aiosqlite
subprocess -> asyncio.subprocess
rethinkdb connections -> ... (im sure theres something out there)

I realize this is a big effort across multiple repos but it would be fairly straightforward to add. adding full async support to python codebases is a big lift compared to nodejs which was built for these use cases

TheTechRobo · 2023-07-05T21:58:09Z

Rethinkdb supports asyncio OOTB:

https://github.com/rethinkdb/rethinkdb-python#asyncio-mode

anjackson · 2023-07-06T13:45:08Z

@goelayu are you running one Brozzler worker and trying to scale up the browser pool? Or multiple Brozzler workers? I don't work on Brozzler, but my understanding is that to scale things up, you are supposed to run multiple workers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation of brozzler's scalability? #250

Evaluation of brozzler's scalability? #250

goelayu commented Aug 11, 2022

vbanos commented Aug 12, 2022

justquick commented Jul 5, 2023

TheTechRobo commented Jul 5, 2023

anjackson commented Jul 6, 2023

Evaluation of brozzler's scalability? #250

Evaluation of brozzler's scalability? #250

Comments

goelayu commented Aug 11, 2022

vbanos commented Aug 12, 2022

justquick commented Jul 5, 2023

TheTechRobo commented Jul 5, 2023

anjackson commented Jul 6, 2023