Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation of brozzler's scalability? #250

Open
goelayu opened this issue Aug 11, 2022 · 4 comments
Open

Evaluation of brozzler's scalability? #250

goelayu opened this issue Aug 11, 2022 · 4 comments

Comments

@goelayu
Copy link

goelayu commented Aug 11, 2022

I am curious if there is any data reporting how well does brozzler scale with increasing the number of parallel browsers?
In my current (very limited) test bed, brozzler takes extremely long to crawl web pages and store the corresponding resources.

Attaching some results when I attempt to crawl 20 random web pages with brozzler while enabling headless Chrome browser.
Scalability results
scale

I also track all the system resource usage (CPU, NW, disk). I am currently running this experiment on a 32 core linux server with 1Gbps NIC and storing data on an underlying hdd with r/w throughput of 150-200MBps
nw
disk
cpu

As you can see, neither of resources are being saturated, and yet brozzler is taking on avg ~40-50s to crawl and store a single page. Furthermore the low CPU usage is extremely concerning, since in my experience increasing the number of parallel browsers linearly increases the overall CPU usage of the system. This could be due to the proxy server used by brozzler?

Also, when I crawl the same corpus of pages using an extremely lightweight, custom, nodejs based crawler (written on top of puppeteer), it can do so about 10x faster than the above observed timings.

@vbanos
Copy link
Contributor

vbanos commented Aug 12, 2022

  1. You are you using warcprox for archiving, right? Have you checked its /status endpoint to see the state of the queues?
    This is probably the bottleneck.
  2. You need to show us the exact python code you are running for your experiment. Maybe you doing something in an sub-optimal way.
  3. Is your "custom nodejs based crawler" using also warcprox for archiving? If not, the comparison is not fair.

@justquick
Copy link

Hi, Im new to Brozzler/warcprox and am notice something that could possibly cause a slowdown at scale. This networked pipelined system contains a lot of reads/writes http, warc file io and rethinkdb tcp. This setup will eventually become IO limited at scale and hit GIL thrashing pretty quickly. When digging into the code, im seeing some thread pool executor parallelization usage but it probably wont help much when scaling and in the worst case could cause some race conditions/unexpected behavior. There is limited usage of asyncio and modern concurrent features in later versions of python (in fact only warcprox benchmark script did it fully).

IMAO this framework needs to step up w/ modern concurrent Python patterns and replace thse these IO blocking touchpoints:

  • urllib requests -> httpx or requests-asyncio
  • builtin file io open -> aiofiles/anyio
  • sqlite3 -> aiosqlite
  • subprocess -> asyncio.subprocess
  • rethinkdb connections -> ... (im sure theres something out there)

I realize this is a big effort across multiple repos but it would be fairly straightforward to add. adding full async support to python codebases is a big lift compared to nodejs which was built for these use cases

@TheTechRobo
Copy link

Rethinkdb supports asyncio OOTB:

https://github.com/rethinkdb/rethinkdb-python#asyncio-mode

@anjackson
Copy link

@goelayu are you running one Brozzler worker and trying to scale up the browser pool? Or multiple Brozzler workers? I don't work on Brozzler, but my understanding is that to scale things up, you are supposed to run multiple workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants