Introduce multithreading #63

satyamtg · 2020-07-15T07:57:47Z

We can eaisly support multithreading here by having multiple threads for for the download method of the xblock_extractor objects. However, we do have videos from youtube_dl which need to be in a separate queue (as that's throttled). So, I think we need to handle that in a good way here as multithreading drastically improves performance of this very scraper. Maybe we can have a main multithreaded process (because it has many HTTP requests) and handle youtube separately.

rgaudin · 2020-07-15T08:24:14Z

Agrees. Thanks for your experiments with multiprocessing.

This is very similar to other scrapers in that we have concurrent usages:

long cpu-intensive stuff we don't want to supervise (ffmpeg)
cpu-intensive stuff we want to supervise (images optimization)
unthrottled downloads
throttled downloads
unthrottled uploads

It's a lot of requirements that calls for flexibility. Also, we definitely want to assess our S3 performance before getting into this as we need to know where are the bottlenecks and which methods delivers best for those download/upload use cases.

This all renders this quite complex which is why I think we shall attempt to solve it on a less fragile scraper (youtube?) first and document/replicate onto others.

stale · 2020-09-13T09:20:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

satyamtg added enhancement question labels Jul 15, 2020

stale bot added the stale label Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce multithreading #63

Introduce multithreading #63

satyamtg commented Jul 15, 2020

rgaudin commented Jul 15, 2020

stale bot commented Sep 13, 2020

Introduce multithreading #63

Introduce multithreading #63

Comments

satyamtg commented Jul 15, 2020

rgaudin commented Jul 15, 2020

stale bot commented Sep 13, 2020