Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 #49

Open
nh2 opened this issue Aug 10, 2016 · 7 comments
Open

Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 #49

nh2 opened this issue Aug 10, 2016 · 7 comments

Comments

@nh2
Copy link

nh2 commented Aug 10, 2016

sparkle-example-lda seems insanely slow (7 minutes locally, 2 minutes on an EMR cluster with 1 master and 2 workers m3.xlarge); e.g. the first zipWithIndex takes multiple minutes (when run locally), not sure why.

No CPU time is used in htop.

@nh2
Copy link
Author

nh2 commented Aug 10, 2016

Example of slowness on EMR:

screenshot from 2016-08-10 04-45-30

@mboes
Copy link
Member

mboes commented Aug 10, 2016

Not sure what's going on here. Next step is to profile with the Scala version of the code on the same dataset to rule out sparkle being the culprit. zipWithIndex is a trivial wrapper around the Java RDD.zipWithIndex() method, so this could be an upstream issue.

I take it you tried this using the default nyt dataset? It could also be the latency of reading a bunch of files from an S3 bucket.

@nh2
Copy link
Author

nh2 commented Aug 10, 2016

I take it you tried this using the default nyt dataset?

Yes.

It could also be the latency of reading a bunch of files from an S3 bucket.

Is there a way to disable any potential file fetching laziness and ensure that the S3 files are downloaded at the beginning of the pipeline?

@mboes
Copy link
Member

mboes commented Aug 10, 2016 via email

@mboes
Copy link
Member

mboes commented Aug 10, 2016

Ok, confirmed. From S3, the run takes 7:15 minutes on my laptop too. But if I download files locally, then it just takes exactly 1 minute. In the nyt dataset, there are 500+ files to process. Looks to me like Spark's S3 client could be faster at downloading many small files (the aws s3 CLI utility is pretty quick in comparison).

@mboes
Copy link
Member

mboes commented Aug 10, 2016

Looks like a known issue: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219. Haven't yet found an upstream ticket to track a resolution though.

@mboes
Copy link
Member

mboes commented Aug 15, 2016

The above mentioned link has sample code for fetching data from S3 using the AmazonS3 lib directly rather than Spark, as a workaround: https://gist.githubusercontent.com/pjrt/f1cad93b154ac8958e65/raw/7b0b764408f145f51477dc05ef1a99e8448bce6d/S3Puller.scala. Feel free to submit a PR Haskellizing that.

@mboes mboes changed the title sparkle-example-lda is extremely slow Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 Aug 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants