An extensible toolset for Spark performance benchmarking.
Currently available Spark jobs (including dataset generators):
Data Type | Algorithm |
---|---|
Vector | KMeans |
Vector | LinearRegression |
Vector | LogisticRegression |
Tabular | GroupByCount |
Tabular | Join |
Tabular | SelectWhereOrderBy |
Text | Grep |
Text | Sort |
Text | WordCount |
To compile the jobs to a jar file:
cd spark
sbt package
- Adjust
run_scripts/submit_local_job
to your local setup and execute it. - Later you can extend the script to submit jobs to a cluster that is available to you, be that in a public cloud or an on-premise setup.