When we first started working Spark at HackerRank, we realized that within our dataset, the size of our outcome sets varied in size by quite a bit. This led to inconsistent model cross validation and training. However, with stratified sampling, we were able to eliminate these inconsistencies and improve overall model predictions. The goal of spark-stratifier
is to provide a tool to stratify datasets for cross validation in PySpark
. This class extends the current CrossValidator
class in Spark.
Currently, the stratified cross validator works with binary classification problems using labels 0
and 1
.
Read more at engineering.hackerrank.com
This tool is 100% Python and the only primary requirements are numpy
and pyspark
.
$ pip install spark-stratifier
You basically use this the exact same way you would with the Spark CrossValidator
... except this time, your data will be stratified.
from spark_stratifier import StratifiedCrossValidator
scv = StratifiedCrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=8
)
model = scv.fit(matrix)
If you want to write some code and contribute to this project, go ahead and start a pull request. We hope this tool is useful for the community and we'd love to hear about how this helps solve your problems!