spark-stratifier

When we first started working Spark at HackerRank, we realized that within our dataset, the size of our outcome sets varied in size by quite a bit. This led to inconsistent model cross validation and training. However, with stratified sampling, we were able to eliminate these inconsistencies and improve overall model predictions. The goal of spark-stratifier is to provide a tool to stratify datasets for cross validation in PySpark. This class extends the current CrossValidator class in Spark.

Currently, the stratified cross validator works with binary classification problems using labels 0 and 1.

Requirements

This tool is 100% Python and the only primary requirements are numpy and pyspark.

Installation

$ pip install spark-stratifier

Example

You basically use this the exact same way you would with the Spark CrossValidator... except this time, your data will be stratified.

from spark_stratifier import StratifiedCrossValidator

scv = StratifiedCrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid,
        evaluator=evaluator,
        numFolds=8
      )

model = scv.fit(matrix)

Contributing

If you want to write some code and contribute to this project, go ahead and start a pull request. We hope this tool is useful for the community and we'd love to hear about how this helps solve your problems!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

spark-stratifier

Requirements

Installation

Example

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

spark-stratifier

Requirements

Installation

Example

Contributing