Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for reproducing sample from BigQuery #700

Merged
merged 5 commits into from
Feb 9, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions ratatool-sampling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Diffy contains record sampling classes for Avro, Parquet, and BigQuery. Supporte
# BigSampler

BigSampler will run a [Scio](https://github.com/spotify/scio) pipeline sampling either Avro or BigQuery data.
It also allows specifying a hash function (either FarmHash or Murmur) with seed (if applicable for
It also allows specifying a hash function (either [FarmHash](https://github.com/google/farmhash) or Murmur) with seed (if applicable for
your hash) and fields to hash for deterministic cohort selection.

For full details see [BigSample.scala](https://github.com/spotify/ratatool/blob/master/ratatool-sampling/src/main/scala/com/spotify/ratatool/samplers/BigSampler.scala)
Expand Down Expand Up @@ -85,6 +85,16 @@ Leveraging `--fields=<field1,field2,...>` BigSampler can produce a hash based on
are in the sample. For example, `--fields=user_id --sample=0.5` will always produce the same sample
of 50% of users. If multiple records contain the same `user_id` they will all be in or out of the
sample.

### Reproducing within BigQuery
Currently, BigSampler defaults to Farmhash, which is also used in BigQuery. When sampling with a seed and one or more fields,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you link to Farmhash? Also, it looks like the FarmHash repo has been archived. Considering it's used in BigQuery, it shouldn't be much of an issue, but is it possible that we should explore an alternate hashing algo in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we do it would be for the distribution work in #699 , or if farmhash stops working. ATM it's still supported through BQ so not super worried

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a link earlier

under the hood Farmhash will create a byte array, convert all inputs to bytes, and concatenate them together. To recreate this in BigQuery, you
will have to pre-create the seed as a little endian hex encoded byte string, as BigQuery does not currently allow directly converting an integer
to bytes.

`FARM_FINGERPRINT(CONCAT(b'\x2A\x00\x00\x00', CAST('abc' as BYTES))` will produce the equivalent hash of `--seed=42` with one `fields` where the given record has value `abc`.

The output will also need to be normalized to the range [0.0, 1.0] from the range [Long.MinValue, Long.MaxValue] in order to produce the exact equivalent sample as BigSampler.

## Sampling a Distribution
BigSampler supports sampling to produce either a Stratified or Uniform distribution.
Expand All @@ -111,7 +121,7 @@ Distribution sampling currently assumes all distinct keys or strata can fit into
## Distributions
### Stratified
![Stratified](https://github.com/spotify/ratatool/blob/master/misc/Stratified.png)
Stratified sampling example. Not that only the specified distributionFields are preserved in the sample.
Stratified sampling example. Note that only the specified distributionFields are preserved in the sample.

![Uniform](https://github.com/spotify/ratatool/blob/master/misc/Uniform.png)
Uniform sampling example. Adjusts
Uniform sampling example. Adjusts input to produce an even output distribution if possible.