From 434a4fb6830b4d2cae1f12f9eb5f38c82308ccdd Mon Sep 17 00:00:00 2001 From: Idrees Khan Date: Thu, 1 Feb 2024 11:14:35 -0500 Subject: [PATCH 1/5] Add docs for reproducing sample in BigQuery --- ratatool-sampling/README.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/ratatool-sampling/README.md b/ratatool-sampling/README.md index 8071812f..02514e3e 100644 --- a/ratatool-sampling/README.md +++ b/ratatool-sampling/README.md @@ -85,6 +85,16 @@ Leveraging `--fields=` BigSampler can produce a hash based on are in the sample. For example, `--fields=user_id --sample=0.5` will always produce the same sample of 50% of users. If multiple records contain the same `user_id` they will all be in or out of the sample. + +### Reproducing within BigQuery +Currently, BigSampler defaults to Farmhash, which is also used in BigQuery. When sampling with a seed and one or more fields, + under the hood Farmhash will create a byte array, convert all inputs to bytes, and concatenate them together. To recreate this in BigQuery, you + will have to pre-create the seed as a little endian hex encoded byte string, as BigQuery does not currently allow directly converting an integer + to bytes. + +`FARM_FINGERPRINT(CONCAT(b'\x2A\x00\x00\x00', b"abc"))` will produce the equivalent hash of `--seed=42` with one `fields` where the given record has value `abc`. + +The output will also need to be normalized to the range [0.0, 1.0] from the range [Long.MinValue, Long.MaxValue] in order to produce the exact equivalent sample as BigSampler. ## Sampling a Distribution BigSampler supports sampling to produce either a Stratified or Uniform distribution. @@ -114,4 +124,4 @@ Distribution sampling currently assumes all distinct keys or strata can fit into Stratified sampling example. Not that only the specified distributionFields are preserved in the sample. ![Uniform](https://github.com/spotify/ratatool/blob/master/misc/Uniform.png) -Uniform sampling example. Adjusts \ No newline at end of file +Uniform sampling example. Adjusts From c631d3c2218a55e9b8b611c46ae333cc3f75d071 Mon Sep 17 00:00:00 2001 From: Idrees Khan Date: Thu, 1 Feb 2024 11:18:04 -0500 Subject: [PATCH 2/5] Update README.md --- ratatool-sampling/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ratatool-sampling/README.md b/ratatool-sampling/README.md index 02514e3e..b371df26 100644 --- a/ratatool-sampling/README.md +++ b/ratatool-sampling/README.md @@ -121,7 +121,7 @@ Distribution sampling currently assumes all distinct keys or strata can fit into ## Distributions ### Stratified ![Stratified](https://github.com/spotify/ratatool/blob/master/misc/Stratified.png) -Stratified sampling example. Not that only the specified distributionFields are preserved in the sample. +Stratified sampling example. Note that only the specified distributionFields are preserved in the sample. ![Uniform](https://github.com/spotify/ratatool/blob/master/misc/Uniform.png) -Uniform sampling example. Adjusts +Uniform sampling example. Adjusts input to produce an even output distribution if possible. From b32d82988ee430760fa37cc303ff3033965609bf Mon Sep 17 00:00:00 2001 From: Idrees Khan Date: Thu, 1 Feb 2024 11:19:42 -0500 Subject: [PATCH 3/5] Update README.md --- ratatool-sampling/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ratatool-sampling/README.md b/ratatool-sampling/README.md index b371df26..455c0e4a 100644 --- a/ratatool-sampling/README.md +++ b/ratatool-sampling/README.md @@ -92,7 +92,7 @@ Currently, BigSampler defaults to Farmhash, which is also used in BigQuery. When will have to pre-create the seed as a little endian hex encoded byte string, as BigQuery does not currently allow directly converting an integer to bytes. -`FARM_FINGERPRINT(CONCAT(b'\x2A\x00\x00\x00', b"abc"))` will produce the equivalent hash of `--seed=42` with one `fields` where the given record has value `abc`. +`FARM_FINGERPRINT(CONCAT(b'\x2A\x00\x00\x00', b'abc'))` will produce the equivalent hash of `--seed=42` with one `fields` where the given record has value `abc`. The output will also need to be normalized to the range [0.0, 1.0] from the range [Long.MinValue, Long.MaxValue] in order to produce the exact equivalent sample as BigSampler. From 0edbd3e375cb811edfad198f879a64cfef2a268c Mon Sep 17 00:00:00 2001 From: Idrees Khan Date: Fri, 2 Feb 2024 09:40:29 -0500 Subject: [PATCH 4/5] Update ratatool-sampling/README.md Co-authored-by: RickardZwahlen --- ratatool-sampling/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ratatool-sampling/README.md b/ratatool-sampling/README.md index 455c0e4a..6541d2df 100644 --- a/ratatool-sampling/README.md +++ b/ratatool-sampling/README.md @@ -92,7 +92,7 @@ Currently, BigSampler defaults to Farmhash, which is also used in BigQuery. When will have to pre-create the seed as a little endian hex encoded byte string, as BigQuery does not currently allow directly converting an integer to bytes. -`FARM_FINGERPRINT(CONCAT(b'\x2A\x00\x00\x00', b'abc'))` will produce the equivalent hash of `--seed=42` with one `fields` where the given record has value `abc`. +`FARM_FINGERPRINT(CONCAT(b'\x2A\x00\x00\x00', CAST('abc' as BYTES))` will produce the equivalent hash of `--seed=42` with one `fields` where the given record has value `abc`. The output will also need to be normalized to the range [0.0, 1.0] from the range [Long.MinValue, Long.MaxValue] in order to produce the exact equivalent sample as BigSampler. From bae8a41dbfd70211bc39cc4ce98ed8884c48e1f4 Mon Sep 17 00:00:00 2001 From: Idrees Khan Date: Fri, 9 Feb 2024 10:50:30 -0500 Subject: [PATCH 5/5] Update README.md --- ratatool-sampling/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ratatool-sampling/README.md b/ratatool-sampling/README.md index 6541d2df..28c5987a 100644 --- a/ratatool-sampling/README.md +++ b/ratatool-sampling/README.md @@ -6,7 +6,7 @@ Diffy contains record sampling classes for Avro, Parquet, and BigQuery. Supporte # BigSampler BigSampler will run a [Scio](https://github.com/spotify/scio) pipeline sampling either Avro or BigQuery data. - It also allows specifying a hash function (either FarmHash or Murmur) with seed (if applicable for + It also allows specifying a hash function (either [FarmHash](https://github.com/google/farmhash) or Murmur) with seed (if applicable for your hash) and fields to hash for deterministic cohort selection. For full details see [BigSample.scala](https://github.com/spotify/ratatool/blob/master/ratatool-sampling/src/main/scala/com/spotify/ratatool/samplers/BigSampler.scala)