Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery JSON column: encode as Jackson JsonNode #5523

Merged
merged 7 commits into from
Nov 18, 2024

Conversation

turb
Copy link
Contributor

@turb turb commented Nov 14, 2024

On write, BigQuery JSON columns must not be provided as String, as it will be escaped. Apache Beam warns:

Make sure the TableRow value is a parsed JSON to ensure the read as a JSON type. Otherwise it will read as a raw (escaped) string.

Reverse-eng. shows that is means "as a JsonNode from Jackson".

Copy link

codecov bot commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 61.44%. Comparing base (c45685a) to head (d966487).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...cala/com/spotify/scio/bigquery/types/package.scala 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5523      +/-   ##
==========================================
+ Coverage   61.43%   61.44%   +0.01%     
==========================================
  Files         312      312              
  Lines       11103    11105       +2     
  Branches      762      744      -18     
==========================================
+ Hits         6821     6824       +3     
+ Misses       4282     4281       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@turb turb force-pushed the bq-jackson branch 2 times, most recently from 6583e46 to 986438b Compare November 14, 2024 18:23
@turb
Copy link
Contributor Author

turb commented Nov 14, 2024

@RustedBones what remains here is binary compatibility: newly compiled Json fails if executed with previous lib.

Copy link
Contributor

@RustedBones RustedBones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an integration test for that to make sure both avro and json formats are ok with those types

case t if t =:= typeOf[Float] => q"$tree.toDouble"
case t if t =:= typeOf[Double] => tree
case t if t =:= typeOf[String] => tree
case t if t =:= typeOf[com.fasterxml.jackson.databind.JsonNode] => tree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should accept JsonNode here

Copy link
Contributor Author

@turb turb Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually need it, but I supposed one may want to provide an already parsed (hence valid) JSON. I will remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -198,7 +199,7 @@ private[types] object ConverterProvider {
case t if t =:= typeOf[Geography] =>
q"$tree.wkt"
case t if t =:= typeOf[Json] =>
q"$tree.wkt"
q"$tree.asJackson"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the doc here, loading json in avro expects a string type, annotated with the JSON sqlType property

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the same comment for BigQuery, but after a reverse-eng. they meant JsonNode. If opened a PR just to change the warning: apache/beam#33121

However I can keep asJackson on line 417, and .wkt on line 202.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case t if t =:= typeOf[Float] => q"$s.toFloat"
case t if t =:= typeOf[Double] => q"$s.toDouble"
case t if t =:= typeOf[String] => q"$s"
case t if t =:= typeOf[com.fasterxml.jackson.databind.JsonNode] => q"$s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should accept JsonNode here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -412,7 +414,7 @@ private[types] object ConverterProvider {
case t if t =:= typeOf[Geography] =>
q"$tree.wkt"
case t if t =:= typeOf[Json] =>
q"$tree.wkt"
q"$tree.asJackson"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 65 to 70
case class Json(wkt: String) {
def asJackson: JsonNode = Json.mapper.readTree(wkt)
}
object Json {
private val mapper = new ObjectMapper()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have this in the conversion part and leave the data as dumb as possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would be the appropriate place to store the ObjectMapper instance, in order to not instantiate it for each row? Directly as a private val in ConverterProvider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for {
key <- Gen.alphaStr
value <- Gen.alphaStr
} yield Json("{\"" + key + "\":\"" + value + "\"}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: triple quote is nicer when writing json

Suggested change
} yield Json("{\"" + key + "\":\"" + value + "\"}")
} yield Json(s"""{"$key":"$value"}""")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL. I always supposed triple quote was only for multi-line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turb turb requested a review from RustedBones November 15, 2024 15:57
@RustedBones
Copy link
Contributor

Wrote some integration tests on top of your PR. Revealed that BigNumeric also has a bug!
Needs some polishing, will add that on top of this work if it's fine.

Thanks again for the contrib, and nice catch!

@RustedBones RustedBones merged commit 49f43b0 into spotify:main Nov 18, 2024
11 checks passed
@RustedBones RustedBones mentioned this pull request Nov 19, 2024
@RustedBones
Copy link
Contributor

This change causes a regression when using the storage API. Seems the Jackson node is not well supported.

@turb
Copy link
Contributor Author

turb commented Dec 17, 2024

This change causes a regression when using the storage API. Seems the Jackson node is not well supported.

That's strange, I'm reading entities with JSON columns using sc.typedBigQueryStorage and it works.

@RustedBones
Copy link
Contributor

The issue is with the writing side.
It also looks that using TableRow instead of JacskonNode is better: The coder for TableRow transforms JacskonNode into a TableRow when an element gets serialized.

I'd try to fix those problems in #5529

@turb
Copy link
Contributor Author

turb commented Dec 18, 2024

FWIW, I tried a table load with method = beam.BigQueryIO.Write.Method.STORAGE_WRITE_API (is that it?) and it worked.

So JacksonNode would replace Json or that's just an internal?

The one thing to check if the column is JSON, is it should not be stored escaped inside a JS string. AFAIK that can be observed only BigQuery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants