-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery JSON column: encode as Jackson JsonNode #5523
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5523 +/- ##
==========================================
+ Coverage 61.43% 61.44% +0.01%
==========================================
Files 312 312
Lines 11103 11105 +2
Branches 762 744 -18
==========================================
+ Hits 6821 6824 +3
+ Misses 4282 4281 -1 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
6583e46
to
986438b
Compare
@RustedBones what remains here is binary compatibility: newly compiled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an integration test for that to make sure both avro and json formats are ok with those types
case t if t =:= typeOf[Float] => q"$tree.toDouble" | ||
case t if t =:= typeOf[Double] => tree | ||
case t if t =:= typeOf[String] => tree | ||
case t if t =:= typeOf[com.fasterxml.jackson.databind.JsonNode] => tree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should accept JsonNode
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually need it, but I supposed one may want to provide an already parsed (hence valid) JSON. I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -198,7 +199,7 @@ private[types] object ConverterProvider { | |||
case t if t =:= typeOf[Geography] => | |||
q"$tree.wkt" | |||
case t if t =:= typeOf[Json] => | |||
q"$tree.wkt" | |||
q"$tree.asJackson" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the doc here, loading json in avro expects a string type, annotated with the JSON
sqlType property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the same comment for BigQuery, but after a reverse-eng. they meant JsonNode. If opened a PR just to change the warning: apache/beam#33121
However I can keep asJackson
on line 417, and .wkt
on line 202.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case t if t =:= typeOf[Float] => q"$s.toFloat" | ||
case t if t =:= typeOf[Double] => q"$s.toDouble" | ||
case t if t =:= typeOf[String] => q"$s" | ||
case t if t =:= typeOf[com.fasterxml.jackson.databind.JsonNode] => q"$s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should accept JsonNode
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -412,7 +414,7 @@ private[types] object ConverterProvider { | |||
case t if t =:= typeOf[Geography] => | |||
q"$tree.wkt" | |||
case t if t =:= typeOf[Json] => | |||
q"$tree.wkt" | |||
q"$tree.asJackson" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case class Json(wkt: String) { | ||
def asJackson: JsonNode = Json.mapper.readTree(wkt) | ||
} | ||
object Json { | ||
private val mapper = new ObjectMapper() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to have this in the conversion part and leave the data as dumb as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would be the appropriate place to store the ObjectMapper
instance, in order to not instantiate it for each row? Directly as a private val
in ConverterProvider
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for { | ||
key <- Gen.alphaStr | ||
value <- Gen.alphaStr | ||
} yield Json("{\"" + key + "\":\"" + value + "\"}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: triple quote is nicer when writing json
} yield Json("{\"" + key + "\":\"" + value + "\"}") | |
} yield Json(s"""{"$key":"$value"}""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL. I always supposed triple quote was only for multi-line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote some integration tests on top of your PR. Revealed that Thanks again for the contrib, and nice catch! |
This change causes a regression when using the storage API. Seems the Jackson node is not well supported. |
That's strange, I'm reading entities with JSON columns using |
The issue is with the writing side. I'd try to fix those problems in #5529 |
FWIW, I tried a table load with So The one thing to check if the column is JSON, is it should not be stored escaped inside a JS string. AFAIK that can be observed only BigQuery. |
On write, BigQuery JSON columns must not be provided as
String
, as it will be escaped. Apache Beam warns:Make sure the TableRow value is a parsed JSON to ensure the read as a JSON type. Otherwise it will read as a raw (escaped) string.
Reverse-eng. shows that is means "as a
JsonNode
from Jackson".