(fix #766) Deprecate AvroCompat, replace automatic schema detection on read + Configurable write #996

clairemcginty · 2024-07-09T17:05:45Z

Magnolify is now on Parquet 0.14, which includes a bugfix for PARQUET-2425 -- AvroSchemaConverter no longer throws an exception when parsing non-grouped repeated fields (the magnolify-parquet default). This was an early blocker in our efforts to deprecate AvroCompat (see discussion on #766).

This PR:

Deprecates AvroCompat in favor of (a) making writeSupport/schema operations Configurable, and introduces a Configuration option specifically for writing grouped arrays
Updates readSupport to automatically detect any grouped arrays in the the Write schema and correct the Read schema as needed.
Writes Avro schemas to file metadata footer by default

clairemcginty · 2024-07-09T17:08:31Z

parquet/src/main/scala/magnolify/parquet/ParquetField.scala

@@ -51,7 +51,7 @@ sealed trait ParquetField[T] extends Serializable {
  protected final def nonEmpty(v: T): Boolean = !isEmpty(v)

  def write(c: RecordConsumer, v: T)(cm: CaseMapper): Unit
-  def newConverter: TypeConverter[T]
+  def newConverter(writerSchema: Type): TypeConverter[T]


Alternately, we could create an overloaded method like:

def newConverter(): TypeConverter = newConverter(false) def newConverter(avroCompat: Boolean): TypeConverter = ...

This will allow forward binary compatibility. Otherwise we should move this to the v0.8 base branch

hmm. forward compat would be nice, but long-term I'd like to be able to phase out the whole AvroCompat option from Magnolify and just write grouped arrays by default... so I'd rather not encode it further into the Magnolify API (def newConverter(avroCompat: Boolean)). So I think we can keep this as is and I'll rebase onto v0.8 branch

although now that I think about it more, the param here should be readSchema/requestedSchema, not writerSchema

clairemcginty · 2024-07-09T17:10:19Z

parquet/src/main/scala/magnolify/parquet/ParquetField.scala

          .asInstanceOf[TypeConverter.Buffered[T]]
          .withRepetition(Repetition.REPEATED)
        val arrayConverter = new TypeConverter.Delegate[T, C[T]](buffered) {
          override def get: C[T] = inner.get(fc.fromSpecific)
        }

-        if (hasAvroArray) {
+        if (Schema.hasGroupedArray(writerSchema)) {


I guess this would throw an error if the schema mixed grouped and non-grouped array types since it checks for the presence of a grouped array in the entire schema, rather than for the specific field.... but we don't have access to CaseMapper here so we couldn't easily isolate the specific field schema. The original approach with the AvroCompat import was also all-or-nothing so this shouldn't functionally be a change in behavior

codecov · 2024-07-09T17:11:59Z

Codecov Report

Attention: Patch coverage is 94.93671% with 4 lines in your changes missing coverage. Please review.

Project coverage is 95.64%. Comparing base (032b52a) to head (b9a7281).

Files with missing lines	Patch %	Lines
...src/main/scala/magnolify/parquet/ParquetType.scala	86.36%	3 Missing ⚠️
...rc/main/scala/magnolify/parquet/ParquetField.scala	97.61%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #996      +/-   ##
==========================================
+ Coverage   95.50%   95.64%   +0.13%     
==========================================
  Files          56       57       +1     
  Lines        1980     1996      +16     
  Branches      186      162      -24     
==========================================
+ Hits         1891     1909      +18     
+ Misses         89       87       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

clairemcginty · 2024-09-12T20:07:03Z

build.sbt

@@ -693,6 +693,7 @@ lazy val tools = project
      "com.google.apis" % "google-api-services-bigquery" % bigqueryVersion,
      "org.apache.avro" % "avro" % avroVersion % Provided,
      "org.apache.parquet" % "parquet-hadoop" % parquetVersion,
+      "org.apache.hadoop" % "hadoop-common" % hadoopVersion,


I'm on the fence about relying so heavily on the hadoop Configuration class, since it pulls in hadoop-common artifact and links us more tightly with Hadoop. Parquet is trying to move away from Configuration and onto their own ParquetConfiguration class, which we could use instead. However, it might be confusing for Scio users since Scio is heavily dependent on Configuration and we don't have immediate plans to offboard from it

Actually I might pull this out into a separate PR. will update shortly

clairemcginty · 2024-09-12T20:53:00Z

parquet/src/main/scala/magnolify/parquet/ParquetType.scala

      val metadata = new java.util.HashMap[String, String]()
-      if (parquetType.avroCompat) {
-        // This overrides `WriteSupport#getName`
-        metadata.put(ParquetWriter.OBJECT_MODEL_NAME_PROP, "avro")


I did drop the behavior of writing writer.model.name: avro if AvroCompat is enabled -- I don't think it makes sense, it should still be Magnolify. I can't think of any reason why this would impact downstream readers -- model name shouldn't matter at all when comparing schema compatibility across files. lf anyone can think of a good reason why this change is breaking let me know...

RustedBones · 2024-09-13T07:28:57Z

parquet/src/main/scala/magnolify/parquet/ParquetField.scala

+        groupAvroArrays || conf.getBoolean(
+          MagnolifyParquetProperties.WriteGroupedArrays,
+          MagnolifyParquetProperties.WriteGroupedArraysDefault
+        )


I'm afraid this check in conf is very costly and has a significant impact on performance. memorization can help, but IMHO we should give the configuration when creating the ParquetType (same place we actually capture the deprecated ParquetArray), so we can compute this only once. WDYT ?

RustedBones · 2024-09-13T07:49:47Z

parquet/src/main/scala/magnolify/parquet/ParquetField.scala

+      // Legacy compat with Magnolify <= 0.7; future versions will remove AvroCompat in favor of
+      // Configuration-based approach
+      @nowarn("cat=deprecation")
+      val groupAvroArrays: Boolean = pa match {
        case ParquetArray.default               => false
        case ParquetArray.AvroCompat.avroCompat => true
      }


This was probably not ideal. I think we should be able to re-use any ParquetField and decide on the behavior when instantiating the parquet type. Here we now mix behaviors with the config.

clairemcginty · 2024-09-17T20:19:18Z

jmh/src/test/scala/magnolify/jmh/MagnolifyBench.scala

+  import MagnolifyBench._
+
+  @Benchmark def parquetWrite(state: ParquetStates.DefaultParquetWriteState): Unit = state.writer.write(nested)
+  @Benchmark def parquetRead(state: ParquetStates.DefaultParquetReadState): Nested = state.reader.read()


It's hard to capture a "true" read benchmark for Parquet since there's so much happening under the hood here (reading and caching the row group, for example). But at least this can be used to track positive and negative trends

ok, this was bothering me, so I re-implemented this benchmark so that instead of reading/writing entire file streams, it's directly writing/reading Pages (smallest unit of IO granularity in Parquet). This skips a lot of the overhead of the file/rowgroup IO, so that we're able to specifically benchmark ParquetType's functionality: converting between parquet Groups and Scala case classes.

clairemcginty · 2024-09-19T19:35:43Z

Ok, using the benchmarks from #1040, this change looks performance-neutral on read/write:

main branch benchmark:

% sbt "jmh/jmh:run -i 10 -wi 10 -f1 -t .*parquet.*magnolify"
[info] Benchmark                           Mode  Cnt      Score     Error  Units
[info] ParquetBench.parquetReadMagnolify  avgt   10  13695.172 ± 311.972  ns/op
[info] ParquetBench.parquetWriteMagnolify  avgt   10  5527.228 ± 70.377  ns/op

avro-compat-rework benchmark:

% sbt "jmh/jmh:run -i 10 -wi 10 -f1 -t .*parquet.*magnolify"
[info] Benchmark                           Mode  Cnt      Score     Error  Units
[info] ParquetBench.parquetReadMagnolify   avgt   10  13491.778 ± 124.734  ns/op
[info] ParquetBench.parquetWriteMagnolify  avgt   10   4769.038 ±  61.517  ns/op

the Configuration check does add some time to WriteContext#init, but that's called once per task only

clairemcginty · 2024-09-19T19:41:47Z

parquet/src/main/scala/magnolify/parquet/ParquetField.scala

  protected def isEmpty(v: T): Boolean
  protected final def nonEmpty(v: T): Boolean = !isEmpty(v)

-  def write(c: RecordConsumer, v: T)(cm: CaseMapper): Unit
-  def newConverter: TypeConverter[T]
+  def write(c: RecordConsumer, v: T)(cm: CaseMapper, groupArrayFields: Boolean): Unit


so this solution now works by passing a precomputed groupArrayFields flag around to all write method.

It works, but I'm thinking that we may want to generalize this from a boolean flag into a Map[String, _]-typed configuration field (parsed from the Hadoop Configuration object passed to ParquetType).

This would give us more flexibility if we need to make any more write options configurable in future. As an example, if a user has any LocalDate* fields in their ParquetType case class, by default parquet-avro will convert them into a local-timestamp-{millis, micros} for Avro 1.11, but timestamp-{millis, micros} on Avro 1.8. We might want to make this behavior configurable to preserve schema compatibility across Avro upgrades.

wdyt @RustedBones ?

We can use a custom trait containing all desired write setting

updated to use a trait!

RustedBones

I think we should speak of list logical-type instead of grouped array.

I'm also wondering if by default we should respect the spec.

parquet-avro does not respect the default, and falls in the backward compat rules, case 1.
Do you know if parquet-avro is able to read arrays when the default spec is used ?

RustedBones · 2024-10-21T13:25:50Z

parquet/src/test/scala/magnolify/parquet/ParquetTypeSuite.scala

@@ -192,6 +195,59 @@ class ParquetTypeSuite extends MagnolifySuite {
      assertEquals(inner.getFields.asScala.map(_.getName).toSeq, Seq("INNERFIRST"))
    }
  }
+
+  test(s"AvroCompat") {


Suggested change

test(s"AvroCompat") {

test("AvroCompat") {

RustedBones · 2024-10-21T13:55:43Z

parquet/src/main/scala/magnolify/parquet/ParquetField.scala


 sealed trait ParquetField[T] extends Serializable {

-  @transient private lazy val schemaCache: concurrent.Map[UUID, Type] =
+  @transient private lazy val schemaCache: concurrent.Map[Boolean, concurrent.Map[UUID, Type]] =


Why not using a compound key, instead of a nested map ?

Suggested change

@transient private lazy val schemaCache: concurrent.Map[Boolean, concurrent.Map[UUID, Type]] =

@transient private lazy val schemaCache: concurrent.Map[(Boolean, UUID), Type] =

clairemcginty · 2024-11-25T19:15:23Z

parquet-avro does not respect the default, and falls in the backward compat rules, case 1.
Do you know if parquet-avro is able to read arrays when the default spec is used ?

Unfortunately, not. It looks like this has just never been implemented. It's something we could try to get into the next Parquet release

clairemcginty requested review from shnapz and RustedBones July 9, 2024 17:05

clairemcginty commented Jul 9, 2024

View reviewed changes

clairemcginty changed the title ~~[fix #766] Derive AvroCompat automatically on read~~ (fix #766) Derive AvroCompat automatically on read Jul 9, 2024

clairemcginty added 3 commits September 12, 2024 12:41

Derive AvroCompat automatically on read

61e5581

Add test

a1ee149

[Very WIP] Make writes Configurable

81e3f07

clairemcginty force-pushed the avro-compat-rework branch from 6c67658 to 81e3f07 Compare September 12, 2024 16:41

clairemcginty added 2 commits September 12, 2024 13:55

fix for scala 2.12

a730718

scalafmt

5e622ab

clairemcginty changed the title ~~(fix #766) Derive AvroCompat automatically on read~~ (fix #766) Deprecate AvroCompat, replace automatic schema detection on read + Configurable write Sep 12, 2024

fix tools test

db5b1b0

clairemcginty commented Sep 12, 2024

View reviewed changes

clairemcginty marked this pull request as ready for review September 12, 2024 20:44

drop Optional

be75cc3

clairemcginty commented Sep 12, 2024

View reviewed changes

RustedBones reviewed Sep 13, 2024

View reviewed changes

clairemcginty force-pushed the avro-compat-rework branch from 3b71138 to aeed545 Compare September 17, 2024 19:18

clairemcginty added 2 commits September 17, 2024 15:30

add JMH benchmark for Parquet

56ff232

[wip] Refactor ParquetType to pass Configuration at instantiation time

ca535fc

clairemcginty force-pushed the avro-compat-rework branch from aeed545 to ca535fc Compare September 17, 2024 20:12

fixup

da8fb91

clairemcginty commented Sep 17, 2024

View reviewed changes

Make Parquet benchmark more granular

23bcbd9

clairemcginty commented Sep 19, 2024

View reviewed changes

Cleanup

eb205db

clairemcginty force-pushed the avro-compat-rework branch from 26c2bd3 to eb205db Compare September 19, 2024 19:44

RustedBones reviewed Oct 21, 2024

View reviewed changes

clairemcginty added 2 commits November 25, 2024 16:23

Respond to PR comments

959c7d2

Merge branch 'main' into HEAD

b9a7281

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix #766) Deprecate AvroCompat, replace automatic schema detection on read + Configurable write #996

(fix #766) Deprecate AvroCompat, replace automatic schema detection on read + Configurable write #996

clairemcginty commented Jul 9, 2024 •

edited

Loading

clairemcginty Jul 9, 2024

RustedBones Jul 10, 2024

clairemcginty Jul 10, 2024

clairemcginty Jul 10, 2024

clairemcginty Jul 9, 2024

codecov bot commented Jul 9, 2024 •

edited

Loading

clairemcginty Sep 12, 2024 •

edited

Loading

clairemcginty Sep 18, 2024

clairemcginty Sep 12, 2024

RustedBones Sep 13, 2024

RustedBones Sep 13, 2024

clairemcginty Sep 17, 2024

clairemcginty Sep 18, 2024

clairemcginty commented Sep 19, 2024

clairemcginty Sep 19, 2024

clairemcginty Sep 19, 2024

RustedBones Oct 21, 2024

clairemcginty Dec 16, 2024

RustedBones left a comment

RustedBones Oct 21, 2024

RustedBones Oct 21, 2024

clairemcginty commented Nov 25, 2024

	@transient private lazy val schemaCache: concurrent.Map[Boolean, concurrent.Map[UUID, Type]] =
	@transient private lazy val schemaCache: concurrent.Map[(Boolean, UUID), Type] =

(fix #766) Deprecate AvroCompat, replace automatic schema detection on read + Configurable write #996

Are you sure you want to change the base?

(fix #766) Deprecate AvroCompat, replace automatic schema detection on read + Configurable write #996

Conversation

clairemcginty commented Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 9, 2024 • edited Loading

Codecov Report

clairemcginty Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RustedBones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty commented Nov 25, 2024

clairemcginty commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading

clairemcginty Sep 12, 2024 •

edited

Loading