refactor XML deserialization #1042

aajtodd · 2024-02-26T14:06:46Z

Issue #

Description of changes

The context for this issue is in aws-sdk-kotlin#1220. Essentially we made a bad assumption that flat collections would always be serialized sequentially. In reality services are returning flat collections interspersed with other XML elements.

Our original approach to deserialization followed closely with kotlinx.serialization where we have a common Serializer and Deserializer interface. Each format we want to support (xml, json, etc) implements those and then codegen is the same across all types. The issue is (1) we end up duplicating information already in the model (field traits) and (2) we have to bend over backwards to make the format work within the interface instead of just creating a runtime type that more closely matches the medium. We discussed as a team our options for addressing this issue and decided to just refactor the way we do XML deserialization to closer match that of Go + Rust. This was something we had discussed prior to GA and just didn't have time to do. Rather than implement a one off workaround tailored specifically to this issue we're going to move in the desired end state which is to generate serializers/deserializers specific to each format (starting with just XML deserialization).

This is a large PR so I'm going to try and summarize the important bits for easier review. In particular because a lot of this PR is net new test code.

XmlParserGenerator guts were replaced to no longer use the common struct/union deserializer and instead generate something directly off the lower level serde-xml types. See Codegen Output below for example output and differences.
- Some of the biggest structural differences are that we generate dedicated functions for list/map deserialization rather than doing it inline. This is partially to help with readability but also to maintain the correct deserializer state by construction since they require one or more inner loops themselves. Flattened collections accumulate values into the member such that every time a flattened member tag is hit we don't replace the previous collection (fixing the bug in aws-sdk-kotlin#1220).
XmlTagReader - new type that sits on top of XmlStreamReader and provides some small conveniences for iterating tokens and maintaining deserializer state.
Removed all previous XML deserializer unit tests in favor of a new module tests/codegen/serde-tests. This new module has a bit of overlap with the existing protocol tests but the iteration time is quicker and is independent of the protocols. This new module has greater coverage than our previous unit tests and I even found some bugs with how we were generating nested lists/maps inside union types as well as found Union members targeting smithy.api#Unit generates extraneous structures #1040. The idea is the same as protocol tests, use the generated code to test with rather than hand writing tests that mimic the structure of generated code. This approach is both easier to write tests for but more accurate as it is testing the real codegen output as opposed to hand written versions of what we expect codegen to output.
The serde-benchmarks project contained a companion module serde-codegen-support that implemented a custom dummy protocol for json + xml. There wasn't anything specific to benchmarks here though so I refactored it to remove notion of benchmarks and moved it to tests/serde as a common module that can be re-used for both the serde benchmarks and the new codegen integration tests

Codegen Output

For the S3 ListObjectVersions output type:

Previously

private fun deserializeListObjectVersionsOperationBody(builder: ListObjectVersionsResponse.Builder, payload: ByteArray) {
    val deserializer = XmlDeserializer(payload)
    val COMMONPREFIXES_DESCRIPTOR = SdkFieldDescriptor(SerialKind.List, XmlSerialName("CommonPrefixes"), Flattened)
    val DELETEMARKERS_DESCRIPTOR = SdkFieldDescriptor(SerialKind.List, XmlSerialName("DeleteMarker"), Flattened)
    val DELIMITER_DESCRIPTOR = SdkFieldDescriptor(SerialKind.String, XmlSerialName("Delimiter"))
    val ENCODINGTYPE_DESCRIPTOR = SdkFieldDescriptor(SerialKind.Enum, XmlSerialName("EncodingType"))
    val ISTRUNCATED_DESCRIPTOR = SdkFieldDescriptor(SerialKind.Boolean, XmlSerialName("IsTruncated"))
    val KEYMARKER_DESCRIPTOR = SdkFieldDescriptor(SerialKind.String, XmlSerialName("KeyMarker"))
    val MAXKEYS_DESCRIPTOR = SdkFieldDescriptor(SerialKind.Integer, XmlSerialName("MaxKeys"))
    val NAME_DESCRIPTOR = SdkFieldDescriptor(SerialKind.String, XmlSerialName("Name"))
    val NEXTKEYMARKER_DESCRIPTOR = SdkFieldDescriptor(SerialKind.String, XmlSerialName("NextKeyMarker"))
    val NEXTVERSIONIDMARKER_DESCRIPTOR = SdkFieldDescriptor(SerialKind.String, XmlSerialName("NextVersionIdMarker"))
    val PREFIX_DESCRIPTOR = SdkFieldDescriptor(SerialKind.String, XmlSerialName("Prefix"))
    val VERSIONIDMARKER_DESCRIPTOR = SdkFieldDescriptor(SerialKind.String, XmlSerialName("VersionIdMarker"))
    val VERSIONS_DESCRIPTOR = SdkFieldDescriptor(SerialKind.List, XmlSerialName("Version"), Flattened)
    val OBJ_DESCRIPTOR = SdkObjectDescriptor.build {
        trait(XmlSerialName("ListVersionsResult"))
        trait(XmlNamespace("http://s3.amazonaws.com/doc/2006-03-01/"))
        field(COMMONPREFIXES_DESCRIPTOR)
        field(DELETEMARKERS_DESCRIPTOR)
        field(DELIMITER_DESCRIPTOR)
        field(ENCODINGTYPE_DESCRIPTOR)
        field(ISTRUNCATED_DESCRIPTOR)
        field(KEYMARKER_DESCRIPTOR)
        field(MAXKEYS_DESCRIPTOR)
        field(NAME_DESCRIPTOR)
        field(NEXTKEYMARKER_DESCRIPTOR)
        field(NEXTVERSIONIDMARKER_DESCRIPTOR)
        field(PREFIX_DESCRIPTOR)
        field(VERSIONIDMARKER_DESCRIPTOR)
        field(VERSIONS_DESCRIPTOR)
    }

    deserializer.deserializeStruct(OBJ_DESCRIPTOR) {
        loop@while (true) {
            when (findNextFieldIndex()) {
                COMMONPREFIXES_DESCRIPTOR.index -> builder.commonPrefixes =
                    deserializer.deserializeList(COMMONPREFIXES_DESCRIPTOR) {
                        val col0 = mutableListOf<CommonPrefix>()
                        while (hasNextElement()) {
                            val el0 = if (nextHasValue()) { deserializeCommonPrefixDocument(deserializer) } else { deserializeNull(); continue }
                            col0.add(el0)
                        }
                        col0
                    }
                DELETEMARKERS_DESCRIPTOR.index -> builder.deleteMarkers =
                    deserializer.deserializeList(DELETEMARKERS_DESCRIPTOR) {
                        val col0 = mutableListOf<DeleteMarkerEntry>()
                        while (hasNextElement()) {
                            val el0 = if (nextHasValue()) { deserializeDeleteMarkerEntryDocument(deserializer) } else { deserializeNull(); continue }
                            col0.add(el0)
                        }
                        col0
                    }
                DELIMITER_DESCRIPTOR.index -> builder.delimiter = deserializeString()
                ENCODINGTYPE_DESCRIPTOR.index -> builder.encodingType = deserializeString().let { EncodingType.fromValue(it) }
                ISTRUNCATED_DESCRIPTOR.index -> builder.isTruncated = deserializeBoolean()
                KEYMARKER_DESCRIPTOR.index -> builder.keyMarker = deserializeString()
                MAXKEYS_DESCRIPTOR.index -> builder.maxKeys = deserializeInt()
                NAME_DESCRIPTOR.index -> builder.name = deserializeString()
                NEXTKEYMARKER_DESCRIPTOR.index -> builder.nextKeyMarker = deserializeString()
                NEXTVERSIONIDMARKER_DESCRIPTOR.index -> builder.nextVersionIdMarker = deserializeString()
                PREFIX_DESCRIPTOR.index -> builder.prefix = deserializeString()
                VERSIONIDMARKER_DESCRIPTOR.index -> builder.versionIdMarker = deserializeString()
                VERSIONS_DESCRIPTOR.index -> builder.versions =
                    deserializer.deserializeList(VERSIONS_DESCRIPTOR) {
                        val col0 = mutableListOf<ObjectVersion>()
                        while (hasNextElement()) {
                            val el0 = if (nextHasValue()) { deserializeObjectVersionDocument(deserializer) } else { deserializeNull(); continue }
                            col0.add(el0)
                        }
                        col0
                    }
                null -> break@loop
                else -> skipValue()
            }
        }
    }
}

After:

private fun deserializeListObjectVersionsOperationBody(builder: ListObjectVersionsResponse.Builder, payload: ByteArray) {
    val root = xmlTagReader(payload)

    loop@while(true) {
        val curr = root.nextTag() ?: break@loop
        when(curr.tag.name) {
            // CommonPrefixes smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$CommonPrefixes
            "CommonPrefixes" -> builder.commonPrefixes = run {
                val el = deserializeCommonPrefixDocument(curr)
                createOrAppend(builder.commonPrefixes, el)
            }
            // DeleteMarkers smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$DeleteMarkers
            "DeleteMarker" -> builder.deleteMarkers = run {
                val el = deserializeDeleteMarkerEntryDocument(curr)
                createOrAppend(builder.deleteMarkers, el)
            }
            // Delimiter smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$Delimiter
            "Delimiter" -> builder.delimiter = curr.tryData()
                .getOrDeserializeErr { "expected (string: `com.amazonaws.s3#Delimiter`)" }
            // EncodingType smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$EncodingType
            "EncodingType" -> builder.encodingType = curr.tryData()
                .parse { EncodingType.fromValue(it) }
                .getOrDeserializeErr { "expected (enum: `com.amazonaws.s3#EncodingType`)" }
            // IsTruncated smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$IsTruncated
            "IsTruncated" -> builder.isTruncated = curr.tryData()
                .parseBoolean()
                .getOrDeserializeErr { "expected (boolean: `com.amazonaws.s3#IsTruncated`)" }
            // KeyMarker smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$KeyMarker
            "KeyMarker" -> builder.keyMarker = curr.tryData()
                .getOrDeserializeErr { "expected (string: `com.amazonaws.s3#KeyMarker`)" }
            // MaxKeys smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$MaxKeys
            "MaxKeys" -> builder.maxKeys = curr.tryData()
                .parseInt()
                .getOrDeserializeErr { "expected (integer: `com.amazonaws.s3#MaxKeys`)" }
            // Name smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$Name
            "Name" -> builder.name = curr.tryData()
                .getOrDeserializeErr { "expected (string: `com.amazonaws.s3#BucketName`)" }
            // NextKeyMarker smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$NextKeyMarker
            "NextKeyMarker" -> builder.nextKeyMarker = curr.tryData()
                .getOrDeserializeErr { "expected (string: `com.amazonaws.s3#NextKeyMarker`)" }
            // NextVersionIdMarker smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$NextVersionIdMarker
            "NextVersionIdMarker" -> builder.nextVersionIdMarker = curr.tryData()
                .getOrDeserializeErr { "expected (string: `com.amazonaws.s3#NextVersionIdMarker`)" }
            // Prefix smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$Prefix
            "Prefix" -> builder.prefix = curr.tryData()
                .getOrDeserializeErr { "expected (string: `com.amazonaws.s3#Prefix`)" }
            // VersionIdMarker smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$VersionIdMarker
            "VersionIdMarker" -> builder.versionIdMarker = curr.tryData()
                .getOrDeserializeErr { "expected (string: `com.amazonaws.s3#VersionIdMarker`)" }
            // Versions smithy.kotlin.synthetic.s3#ListObjectVersionsResponse$Versions
            "Version" -> builder.versions = run {
                val el = deserializeObjectVersionDocument(curr)
                createOrAppend(builder.versions, el)
            }
            else -> {}
        }
        curr.drop()
    }
}

Effect on Artifact Sizes

The 1.0.64 S3 release was 5,072,329 bytes. Local builds are coming in at 5,039,276 bytes (~0.6% smaller).

Benchmarks

I've updated the benchmarks. They are included inline here for easy review. The tl;dr is that the generated deserializers are adding less overhead to raw token lexing than before and as a result is faster.

jvm summary:
Benchmark                                                         (sourceFilename)  Mode  Cnt   Score   Error  Units
a.s.k.b.s.xml.XmlDeserializerBenchmark.deserializeBenchmark                    N/A  avgt    5  33.566 ± 0.074  ms/op
a.s.k.b.s.xml.XmlLexerBenchmark.deserializeBenchmark          countries-states.xml  avgt    5  25.200 ± 0.079  ms/op
a.s.k.b.s.xml.XmlLexerBenchmark.deserializeBenchmark            kotlin-article.xml  avgt    5   0.846 ± 0.003  ms/op
a.s.k.b.s.xml.XmlSerializerBenchmark.serializeBenchmark                        N/A  avgt    5  21.714 ± 0.385  ms/op

The lexer internals didn't change so they are nearly the same as the prior baseline. The deserialize benchmark came
in at 33.566 ms/op compared to the prior 90.697 ms/op (62% faster).

Binary Compatibility

This change intentionally breaks binary compatibility on a few @InternalApi APIs:

XmlDeserializer - removed completely as it's no longer used
parseRestXmlError/parseEc2QueryError - removed erroneous suspend
XmlToken and XmlToken.QualifiedName - renamed fields to improve readability

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…rrect tag to decode from

lauzadis · 2024-02-26T15:19:32Z

.../src/main/kotlin/software/amazon/smithy/kotlin/codegen/rendering/serde/XmlParserGenerator.kt

+        writer.deserializeLoop(serdeCtx) { innerCtx ->
+            members.forEach { member ->
+                val name = member.getTrait<XmlNameTrait>()?.value ?: member.memberName
+                write("// ${member.memberName} ${escape(member.id.toString())}")


opinion; I don't think these member name comments are super useful, did you mean to include them or just used for debugging?

I meant to include them, I think they are helpful if you have to debug something it points you exactly to the model shape.

.../src/main/kotlin/software/amazon/smithy/kotlin/codegen/rendering/serde/XmlParserGenerator.kt

runtime/serde/common/src/aws/smithy/kotlin/runtime/serde/Parsers.kt

runtime/serde/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/XmlTagReader.kt

lauzadis · 2024-02-26T15:57:49Z

runtime/serde/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/XmlTagReader.kt

+        return nextTok?.tagReader(reader).also { newScope ->
+            last = newScope
+        }


question: not questioning the correctness but why is last set to the newly created newScope reader rather than the old input reader?

It's the last tag reader we dispensed via nextTag, we use it to ensure that when nextTag is invoked again that we have the correct state.

runtime/serde/serde-xml/common/test/aws/smithy/kotlin/runtime/serde/xml/XmlTagReaderTest.kt

lauzadis · 2024-02-26T16:12:41Z

tests/benchmarks/serde-benchmarks/README.md

@@ -8,20 +8,20 @@ This project contains micro benchmarks for the serialization implementation(s).
 ./gradlew :runtime:serde:serde-benchmarks:jvmBenchmark
 ```

-Baseline `0.7.8-beta` on EC2 **[m5.4xlarge](https://aws.amazon.com/ec2/instance-types/m5/)** in **OpenJK 1.8.0_312**:
+Baseline on EC2 **[m5.4xlarge](https://aws.amazon.com/ec2/instance-types/m5/)** in **Corretto-17.0.10.8.1**:


We probably want to keep the SDK version here, right?

I was going to but it's kind of a chicken and an egg problem. In a branch we aren't on a tagged version so we either guess what that version is going to be, use a commit sha/PR number, or just let the commit history speak for itself. I chose let the commit history speak for itself.

lauzadis · 2024-02-26T16:15:49Z

tests/codegen/serde-tests/build.gradle.kts

+    // FIXME - this task up-to-date checks are wrong, likely something is not setup right with inputs/outputs somewhere
+    // for now just always run it
+    outputs.upToDateWhen { false }


question: is there a backlog task for this?

lauzadis · 2024-02-26T16:23:54Z

tests/codegen/serde-tests/src/test/kotlin/aws/smithy/kotlin/tests/serde/XmlUnionTest.kt

+    // FIXME - https://github.com/awslabs/smithy-kotlin/issues/1040
+    // @Test
+    // fun testUnitField() { }


Should this be uncommented and filled out? Even if failing, we can add an @Ignore and then ensure it passes once the bug is fixed

The question is what to write as a test? Nothing we put here is going to be right at the moment.

...-codegen/src/main/kotlin/software/amazon/smithy/kotlin/codegen/core/AbstractCodeWriterExt.kt

...in-codegen/src/main/kotlin/software/amazon/smithy/kotlin/codegen/rendering/serde/SerdeExt.kt

ianbotsf · 2024-02-26T15:49:41Z

.../src/main/kotlin/software/amazon/smithy/kotlin/codegen/rendering/serde/XmlParserGenerator.kt

-                        renderDeserializerBody(ctx, shape, members.toList(), writer)
-                        writer.write("return value ?: throw #T(#S)", RuntimeTypes.Serde.DeserializationException, "Deserialized union value unexpectedly null: ${symbol.name}")
+                        renderDeserializerBody(ctx, serdeCtx, shape, members.toList(), writer)
+                        writer.write("return value ?: throw #T(#S)", Serde.DeserializationException, "Deserialized union value unexpectedly null: ${symbol.name}")


Style: I generally find non-top-level imports to be confusing and would rather read RuntimeTypes.Serde.DeserializationException than Serde.DeserializationException, even though the latter is shorter.

runtime/serde/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/XmlStreamReader.kt

runtime/serde/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/XmlTagReader.kt

ianbotsf · 2024-02-26T18:59:23Z

runtime/serde/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/XmlTagReader.kt

+        var cand = nextToken()
+        while (cand != null && cand !is XmlToken.BeginElement) {
+            cand = nextToken()
+        }
+
+        val nextTok = cand as? XmlToken.BeginElement


Style: Might be simpler with a sequence:

val nextTok = generateSequence(::nextToken) .filterIsInstance<XmlToken.BeginElement>() .firstOrNull()

runtime/serde/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/XmlTagReader.kt

...-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/deserialization/LexingXmlStreamReader.kt

.../src/main/kotlin/software/amazon/smithy/kotlin/codegen/rendering/serde/XmlParserGenerator.kt

sonarcloud · 2024-02-28T01:50:04Z

Quality Gate failed

Failed conditions
3.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

lauzadis · 2024-02-28T14:47:56Z

...-protocols/common/src/aws/smithy/kotlin/runtime/awsprotocol/xml/Ec2QueryErrorDeserializer.kt

@@ -14,7 +14,7 @@ internal data class Ec2QueryErrorResponse(val errors: List<Ec2QueryError>, val r
 internal data class Ec2QueryError(val code: String?, val message: String?)

 @InternalApi
-public fun parseEc2QueryErrorResponse(payload: ByteArray): ErrorDetails {
+public suspend fun parseEc2QueryErrorResponse(payload: ByteArray): ErrorDetails {


question: this suspend is now unnecessary but I'm assuming you kept it for backwards-compatibility. should we also deprecate this suspend fun and create a new non-suspending function?

same for parseRestXmlErrorResponse

We could I chose not to for now but don't feel strongly. Yes I kept it for binary compat, this wasn't necessary for some time I don't think since our deserializers haven't been suspend for a very long time.

aajtodd added 17 commits February 23, 2024 14:33

add Tag scoped reader abstraction

3578d7f

bootstrap generated serde tests

4dc2106

add xml serde test suite

f30302f

implement map and list deserialize

0f3fe69

refactor to use result

130d215

almost there

ec5736e

fix union deserialization of flat collections

1ce6139

enable interspersed flat tests

d9628fa

enable more tests

29e0823

add hooks for unwrapping operation and error payloads and tracking co…

cdb4706

…rrect tag to decode from

fix attribute lookup

9f71a16

fix serde ctx

71ebe86

drop XmlDeserializer

184439a

remove unused fun

c9214ea

cleanup and renames

d4bd423

reorganize fields for better names

67d5575

api dump + changelog

e912826

aajtodd requested a review from a team as a code owner February 26, 2024 14:06

update benchmark baseline

5b7f8cf

aajtodd mentioned this pull request Feb 26, 2024

refactor XML deserialize awslabs/aws-sdk-kotlin#1233

Merged

fix -warn

b78724a

lauzadis reviewed Feb 26, 2024

View reviewed changes

fix member names

a1fe7c6

ianbotsf reviewed Feb 26, 2024

View reviewed changes

0marperez reviewed Feb 26, 2024

View reviewed changes

.../src/main/kotlin/software/amazon/smithy/kotlin/codegen/rendering/serde/XmlParserGenerator.kt Show resolved Hide resolved

0marperez reviewed Feb 26, 2024

View reviewed changes

.../src/main/kotlin/software/amazon/smithy/kotlin/codegen/rendering/serde/XmlParserGenerator.kt Show resolved Hide resolved

aajtodd added 4 commits February 27, 2024 12:33

feedback

4a98f5e

fix debug comment

822b727

regenerate api dump

feca351

fix map key type

ac7f246

really fix enum key types

1b1d0a5

lauzadis approved these changes Feb 28, 2024

View reviewed changes

ianbotsf approved these changes Feb 28, 2024

View reviewed changes

aajtodd merged commit 4a20344 into main Feb 28, 2024
12 of 14 checks passed

aajtodd deleted the fix-xml-deserialize branch February 28, 2024 20:00

aajtodd mentioned this pull request Mar 20, 2024

refactor: decrease generated artifact size #1057

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor XML deserialization #1042

refactor XML deserialization #1042

aajtodd commented Feb 26, 2024

lauzadis Feb 26, 2024

aajtodd Feb 27, 2024

lauzadis Feb 26, 2024

aajtodd Feb 27, 2024

lauzadis Feb 26, 2024

aajtodd Feb 27, 2024

lauzadis Feb 26, 2024

lauzadis Feb 26, 2024

aajtodd Feb 27, 2024

ianbotsf Feb 26, 2024

ianbotsf Feb 26, 2024

sonarcloud bot commented Feb 28, 2024

lauzadis Feb 28, 2024

aajtodd Feb 28, 2024

refactor XML deserialization #1042

refactor XML deserialization #1042

Conversation

aajtodd commented Feb 26, 2024

Issue #

Description of changes

Codegen Output

Effect on Artifact Sizes

Benchmarks

Binary Compatibility

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Feb 28, 2024

Quality Gate failed

Choose a reason for hiding this comment

Choose a reason for hiding this comment