Critical Warnings For Users of This Connector Need To Be Documented #308

fuzing · 2024-11-11T15:11:14Z

This connector should be documented more clearly for dangerous caveats, namely:

It stores its Kafka topic offsets as ancillary data within the Iceberg snapshot data (snapshot summary). Committing new data to your tables requires the kafka offset to be >= the offsets stored within the connector-created snapshot summary data stored in iceberg.

This can be seen in the Coordinator#commitToTable() method
`

List<Envelope> filteredEnvelopeList =
    envelopeList.stream()
        .filter(
            envelope -> {
              Long minOffset = committedOffsets.get(envelope.partition());
              return minOffset == null || envelope.offset() >= minOffset;
            })
        .collect(toList());

`

This is very problematic in the following scenarios:

If one does not persist Kafka state across invocations (e.g. docker/k8s without specifying volumes for kafka data) then upon restart this connector will not begin committing data until after the offset is >= to that stored in the iceberg snapshot summary. So, if you commit up to, say, offset 10,000 for a given table, then restart kafka, there won't be a new iceberg commit made for that table until after you get past 10,000. Worse still, this is only filtered on commit (i.e. after your actual data has been written to iceberg), meaning you'll end up with 10,000+ orphaned files in iceberg.
When using this connector, it should probably be the only ingestion source to iceberg. If you have multiple ingestion sources to iceberg, these sources won't be using the same semantics as this connector, and clearly won't be writing this extra snapshot summary data. This is less of an issue because the connector sees the absence of this special snapshot summary (i.e. minOffset == null) as an affirmative signal for commit, but this is definitely a design-smell

It looks like a fair number of the currently outstanding issues for this repo center on these design anomalies. It would definitely be wise to document this.

One solution for those environments not persisting kafka state would be to have a boolean setting that could be set such that the connector disregards the iceberg persisted offsets for its initial commit (this would solve for (1) above, but not (2)).

Otherwise, this connector is a great solution!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical Warnings For Users of This Connector Need To Be Documented #308

Critical Warnings For Users of This Connector Need To Be Documented #308

fuzing commented Nov 11, 2024 •

edited

Loading

Critical Warnings For Users of This Connector Need To Be Documented #308

Critical Warnings For Users of This Connector Need To Be Documented #308

Comments

fuzing commented Nov 11, 2024 • edited Loading

fuzing commented Nov 11, 2024 •

edited

Loading