Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MalformedJsonError due to Databricks identity column #62

Open
Johannes-Vink opened this issue May 23, 2024 · 0 comments
Open

MalformedJsonError due to Databricks identity column #62

Johannes-Vink opened this issue May 23, 2024 · 0 comments

Comments

@Johannes-Vink
Copy link

Johannes-Vink commented May 23, 2024

I am running -for the fun of it- duckdb directly on Azure Databricks. My ultimate goal is to pitch duckdb against Databricks Photon, to see if it can compete with Photon while running on Databricks.

Instead of authenticating via SAS keys etc to an Azure storage account, I've used the Azure Databricks Volume, which is seen as a local reference and that works good:
image

A newly created table is working and I've copied over 10x + optimize on it:
image

This table has very high Delta versions and features enabled:
delta.checkpoint.writeStatsAsJson=false delta.checkpoint.writeStatsAsStruct=true delta.enableDeletionVectors=true delta.feature.appendOnly=supported delta.feature.checkConstraints=supported delta.feature.deletionVectors=supported delta.feature.invariants=supported delta.minReaderVersion=3 delta.minWriterVersion=7

And old table, unmodified for some time is not working:
image
But the table has much lower Delta versions:
delta.minReaderVersion=1 delta.minWriterVersion=6

I did upgrade to a higher version:
delta.feature.appendOnly=supported delta.feature.changeDataFeed=supported delta.feature.checkConstraints=supported delta.feature.generatedColumns=supported delta.feature.identityColumns=supported delta.feature.invariants=supported delta.minReaderVersion=3 delta.minWriterVersion=7
But still the same json error from duckdb.

Then I found out that the original table does have a Databricks identity column (introduced in Databricks 10.4, but I think that this feature is outside the open source delta spec):
CREATE TABLE IF NOT EXISTS SCHEMA.TABLE ( ID bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1) )
My new table was copied without the identity and that works fine.

Looking into the delta table structure in the crc are the following entries
"schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"ID\",\"type\":\"long\",\"nullable\":true,\"metadata\":{\"delta.identity.start\":1,\"delta.identity.step\":1,\"delta.identity.allowExplicitInsert\":false}}]}",
And further down these:
"writerFeatures": [ "identityColumns", "deletionVectors" ]

So the good news:

  • duckdb works out of the box on Azure Databricks Volumes
  • duckdb works with very high Delta versions, including deletion vectors (I did not test special characters in column names btw, spaces, {} [] () etc...)

The bad news:

  • Databricks has some features outside the open source spec (?) that let duckdb fail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant