feat/databricks delta table destination connector #318

rbiseck3 · 2024-12-20T14:26:48Z

Description

Adds in support for writing to databricks delta tables in two different ways:

Adds in Databricks delta table destination connector, building off of the existing sql foundation.
Adds support for supplemental table migration in the databricks volumes connector

micmarty-deepsense · 2024-12-30T12:33:08Z

It seems that destination entry in unstructured_ingest/v2/processes/connectors/sql/__init__.py is missing

micmarty-deepsense · 2025-01-02T10:42:18Z

unstructured_ingest/v2/processes/connectors/sql/sql.py

@@ -406,7 +422,7 @@ def upload_dataframe(self, df: pd.DataFrame, file_data: FileData) -> None:
                cursor.executemany(stmt, values)

    def get_table_columns(self) -> list[str]:
-        with self.connection_config.get_cursor() as cursor:
+        with self.get_cursor() as cursor:
            cursor.execute(f"SELECT * from {self.upload_config.table_name}")


I know this line hasn't changed, but I think it's worth double-checking to ensure it's the solution we want to keep. Using SELECT * doesn't seem optimal for just fetching column descriptions (I might be wrong). Perhaps we could opt for something like this:

cursor.execute("DESCRIBE `some_table_name`")

which produces output like this:

[Row(col_name='id', data_type='string', comment=None), Row(col_name='record_id', data_type='string', comment=None), Row(col_name='element_id', data_type='string', comment=None), Row(col_name='text', data_type='string', comment=None), Row(col_name='embeddings', data_type='array<float>', comment=None), Row(col_name='type', data_type='string', comment=None), Row(col_name='url', data_type='string', comment=None), Row(col_name='version', data_type='string', comment=None), Row(col_name='data_source_date_created', data_type='timestamp', comment=None), Row(col_name='data_source_date_modified', data_type='timestamp', comment=None), Row(col_name='data_source_date_processed', data_type='timestamp', comment=None), Row(col_name='data_source_permissions_data', data_type='string', comment=None), Row(col_name='data_source_filesize_bytes', data_type='float', comment=None), Row(col_name='data_source_url', data_type='string', comment=None), Row(col_name='data_source_version', data_type='string', comment=None), Row(col_name='data_source_record_locator', data_type='string', comment=None), Row(col_name='category_depth', data_type='int', comment=None), Row(col_name='parent_id', data_type='string', comment=None), Row(col_name='attached_filename', data_type='string', comment=None), Row(col_name='filetype', data_type='string', comment=None), Row(col_name='last_modified', data_type='timestamp', comment=None), Row(col_name='file_directory', data_type='string', comment=None), Row(col_name='filename', data_type='string', comment=None), Row(col_name='languages', data_type='array<string>', comment=None), Row(col_name='page_number', data_type='string', comment=None), Row(col_name='links', data_type='string', comment=None), Row(col_name='page_name', data_type='string', comment=None), Row(col_name='link_urls', data_type='string', comment=None), Row(col_name='link_texts', data_type='string', comment=None), Row(col_name='sent_from', data_type='string', comment=None), Row(col_name='sent_to', data_type='string', comment=None), Row(col_name='subject', data_type='string', comment=None), Row(col_name='section', data_type='string', comment=None), Row(col_name='header_footer_type', data_type='string', comment=None), Row(col_name='emphasized_text_contents', data_type='string', comment=None), Row(col_name='emphasized_text_tags', data_type='string', comment=None), Row(col_name='text_as_html', data_type='string', comment=None), Row(col_name='regex_metadata', data_type='string', comment=None), Row(col_name='detection_class_prob', data_type='float', comment=None), Row(col_name='is_continuation', data_type='boolean', comment=None), Row(col_name='orig_elements', data_type='string', comment=None), Row(col_name='coordinates_points', data_type='string', comment=None), Row(col_name='coordinates_system', data_type='string', comment=None), Row(col_name='coordinates_layout_width', data_type='float', comment=None), Row(col_name='coordinates_layout_height', data_type='float', comment=None)]

I don't actually know when this was introduced and I'm more than happy to refactor to something more optimal to fetch column data.

micmarty-deepsense · 2025-01-02T12:35:44Z

test/integration/connectors/env_setup/sql/databricks_delta_tables/destination/schema.sql

+    last_modified TIMESTAMP,
+    file_directory STRING,
+    filename STRING,
+    languages ARRAY<STRING>,


arrays are not correctly inserted; prepare_data converts lists into strings.

[upload] [DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION] Cannot resolve "languages" due to data type mismatch: cannot cast "STRING" to "ARRAY<STRING>". SQLSTATE: 42K09; line 1 pos 0

The solution is to either change the SQL statement for insertion so that we can use ARRAY as a keyword, or we change the type to STRING (which worked for me)

actually there's also an issue with embeddings column
https://github.com/databricks/databricks-sql-python/blob/01e998cabe4a4972d5b45f7936219d924706c75c/docs/parameters.md#migrating-to-native-parameters

Unfortunately this SDK seems to not support array parameters:

I don't see any other way that would be safe (not prone to SQL injection) and would allow for having cusor.execute and list of parameters at the same time.

Ya I've reached out to a contact at Databricks. Hopefully we'll hear a work around. Otherwise this entire approach might need to be redone.

micmarty-deepsense · 2025-01-02T12:39:00Z

unstructured_ingest/v2/processes/connectors/sql/sql.py

@@ -129,8 +129,13 @@ class SQLIndexer(Indexer, ABC):
    connection_config: SQLConnectionConfig


(comment not tied to any particular line, but relevant to this file)
I was getting "column does not exist" type of errors, because the schema expects a bunch of columns to have the data_source_ prefix.

We'd need to adjust column names here and there, I guess

rbiseck3 · 2025-01-02T20:43:52Z

@micmarty-deepsense this PR is still in the works as we wait to hear back from databricks on getting a solution for variable binding and using list data.

micmarty-deepsense · 2025-01-03T10:08:48Z

I've also run into errors saying that the Timestamp is not JSON serializable. To fix it, change this line to:

json.dump(data, f, indent=2, default=str)

... but I believe it's not a good place to fix this 😛

Here's my temporary dev branch that attempts to naively address some of the obstacles I faced when testing this PR

rbiseck3 changed the title ~~feat/databricks delta table~~ feat/databricks delta table destination connector Dec 20, 2024

rbiseck3 temporarily deployed to ci December 20, 2024 14:27 — with GitHub Actions Inactive

rbiseck3 had a problem deploying to ci December 20, 2024 14:35 — with GitHub Actions Error

rbiseck3 temporarily deployed to ci December 20, 2024 14:38 — with GitHub Actions Inactive

micmarty-deepsense reviewed Jan 2, 2025

View reviewed changes

rbiseck3 force-pushed the roman/databricks-delta-table branch from cd07dba to 74a209a Compare January 7, 2025 13:56

rbiseck3 temporarily deployed to ci January 7, 2025 13:56 — with GitHub Actions Inactive

rbiseck3 temporarily deployed to ci January 7, 2025 16:12 — with GitHub Actions Inactive

rbiseck3 force-pushed the roman/databricks-delta-table branch from d0731f9 to ffd643e Compare January 8, 2025 15:05

rbiseck3 temporarily deployed to ci January 8, 2025 15:06 — with GitHub Actions Inactive

rbiseck3 had a problem deploying to ci January 8, 2025 15:13 — with GitHub Actions Error

rbiseck3 had a problem deploying to ci January 8, 2025 15:13 — with GitHub Actions Failure

rbiseck3 added 8 commits January 10, 2025 11:23

Add requirements file

bded9cc

begin adding databricks delta table connector

ca3c34e

support M2M auth

cb2fc82

finish databricks delta table connector

ad6522b

support migrating to table from volumes connector

cda75c2

tidy

1db9b6d

populate default but empty TableMigrationConfig

f08ae35

Add dest int test

8f363a5

rbiseck3 force-pushed the roman/databricks-delta-table branch from 3056d8d to 8f363a5 Compare January 10, 2025 16:24

rbiseck3 temporarily deployed to ci January 10, 2025 16:24 — with GitHub Actions Inactive

rbiseck3 temporarily deployed to ci January 10, 2025 16:32 — with GitHub Actions Inactive

rbiseck3 deployed to ci January 10, 2025 16:32 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/databricks delta table destination connector #318

feat/databricks delta table destination connector #318

rbiseck3 commented Dec 20, 2024 •

edited

Loading

micmarty-deepsense commented Dec 30, 2024

micmarty-deepsense Jan 2, 2025 •

edited

Loading

rbiseck3 Jan 2, 2025

micmarty-deepsense Jan 2, 2025

micmarty-deepsense Jan 2, 2025

micmarty-deepsense Jan 2, 2025

rbiseck3 Jan 2, 2025

micmarty-deepsense Jan 2, 2025

rbiseck3 commented Jan 2, 2025

micmarty-deepsense commented Jan 3, 2025 •

edited

Loading

		@@ -129,8 +129,13 @@ class SQLIndexer(Indexer, ABC):
		connection_config: SQLConnectionConfig

feat/databricks delta table destination connector #318

Are you sure you want to change the base?

feat/databricks delta table destination connector #318

Conversation

rbiseck3 commented Dec 20, 2024 • edited Loading

Description

micmarty-deepsense commented Dec 30, 2024

micmarty-deepsense Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

rbiseck3 Jan 2, 2025

Choose a reason for hiding this comment

micmarty-deepsense Jan 2, 2025

Choose a reason for hiding this comment

micmarty-deepsense Jan 2, 2025

Choose a reason for hiding this comment

micmarty-deepsense Jan 2, 2025

Choose a reason for hiding this comment

rbiseck3 Jan 2, 2025

Choose a reason for hiding this comment

micmarty-deepsense Jan 2, 2025

Choose a reason for hiding this comment

rbiseck3 commented Jan 2, 2025

micmarty-deepsense commented Jan 3, 2025 • edited Loading

rbiseck3 commented Dec 20, 2024 •

edited

Loading

micmarty-deepsense Jan 2, 2025 •

edited

Loading

micmarty-deepsense commented Jan 3, 2025 •

edited

Loading