Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending a local Pandas dataframe to a Delta table in Azure is slow #450

Open
beefupinetree opened this issue Oct 4, 2024 · 1 comment

Comments

@beefupinetree
Copy link

I followed the instructions on this page to create a SQLAlchemy engine and used it with the Pandas to_sql() method. It's taking around 2 seconds to append one data point to a Delta table in Azure Databricks, and it seems to be scaling linearly. A dataframe containing 1 column and 10 rows is taking ~20 seconds to push to Azure.

Is there a way to make writing back to Databricks from a local machine faster?

Example code:

from sqlalchemy import create_engine
import pandas as pd
import os

server = 'my_server.azuredatabricks.net'
http_path = "/sql/1.0/warehouses/my_warehouse"
access_token = "MY_TOKEN"
catalog = "my_catalog"
schema = "my_schema"

if "NO_PROXY" in os.environ:
    os.environ["NO_PROXY"] = os.environ["NO_PROXY"] + "," + server
else:
    os.environ["NO_PROXY"] = server

if "no_proxy" in os.environ:
    os.environ["no_proxy"] = os.environ["no_proxy"] + "," + server
else:
    os.environ["no_proxy"] = server

engine = create_engine(f"databricks://token:{access_token}@{server}?" +
        f"http_path={http_path}&catalog={catalog}&schema={schema}"
)

df = pd.DataFrame({'name' : ['User 1', 'User 2', 'User 3', User 4', 'User 5', 'User 6', User 7', 'User 8', 'User 9', 'User 10']})

# The next command takes around 20 seconds to complete
df.to_sql(name='my_test_table', con=engine, if_exists='append', index=False)

Local specs:
databricks-sql-connector==3.4.0
pandas==2.2.2
Python 3.10.14

Azure specs:
Databricks SQL warehouse cluster
Runtime==13.3 LTS

@susodapop
Copy link
Contributor

Just giving you some validation that yes this is very slow. Some discussion for the reasons behind this can be found here. It won't be possible to improve the speed of the imports automatically without some changes to the connector. But if you're willing to write your own INSERT queries you can import with much greater efficiency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants