Support write operation #44

quentingodeau · 2024-02-29T07:30:11Z

Hi,

It's not really an issue but more an insight on what I plan to work on.
For the moment I didn't start but when I do I will post here a message to notify. If someone start before me please let me know ;)

samansmink · 2024-02-29T09:49:43Z

Thats great to hear :) Let me know if you need any help, feel free to reach out to me on the duckdb discord or [email protected]

csubhodeep · 2024-03-04T11:32:41Z

Hello duckdb team, nice to see that this has already been posted as an issue. I and my team would also love to have this as a feature. Just to add some context here - we are working with ETL pipelines in my company that mostly use pandas however for some performance related reasons we have started migrating to duckdb. We have an Azure native infra and therefore, while we could already enjoy the parquet import feature we would like to have a parquet export capability like in S3.

Just a small example,

CREATE TABLE weather (
    city    VARCHAR,
    temp_lo INTEGER, -- minimum temperature on a day
    temp_hi INTEGER, -- maximum temperature on a day
    prcp    REAL,
    date    DATE
);

INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');

COPY weather TO 'az://⟨my_container⟩/⟨my_file⟩.⟨parquet_or_csv⟩'

Additionally, would love to receive advice on any temporary workarounds that can enable us to write from duckdb directly to Azure Blob (or Data Lake) Storage. Thanks 😃

samansmink · 2024-03-04T12:05:46Z

@csubhodeep you can try using fsspec if you're on python, they should have azure support

csubhodeep · 2024-03-04T14:17:09Z

@csubhodeep you can try using fsspec if you're on python, they should have azure support

ok thanks a lot. I will try it.

csubhodeep · 2024-03-04T15:00:12Z

Thanks again! I tried to use the fsspec library in conjunction with the adlfs. TLDR - NO success

Here are what I tried:

>>> storage_account_name = "our_account"
>>> container_name = "our_container"
>>> account_creds = <our_key>
>>> duckdb.register_filesystem(filesystem('abfs', connection_string=account_creds))

>>> duckdb.sql("CREATE OR REPLACE TABLE test_table (a INTEGER, b VARCHAR(100))")
>>> duckdb.sql("INSERT INTO test_table VALUES (1, 'a'), (2, 'b'), (3, 'c')")
>>> duckdb.sql("SELECT * FROM test_table")
┌───────┬─────────┐
│   a   │    b    │
│ int32 │ varchar │
├───────┼─────────┤
│     1 │ a       │
│     2 │ b       │
│     3 │ c       │
└───────┴─────────┘
>>> write_query = f"COPY test_table TO 'https://{storage_account_name}.blob.core.windows.net/{container_name}/test.parquet' (FORMAT 'parquet')"
---------------------------------------------------------------------------
IOException                               Traceback (most recent call last)
Cell In[41], [line 2](vscode-notebook-cell:?execution_count=41&line=2)
      [1](vscode-notebook-cell:?execution_count=41&line=1) # dump it as parquet
----> [2](vscode-notebook-cell:?execution_count=41&line=2) duckdb.sql(write_query)

IOException: IO Error: Cannot open file "https://<storage_account_name>.blob.core.windows.net/<container_name>/test.parquet": No such file or directory
>>> write_query = f"COPY test_table TO 'az://{storage_account_name}.blob.core.windows.net/{container_name}/test.parquet' (FORMAT 'parquet')"
---------------------------------------------------------------------------
NotImplementedException                   Traceback (most recent call last)
Cell In[43], [line 2](vscode-notebook-cell:?execution_count=43&line=2)
      [1](vscode-notebook-cell:?execution_count=43&line=1) # dump it as parquet
----> [2](vscode-notebook-cell:?execution_count=43&line=2) duckdb.sql(write_query)

NotImplementedException: Not implemented Error: Writing to Azure containers is currently not supported

Please let me know if I am doing something wrong.

samansmink · 2024-03-04T15:04:05Z

could you try the abfs:// urls instead? the az:// once are triggering our auto-installation routing the requests through the azure extension. Alternatively, disable autoloading with set autoinstall_known_extensions=false;

csubhodeep · 2024-03-04T15:13:09Z

After trying the suggestion above, here are the results:

Exception ignored in: <function AzureBlobFile.__del__ at 0x7feb3d5a4280>
Traceback (most recent call last):
  File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/adlfs/spec.py", line 2166, in __del__
    self.close()
  File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/adlfs/spec.py", line 1983, in close
    super().close()
  File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/spec.py", line 1932, in close
    self.flush(force=True)
  File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/spec.py", line 1803, in flush
    if self._upload_chunk(final=force) is not False:
  File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/adlfs/spec.py", line 2147, in _async_upload_chunk
    raise RuntimeError(f"Failed to upload block: {e}!") from e
RuntimeError: Failed to upload block: The specifed resource name contains invalid characters.
RequestId:3545381a-d01e-0083-2346-6efd88000000
Time:2024-03-04T15:11:28.7881722Z
ErrorCode:InvalidResourceName
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidResourceName</Code><Message>The specifed resource name contains invalid characters.
RequestId:3545381a-d01e-0083-2346-6efd88000000
Time:2024-03-04T15:11:28.7881722Z</Message></Error>!

Is it more of an adlfs issue?

samansmink · 2024-03-04T15:29:30Z

I think you're not setting the connection string correctly, you're setting it to your key it appears.

Let's move this discussion elsewhere though as this is no longer about this issue. Please check if you're actually using fsspec correctly. If things are still wrong and it appears to be duckdb side, feel free to open an issue in duckdb/duckdb

csubhodeep · 2024-03-04T19:06:17Z

I have managed to make it work the issue at my end was this part of the path - {storage_account_name}.blob.core.windows.net which even the adlfs library does not like for some reason. Not using that in the path just works fine. Thanks a lot again for all the guidance.

patialashahi31 · 2024-03-19T14:48:17Z

Hi. I guess I am facing the same issue. I am using kotlin. Is there any workaround for this?

Thanks

quentingodeau · 2024-03-19T21:56:36Z

Hello,

For kotlin I'm not aware of a workaround :(
Nevertheless I have start to work on the issue but I think I will not be able to make it work until the following PR is merge

bayees · 2024-04-06T17:51:39Z

+1 for this feature

IlijaStankovski · 2024-07-11T17:20:52Z

Did the write operation feature via duckdb extensions end up being merged in the 1.0.0 release? I am currently using the 1.0.0 release and write operations via the azure extension still fails when using the duckdb node.js libraries. I am using both az:// and abfss:// when trying to write back to a parquet file hosted in Azure Storage Account (az://) or Azure Data Lake (abfss://).

Write operation error message:

Error: Not implemented Error: AzureDfsStorageFileSystem: FileExists is not implemented! {errno: -1, code: 'DUCKDB_NODEJS_ERROR', errorType: 'Not implemented', stack: 'Error: Not implemented Error: AzureDfsStorageFile

Any clarification on the write operation support state to Azure via node.js is highly appreciated.

Thank you.

samansmink · 2024-07-11T17:37:05Z

@IlijaStankovski It has not. I would like to pick this up at some point but I can't give a timeline here unfortunately.

IlijaStankovski · 2024-07-11T17:45:17Z

@samansmink thanks for the quick update, if you have a branch of code to build from that you want testers for, please let me know. Cheers ...

miquelmartorell · 2024-08-26T09:16:25Z

+1 I support this request. It is really necessary. Thank you
EDIT: With this solution it works perfectly: #44 (comment)

djouallah · 2024-09-13T02:35:04Z

it will be nice to have :)

davidsteinar · 2024-09-19T20:26:39Z

@samansmink any updates on this?

samansmink · 2024-09-20T09:19:30Z

@davidsteinar not yet, sorry! I agree this is one the more high prio features we should look into but i can not give a timeline yet

davidsteinar · 2024-09-20T09:31:08Z

@samansmink for sure, it would enable me to throw away our existing data warehouse completely! Do you have any idea on the size/complexity of implementing this?

nixent · 2024-09-29T06:34:41Z

+1 to support the feature

pantonis · 2024-10-06T17:54:25Z

I don't think duckdb has azure on top of its list. don't know why though. Azure gen2 is absolutely the top storage service for now.. Hope someday somebody implements this

DrewScatterday · 2024-10-21T19:17:32Z

+1 for support to writing a parquet to azure

DrewScatterday · 2024-10-21T21:23:12Z

I'm trying to do the workaround @csubhodeep did and I'm having some issues. I have an azure storage account dscatterday with a blob container main. I have an env var for connection string that has the format:
DefaultEndpointsProtocol=https;AccountName=dscatterday;AccountKey=asdadasdad==;EndpointSuffix=core.windows.net. I installed adlfs with pip install adlfs

Here is a simple python code example to replicate my issue:

import duckdb
import os
from fsspec import filesystem

duckdb.register_filesystem(filesystem("abfs", connection_string=os.getenv("AZURE_CONN_STR")))
con = duckdb.connect()
con.execute("CREATE TABLE parqdata AS SELECT * FROM read_parquet('C:\\Users\\dre11620\\Downloads\\NY\\smallNYTaxi.parquet\\*.parquet', hive_partitioning = true);")
con.execute(f'''COPY parqdata TO 'abfs://main/test.parquet' (FORMAT PARQUET, PER_THREAD_OUTPUT);''')

But I keep getting back a duckdb error:

duckdb.duckdb.IOException: IO Error: Failed to create directory "abfs://main/test.parquet": The filename, directory name, or volume label syntax is incorrect.

What am I missing?

DrewScatterday · 2024-10-21T21:54:28Z

Figured out the issue, I needed to do register_filesystem on the db connection object. Works now!
con.register_filesystem(filesystem("abfs", connection_string=os.getenv("AZURE_CONN_STR")))

samansmink mentioned this issue Feb 29, 2024

Improve CI #45

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support write operation #44

Support write operation #44

quentingodeau commented Feb 29, 2024

samansmink commented Feb 29, 2024

csubhodeep commented Mar 4, 2024 •

edited

Loading

samansmink commented Mar 4, 2024

csubhodeep commented Mar 4, 2024

csubhodeep commented Mar 4, 2024

samansmink commented Mar 4, 2024

csubhodeep commented Mar 4, 2024

samansmink commented Mar 4, 2024 •

edited

Loading

csubhodeep commented Mar 4, 2024

patialashahi31 commented Mar 19, 2024

quentingodeau commented Mar 19, 2024

bayees commented Apr 6, 2024

IlijaStankovski commented Jul 11, 2024

samansmink commented Jul 11, 2024

IlijaStankovski commented Jul 11, 2024

miquelmartorell commented Aug 26, 2024 •

edited

Loading

djouallah commented Sep 13, 2024

davidsteinar commented Sep 19, 2024

samansmink commented Sep 20, 2024

davidsteinar commented Sep 20, 2024

nixent commented Sep 29, 2024

pantonis commented Oct 6, 2024 •

edited

Loading

DrewScatterday commented Oct 21, 2024

DrewScatterday commented Oct 21, 2024 •

edited

Loading

DrewScatterday commented Oct 21, 2024 •

edited

Loading

Support write operation #44

Support write operation #44

Comments

quentingodeau commented Feb 29, 2024

samansmink commented Feb 29, 2024

csubhodeep commented Mar 4, 2024 • edited Loading

samansmink commented Mar 4, 2024

csubhodeep commented Mar 4, 2024

csubhodeep commented Mar 4, 2024

samansmink commented Mar 4, 2024

csubhodeep commented Mar 4, 2024

samansmink commented Mar 4, 2024 • edited Loading

csubhodeep commented Mar 4, 2024

patialashahi31 commented Mar 19, 2024

quentingodeau commented Mar 19, 2024

bayees commented Apr 6, 2024

IlijaStankovski commented Jul 11, 2024

samansmink commented Jul 11, 2024

IlijaStankovski commented Jul 11, 2024

miquelmartorell commented Aug 26, 2024 • edited Loading

djouallah commented Sep 13, 2024

davidsteinar commented Sep 19, 2024

samansmink commented Sep 20, 2024

davidsteinar commented Sep 20, 2024

nixent commented Sep 29, 2024

pantonis commented Oct 6, 2024 • edited Loading

DrewScatterday commented Oct 21, 2024

DrewScatterday commented Oct 21, 2024 • edited Loading

DrewScatterday commented Oct 21, 2024 • edited Loading

csubhodeep commented Mar 4, 2024 •

edited

Loading

samansmink commented Mar 4, 2024 •

edited

Loading

miquelmartorell commented Aug 26, 2024 •

edited

Loading

pantonis commented Oct 6, 2024 •

edited

Loading

DrewScatterday commented Oct 21, 2024 •

edited

Loading

DrewScatterday commented Oct 21, 2024 •

edited

Loading