-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(filesystem): UNC paths are not supported #1209
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
@rudolfix, that's a couple of cunning fixes. I'd say I didn't find ideological errors - it's just about a couple of symbols (normal ones, they work find in Python file opening), which make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IlyaFaer it is hard to say if this works or not without comments and tests or maybe I'm reviewing too early?
@rudolfix, I added the test into And as I said, there is no mistake in the code. There's just a bunch of Otherwise, UNC works fine with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK good!
but you can add tests also here. we are testing glob
in test_local_fileststem
. please unit test this
- UNC path
- windows path starting with drive letter
- windows path using backslashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more specific on the test, following https://en.wikipedia.org/wiki/File_URI_scheme
please test following windows paths
- C:\a\b\b
- \localhost\c$\a
- file://server/folder/data.xml
- file:////server/folder/data.xml (if the above works, you can skip this one)
with theglob
function
path=posixpath.join(bucket_url_parsed.path, file_name) | ||
).geturl() | ||
|
||
if file.startswith("//"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do we get files starting with //
? is this UNC path? please comment on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's legacy, but I think we should consider it as equal to \\
. Because... well, dealing with slash direction is very unpleasant for users.
Anyway, I don't think it's needed anymore. glob
seems to be handling fine all the possible cases.
@rudolfix, hm, I don't think it actually works. Surprisingly, when you try to read one file, it goes fine. But if you give a glob,
So, basically, it can see files by UNC path (I logged, it really does), but it filters them out, not returning anything, here: However, if I give a file path (not a glob), it reads the file correctly, not going to this part of code, which does the path translation and pattern comparison. If I give a normal full path, like |
@rudolfix, the only way I see how to fix it, is to write our own filesystem, inheriting the |
heh this is so bad :) could you check first if regular Python glob works correctly? then my take would be to use it for |
@rudolfix, yes, Python works fine, I tried it even before started to look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are things that still does not work
).geturl() | ||
|
||
if is_unc_path: | ||
file_name = file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name must contain only name (final path component)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, not exactly. There is a relpath
used right now:
dlt/dlt/common/storages/fsspec_filesystem.py
Line 305 in 77e2499
file_name = posixpath.relpath(file, bucket_url_no_schema) |
So, it includes folder names too (if the bucket is samples
, then it gives names like csv/freshman_kgs.csv
):
dlt/tests/common/storages/utils.py
Lines 40 to 50 in 77e2499
"csv/freshman_kgs.csv", | |
"csv/freshman_lbs.csv", | |
"csv/mlb_players.csv", | |
"csv/mlb_teams_2012.csv", | |
"jsonl/mlb_players.jsonl", | |
"met_csv/A801/A881_20230920.csv", | |
"met_csv/A803/A803_20230919.csv", | |
"met_csv/A803/A803_20230920.csv", | |
"parquet/mlb_players.parquet", | |
"gzip/taxi.csv.gz", | |
"sample.txt", |
Is it supposed to work like this? I didn't ask, because tests expect relative paths, not just names, but now when you said it, I start to suspect.
The problem is: for UNC relpath
doesn't want to work. Not a problem to take a file name, but it'll be not the same as ordinary paths return. Relative paths for UNC seem problematic.
So, with UNC it'll be freshman_kgs.csv
, but with ordinary path it's csv/freshman_kgs.csv
. I suppose it'll be better if they return the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! although our docs mention that file_name can be a relative path, it is very misleading. so let's fix it:
- rename
file_name
torelative_path
and addfile_name
with just the name part - update all the tests
- update documentation: https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem#fileitem-representation
luckily we do not use file_name
in verified sources
|
||
if is_unc_path: | ||
file_url = "file:///" + file | ||
file_name = file.replace(bucket_url_no_schema, "").replace("\\", "/").lstrip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the most straightforward and stupid solution, but I think it'll work. For UNC we're getting file
from glob
, which is supposed to always return the same result, so str
edits like this should work. This way it produces the same result as relpath
for normal paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is file
here? file_url must contain UNC path so we are able to use it to open later. make sure it is like that
assert len(all_file_items) == len(expected_files) | ||
|
||
for file in all_file_items: | ||
assert file["file_name"] in expected_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IlyaFaer what is the problem with using assert_sample_files(all_file_items, filesystem, config, load_content)
here? this glob must work like any other globs. returned file details must be identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will also try to load the content of the file so it will verify if fsspec open
works. I think it is crucial here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudolfix, opening doesn't work. Hm-m, I didn't notice it, but our code uses fsspec.open()
, which (just like any other fsspec
method) breaks UNC paths. O-o-okay, I added like a separate branch for UNC. Locally, the tests all passed, pushing (there are still several small things I didn't do, but I wanna see if I didn't break anything).
@@ -276,19 +279,30 @@ def glob_files( | |||
""" | |||
import os | |||
|
|||
is_unc_path = "$" in bucket_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you assume that all paths with "$" are UNC? The precondition is that protocol must be "file://"
and there are better ways to recognize this: https://stackoverflow.com/questions/76209122/how-to-detect-a-windows-network-path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
).geturl() | ||
|
||
if is_unc_path: | ||
file_name = file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! although our docs mention that file_name can be a relative path, it is very misleading. so let's fix it:
- rename
file_name
torelative_path
and addfile_name
with just the name part - update all the tests
- update documentation: https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem#fileitem-representation
luckily we do not use file_name
in verified sources
|
||
if is_unc_path: | ||
file_url = "file:///" + file | ||
file_name = file.replace(bucket_url_no_schema, "").replace("\\", "/").lstrip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is file
here? file_url must contain UNC path so we are able to use it to open later. make sure it is like that
…ve paths forms, improves tests
…veral globs to tests
@rudolfix, uf-f, that's quite a big re-writing. But I think I understood it |
@IlyaFaer it was actually impossible to correctly implement all file paths styles on all platforms using fsspec. what this PR does is to use Python as usual tests are way more complicated than code... |
local_path = unquote(uri.path) | ||
if uri.netloc: | ||
# or UNC file://localhost/path | ||
local_path = "//" + unquote(uri.netloc) + local_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it intended to be //
or there is one more /
?
raise ConfigurationValueError( | ||
"File path or netloc missing. Field bucket_url of FilesystemClientConfiguration" | ||
" must contain valid url with a path or host:password component." | ||
"File path and netloc are missing. Field bucket_url of" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think File path and netloc are missing.
will confuse people, to avoid this should we mention that we are using urlparse
or URI
?
return str(pathlib.Path(local_path)) | ||
|
||
@staticmethod | ||
def make_file_uri(local_path: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could have been in some dlt.common.destination.utils|filesystem
helpers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right that this needs a better place. we have file utils in FileStorage
but they probably need to go somewhere else, then we'll move those as well
return not uri_parsed.scheme or os.path.isabs(uri) | ||
|
||
@staticmethod | ||
def make_local_path(file_uri: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could have been in some dlt.common.destination.utils|filesystem helpers
Towards #1175