Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(filesystem): UNC paths are not supported #1209

Merged
merged 27 commits into from
Apr 29, 2024
Merged

fix(filesystem): UNC paths are not supported #1209

merged 27 commits into from
Apr 29, 2024

Conversation

IlyaFaer
Copy link
Contributor

Towards #1175

Copy link

netlify bot commented Apr 11, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 556a256
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/662ce2b6dbd3a2000863046b

@IlyaFaer IlyaFaer changed the title Windows path fix fix(filesystem): UNC paths are not supported Apr 11, 2024
@IlyaFaer
Copy link
Contributor Author

IlyaFaer commented Apr 11, 2024

@rudolfix, that's a couple of cunning fixes. I'd say I didn't find ideological errors - it's just about a couple of symbols (normal ones, they work find in Python file opening), which make fsspec (not our code) to turn the path into something weird. Can't say I like it, but it worked. Pushed a test in verified-sources: dlt-hub/verified-sources#423

@IlyaFaer IlyaFaer requested a review from rudolfix April 11, 2024 13:31
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IlyaFaer it is hard to say if this works or not without comments and tests or maybe I'm reviewing too early?

@IlyaFaer
Copy link
Contributor Author

IlyaFaer commented Apr 12, 2024

@rudolfix, I added the test into verified-sources (dlt-hub/verified-sources#423), to check it with filesystem. It passes locally:

image

And as I said, there is no mistake in the code. There's just a bunch of if in the fsspec, like if path.startswith("//") and not ":" in path. Because of these conditions met, fsspec treats a UNC path incorrectly and tries to make it absolute, adding an incorrect prefix to the path. The path itself is fine, just a bit chaotic with slashes. Python deals with it easily, however, fsspec can't chew it up. So, a small rearrange of slashes and colons in case of UNC is needed to make it pass through all the conditions without significant changes of the path.

Otherwise, UNC works fine with fsspec and filesystem. They just sometimes can't recognize that the path is UNC.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK good!

but you can add tests also here. we are testing glob in test_local_fileststem. please unit test this

  • UNC path
  • windows path starting with drive letter
  • windows path using backslashes

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more specific on the test, following https://en.wikipedia.org/wiki/File_URI_scheme
please test following windows paths

  • C:\a\b\b
  • \localhost\c$\a
  • file://server/folder/data.xml
  • file:////server/folder/data.xml (if the above works, you can skip this one)
    with the glob function

path=posixpath.join(bucket_url_parsed.path, file_name)
).geturl()

if file.startswith("//"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we get files starting with //? is this UNC path? please comment on that

Copy link
Contributor Author

@IlyaFaer IlyaFaer Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's legacy, but I think we should consider it as equal to \\. Because... well, dealing with slash direction is very unpleasant for users.

Anyway, I don't think it's needed anymore. glob seems to be handling fine all the possible cases.

dlt/common/storages/fsspec_filesystem.py Outdated Show resolved Hide resolved
@IlyaFaer
Copy link
Contributor Author

IlyaFaer commented Apr 16, 2024

@rudolfix, hm, I don't think it actually works. Surprisingly, when you try to read one file, it goes fine. But if you give a glob, fsspec does path translation of some sort, and always adds a disk name into the start, creating a pattern like:

(?s:[\\\\/][\\\\/]localhost[\\\\/][\\\\/]C\\$[\\\\/]git_reps[\\\\/]dlt[\\\\/]tests[\\\\/]common[\\\\/]storages[\\\\/]samples[\\\\/].*)\\Z

Here:
https://github.com/fsspec/filesystem_spec/blob/37c1bc63b9c5a5b2b9a0d5161e89b4233f888b29/fsspec/utils.py#L742

So, basically, it can see files by UNC path (I logged, it really does), but it filters them out, not returning anything, here:
https://github.com/fsspec/filesystem_spec/blob/37c1bc63b9c5a5b2b9a0d5161e89b4233f888b29/fsspec/spec.py#L608

However, if I give a file path (not a glob), it reads the file correctly, not going to this part of code, which does the path translation and pattern comparison.

If I give a normal full path, like C:/a/b/c, glob also works fine. But UNC - no.

@IlyaFaer
Copy link
Contributor Author

IlyaFaer commented Apr 16, 2024

@rudolfix, the only way I see how to fix it, is to write our own filesystem, inheriting the fsspec.LocalFileSystem, and override the glob method to fix this translation. As there are no conditions in the translation, there is no way to skip it for globs (it's skipped only for precise paths). What do you think?

@rudolfix
Copy link
Collaborator

@IlyaFaer

@rudolfix, the only way I see how to fix it, is to write our own filesystem, inheriting the fsspec.LocalFileSystem and override the glob method to fix this translation. As there are no conditions in the translation, there is no way to skip it for globs (it's skipped only for precise paths). What do you think?

heh this is so bad :) could you check first if regular Python glob works correctly? then my take would be to use it for file:// and just translate the results to something fsspec understand. but let's check it out.

@IlyaFaer
Copy link
Contributor Author

@rudolfix, yes, Python works fine, I tried it even before started to look at fsspec

Untitled

@IlyaFaer IlyaFaer marked this pull request as ready for review April 17, 2024 10:44
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are things that still does not work

dlt/common/storages/fsspec_filesystem.py Outdated Show resolved Hide resolved
).geturl()

if is_unc_path:
file_name = file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name must contain only name (final path component)

Copy link
Contributor Author

@IlyaFaer IlyaFaer Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, not exactly. There is a relpath used right now:

file_name = posixpath.relpath(file, bucket_url_no_schema)

So, it includes folder names too (if the bucket is samples, then it gives names like csv/freshman_kgs.csv):

"csv/freshman_kgs.csv",
"csv/freshman_lbs.csv",
"csv/mlb_players.csv",
"csv/mlb_teams_2012.csv",
"jsonl/mlb_players.jsonl",
"met_csv/A801/A881_20230920.csv",
"met_csv/A803/A803_20230919.csv",
"met_csv/A803/A803_20230920.csv",
"parquet/mlb_players.parquet",
"gzip/taxi.csv.gz",
"sample.txt",

Is it supposed to work like this? I didn't ask, because tests expect relative paths, not just names, but now when you said it, I start to suspect.

The problem is: for UNC relpath doesn't want to work. Not a problem to take a file name, but it'll be not the same as ordinary paths return. Relative paths for UNC seem problematic.

So, with UNC it'll be freshman_kgs.csv, but with ordinary path it's csv/freshman_kgs.csv. I suppose it'll be better if they return the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! although our docs mention that file_name can be a relative path, it is very misleading. so let's fix it:

  1. rename file_name to relative_path and add file_name with just the name part
  2. update all the tests
  3. update documentation: https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem#fileitem-representation

luckily we do not use file_name in verified sources

tests/common/storages/test_local_filesystem.py Outdated Show resolved Hide resolved
tests/common/storages/test_local_filesystem.py Outdated Show resolved Hide resolved
tests/common/storages/test_local_filesystem.py Outdated Show resolved Hide resolved

if is_unc_path:
file_url = "file:///" + file
file_name = file.replace(bucket_url_no_schema, "").replace("\\", "/").lstrip("/")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the most straightforward and stupid solution, but I think it'll work. For UNC we're getting file from glob, which is supposed to always return the same result, so str edits like this should work. This way it produces the same result as relpath for normal paths.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is file here? file_url must contain UNC path so we are able to use it to open later. make sure it is like that

assert len(all_file_items) == len(expected_files)

for file in all_file_items:
assert file["file_name"] in expected_files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IlyaFaer what is the problem with using assert_sample_files(all_file_items, filesystem, config, load_content) here? this glob must work like any other globs. returned file details must be identical

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will also try to load the content of the file so it will verify if fsspec open works. I think it is crucial here

Copy link
Contributor Author

@IlyaFaer IlyaFaer Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolfix, opening doesn't work. Hm-m, I didn't notice it, but our code uses fsspec.open(), which (just like any other fsspec method) breaks UNC paths. O-o-okay, I added like a separate branch for UNC. Locally, the tests all passed, pushing (there are still several small things I didn't do, but I wanna see if I didn't break anything).

@@ -276,19 +279,30 @@ def glob_files(
"""
import os

is_unc_path = "$" in bucket_url
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you assume that all paths with "$" are UNC? The precondition is that protocol must be "file://"
and there are better ways to recognize this: https://stackoverflow.com/questions/76209122/how-to-detect-a-windows-network-path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this splitdrive works so well:

image

Both in traditional and non-traditional cases, the first element is '', which means the drive wasn't split. The glob in the main time will handle all of them.

dlt/common/storages/fsspec_filesystem.py Outdated Show resolved Hide resolved
).geturl()

if is_unc_path:
file_name = file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! although our docs mention that file_name can be a relative path, it is very misleading. so let's fix it:

  1. rename file_name to relative_path and add file_name with just the name part
  2. update all the tests
  3. update documentation: https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem#fileitem-representation

luckily we do not use file_name in verified sources


if is_unc_path:
file_url = "file:///" + file
file_name = file.replace(bucket_url_no_schema, "").replace("\\", "/").lstrip("/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is file here? file_url must contain UNC path so we are able to use it to open later. make sure it is like that

@IlyaFaer
Copy link
Contributor Author

@rudolfix, uf-f, that's quite a big re-writing. But I think I understood it

@rudolfix
Copy link
Collaborator

@IlyaFaer it was actually impossible to correctly implement all file paths styles on all platforms using fsspec. what this PR does is to use Python glob for all file:// uris. it also converts all file:// uris to native paths before opening. somehow fsspec is able to open UNC paths, but it is not able to do that when they are in file:// form.

as usual tests are way more complicated than code...

@rudolfix rudolfix merged commit 53dfed2 into devel Apr 29, 2024
50 checks passed
@rudolfix rudolfix deleted the windows_path_fix branch April 29, 2024 15:34
local_path = unquote(uri.path)
if uri.netloc:
# or UNC file://localhost/path
local_path = "//" + unquote(uri.netloc) + local_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it intended to be // or there is one more /?

raise ConfigurationValueError(
"File path or netloc missing. Field bucket_url of FilesystemClientConfiguration"
" must contain valid url with a path or host:password component."
"File path and netloc are missing. Field bucket_url of"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think File path and netloc are missing. will confuse people, to avoid this should we mention that we are using urlparse or URI?

return str(pathlib.Path(local_path))

@staticmethod
def make_file_uri(local_path: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could have been in some dlt.common.destination.utils|filesystem helpers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right that this needs a better place. we have file utils in FileStorage but they probably need to go somewhere else, then we'll move those as well

return not uri_parsed.scheme or os.path.isabs(uri)

@staticmethod
def make_local_path(file_uri: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could have been in some dlt.common.destination.utils|filesystem helpers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants