Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(filesystem): UNC paths are not supported #1209

Merged
merged 27 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
8e111c1
fix(filesystem): UNC paths are not supported
IlyaFaer Apr 10, 2024
45d1d7b
small fix
IlyaFaer Apr 11, 2024
e39b8bc
use glob for UNC paths
IlyaFaer Apr 17, 2024
de82bb5
Merge branch 'devel' into windows_path_fix
IlyaFaer Apr 17, 2024
7c465bc
add the third slash
IlyaFaer Apr 17, 2024
6fee728
fix slahes
IlyaFaer Apr 17, 2024
f43a13a
fix if
IlyaFaer Apr 17, 2024
e813a3c
Merge branch 'devel' into windows_path_fix
rudolfix Apr 17, 2024
95edfaf
less invasive switch to Python glob
rudolfix Apr 17, 2024
50c6f2b
moves unc test to local so it runs on windows
rudolfix Apr 17, 2024
c045539
Merge branch 'devel' of https://github.com/dlt-hub/dlt into windows_p…
IlyaFaer Apr 18, 2024
57a6b47
Merge branch 'windows_path_fix' of https://github.com/dlt-hub/dlt int…
IlyaFaer Apr 18, 2024
2f1346c
review fixes
IlyaFaer Apr 18, 2024
0b8e04e
test UNC path reading
IlyaFaer Apr 22, 2024
7c1bdfb
add docs
IlyaFaer Apr 23, 2024
e7a893f
lint ignores
IlyaFaer Apr 23, 2024
fbcffbb
lint fix
IlyaFaer Apr 23, 2024
f111714
lint fix
IlyaFaer Apr 23, 2024
7c0a918
excess ignore
IlyaFaer Apr 23, 2024
5232ce6
lint fix
IlyaFaer Apr 23, 2024
98b985f
Merge branch 'devel' into windows_path_fix
rudolfix Apr 27, 2024
e0ef3a9
adds utils to convert from file to local path to filesystem config
rudolfix Apr 27, 2024
d983976
offloads globbing to Python glob for locat paths, allows UNC and nati…
rudolfix Apr 27, 2024
3810f80
allows the test file bucket to be relative path, fixes tests, adds se…
rudolfix Apr 27, 2024
44cca36
bumps dev adlfs version up to have working glob
rudolfix Apr 27, 2024
23aee6d
updates filesystem docs
rudolfix Apr 27, 2024
556a256
fixes UNC to file uri
rudolfix Apr 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 64 additions & 19 deletions dlt/common/storages/configuration.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os
from typing import TYPE_CHECKING, Any, Literal, Optional, Type, get_args, ClassVar, Dict, Union
from urllib.parse import urlparse
import pathlib
from typing import Any, Literal, Optional, Type, get_args, ClassVar, Dict, Union
from urllib.parse import urlparse, unquote

from dlt.common.configuration import configspec, resolve_type
from dlt.common.configuration.exceptions import ConfigurationValueError
Expand Down Expand Up @@ -87,24 +88,22 @@ class FilesystemConfiguration(BaseConfiguration):
@property
def protocol(self) -> str:
"""`bucket_url` protocol"""
url = urlparse(self.bucket_url)
# this prevents windows absolute paths to be recognized as schemas
if not url.scheme or (os.path.isabs(self.bucket_url) and "\\" in self.bucket_url):
if self.is_local_path(self.bucket_url):
return "file"
else:
return url.scheme
return urlparse(self.bucket_url).scheme

def on_resolved(self) -> None:
url = urlparse(self.bucket_url)
if not url.path and not url.netloc:
uri = urlparse(self.bucket_url)
if not uri.path and not uri.netloc:
raise ConfigurationValueError(
"File path or netloc missing. Field bucket_url of FilesystemClientConfiguration"
" must contain valid url with a path or host:password component."
"File path and netloc are missing. Field bucket_url of"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think File path and netloc are missing. will confuse people, to avoid this should we mention that we are using urlparse or URI?

" FilesystemClientConfiguration must contain valid uri with a path or host:password"
" component."
)
# this is just a path in a local file system
if url.path == self.bucket_url:
url = url._replace(scheme="file")
self.bucket_url = url.geturl()
if self.is_local_path(self.bucket_url):
self.bucket_url = self.make_file_uri(self.bucket_url)

@resolve_type("credentials")
def resolve_credentials_type(self) -> Type[CredentialsConfiguration]:
Expand All @@ -117,11 +116,57 @@ def fingerprint(self) -> str:

def __str__(self) -> str:
"""Return displayable destination location"""
url = urlparse(self.bucket_url)
uri = urlparse(self.bucket_url)
# do not show passwords
if url.password:
new_netloc = f"{url.username}:****@{url.hostname}"
if url.port:
new_netloc += f":{url.port}"
return url._replace(netloc=new_netloc).geturl()
if uri.password:
new_netloc = f"{uri.username}:****@{uri.hostname}"
if uri.port:
new_netloc += f":{uri.port}"
return uri._replace(netloc=new_netloc).geturl()
return self.bucket_url

@staticmethod
def is_local_path(uri: str) -> bool:
"""Checks if `uri` is a local path, without a schema"""
uri_parsed = urlparse(uri)
# this prevents windows absolute paths to be recognized as schemas
return not uri_parsed.scheme or os.path.isabs(uri)

@staticmethod
def make_local_path(file_uri: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could have been in some dlt.common.destination.utils|filesystem helpers

"""Gets a valid local filesystem path from file:// scheme.
Supports POSIX/Windows/UNC paths

Returns:
str: local filesystem path
"""
uri = urlparse(file_uri)
if uri.scheme != "file":
raise ValueError(f"Must be file scheme but is {uri.scheme}")
if not uri.path and not uri.netloc:
raise ConfigurationValueError("File path and netloc are missing.")
local_path = unquote(uri.path)
if uri.netloc:
# or UNC file://localhost/path
local_path = "//" + unquote(uri.netloc) + local_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it intended to be // or there is one more /?

else:
# if we are on windows, strip the POSIX root from path which is always absolute
if os.path.sep != local_path[0]:
# filesystem root
if local_path == "/":
return str(pathlib.Path("/").resolve())
# this prevents /C:/ or ///share/ where both POSIX and Windows root are present
if os.path.isabs(local_path[1:]):
local_path = local_path[1:]
return str(pathlib.Path(local_path))

@staticmethod
def make_file_uri(local_path: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could have been in some dlt.common.destination.utils|filesystem helpers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right that this needs a better place. we have file utils in FileStorage but they probably need to go somewhere else, then we'll move those as well

"""Creates a normalized file:// uri from a local path

netloc is never set. UNC paths are represented as file://host/path
"""
p_ = pathlib.Path(local_path)
p_ = p_.expanduser().resolve()
# return "file:///" + p_.as_posix().lstrip("/")
return p_.as_uri()
96 changes: 59 additions & 37 deletions dlt/common/storages/fsspec_filesystem.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import io
import glob
import gzip
import mimetypes
import pathlib
import posixpath
from io import BytesIO
from gzip import GzipFile
from typing import (
Literal,
cast,
Expand Down Expand Up @@ -41,6 +43,7 @@ class FileItem(TypedDict, total=False):

file_url: str
file_name: str
relative_path: str
mime_type: str
encoding: Optional[str]
modification_date: pendulum.DateTime
Expand Down Expand Up @@ -181,12 +184,22 @@ def fsspec(self) -> AbstractFileSystem:
else:
return fsspec_filesystem(self["file_url"], self.credentials)[0]

@property
def local_file_path(self) -> str:
"""Gets a valid local filesystem path from file:// scheme.
Supports POSIX/Windows/UNC paths

Returns:
str: local filesystem path
"""
return FilesystemConfiguration.make_local_path(self["file_url"])

def open( # noqa: A003
self,
mode: str = "rb",
compression: Literal["auto", "disable", "enable"] = "auto",
**kwargs: Any,
) -> IO[Any]:
) -> Union[GzipFile, IO[Any]]:
"""Open the file as a fsspec file.

This method opens the file represented by this dictionary as a file-like object using
Expand All @@ -213,7 +226,7 @@ def open( # noqa: A003
raise ValueError("""The argument `compression` must have one of the following values:
"auto", "enable", "disable".""")

opened_file: IO[Any]
opened_file: Union[IO[Any], GzipFile]
# if the user has already extracted the content, we use it so there is no need to
# download the file again.
if "file_content" in self:
Expand All @@ -234,8 +247,13 @@ def open( # noqa: A003
**text_kwargs,
)
else:
if "local" in self.fsspec.protocol:
# use native local file path to open file:// uris
file_url = self.local_file_path
else:
file_url = self["file_url"]
opened_file = self.fsspec.open(
self["file_url"], mode=mode, compression=compression_arg, **kwargs
file_url, mode=mode, compression=compression_arg, **kwargs
)
return opened_file

Expand All @@ -245,11 +263,11 @@ def read_bytes(self) -> bytes:
Returns:
bytes: The file content.
"""
return ( # type: ignore
self["file_content"]
if "file_content" in self and self["file_content"] is not None
else self.fsspec.read_bytes(self["file_url"])
)
if "file_content" in self and self["file_content"] is not None:
return self["file_content"] # type: ignore
else:
with self.open(mode="rb", compression="disable") as f:
return f.read()


def guess_mime_type(file_name: str) -> Sequence[str]:
Expand All @@ -274,45 +292,49 @@ def glob_files(
Returns:
Iterable[FileItem]: The list of files.
"""
import os

bucket_url_parsed = urlparse(bucket_url)
# if this is a file path without a scheme
if not bucket_url_parsed.scheme or (os.path.isabs(bucket_url) and "\\" in bucket_url):
# this is a file so create a proper file url
bucket_url = pathlib.Path(bucket_url).absolute().as_uri()
is_local_fs = "local" in fs_client.protocol
if is_local_fs and FilesystemConfiguration.is_local_path(bucket_url):
bucket_url = FilesystemConfiguration.make_file_uri(bucket_url)
bucket_url_parsed = urlparse(bucket_url)
else:
bucket_url_parsed = urlparse(bucket_url)
bucket_url_no_schema = bucket_url_parsed._replace(scheme="", query="").geturl()
bucket_url_no_schema = (
bucket_url_no_schema[2:] if bucket_url_no_schema.startswith("//") else bucket_url_no_schema
)
filter_url = posixpath.join(bucket_url_no_schema, file_glob)

glob_result = fs_client.glob(filter_url, detail=True)
if isinstance(glob_result, list):
raise NotImplementedError(
"Cannot request details when using fsspec.glob. For adlfs (Azure) please use version"
" 2023.9.0 or later"
)
if is_local_fs:
root_dir = FilesystemConfiguration.make_local_path(bucket_url)
# use a Python glob to get files
files = glob.glob(str(pathlib.Path(root_dir).joinpath(file_glob)), recursive=True)
glob_result = {file: fs_client.info(file) for file in files}
else:
root_dir = bucket_url_parsed._replace(scheme="", query="").geturl().lstrip("/")
filter_url = posixpath.join(root_dir, file_glob)
glob_result = fs_client.glob(filter_url, detail=True)
if isinstance(glob_result, list):
raise NotImplementedError(
"Cannot request details when using fsspec.glob. For adlfs (Azure) please use"
" version 2023.9.0 or later"
)

for file, md in glob_result.items():
if md["type"] != "file":
continue
# relative paths are always POSIX
if is_local_fs:
rel_path = pathlib.Path(file).relative_to(root_dir).as_posix()
file_url = FilesystemConfiguration.make_file_uri(file)
else:
rel_path = posixpath.relpath(file, root_dir)
file_url = bucket_url_parsed._replace(
path=posixpath.join(bucket_url_parsed.path, rel_path)
).geturl()

# make that absolute path on a file://
if bucket_url_parsed.scheme == "file" and not file.startswith("/"):
file = f"/{file}"
file_name = posixpath.relpath(file, bucket_url_no_schema)
file_url = bucket_url_parsed._replace(
path=posixpath.join(bucket_url_parsed.path, file_name)
).geturl()

mime_type, encoding = guess_mime_type(file_name)
scheme = bucket_url_parsed.scheme
mime_type, encoding = guess_mime_type(rel_path)
yield FileItem(
file_name=file_name,
file_name=posixpath.basename(rel_path),
relative_path=rel_path,
file_url=file_url,
mime_type=mime_type,
encoding=encoding,
modification_date=MTIME_DISPATCH[bucket_url_parsed.scheme](md),
modification_date=MTIME_DISPATCH[scheme](md),
size_in_bytes=int(md["size"]),
)
2 changes: 2 additions & 0 deletions dlt/sources/helpers/rest_client/paginators.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ class BaseReferencePaginator(BasePaginator):
2. `update_request` method to update the request object with the next
page reference.
"""

def __init__(self) -> None:
super().__init__()
self.__next_reference: Optional[str] = None
Expand Down Expand Up @@ -208,6 +209,7 @@ class BaseNextUrlPaginator(BaseReferencePaginator):

See `HeaderLinkPaginator` and `JSONResponsePaginator` for examples.
"""

def update_request(self, request: Request) -> None:
# Handle relative URLs
if self._next_reference:
Expand Down
4 changes: 2 additions & 2 deletions docs/tools/fix_grammar_gpt.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# constants
BASE_DIR = "../website/docs"
GPT_MODEL = "gpt-3.5-turbo-0125"
MAX_CHUNK_SIZE = 14000 # make sure that this is below the context window size of the model to not have cut off files
MAX_CHUNK_SIZE = 14000 # make sure that this is below the context window size of the model to not have cut off files

SYSTEM_PROMPT = """\
You are a grammar checker. Every message you get will be a document that is to be grammarchecked and returned as such.
Expand Down Expand Up @@ -104,7 +104,7 @@ def get_chunk_length(chunk: List[str]) -> int:
chunks.append(current_chunk)

# sanity test, make sure we still have the full doc
assert doc == functools.reduce(lambda a, b: a+b, chunks)
assert doc == functools.reduce(lambda a, b: a + b, chunks)

fmt.note(f"Created {len(chunks)} chunks")

Expand Down
41 changes: 37 additions & 4 deletions docs/website/docs/dlt-ecosystem/destinations/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,43 @@ If for any reason you want to have those files in a local folder, set up the `bu
```toml
[destination.filesystem]
bucket_url = "file:///absolute/path" # three / for an absolute path
# bucket_url = "file://relative/path" # two / for a relative path
```

`dlt` correctly handles the native local file paths. Indeed, using the `file://` schema may be not intuitive especially for Windows users.

```toml
[destination.filesystem]
bucket_url = 'C:\a\b\c'
```

In the example above we specify `bucket_url` using **toml's literal strings** that do not require [escaping of backslashes](https://github.com/toml-lang/toml/blob/main/toml.md#string).

```toml
[destination.unc_destination]
bucket_url = '\\localhost\c$\a\b\c' # UNC equivalent of C:\a\b\c

[destination.posix_destination]
bucket_url = '/var/local/data' # absolute POSIX style path

[destination.relative_destination]
bucket_url = '_storage/data' # relative POSIX style path
```

In the examples above we define a few named filesystem destinations:
* **unc_destination** demonstrates Windows UNC path in native form
* **posix_destination** demonstrates native POSIX (Linux/Mac) absolute path
* **relative_destination** demonstrates native POSIX (Linux/Mac) relative path. In this case `filesystem` destination will store files in `$cwd/_storage/data` path
where **$cwd** is your current working directory.

`dlt` supports Windows [UNC paths with file:// scheme](https://en.wikipedia.org/wiki/File_URI_scheme). They can be specified using **host** or purely as **path**
component.

```toml
[destination.unc_with_host]
bucket_url="file://localhost/c$/a/b/c"

[destination.unc_with_path]
bucket_url="file:////localhost/c$/a/b/c"
```

## Write disposition
Expand Down Expand Up @@ -408,7 +444,4 @@ managed in the regular way by the final destination you have configured.
You will also notice `init` files being present in the root folder and the special `dlt` folders. In the absence of the concepts of schemas and tables
in blob storages and directories, `dlt` uses these special files to harmonize the behavior of the `filesystem` destination with the other implemented destinations.




<!--@@@DLT_TUBA filesystem-->
Loading
Loading