-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add git to filesystem source 301 #954
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quality work! thanks for all the code and tests. also the fs registration is a nice addition.
in the meantime we merged gdrive implementation. you could merge current devel into this branch. conflicts should not be significant
dlt/common/storages/configuration.py
Outdated
"File path or netloc missing. Field bucket_url of FilesystemClientConfiguration" | ||
" must contain valid url with a path or host:password component." | ||
) | ||
if not url.scheme in ("gitpythonfs", "github", "git", "s3"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- pls use
not in
operator. linter will complain somewhere - why s3 is exempt? AFAIK it must contain bucket name which is a path component
also could you remind me when gitpythonfs
does not need netloc and path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- pls use
not in
operator. linter will complain somewhere
Accepted
- why s3 is exempt? AFAIK it must contain bucket name which is a path component
You are right. I will take s3
out of the the exception list.
My thinking was (would need a separate github issue):
s3fs already gives nice error messages for invalid bucket name, unauthorised for bucket etc. My guess is that adlfs an , gcsfs do too. I doubt the rule saves users from confusion/harm. I was heading towards proving that, and then removing this rule entirely or only testing for specific known problem protocols (not exclusion list). I was trying to reduce fsspec implementation-specific conditional logic in the dlt codebase.
also could you remind me when
gitpythonfs
does not need netloc and path
gitpythonfs://
returns the root of the repo. Expected use case: bucket_url = "gitpythonfs://", file_glob = "**.md"
.
We decided it's neater and more common to put repo path in the config (kwargs), rather than stuff it into the url netloc. I looked at dropboxdrivefs
and I think it will be same pattern as this - dropbox knows which account based on auth token, then the rest is globbable paths.
At some point we could extend public filesystem dlt source/resource API with methods that just have url
parameter (which can include glob - s3://bucket/csv/**.csv
, gitpythonfs://csv/**.csv
) instead of separate bucket_url
and file_glob
parameters. It would be more fsspec-onic , in line with open_files("s3://bucket/csv/**.csv")
. There could be an optional path_mask
to shorten the path string that ends up in the pipeline metadata.
try: | ||
return ensure_pendulum_datetime(file_metadata[field_name]) | ||
except KeyError: | ||
if protocol not in MTIME_FIELD_NAMES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I was thinking a lot about it recently. My take is that for protocols not supporting modification time we would just return pendulum.now(). what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good enough. I set to pendulum.now()
...for now 🤣 ...
Always Hard to tell with these unknowns 😓. now()
duplicates loaded_at
etc, which pipelines already have. And it could be misleading.
It would be better to use something that indicates that no info was available:
None
. will it break the pipeline downstream?- Jan 1st 1970. ie, epoch zero. It's ugly but people know what it means.
350113e
to
89b3ca5
Compare
- Intended to be used in filesystem verified Source. -read-only - due to depedency of gitpython on git binaries, gitpythonfs should only be conditionally imported, for example by registering the module using `register_implementation_in_fsspec()`
- See https://www.rfc-editor.org/rfc/rfc1808.html#section-2.1 . So, for example, gitpythonfs:// is valid and works in fsspec. - However in bucket-based systems, the bucket name is where the netloc would be - eg s3://bucket_name. But luckily s3fs (and probably az, gcs etc) already gives helpful error messages if no bucket name given, wrong bucket name given etc. - As a ToDo, this rule could test only those protocols whose fsspec implementation needs netloc, rather than excluding a few protocols as is done here. But we don't know which protocols this rule was initially added for.
-reduces need to modify code every time a new fsspec implementation is added - `mtime` is idiomatic in *nix file systems.
89b3ca5
to
ea0557f
Compare
- clearer naming in FileItemDict
@@ -157,29 +199,31 @@ class FileItemDict(DictStrAny): | |||
def __init__( | |||
self, | |||
mapping: FileItem, | |||
credentials: Optional[Union[FileSystemCredentials, AbstractFileSystem]] = None, | |||
fs_details: Optional[Union[AbstractFileSystem, FilesystemConfiguration, FileSystemCredentials]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have simply added FilesystemConfiguration
as an additional type, and tidied up the naming. Is that far enough for this PR?
Possible fast-follower to this PR:
- Limit the types that can be passed to
FileItemDict
constructor. I get the feeling this is an important internal API. The following calls are currently in use:
AbstractFileSystem
, in tests/common/storages/test_fsspec_filesystem.pyFileSystemCredentials
, in tests/common/storages/utils.pyNone
, in verified-sources sources/inbox/init.py
Could there be other usage in dlt example projects?
2ee3eab
to
e48f641
Compare
Description
modification_date
Related Issues
filesystem
source verified-sources#301Additional Context
gitpythonfs
uses code snippets and ideas from other fsspec implementations. . I have given thanks in the code but I don't ensure this meets any fsspec or dlt licensing requirements.CICD:
make test-common
is passingmake lint
in local dev