-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add experimental remote HDFS support for native DataFusion reader #1359
base: main
Are you sure you want to change the base?
Conversation
@@ -77,6 +77,7 @@ datafusion-comet-proto = { workspace = true } | |||
object_store = { workspace = true } | |||
url = { workspace = true } | |||
chrono = { workspace = true } | |||
datafusion-objectstore-hdfs = { git = "https://github.com/comphead/datafusion-objectstore-hdfs", branch = "master", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove I'm keeping the updated HDFS object storage in personal repo for now, let me know if there any concerns
native/core/src/execution/planner.rs
Outdated
@@ -1220,7 +1217,7 @@ impl PhysicalPlanner { | |||
// TODO: I think we can remove partition_count in the future, but leave for testing. | |||
assert_eq!(file_groups.len(), partition_count); | |||
|
|||
let object_store_url = ObjectStoreUrl::local_filesystem(); | |||
let object_store_url = ObjectStoreUrl::parse("hdfs://namenode:9000").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be addressed in #1360
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The url should be available as part of the file path passed in. (see line 1178 above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthchandra it is already fixed.
session_context: Arc<SessionContext>, | ||
) -> Result<(), ExecutionError> { | ||
// TODO: read the namenode configuration from file schema or from spark.defaultFS | ||
let url = Url::try_from("hdfs://namenode:9000").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be addressed in #1360
@@ -1861,6 +1864,40 @@ fn trim_end(s: &str) -> &str { | |||
} | |||
} | |||
|
|||
#[cfg(not(feature = "hdfs"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hdfs
cargo feature makes a conditional compilation if hdfs needed
pub(crate) fn register_object_store( | ||
session_context: Arc<SessionContext>, | ||
) -> Result<(), ExecutionError> { | ||
let object_store = object_store::local::LocalFileSystem::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have to be only a local file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the feature enabled for the Comet. LocalFileSystem is by default if no specific features selected.
the annotation on this method is
#[cfg(not(feature = "hdfs"))]
This allows to plugin other features like S3, etc
This particular method is responsible for no remote feature selected e.g. for local filesystem.
If a feature selected the conditional compilation will register an object store related to the feature, like HDFS or S3
native/core/src/execution/planner.rs
Outdated
@@ -1220,7 +1217,7 @@ impl PhysicalPlanner { | |||
// TODO: I think we can remove partition_count in the future, but leave for testing. | |||
assert_eq!(file_groups.len(), partition_count); | |||
|
|||
let object_store_url = ObjectStoreUrl::local_filesystem(); | |||
let object_store_url = ObjectStoreUrl::parse("hdfs://namenode:9000").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The url should be available as part of the file path passed in. (see line 1178 above)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1359 +/- ##
=============================================
- Coverage 56.12% 39.16% -16.96%
- Complexity 976 2065 +1089
=============================================
Files 119 262 +143
Lines 11743 60323 +48580
Branches 2251 12836 +10585
=============================================
+ Hits 6591 23627 +17036
- Misses 4012 32223 +28211
- Partials 1140 4473 +3333 ☔ View full report in Codecov by Sentry. |
Which issue does this PR close?
Closes #1337.
Rationale for this change
What changes are included in this PR?
How are these changes tested?
Manually starting a remote hdfs cluster and running