Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add view support to the Rest Catalog #818

Open
2 of 7 tasks
ndrluis opened this issue Jun 14, 2024 · 9 comments
Open
2 of 7 tasks

Add view support to the Rest Catalog #818

ndrluis opened this issue Jun 14, 2024 · 9 comments
Labels
good first issue Good for newcomers

Comments

@ndrluis
Copy link
Collaborator

ndrluis commented Jun 14, 2024

Feature Request / Improvement

Reference: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml

@sungwy
Copy link
Collaborator

sungwy commented Jun 15, 2024

Thank you for raising this @ndrluis 💯 I will add this as a 0.8.0 milestone for now

@sungwy sungwy added this to the PyIceberg 0.8.0 release milestone Jun 15, 2024
@kevinjqliu kevinjqliu added the good first issue Good for newcomers label Aug 7, 2024
@shiv-io
Copy link

shiv-io commented Oct 20, 2024

Would love to take a first stab at this @kevinjqliu, could you assign this to me? edit: here's a PR for view_exists: #1242. Thanks!

@corleyma
Copy link

corleyma commented Oct 21, 2024

I am really curious about how Load View should work, given that currently only SQL representations of views are supported and I don't think we have an in-process SQL engine that can convert SQL into an iceberg scan plan (yet/at all?).

@shiv-io did you already have some thoughts there?

@ndrluis
Copy link
Collaborator Author

ndrluis commented Oct 22, 2024

Following what @danielcweeks said in this email, I believe we could discuss and experiment with SQLGlot to create support for other dialects. However, to support load views, we likely need to rely on a query engine. I'm not sure if there is a query engine in the Python ecosystem that would make sense to support, but I feel that we could use Apache DataFusion through the iceberg-rust implementation or the Python bindings.

@sungwy
Copy link
Collaborator

sungwy commented Oct 22, 2024

That's an interesting question @corleyma . The way I see it, PyIceberg is a language library, that tries to remain open to any Python based query engine that wants to make use of its functions to process Iceberg tables. So I think the first step in introducing view support in PyIceberg would be for us to fetch the view representations from the REST Catalog endpoint and serve the view representations to any query engines that want to integrate with it (like Daft).

I agree with @ndrluis though, that it would be cool to leverage projects like DataFusion to improve the way we load, slice and dice the tables in PyIceberg.

@corleyma
Copy link

corleyma commented Oct 22, 2024

I agree with @sungwy that the primary goal of pyiceberg should be to make it possible for query engines to interface with Iceberg tables and views.

Nonetheless, it would be really ideal to have some out of the box way to get a scan of a view (PyArrow Dataset-like is the most ideal, but returning Table/RecordBatchReader like current table scan functionality is a fine endpoint). This is ideal because it provides an easy path for integrating with other things (like polars) that currently support pyiceberg tables, and because it will benefit use of pyiceberg for more operational concerns e.g. being able to easily preview view contents, etc.

I think DataFusion (either via Python bindings or via iceberg-rust) would be a great way to accomplish this goal. Since (I think?) pyiceberg is much further along in implementing the iceberg sdk than iceberg-rust, it would be interesting if it were possible for pyiceberg to use DataFusion directly but I suspect you need some custom rust code no matter what?

@shiv-io
Copy link

shiv-io commented Oct 22, 2024

I'm fairly new to the Iceberg ecosystem -- thanks for the insightful discussion, looks like I have some reading to do before I can weigh in.

load_view aside though, I'd love to work on the other view features if contributions towards this issue are being accepted.

@corleyma
Copy link

corleyma commented Oct 22, 2024

@shiv-io It should still be possible to do load_view without supporting any scanning functionality yet, and like @sungwy says, that is likely a necessary precursor for other query engines anyway.

look at how load_table works today: we return a Table model with all the metadata about the table, and this model exposes functionality for data scans, etc. So load_view would start with returning a model with all the metadata about the view (as specified in the spec), and then we can look at trying to add some DataFusion-based scan functionality in subsequent iterations.

@kevinjqliu
Copy link
Contributor

look at how load_table works today: we return a Table model with all the metadata about the table, and this model exposes functionality for data scans, etc. So load_view would start with returning a model with all the metadata about the view (as specified in the spec), and then we can look at trying to add some DataFusion-based scan functionality in subsequent iterations.

+1, I think it's a good idea to separate accessing the iceberg views from using them. The ability to read an iceberg view is great for general view operations. Even printing out what the view definition is would be a great feature to have.

Connecting the view with an external engine can be a separate story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants