Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tidy] Remove component to data mapping from data manager #451

Merged
merged 22 commits into from
May 14, 2024

Conversation

antonymilne
Copy link
Contributor

@antonymilne antonymilne commented May 2, 2024

Description

Small refactor as part of https://github.com/McK-Internal/vizro-internal/issues/714 in preparation for more exciting changes such as https://github.com/McK-Internal/vizro-internal/issues/753. Commit history also contains some POCs for other parts of that ticket such as making parametrised calls to dynamic data. Getting this PR merged first will lay the foundation for that work.

Now that we have a "proper" scheme for caching, there's no need to store links between component ID and data source name in the data manager. Instead, the captured callable itself contains the data source name that is used to fetch data as data_manager[data_source_name].load(). This simplifies things quite a bit.

Tested by running hatch run example features and hatch run example features/yaml_version (after modifying to include some dynamic data examples), and all "figure" components (Graph, AgGrid, Table) work still.

Reviewers

@petar-qb please check the resulting diff and make sure you understand it, but no need to run any examples since not much has really changed here. I'll talk you through the intermediate commits and work with you on https://github.com/McK-Internal/vizro-internal/issues/753 next week.

@huong-li-nguyen or @maxschulz-COL should be easy to skim through and give a ✅ if all looks good.

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

    • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
    • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
    • I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.
    • I have not referenced individuals, products or companies in any commits, directly or indirectly.
    • I have not added data or restricted code in any commits, directly or indirectly.

@antonymilne antonymilne requested a review from petar-qb May 2, 2024 16:13
@antonymilne antonymilne marked this pull request as ready for review May 2, 2024 16:42
# This default value is not actually used anywhere at the moment since __call__ is always used with data_frame
# specified. It's here to match Table and AgGrid and because we might want to use __call__ more in future.
# If the functionality of process_callable_data_frame moves to CapturedCallable then this would move there too.
kwargs.setdefault("data_frame", data_manager[self["data_frame"]].load())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of code probably won't last very long. After #439 is merged then I don't think any code does __call__ without specifying a data_frame. Removing or at least rewriting this bit of code will then fix the problem @petar-qb pointed out in #398 (comment).

Copy link
Contributor

@maxschulz-COL maxschulz-COL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed through, looks good 💪

@antonymilne antonymilne enabled auto-merge (squash) May 9, 2024 08:16
Copy link
Contributor

@petar-qb petar-qb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this commit and considering all pros/cons I fully agree that _DynamicData.pd_DataFrameCallable should be a pure function, not a CaptureCallable. 👍

Regardless that CaptureCallable is better for reusing across multiple data sources, and regardless nested function arguments cannot be updated with this approach, does it seem possible to add: _DynamicData.pd_DataFrameCallable as CaptureCallable as an optional approach in the future?
(just to allow advanced users to reuse load data function and nesting capabilities)

@antonymilne
Copy link
Contributor Author

Regarding this commit and considering all pros/cons I fully agree that _DynamicData.pd_DataFrameCallable should be a pure function, not a CaptureCallable. 👍

Regardless that CaptureCallable is better for reusing across multiple data sources, and regardless nested function arguments cannot be updated with this approach, does it seem possible to add: _DynamicData.pd_DataFrameCallable as CaptureCallable as an optional approach in the future? (just to allow advanced users to reuse load data function and nesting capabilities)

Thank you for the review @petar-qb! I was waiting until #479 had been finalised before I answered this, because the answer kept on changing...

As it stands, I think this will definitely be possible (even without adding __name__ and __qualname__ to CapturedCallable, although it might be a good idea to do so anyway) because I always make sure that the cache key includes the data source name. This means that you could do:

# @capture("data") doesn't exist but pretend it does
def load_csv(filename):
     return pd.read_csv(filename)

# These are all dynamic data
data_manager["data_x"] = load_csv("data_x.csv")
data_manager["data_y"] = load_csv("data_y.csv")
data_manager["data_z"] = load_csv("data_x.csv") # even the same value should work indepedently

Probably this will work already actually if you do CapturedCallable(load_data) without the correct @capture decorator existing, because everything just relies on duck typing rather than strict checks.

Let's not promote this as a possible approach until we have a really good reason to do so though. Thinking about it again, even the reusability can now be handled using a partial function ok so the only real reason to use CapturedCallable would be that you prefer the syntax or have the nested keys, which seems pretty unlikely. So you'd already do the above as:

def load_csv(filename):
     return pd.read_csv(filename)

# These are all dynamic data
data_manager["data_x"] = partial(load_csv, "data_x.csv")
data_manager["data_y"] = partial(load_csv, "data_y.csv")
data_manager["data_z"] = partial(load_csv, "data_x.csv")

@antonymilne antonymilne merged commit 5d5d8a8 into main May 14, 2024
34 checks passed
@antonymilne antonymilne deleted the tidy/data-source-mapping branch May 14, 2024 21:06
nadijagraca pushed a commit that referenced this pull request May 15, 2024
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants