-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase sorting scalability via CytoTable metadata columns #204
Conversation
for further performance in cytomining#175
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big PR @d33bs - tough for me (with my limited expertise) to go through! @falquaddoomi can you give a look when you get a chance.
In general, I have one concern. If we end up deciding (or duckdb fixes this) to go back to the old way (where we had thought but didn't end up having potential issues with sorting), it seems this will be super difficult to disentangle. Or maybe I'm not thinking about this correctly? Could it be that this increasing scalability is independent of the previous solution which reduced speed?
(some additional context @falquaddoomi - we are needing to solve this for an upcoming project that will use cytotable heavily. Thanks!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine this PR is related to our discussion about ordering by a few columns versus ORDER BY ALL
? If so, I think it's an improvement on the previous behavior; kudos.
On @gwaybio's point, I think I'm lacking some context; I recall that not ordering the intermediate joins caused some kind of problem, but perhaps not? If it's the case that it can complete without ordering, perhaps it makes sense to make at least the ordering by the metadata columns an option that the user can disable at runtime. I saw that you rely on the metadata columns for things other than sorting, so I think it's fine to just always add them and just provide the option to disable the ordering.
FYI, the comments I left were just about possibly disabling ordering; otherwise, the PR looks good to me!
Thanks @gwaybio and @falquaddoomi for the reviews! I like the idea of an optional setting for this sorting mechanism, with a possible backup method which doesn't leverage CytoTable metadata. Generally, I still feel that sorting should be required to guarantee no data loss with While we plan to remove |
Note: Initially failing tests for 4ffe9c1 appeared to have something to do with a Poetry (and not CytoTable) dependency failure (maybe fixed through a deploy by the time of a 3rd re-run?). I don't think these are related to CytoTable code as they were at the layer of Poetry installations. Errors were: Update: appears related to tox-dev/filelock#337 |
Thanks again @gwaybio and @falquaddoomi ! I've added some updates which make sorting optional through the use of parameters called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see that you added the sort option! I anticipate somewhere down the line we could use that option as a means to compare the performance and correctness of sorting vs. not sorting.
Cheers, thanks @falquaddoomi ! Agreed on comparisons; it will be interesting to see the contrast, excited to learn more! |
Description
This PR seeks to refine #175 by increasing the performance through generated CytoTable metadata columns which are primarily beneficial during large join operations. Anecdotally, I noticed that
ORDER BY ALL
memory consumption for joined tables becomes very high when working with a larger dataset. Before this change, large join operations attempt to sort by all columns included in the join. After this change, only CytoTable metadata columns are used for sorting, decreasing the amount of processing required to create deterministic datasets.I hope to further refine this work through #193 and #176, which would I feel provide additional insights concerning performance and best practice recommendations. I can also see how these might be required to validate things here, but didn't want to hold review comments (as these also might further inform efforts within those issues).
Closes #175
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.