Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row ordering design choice #356

Open
anmyachev opened this issue May 23, 2024 · 1 comment
Open

Row ordering design choice #356

anmyachev opened this issue May 23, 2024 · 1 comment

Comments

@anmyachev
Copy link
Contributor

Why do I even think it is necessary to maintain order?

This matches the definition of dataframes from the article. If we take the approach of defining based on articles about dataframes and their algebra, then we can also look for new articles and do a more in-depth comparative analysis (since the article I cited is from 2020).

Are there any use cases where this is important?

I think it’s obvious that there are workloads for which the order of the data is important. For example, values were recorded in some area over time, without recording timestamps, to reduce the size of the dataset. Any use of operations that violate the order invalidates the trends that can be obtained from these data.

Why not come up with a new concept that has characteristics of both dataframes and relational tables?

For ease of DataFrame API adaptation, it seems that all that is needed is to more or less successfully combine current concepts that will conveniently coexist in one interface (at least for first stable release). With this approach, libraries belonging to one of these groups may need to implement the characteristics of another group. In the case of a new concept, the number of other characteristics groups may increase to two.

Solution.

Based on the fact that these two concepts have existed for a long time and have not been completely united during this time, and that at the moment there are many hybrids that implement the interface of the opposite group using their own basis of operations, I believe that the solution should not be ideal, but just quite flexible.

So let's allow the order to be preserved or not, based on the user's choices, be it additional function parameters, environment variables, or context managers.

This way there will be enough flexibility in relation to libraries that implement the relational approach (they will also be performant, since there will be no need to maintain order using an additional index column or other tricks) and at the same time, a greater number of user cases will be covered by the standard.

@kkraus14
Copy link
Collaborator

This matches the definition of dataframes from the article. If we take the approach of defining based on articles about dataframes and their algebra, then we can also look for new articles and do a more in-depth comparative analysis (since the article I cited is from 2020).

This paper was brought up early in this effort and I believe there was consensus that the article is one person's / group's definition of dataframes but is not a universal definition of dataframes and we did not want to follow all of the semantics defined within it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants