Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds @unnest_wider, @unnest_longer, and @nest #77

Merged
merged 14 commits into from
Jan 2, 2024

Conversation

drizk1
Copy link
Member

@drizk1 drizk1 commented Dec 27, 2023

This pull request got a little bigger than I initially anticipated, but the four added macros all support tidy selection and interpolation.

These two #34 support grouped dataframes (ungroup -> regroup)

  • @unnest_wider() supports names_sep
  • @unnest_longer supports indicies_include, and keep_empty

After the unnests, I thought I would try nest. I was struggling with some syntax issues keeping @nest(df, by = , key = ) in the same macro with @nest(df, nested_col = cols) method, so I ended up splitting them for the sake of simplicity.

  • @nest_by looks slightly different then the tidyr version in that the by and key are not explicitly written, but supported.
  • @nest(df, nested_col = cols, nested_col2= cols2, etc)
  • neither nest support grouped dataframes yet. I still dont fully understand what the group_by -> nest does to reimplement it. I just know it is different then the by argument above, but similar (looks like maybe each group becomes its own df/array?).

Before going further and writing brief documentation for the nests, I thought I would check in. Should I drop the nests from theis PR for now, while I try to sort out grouping and while I continue to try reducing them back into just 1 macro.

I also added tidy selection to @unite.

@kdpsingh
Copy link
Member

Thanks so much, @drizk1! Let me take a look at the @nest() syntax and see if I can make it line up with tidyr. Don't remove it -- leave it in for now. I just want to see if I can figure out how to make it match.

This is exciting!

@drizk1
Copy link
Member Author

drizk1 commented Dec 27, 2023

Sounds good! For context, originally, after separating them into two underlying functions, I thought I could use multiple dispatch with 2 @nest macros but that was throwing errors so I split the macro in 2.

For @nest I tried pairs before settling on the method above. Using pairs made tidy selection tricky.

For @nest_by, I also couldn't figure out how to leverage keyword arguments with tidy selection so that the argument names would be visible, but perhaps there is a way.

I'm also happy to go back to the drawing board and see if I can make one function that performs the different nests to make the macro syntax match more easily.

@kdpsingh
Copy link
Member

Ah, this may be because while functions can perform dispatch based on types, macros can only perform dispatch based on the number of arguments. This is because macros see all arguments as expressions, so they are all the same type.

@kdpsingh
Copy link
Member

kdpsingh commented Dec 27, 2023

Looking at the tidyr nest_by() documentation, it looks like this function returns a rowwise data frame. I have mostly figured out how to implement rowwise data frames but I would wait to add @nest_by() until rowwise is properly implemented. We don't have to remove the codebase - I may just comment it out for now and focus on @nest().

@drizk1
Copy link
Member Author

drizk1 commented Dec 27, 2023

Re the macros: ok this is great to know for the future.

Thank you for clarifying the rowwise aspect. I spent some time reading the cheat sheet and documentation again this morning, and it lines up in my mind more now.

Focusing on nest and commenting out nest_by until tidier has the rowwise dataframe ability sounds good to me. Let me know if there's I can do to help.

@drizk1 drizk1 changed the title Adds @unnest_wider, @unnest_longer, @nest, @nest_by Adds @unnest_wider, @unnest_longer, and @nest Dec 28, 2023
@kdpsingh
Copy link
Member

At this point, the main thing I'd like to see is support for grouped data frames in @nest(). It doesn't need to support by = yet because none of the TidierData macros support that yet.

If you pass it a grouped data frame, it should separately nest the selected columns into a data frame for each group, with one nested data frame per group.

@drizk1
Copy link
Member Author

drizk1 commented Dec 28, 2023

Sweet. I was tinkering with that last night actually. I think I know how to make it happen now so I'll try to make it official in the next day or two

@drizk1
Copy link
Member Author

drizk1 commented Dec 29, 2023

Alright I sorted out the grouping. While doing so, I realized tidyr nests into tibbles, which might be more similar to nesting into dataframes, than to the arrays I was nesting into.

Nesting into dataframes was just a few lines to changes, so switching it is no problem.

The question I now have is which one would you prefer it nests into? My understanding is that arrays may be less memory intensive but also less flexible than a dataframe?

We could theoretically offer an argument so the user can choose ?

I'm open to anything, but fully defer the decision to you, and I will implement it.

@kdpsingh
Copy link
Member

Ooh thanks for catching this. I think we should nest into DataFrames, which is important if you are nesting multiple columns. Let's not implement an option for alternatives for now. Just make sure that the unnesting works correctly if we nest into DataFrames. Once you do that, I'll review and merge. Exciting!

@drizk1
Copy link
Member Author

drizk1 commented Dec 30, 2023

Alright, so now, unnesting supports dataframes.

  • @unnest_wider can unnest arrays, tuples, dataframes and dicts. (adding tuple support was the only way to make the examples below possible)
  • @unnest_longer can unnest arrays and dataframes

And @nest nests into datafames.

I tested it against with the following tidyr example that has dataframes of different lengths and achieved identical results for the following four examples so it should be ready to go

df = DataFrame(
    x = 1:3,
    y = Any[
        DataFrame(),
        DataFrame(a = [1], b = [2]),
        DataFrame(a = 1:3, b = [3, 2, 1], c = [4, 4, 4])
    ]
)

@chain df begin 
  @unnest_wider(y)
  @unnest_longer(a:c, keep_empty = true)
end

@chain df begin 
  @unnest_wider(y)
  @unnest_longer(a:c, keep_empty = false)
end

@chain df begin 
  @unnest_longer(y, keep_empty = true)
  @unnest_wider(y)
end

@chain df begin 
  @unnest_longer(y, keep_empty = false)
  @unnest_wider(y)
end

@kdpsingh
Copy link
Member

This looks amazing! I will review and merge soon.

@kdpsingh
Copy link
Member

Great work thus far. Discovered one issue.

@nest() doesn't quite match up with the nest() behavior in R tidyr.

For example, in R, nesting multiple columns produces this:

> df = tibble(a = rep(letters[1:5], each = 3), b = 1:15, c = 16:30)
> df |> nest(data = b:c)
# A tibble: 5 × 2
  a     data            
  <chr> <list>          
1 a     <tibble [3 × 2]>
2 b     <tibble [3 × 2]>
3 c     <tibble [3 × 2]>
4 d     <tibble [3 × 2]>
5 e     <tibble [3 × 2]>

And in this PR, nesting multiple columns produces this:

julia> df = DataFrame(a = repeat('a':'e', inner = 3), b = 1:15, c = 16:30)
julia> @chain df @nest(data = b:c)
15×2 DataFrame
 Row │ a     data          
     │ Char  DataFrame     
─────┼─────────────────────
   1 │ a     1×2 DataFrame 
   2 │ a     1×2 DataFrame 
   3 │ a     1×2 DataFrame 
   4 │ b     1×2 DataFrame 
   5 │ b     1×2 DataFrame 
   6 │ b     1×2 DataFrame 
   7 │ c     1×2 DataFrame 
   8 │ c     1×2 DataFrame 
   9 │ c     1×2 DataFrame 
  10 │ d     1×2 DataFrame 
  11 │ d     1×2 DataFrame 
  12 │ d     1×2 DataFrame 
  13 │ e     1×2 DataFrame 
  14 │ e     1×2 DataFrame 
  15 │ e     1×2 DataFrame 

Any thoughts on how to fix?

@drizk1
Copy link
Member Author

drizk1 commented Dec 31, 2023

Oh wow. Great catch. It is almost as if it groups it based on the remaining columns and then nests. I think using setdiff to groupby the outer dataframe columns, and then converting the grouped dataframes to dataframes and nesting them might work.. I will play around with it.

edit: it works for 1 nest, now trying to sort out when its nesting multiple

@kdpsingh
Copy link
Member

Hmm this gives me an idea. It might be possible to implement @nest() in just a couple lines of code.

@kdpsingh
Copy link
Member

Actually, go ahead with modifying what you have now and see if you can get it working.

There's one parsing functionality I'd need to add still to get this working with less code.

So I'll revisit this if you can't get it working the way you have it.

@drizk1
Copy link
Member Author

drizk1 commented Dec 31, 2023

Alright, @nest is now correctly determining the number of rows based on the groups and it supports multiple columns to group by for the outer df.
from the example above:

5×2 DataFrame
 Row │ a     data          
     │ Char  DataFrame     
─────┼─────────────────────
   1 │ a     3×2 DataFrame 
   2 │ b     3×2 DataFrame 
   3 │ c     3×2 DataFrame 
   4 │ d     3×2 DataFrame 
   5 │ e     3×2 DataFrame 

## this matches R as well
df = DataFrame(x = [1, 1, 1, 2, 2, 3], y = 1:6, z = 13:18, a = 7:12, ab = 12:-1:7);
@nest(df, n2 = starts_with("a"), n3 = (y:z))
3×3 DataFrame
 Row │ x      n3             n2            
     │ Int64  DataFrame      DataFrame     
─────┼─────────────────────────────────────
   1 │     1  3×2 DataFrame  1×2 DataFrame 
   2 │     2  2×2 DataFrame  1×2 DataFrame 
   3 │     3  1×2 DataFrame  1×2 DataFrame 

I will note tho, that when trying to unnest multiple nested columns that i nest in this second example above, I am getting slightly different dimensions than with R. I suspect this might have to do with the slightly different behavior or unnest_wider (illustrated below - in Julia it won't add new rows, but in R it will)? Of note, when using only unnest_longer and unnest_wider in R, for the 6x5 df above, it does not return to a 6x5. It only does so if unnesting with unnest() which I have not yet tried building.

Depending on what you think, I think i may have to go back and rework unnest_longer and unnest_wider given the example below.

in R to go back to original df

df4 <- data.frame(
  x = c("a", "b", "a", "b", "C", "a"),
  y = c("e", "e", "e", "f", "f", "3"),
  yz = 13:18,
  a = 7:12,
  ab = 12:7
)
test4 = nest(df4, n2 = a:ab)

test4 %>% 
  unnest_wider(n2)

in julia to go back to original df

df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"], yz = [13, 14, 15, 16, 17, 13], a = 7:12, ab = 12:-1:7)
test4 =  @nest(df4, n2 = a:ab)
@chain test4 begin
  @unnest_wider(n2)
  @unnest_longer(a:ab)
end

@kdpsingh
Copy link
Member

kdpsingh commented Jan 1, 2024

Thanks for the update. I'll take a look and see if I can figure out why it's behaving differently.

While I am eager to merge, I want to make sure things behave similarly across the implementations, especially for the use case where we nest and then unnest.

@drizk1
Copy link
Member Author

drizk1 commented Jan 1, 2024

I totally agree.

please ignore the two commits below, and frankly most of my more recent comment above. they were my mind playing tricks on me.

#same results as in R 
df44 = DataFrame(a = repeat('a':'e', inner = 3), b = 1:15, c = 16:30)
dfdf = @chain df44 @nest(data = b:c)
@chain dfdf @unnest_wider(data) @unnest_longer(b:c)

df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"], yz = [13, 14, 15, 16, 17, 13], a = 7:12, ab = 12:-1:7)
test4 = @nest(df4, n2 = yz:ab)
@chain test4 @unnest_wider(n2) @unnest_longer(yz:ab)

unnesting multiple columns of nests back to the orignal dataframes is the last frontier I think. I still get multiple dimensions for that.

Edit: I finally figured out where the bug is. the bug is not in either of the unnests, but in the nest when nesting multiple sets of columns at once. The example below illustrate how some of the cells are not properly populated as the below yields differences from R.

test2 = @nest(df4, n2 = a:ab, n3 = y:yz)
@chain test2 begin @unnest_wider(n2:n3) 

@drizk1
Copy link
Member Author

drizk1 commented Jan 1, 2024

Sorry to have taken you on a journey of excess commits over the last week. I have a deep appreciation for un/nesting now.

Last night, I realized that @nest was nesting multiple sets sequentially not in parallel. This was causing dimension mismatches leading to the issues returning to the original df when unnesting.

This is now fixed and it behaves the same as in R, So now the behavior for @unnest_wider, @unnest_longer, and @nest map to tidyr.

This returns to the original dataframe

df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"],  yz = 13:18, a = 7:12, ab = 12:-1:7)
nested_df = @nest(df4, n2 = starts_with("a"), n3 = y:yz)

@chain nested_df begin
  @unnest_wider(n3:n2)
  @unnest_longer(y:ab)
end

just like in R

df4 <- data.frame(
  x = c("a", "b", "a", "b", "C", "a"),
  y = c("e", "e", "e", "f", "f", "e"),
  yz = 13:18,
  a = 7:12,
  ab = 12:7)
nested_df = nest(df4, n2 = a:ab, n3 = y:yz)
nested_df %>% unnest_wider(n2:n3) %>% unnest_longer(a:yz)

I checked the intermediate state after unnesting wider and they match each other as well.

I think it is is finally ready from my standpoint. Again, sorry for the whirlwind of preemptive commits and thank you for helping me figure out some of the bugs.

@kdpsingh
Copy link
Member

kdpsingh commented Jan 1, 2024

Awesome! Will look at this soon. Super excited to see this.

@kdpsingh kdpsingh merged commit 62e1689 into TidierOrg:main Jan 2, 2024
3 checks passed
@drizk1 drizk1 deleted the nesting branch January 3, 2024 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants