Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match merging optimization #363

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

TimWhiting
Copy link
Collaborator

Takes a match that has common superstructure, and reworks it to match on that first.

Note that the transform expects there only to be a single pattern in the match statement, which I believe is a fine assumption to begin with, since tuple types are used for multiple pattern matches, and it rewrites the core from the bottom up. However, it probably should be improved to handle this as well. Branches can get tricky fast.

So

match e
    Cons(a, Cons(1, Cons(c, Nil))) -> (a + c).show.println
    Cons(_, Cons(b, _)) -> b.show.println
    Nil -> "Nothing".println
    _ -> implicit error

Turns into

match e
  Cons(a, Cons(b, d)) -> 
    match (b, d)
       (1, Cons(c, Nil) -> a + c.show.println
       (b, _) -> b.show.println
        _ -> implicit error
  Nil -> "Nothing".println
  _ -> implicit error

In the generated C code, the original results in duplicated Cons checks for the first two Conses.
In the new, those checks are eliminated.

It is smart enough to push the implicit error incomplete match error branches down into subpatterns, but doesn't do any exhaustiveness checking to see if the changes have resulted in any new exhaustive cases.

@TimWhiting TimWhiting changed the base branch from master to dev September 22, 2023 04:32
@TimWhiting TimWhiting marked this pull request as draft October 23, 2023 22:38
@TimWhiting
Copy link
Collaborator Author

This is not ready to merge, I'll add some tests since it does end up moving things around quite a bit.

skip guards
catch all case
@TimWhiting
Copy link
Collaborator Author

TimWhiting commented Jul 10, 2024

TLDR: Shaves off 3-5% runtime on some heavily nested matches which impact ref counting (e.g. some variants of rbtree), otherwise doesn't impact the runtime.

Old fip benchmarks:

benchmark variant param elapsed relative stddev rss

kk rbtree fip 100000 0.63 1.000 .0148324 4816
kk rbtree fip-icfp 100000 0.62 .984 .00762203 4800
kk rbtree std-reuse 100000 0.65 1.031 .00461077 4800
kk rbtree std 100000 1.38 2.190 .0317361 4800
kk rbtree fip-clrs 100000 0.8 1.269 .00982963 4816
c rbtree clrs 100000 1.095 1.738 .01602356 10432
c rbtree clrs-mi 100000 0.62 .984 .0116428 5952
c rbtree clrs-full 100000 1.105 1.753 .0211089 10944
c rbtree clrs-full-mi 100000 0.615 .976 .0152768 5952
cpp rbtree stl 100000 0.76 1.206 .0357757 11472
cpp rbtree stl-mi 100000 0.33 .523 .00286459 5968

kk ftree fip 100000 0.75 1.000 .0083666 4928
kk ftree std-reuse 100000 0.73 .973 .00532934 4080
kk ftree std 100000 1.16 1.546 .0263274 4112

kk msort fip 100000 1.03 1.000 .0161245 6368
kk msort std-reuse 100000 0.92 .893 .00564783 10576
kk msort std 100000 1.13 1.097 .0120170 10608

kk qsort fip 100000 1.485 1.000 .0261725 12112
kk qsort std-reuse 100000 1.92 1.292 .0255149 12624
kk qsort std 100000 2.365 1.592 .0529204 12656

kk tmap fip 100000 1.24 1.000 .0130384 7968
kk tmap std-reuse 100000 0.78 .629 .00344517 7984
kk tmap std 100000 0.82 .661 .00512008 7984
c tmap fip 100000 4.64 3.741 .1656211 12600
c tmap fip-mi 100000 0.66 .532 .0044510 7520
c tmap std 100000 4.635 3.737 3.88184 10648
c tmap std-mi 100000 0.68 .548 .00387494 7520

New fip benchmarks:

benchmark variant param elapsed relative stddev rss

kk rbtree fip 100000 0.61 1.000 .0109545 4832
kk rbtree fip-icfp 100000 0.6 .983 .00695086 4832
kk rbtree std-reuse 100000 0.65 1.065 .010 4816
kk rbtree std 100000 1.38 2.262 .0189252 4800
kk rbtree fip-clrs 100000 0.785 1.286 .013177 4816
c rbtree clrs 100000 1.095 1.795 .0200686 12712
c rbtree clrs-mi 100000 0.61 1.000 .0114018 5952
c rbtree clrs-full 100000 1.09 1.786 .0532815 9456
c rbtree clrs-full-mi 100000 0.62 1.016 .00556486 5952
cpp rbtree stl 100000 0.725 1.188 .0266969 7552
cpp rbtree stl-mi 100000 0.32 .524 0 5968

kk ftree fip 100000 0.8 1.000 .00894427 4912
kk ftree std-reuse 100000 0.75 .937 .00662559 4112
kk ftree std 100000 1.1 1.375 .0115040 4112

kk msort fip 100000 0.01 0
kk msort std-reuse 100000 0.94 1.000 .0250998 10576
kk msort std 100000 1.15 1.223 .00386746 10624

kk qsort fip 100000 1.55 1.000 .0457165 12064
kk qsort std-reuse 100000 1.96 1.264 .0828860 12592
kk qsort std 100000 2.465 1.590 .0859924 12640

kk tmap fip 100000 1.235 1.000 .0276586 7984
kk tmap std-reuse 100000 0.78 .631 .00399079 7952
kk tmap std 100000 0.83 .672 .00368069 8000
c tmap fip 100000 4.6 3.724 .2286547 9608
c tmap fip-mi 100000 0.66 .534 .00506596 7520
c tmap std 100000 4.61 3.732 .2087919 10808
c tmap std-mi 100000 0.68 .550 .00301247 7520

@TimWhiting TimWhiting marked this pull request as ready for review July 11, 2024 14:12
@TimWhiting TimWhiting requested a review from anfelor July 11, 2024 14:19
Copy link
Collaborator

@anfelor anfelor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tim! This looks pretty cool, I didn't realize we could make the binary search tree benchmarks even faster with a proper pattern match compiler.

I added two nitpicks on the tests, but didn't really read the code. Is there something you wanted me to look at in particular?

test/parc/parc16.kk Outdated Show resolved Hide resolved
test/parc/parc2.kk.out Show resolved Hide resolved
@TimWhiting
Copy link
Collaborator Author

TimWhiting commented Jul 13, 2024

@anfelor

Hi Tim! This looks pretty cool, I didn't realize we could make the binary search tree benchmarks even faster with a proper pattern match compiler.

Yes, I originally intended to do this to clean up the generated C code - so there wouldn't be as many duplicated checks for similar structure, but I expected the C compiler at least to mostly optimize these redundant checks. It looks however that it is obscure enough that the C compiler cannot optimize it.

Looking at it again, when running the benchmark script I realized I was getting some inconsistent results, especially with -n=3. I noticed that the first run was considerably slower sometimes. I skipped reporting the first run and now run with -n=10. The new results aren't quite as impressive, but considering how fast and optimized these benchmarks are already, it seems like it is still a decent improvement. Up to %5 percent.

Is there something you wanted me to look at in particular?

You are one of the only other people besides Daan who could give this a review, and I'd appreciate pointing out any obvious bugs before I take up Daan's time since he is so busy. The file isn't very big <600 lines of heavily commented code. I understand if you are busy as well, so just whatever you are willing to look at I'd appreciate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants