Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large speedup in countrycode() #341

Merged

Conversation

etiennebacher
Copy link
Contributor

@etiennebacher etiennebacher commented Sep 4, 2023

Refactor of the way we search for regex matches on each input. Instead of running one grepl() per regex, I collapse all regexes into a single one by putting them into capture groups. Then, a single run of gregexpr() is needed and we can extract the matches from the information on the capture groups. I put a bit more explanations in the code, but you can also refer to this StackOverflow post.

Instead of using nested sapply(), we can pass the whole vector of inputs (country names, iso, etc.) to grepl() directly.

There was also one ifelse() that took a lot of time but isn't needed when we only have one destination.

This leads to a very large speedup:

library(bench)

out <- cross::run(
  pkgs = c("vincentarelbundock/countrycode",
  "etiennebacher/countrycode@large-speedup"),
  ~{
    library(countrycode)

    test <- data.frame(
      origin = sample(codelist$country.name.en, 1e7, TRUE),
      destination = sample(codelist$country.name.en, 1e7, TRUE)
    )

    bench::mark(
      countrycode(test$origin, "country.name", "iso3c"),
      iterations = 10
    )
  }
)

tidyr::unnest(out, result) |>
  dplyr::select(pkg, median, mem_alloc) |>
  dplyr::mutate(pkg = ifelse(grepl("vincent", pkg), "main", "fork"))
#> # A tibble: 2 × 3
#> pkg median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 main 12.1s 1.17GB
#> 2 fork 2.23s 1.32GB

@vincentarelbundock
Copy link
Owner

Oooh, this looks awesome, thanks!

Frankly, it's a bit scary too, so I'll want to play with it a bit before merging. Unfortunately, it's the start of semester, so I may not get to this for a little while.

If

@vincentarelbundock
Copy link
Owner

If @cjyetman has time and interest (no pressure), he could also merge this after some testing.

@etiennebacher
Copy link
Contributor Author

Frankly, it's a bit scary too, so I'll want to play with it a bit before merging

Yup, completely agree, the last thing you want is realize later that codes and names don't match. I'll see later if I can test more extensively

@etiennebacher
Copy link
Contributor Author

etiennebacher commented Sep 4, 2023

FYI an improvement that makes the code much simpler was suggested to me on SO, I'll modify the PR

@etiennebacher
Copy link
Contributor Author

New timing after the simplification:

#> # A tibble: 1 × 3
#>   pkg     median mem_alloc
#>   <chr> <bch:tm> <bch:byt>
#> 1 fork     1.18s     742MB

@etiennebacher
Copy link
Contributor Author

@vincentarelbundock or @cjyetman can you enable the CI for all commits in the PR so that I know if they pass?

@cjyetman
Copy link
Collaborator

cjyetman commented Sep 4, 2023

@vincentarelbundock or @cjyetman can you enable the CI for all commits in the PR so that I know if they pass?

I cannot (enable CI on commits from forks), but I have approved the CI to run on your most recent push.

@etiennebacher
Copy link
Contributor Author

For information, if you're ok with adding a dependency on data.table (to have access to chmatch()) then it's possible to go down to 750ms and 550MB for the example above

@vincentarelbundock
Copy link
Owner

Got it.

I think it's best to stay dependency-free, though

@vincentarelbundock
Copy link
Owner

Looks like all the tests pass. I guess that's better than whatever impressionistic sense I could get from playing with different things. What do you think @cjyetman , should I just merge this?

@cjyetman
Copy link
Collaborator

cjyetman commented Sep 5, 2023

I'll try to take a look at it today.

@etiennebacher
Copy link
Contributor Author

I ran revdepcheck on the 31 reverse dependencies (both Imports and Suggests) and I found no new errors

@cjyetman
Copy link
Collaborator

cjyetman commented Sep 5, 2023

@etiennebacher can you explain what cross::run() is so that I can replicate your example?

@etiennebacher
Copy link
Contributor Author

It allows us to run the same code automatically with two (or more) versions of the same package. I specified the main branch and my fork at the beginning and then I give the code to run. cross::run() will automatically install both versions in a temp location, run the code for each version separately, and output the result in a nested table that I unnest at the end.

https://github.com/DavisVaughan/cross

@cjyetman
Copy link
Collaborator

cjyetman commented Sep 5, 2023

This seems to work as expected, tests pass, build passes, so that seems ok. I can't exactly replicate the example because I don't know where cross::run() comes from, but loading different versions of countrycode in different sessions does seem to show a significant speed improvement with this example.

A few suggestions:

  • I would strongly prefer to see minor whitespace changes on unrelated lines removed from this PR to make it easier to review.
  • The ifelse() change starting on L207 I would prefer to see done in a separate PR because it's conceptually a separate idea/improvement.

@vincentarelbundock
Copy link
Owner

Thanks a ton for the review @cjyetman, I appreciate your time.

I also agree with the comments about white space and separate PRs for different concepts.

That said, it's probably fine to merge this now in order to avoid more busywork for Etienne. So @etiennebacher, if you want to add your name as a contributor (your choice), I can merge the PR.

A 5x speedup is nothing to scoff at. Very nice! Thanks!

@cjyetman
Copy link
Collaborator

cjyetman commented Sep 5, 2023

It allows us to run the same code automatically with two (or more) versions of the same package. I specified the main branch and my fork at the beginning and then I give the code to run. cross::run() will automatically install both versions in a temp location, run the code for each version separately, and output the result in a nested table that I unnest at the end.

https://github.com/DavisVaughan/cross

Thanks for the link, I was not finding {cross} on CRAN.

@cjyetman
Copy link
Collaborator

cjyetman commented Sep 5, 2023

Note that if multiple destinations are used, the speed-up disappears is reduced and the memory usage is the same, but they still produce the same result:

  library(bench)
  
  out <- cross::run(
    pkgs = c("vincentarelbundock/countrycode",
             "etiennebacher/countrycode@large-speedup"),
    ~{
      library(countrycode)
      
      sourcevar <- sample(codelist$country.name.en, 1e7, TRUE)
      
      bench::mark(
        countrycode(sourcevar, "country.name", c("iso3c", "iso3n")),
        iterations = 10
      )
    }
  )
  
  tidyr::unnest(out, result) |>
    dplyr::select(pkg, median, mem_alloc) |>
    dplyr::mutate(pkg = ifelse(grepl("vincent", pkg), "main", "fork"))

#> # A tibble: 2 × 3
#>   pkg     median mem_alloc
#>   <chr> <bch:tm> <bch:byt>
#> 1 main     5.95s    2.45GB
#> 2 fork     4.56s    2.45GB

@vincentarelbundock vincentarelbundock merged commit 7342335 into vincentarelbundock:main Sep 6, 2023
6 checks passed
@vincentarelbundock
Copy link
Owner

Thanks to both of you (and especially Etienne!). Merged!

@etiennebacher etiennebacher deleted the large-speedup branch September 6, 2023 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants