Large speedup in `countrycode()` #341

etiennebacher · 2023-09-04T12:56:28Z

Refactor of the way we search for regex matches on each input. Instead of running one grepl() per regex, I collapse all regexes into a single one by putting them into capture groups. Then, a single run of gregexpr() is needed and we can extract the matches from the information on the capture groups. I put a bit more explanations in the code, but you can also refer to this StackOverflow post.

Instead of using nested sapply(), we can pass the whole vector of inputs (country names, iso, etc.) to grepl() directly.

There was also one ifelse() that took a lot of time but isn't needed when we only have one destination.

This leads to a very large speedup:

library(bench)

out <- cross::run(
  pkgs = c("vincentarelbundock/countrycode",
  "etiennebacher/countrycode@large-speedup"),
  ~{
    library(countrycode)

    test <- data.frame(
      origin = sample(codelist$country.name.en, 1e7, TRUE),
      destination = sample(codelist$country.name.en, 1e7, TRUE)
    )

    bench::mark(
      countrycode(test$origin, "country.name", "iso3c"),
      iterations = 10
    )
  }
)

tidyr::unnest(out, result) |>
  dplyr::select(pkg, median, mem_alloc) |>
  dplyr::mutate(pkg = ifelse(grepl("vincent", pkg), "main", "fork"))
#> # A tibble: 2 × 3
#> pkg median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 main 12.1s 1.17GB
#> 2 fork 2.23s 1.32GB

vincentarelbundock · 2023-09-04T13:07:41Z

Oooh, this looks awesome, thanks!

Frankly, it's a bit scary too, so I'll want to play with it a bit before merging. Unfortunately, it's the start of semester, so I may not get to this for a little while.

If

vincentarelbundock · 2023-09-04T13:08:21Z

If @cjyetman has time and interest (no pressure), he could also merge this after some testing.

etiennebacher · 2023-09-04T13:11:58Z

Frankly, it's a bit scary too, so I'll want to play with it a bit before merging

Yup, completely agree, the last thing you want is realize later that codes and names don't match. I'll see later if I can test more extensively

etiennebacher · 2023-09-04T13:50:04Z

FYI an improvement that makes the code much simpler was suggested to me on SO, I'll modify the PR

etiennebacher · 2023-09-04T14:21:26Z

New timing after the simplification:

#> # A tibble: 1 × 3
#>   pkg     median mem_alloc
#>   <chr> <bch:tm> <bch:byt>
#> 1 fork     1.18s     742MB

etiennebacher · 2023-09-04T16:01:59Z

@vincentarelbundock or @cjyetman can you enable the CI for all commits in the PR so that I know if they pass?

cjyetman · 2023-09-04T16:04:51Z

@vincentarelbundock or @cjyetman can you enable the CI for all commits in the PR so that I know if they pass?

I cannot (enable CI on commits from forks), but I have approved the CI to run on your most recent push.

etiennebacher · 2023-09-04T16:45:27Z

For information, if you're ok with adding a dependency on data.table (to have access to chmatch()) then it's possible to go down to 750ms and 550MB for the example above

vincentarelbundock · 2023-09-04T17:21:44Z

Got it.

I think it's best to stay dependency-free, though

vincentarelbundock · 2023-09-04T17:43:06Z

Looks like all the tests pass. I guess that's better than whatever impressionistic sense I could get from playing with different things. What do you think @cjyetman , should I just merge this?

cjyetman · 2023-09-05T08:27:11Z

I'll try to take a look at it today.

etiennebacher · 2023-09-05T11:46:52Z

I ran revdepcheck on the 31 reverse dependencies (both Imports and Suggests) and I found no new errors

cjyetman · 2023-09-05T20:24:28Z

@etiennebacher can you explain what cross::run() is so that I can replicate your example?

etiennebacher · 2023-09-05T20:28:29Z

It allows us to run the same code automatically with two (or more) versions of the same package. I specified the main branch and my fork at the beginning and then I give the code to run. cross::run() will automatically install both versions in a temp location, run the code for each version separately, and output the result in a nested table that I unnest at the end.

https://github.com/DavisVaughan/cross

cjyetman · 2023-09-05T20:36:50Z

This seems to work as expected, tests pass, build passes, so that seems ok. I can't exactly replicate the example because I don't know where cross::run() comes from, but loading different versions of countrycode in different sessions does seem to show a significant speed improvement with this example.

A few suggestions:

I would strongly prefer to see minor whitespace changes on unrelated lines removed from this PR to make it easier to review.
The ifelse() change starting on L207 I would prefer to see done in a separate PR because it's conceptually a separate idea/improvement.

vincentarelbundock · 2023-09-05T20:54:37Z

Thanks a ton for the review @cjyetman, I appreciate your time.

I also agree with the comments about white space and separate PRs for different concepts.

That said, it's probably fine to merge this now in order to avoid more busywork for Etienne. So @etiennebacher, if you want to add your name as a contributor (your choice), I can merge the PR.

A 5x speedup is nothing to scoff at. Very nice! Thanks!

cjyetman · 2023-09-05T21:01:26Z

It allows us to run the same code automatically with two (or more) versions of the same package. I specified the main branch and my fork at the beginning and then I give the code to run. cross::run() will automatically install both versions in a temp location, run the code for each version separately, and output the result in a nested table that I unnest at the end.

https://github.com/DavisVaughan/cross

Thanks for the link, I was not finding {cross} on CRAN.

cjyetman · 2023-09-05T21:44:51Z

Note that if multiple destinations are used, the speed-up ~~disappears~~ is reduced and the memory usage is the same, but they still produce the same result:

  library(bench)
  
  out <- cross::run(
    pkgs = c("vincentarelbundock/countrycode",
             "etiennebacher/countrycode@large-speedup"),
    ~{
      library(countrycode)
      
      sourcevar <- sample(codelist$country.name.en, 1e7, TRUE)
      
      bench::mark(
        countrycode(sourcevar, "country.name", c("iso3c", "iso3n")),
        iterations = 10
      )
    }
  )
  
  tidyr::unnest(out, result) |>
    dplyr::select(pkg, median, mem_alloc) |>
    dplyr::mutate(pkg = ifelse(grepl("vincent", pkg), "main", "fork"))

#> # A tibble: 2 × 3
#>   pkg     median mem_alloc
#>   <chr> <bch:tm> <bch:byt>
#> 1 main     5.95s    2.45GB
#> 2 fork     4.56s    2.45GB

vincentarelbundock · 2023-09-06T02:04:16Z

Thanks to both of you (and especially Etienne!). Merged!

etiennebacher added 3 commits September 4, 2023 13:56

init

1e963a4

cleanup, add explanations

48134d7

only use ifelse() if necessary

2e4f03b

etiennebacher added 3 commits September 4, 2023 16:13

simplify the code by vectorizing grepl()

1917f21

remove again the if condition

d1aefef

add some explanations

33e5db8

vincentarelbundock added 2 commits September 5, 2023 21:26

ctb

9735f90

news

7342335

vincentarelbundock merged commit 7342335 into vincentarelbundock:main Sep 6, 2023
6 checks passed

etiennebacher deleted the large-speedup branch September 6, 2023 05:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large speedup in `countrycode()` #341

Large speedup in `countrycode()` #341

etiennebacher commented Sep 4, 2023 •

edited

Loading

vincentarelbundock commented Sep 4, 2023

vincentarelbundock commented Sep 4, 2023

etiennebacher commented Sep 4, 2023

etiennebacher commented Sep 4, 2023 •

edited

Loading

etiennebacher commented Sep 4, 2023

etiennebacher commented Sep 4, 2023

cjyetman commented Sep 4, 2023

etiennebacher commented Sep 4, 2023

vincentarelbundock commented Sep 4, 2023

vincentarelbundock commented Sep 4, 2023

cjyetman commented Sep 5, 2023

etiennebacher commented Sep 5, 2023

cjyetman commented Sep 5, 2023

etiennebacher commented Sep 5, 2023

cjyetman commented Sep 5, 2023

vincentarelbundock commented Sep 5, 2023

cjyetman commented Sep 5, 2023

cjyetman commented Sep 5, 2023 •

edited

Loading

vincentarelbundock commented Sep 6, 2023

Large speedup in countrycode() #341

Large speedup in countrycode() #341

Conversation

etiennebacher commented Sep 4, 2023 • edited Loading

vincentarelbundock commented Sep 4, 2023

vincentarelbundock commented Sep 4, 2023

etiennebacher commented Sep 4, 2023

etiennebacher commented Sep 4, 2023 • edited Loading

etiennebacher commented Sep 4, 2023

etiennebacher commented Sep 4, 2023

cjyetman commented Sep 4, 2023

etiennebacher commented Sep 4, 2023

vincentarelbundock commented Sep 4, 2023

vincentarelbundock commented Sep 4, 2023

cjyetman commented Sep 5, 2023

etiennebacher commented Sep 5, 2023

cjyetman commented Sep 5, 2023

etiennebacher commented Sep 5, 2023

cjyetman commented Sep 5, 2023

vincentarelbundock commented Sep 5, 2023

cjyetman commented Sep 5, 2023

cjyetman commented Sep 5, 2023 • edited Loading

vincentarelbundock commented Sep 6, 2023

Large speedup in `countrycode()` #341

Large speedup in `countrycode()` #341

etiennebacher commented Sep 4, 2023 •

edited

Loading

etiennebacher commented Sep 4, 2023 •

edited

Loading

cjyetman commented Sep 5, 2023 •

edited

Loading