feat: Add `apply_parallel` function to RunGroupBy #262

lewisjared · 2023-09-21T03:07:30Z

A few speed improvements for large groupby operations

Adds the apply_parallel function to RunGroupBy which parallelises the apply operation
Enable run_append to append DataFrame's formatted like the result of run.timeseries()

While the use-case of the former is more obvious, being able to append pd.Dataframes often result in fewer calls to scmdata.ScmRun which can be expensive if performed at the bottom of a loop that is performed a lot of times. This arises in the case where the inner function of a groupby operation uses native pandas functionality. Without this optimisation, one must always convert back to an ScmRun object e.g.:

def inner(run: scmdata.ScmRun) -> scmdata.ScmRun:
    return run.process_over("variable", "sum", as_run=True)

res: scmdata.ScmRun = my_data.groupby("region", inner)

But his optimisation would allow for:

def inner(run: scmdata.ScmRun) -> pd.DataFrame:
    return run.process_over("variable", "sum", as_run=False)

res: scmdata.ScmRun = my_data.groupby("region", inner)

which doesn't require the extra conversion from DataFRame -> ScmRun before appending

The diff is quite busy as I added type-hints at the same time. Let me know if you would like me to split out the type hints.

Pull request

Please confirm that this pull request has done the following:

Tests added
Documentation added (where applicable)
Example added (either to an existing notebook or as a new notebook, where applicable)
Changelog in '/changelog' added

Parallelises the apply call using joblib

src/scmdata/groupby.py

codecov · 2023-09-21T04:56:14Z

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (500aa22) 95.58% compared to head (5bd6eb3) 95.40%.
Report is 11 commits behind head on master.

❗ Current head 5bd6eb3 differs from pull request most recent head bada923. Consider uploading reports for the commit bada923 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
- Coverage   95.58%   95.40%   -0.19%     
==========================================
  Files          24       24              
  Lines        2154     2175      +21     
  Branches      400      403       +3     
==========================================
+ Hits         2059     2075      +16     
- Misses         76       81       +5     
  Partials       19       19

Files	Coverage Δ
src/scmdata/database/__init__.py	`100.00% <100.00%> (ø)`
src/scmdata/pyam_compat.py	`36.84% <100.00%> (ø)`
src/scmdata/run.py	`95.00% <89.18%> (-0.35%)`	⬇️
src/scmdata/groupby.py	`82.19% <87.50%> (-0.07%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

znicholls

Nice, lot of work in here!

src/scmdata/groupby.py

LICENSE

src/scmdata/database/__init__.py

src/scmdata/groupby.py

mikapfl

Nice, both the apply_parallel and the typing improvements. Have you tested the typing e.g. on ndcs or should I?

src/scmdata/groupby.py

lewisjared · 2023-10-11T21:30:25Z

I've tested the functionality, but not the typing. It would be great if you could take a look

lewisjared · 2023-10-11T21:30:47Z

@mikapfl forgot to tag you in the previous comment

mikapfl · 2023-10-12T13:46:55Z

I've tested the functionality, but not the typing. It would be great if you could take a look

ndc-quantification type-checks out of the box with this branch. Amazing, I fully expected some fallout, but everything's just fine.

mikapfl · 2023-10-12T13:52:36Z

You get some warnings error: Unused "type: ignore" comment [unused-ignore], because it can now figure out more types! Yay!

lewisjared · 2023-10-13T09:56:27Z

Great! I'll get this merged then

feat: Add apply_parallel function to RunGroupBy

0cc9bb8

Parallelises the apply call using joblib

lewisjared requested review from mikapfl and znicholls September 21, 2023 03:07

chore: Apply more type hints

08a00b7

This was referenced Sep 21, 2023

Type-checking columns #259

Closed

Type information for RunGroupBy.apply() #258

Open

chore: Missing ""

eca06f7

lewisjared commented Sep 21, 2023

View reviewed changes

src/scmdata/groupby.py Show resolved Hide resolved

chore: Type adjustments

d1f0564

lewisjared commented Sep 21, 2023

View reviewed changes

src/scmdata/groupby.py Show resolved Hide resolved

lewisjared added 2 commits September 21, 2023 15:21

chore: testing

f7a176e

chore: fix up coverage

5bd6eb3

lewisjared marked this pull request as ready for review September 27, 2023 01:44

znicholls approved these changes Oct 4, 2023

View reviewed changes

mikapfl approved these changes Oct 9, 2023

View reviewed changes

src/scmdata/groupby.py Show resolved Hide resolved

src/scmdata/groupby.py Show resolved Hide resolved

This was referenced Oct 12, 2023

Refactor apply_parallel #264

Closed

Refactor apply_parallel #268

Merged

docs: changelog

bada923

lewisjared merged commit 9adce7d into master Oct 13, 2023
15 checks passed

lewisjared deleted the speed-improvements branch October 13, 2023 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `apply_parallel` function to RunGroupBy #262

feat: Add `apply_parallel` function to RunGroupBy #262

lewisjared commented Sep 21, 2023 •

edited

Loading

codecov bot commented Sep 21, 2023 •

edited

Loading

znicholls left a comment

mikapfl left a comment

lewisjared commented Oct 11, 2023

lewisjared commented Oct 11, 2023

mikapfl commented Oct 12, 2023

mikapfl commented Oct 12, 2023

lewisjared commented Oct 13, 2023

feat: Add apply_parallel function to RunGroupBy #262

feat: Add apply_parallel function to RunGroupBy #262

Conversation

lewisjared commented Sep 21, 2023 • edited Loading

Pull request

codecov bot commented Sep 21, 2023 • edited Loading

Codecov Report

znicholls left a comment

Choose a reason for hiding this comment

mikapfl left a comment

Choose a reason for hiding this comment

lewisjared commented Oct 11, 2023

lewisjared commented Oct 11, 2023

mikapfl commented Oct 12, 2023

mikapfl commented Oct 12, 2023

lewisjared commented Oct 13, 2023

feat: Add `apply_parallel` function to RunGroupBy #262

feat: Add `apply_parallel` function to RunGroupBy #262

lewisjared commented Sep 21, 2023 •

edited

Loading

codecov bot commented Sep 21, 2023 •

edited

Loading