Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add apply_parallel function to RunGroupBy #262

Merged
merged 7 commits into from
Oct 13, 2023

Conversation

lewisjared
Copy link
Collaborator

@lewisjared lewisjared commented Sep 21, 2023

A few speed improvements for large groupby operations

  • Adds the apply_parallel function to RunGroupBy which parallelises the apply operation
  • Enable run_append to append DataFrame's formatted like the result of run.timeseries()

While the use-case of the former is more obvious, being able to append pd.Dataframes often result in fewer calls to scmdata.ScmRun which can be expensive if performed at the bottom of a loop that is performed a lot of times. This arises in the case where the inner function of a groupby operation uses native pandas functionality. Without this optimisation, one must always convert back to an ScmRun object e.g.:

def inner(run: scmdata.ScmRun) -> scmdata.ScmRun:
    return run.process_over("variable", "sum", as_run=True)

res: scmdata.ScmRun = my_data.groupby("region", inner)

But his optimisation would allow for:

def inner(run: scmdata.ScmRun) -> pd.DataFrame:
    return run.process_over("variable", "sum", as_run=False)

res: scmdata.ScmRun = my_data.groupby("region", inner)

which doesn't require the extra conversion from DataFRame -> ScmRun before appending

The diff is quite busy as I added type-hints at the same time. Let me know if you would like me to split out the type hints.

Pull request

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Example added (either to an existing notebook or as a new notebook, where applicable)
  • Changelog in '/changelog' added

Parallelises the apply call using joblib
@codecov
Copy link

codecov bot commented Sep 21, 2023

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (500aa22) 95.58% compared to head (5bd6eb3) 95.40%.
Report is 11 commits behind head on master.

❗ Current head 5bd6eb3 differs from pull request most recent head bada923. Consider uploading reports for the commit bada923 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
- Coverage   95.58%   95.40%   -0.19%     
==========================================
  Files          24       24              
  Lines        2154     2175      +21     
  Branches      400      403       +3     
==========================================
+ Hits         2059     2075      +16     
- Misses         76       81       +5     
  Partials       19       19              
Files Coverage Δ
src/scmdata/database/__init__.py 100.00% <100.00%> (ø)
src/scmdata/pyam_compat.py 36.84% <100.00%> (ø)
src/scmdata/run.py 95.00% <89.18%> (-0.35%) ⬇️
src/scmdata/groupby.py 82.19% <87.50%> (-0.07%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lewisjared lewisjared marked this pull request as ready for review September 27, 2023 01:44
Copy link
Collaborator

@znicholls znicholls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, lot of work in here!

src/scmdata/groupby.py Show resolved Hide resolved
src/scmdata/groupby.py Show resolved Hide resolved
src/scmdata/groupby.py Show resolved Hide resolved
LICENSE Show resolved Hide resolved
src/scmdata/database/__init__.py Show resolved Hide resolved
src/scmdata/groupby.py Show resolved Hide resolved
src/scmdata/groupby.py Show resolved Hide resolved
src/scmdata/groupby.py Show resolved Hide resolved
Copy link
Member

@mikapfl mikapfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, both the apply_parallel and the typing improvements. Have you tested the typing e.g. on ndcs or should I?

src/scmdata/groupby.py Show resolved Hide resolved
src/scmdata/groupby.py Show resolved Hide resolved
@lewisjared
Copy link
Collaborator Author

I've tested the functionality, but not the typing. It would be great if you could take a look

@lewisjared
Copy link
Collaborator Author

@mikapfl forgot to tag you in the previous comment

@mikapfl
Copy link
Member

mikapfl commented Oct 12, 2023

I've tested the functionality, but not the typing. It would be great if you could take a look

ndc-quantification type-checks out of the box with this branch. Amazing, I fully expected some fallout, but everything's just fine.

@mikapfl
Copy link
Member

mikapfl commented Oct 12, 2023

You get some warnings error: Unused "type: ignore" comment [unused-ignore], because it can now figure out more types! Yay!

This was referenced Oct 12, 2023
@lewisjared
Copy link
Collaborator Author

Great! I'll get this merged then

@lewisjared lewisjared merged commit 9adce7d into master Oct 13, 2023
15 checks passed
@lewisjared lewisjared deleted the speed-improvements branch October 13, 2023 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants