-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add apply_parallel
function to RunGroupBy
#262
Conversation
Parallelises the apply call using joblib
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #262 +/- ##
==========================================
- Coverage 95.58% 95.40% -0.19%
==========================================
Files 24 24
Lines 2154 2175 +21
Branches 400 403 +3
==========================================
+ Hits 2059 2075 +16
- Misses 76 81 +5
Partials 19 19
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, lot of work in here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, both the apply_parallel
and the typing improvements. Have you tested the typing e.g. on ndcs or should I?
I've tested the functionality, but not the typing. It would be great if you could take a look |
@mikapfl forgot to tag you in the previous comment |
ndc-quantification type-checks out of the box with this branch. Amazing, I fully expected some fallout, but everything's just fine. |
You get some warnings |
Great! I'll get this merged then |
A few speed improvements for large groupby operations
apply_parallel
function toRunGroupBy
which parallelises the apply operationrun_append
to append DataFrame's formatted like the result ofrun.timeseries()
While the use-case of the former is more obvious, being able to append pd.Dataframes often result in fewer calls to
scmdata.ScmRun
which can be expensive if performed at the bottom of a loop that is performed a lot of times. This arises in the case where the inner function of a groupby operation uses native pandas functionality. Without this optimisation, one must always convert back to an ScmRun object e.g.:But his optimisation would allow for:
which doesn't require the extra conversion from DataFRame -> ScmRun before appending
The diff is quite busy as I added type-hints at the same time. Let me know if you would like me to split out the type hints.
Pull request
Please confirm that this pull request has done the following: