Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate new version of the model that jointly models the competing risk incidences #53

Merged
merged 77 commits into from
Aug 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
99a4480
removing Incidence
juAlberge Jun 19, 2024
9b9735c
adding MultiIncidence to hazardosu
juAlberge Jun 19, 2024
a63ef45
reinserting example for multiincidence
juAlberge Jun 19, 2024
01b2629
Hardcode uniform time sampler to simplify estimator
ogrisel Jul 1, 2024
e0f9fd0
Remove any reference to the dataset-specific oracle scorer
ogrisel Jul 1, 2024
ab06498
Rename to SurvivalBoost
ogrisel Jul 1, 2024
0671d98
Update the .score method to work in the any event setting
ogrisel Jul 1, 2024
846c019
Rename test file
ogrisel Jul 1, 2024
d40906a
Fix test_gradient_boosting_incidence_parameter_tuning
ogrisel Jul 1, 2024
59e7edd
Simplify code
ogrisel Jul 1, 2024
c0e02eb
Raise error if hard-zero resampling censors too many events
ogrisel Jul 1, 2024
6deca3e
adding test
juAlberge Jul 1, 2024
306feb8
Refactor survival boost tests
ogrisel Jul 1, 2024
f3f1925
Fix missing assertion
ogrisel Jul 1, 2024
7e21186
fix TODO
Jul 1, 2024
52134bd
fix linter
Jul 1, 2024
f245f8a
Fix LaTeX
ogrisel Jul 1, 2024
fb78d83
Merge pull request #52 from soda-inria/enhance_survival_boost_test
ogrisel Jul 1, 2024
ca8e009
Merge remote-tracking branch 'origin/main' into multi-incidence
ogrisel Jul 1, 2024
3f47477
More informative error message for bad value for hard_zero_fraction
ogrisel Jul 1, 2024
527f88d
Remove untested code related to Cox PH based conditional censoring es…
ogrisel Jul 2, 2024
94e7397
Avoid using 'Est' as a short hand for 'Estimator'
ogrisel Jul 2, 2024
3c722ae
FIX accept both float and array-like for time_horizon in predict_proba
glemaitre Jul 2, 2024
a68a9e9
DOC update predict_proba docstring
glemaitre Jul 2, 2024
c14c21a
TST add a test that check the behaviour of predict_proba
glemaitre Jul 2, 2024
9f9f750
DOC add docstring to test
glemaitre Jul 2, 2024
7595c7f
TST check that probability sum to one for all events
glemaitre Jul 2, 2024
071fe63
DOC fix docstring type for time_horizon
glemaitre Jul 2, 2024
4376f36
Rename BaseIPCW to KaplanMeierIPCW. Adding some documentation.
juAlberge Jul 2, 2024
3b90460
API correct the API of predict_proba
glemaitre Jul 2, 2024
c9052c4
Use arturo formulation
glemaitre Jul 2, 2024
8ba6ef6
Silence 'DeprecationWarning: invalid escape sequence' by using raw do…
ogrisel Jul 2, 2024
b725b01
update docstring
glemaitre Jul 2, 2024
4cfb671
Simplify KaplanMeierIPCW and improve its docstring
ogrisel Jul 2, 2024
5880b63
adding some documentation to SurvivalBoost
juAlberge Jul 2, 2024
fcb5b34
More docstring fixes
ogrisel Jul 2, 2024
2323b3f
Apply suggestions from code review
glemaitre Jul 3, 2024
cdfca30
TST update test and code to follow scikit-learn convention
glemaitre Jul 3, 2024
7b68bba
TST check that we override properly
glemaitre Jul 3, 2024
841a9fa
Apply suggestions from code review
glemaitre Jul 3, 2024
f3580a6
DOC start documenting parameter
glemaitre Jul 2, 2024
2d2545b
DOC documents parameters in SurvivalBoost docstring
glemaitre Jul 3, 2024
0f1f3fd
DOC add ipcw estimator docstring
glemaitre Jul 3, 2024
642f414
DOC add attributes, reference, and examples
glemaitre Jul 3, 2024
c1078cc
DOC document some parameters and attributes of WeightedMultiClassTarg…
glemaitre Jul 3, 2024
6cff59d
Merge pull request #54 from glemaitre/bug_predict_proba
ogrisel Jul 3, 2024
0bf8542
API change the shape of the incidence curves
glemaitre Jul 3, 2024
d262cfe
DOC Use data generator in marginal cumulative incidence example
Jul 3, 2024
8444471
FIX slice properly the prediction
glemaitre Jul 3, 2024
0ef3d78
fix install instructions with flit
judithabk6 Jul 2, 2024
e786eec
MAINT Remove old python version in black config (#57)
jovan-stojanovic Jul 3, 2024
0802842
MAINT activate doctest check with pytest by default
glemaitre Jul 3, 2024
b837639
Fix a pandas warning in load_seer
ogrisel Jul 3, 2024
f8d3ad8
FIX do not pytest collect outside of the hazardous module folder
ogrisel Jul 3, 2024
23736de
FIX update the example to integrate on the right columns
glemaitre Jul 3, 2024
1af117b
Apply suggestions from code review
glemaitre Jul 3, 2024
40e1774
Merge pull request #56 from glemaitre/improve_doc
ogrisel Jul 3, 2024
6cb17cd
Set explicit value for n_events
Jul 3, 2024
47ceb40
Merge remote-tracking branch 'origin/main' into modify_check_predict_…
glemaitre Jul 3, 2024
8a6bcaa
Merge remote-tracking branch 'origin/multi-incidence' into modify_che…
glemaitre Jul 3, 2024
f8defea
Improve variable names and comments in WeightedMultiClassTargetSampler
ogrisel Jul 3, 2024
dd7990b
Use n_horizons_per_observation=3 by default
ogrisel Jul 3, 2024
0a38c24
Update hazardous/_survival_boost.py
glemaitre Jul 3, 2024
e6f0d90
Remove reference to "column" to describe the second axis of a 3D array.
ogrisel Jul 4, 2024
259aeda
Merge pull request #61 from glemaitre/modify_check_predict_survival_b…
ogrisel Jul 4, 2024
c05f601
Improve docstrings for SurvivalBoost
ogrisel Jul 4, 2024
e5e7f33
Make sure to use zero-width param ranges in marginal example
ogrisel Jul 4, 2024
ee8af69
Merge branch 'multi-incidence' into use_data_generator
ogrisel Jul 4, 2024
d90cf4f
Silence AJ warning on tied events
ogrisel Jul 4, 2024
d0834c3
Small improvements
ogrisel Jul 4, 2024
5efd0ee
Merge pull request #62 from ArturoAmorQ/use_data_generator
ogrisel Jul 4, 2024
139ef51
Use a string in the public API of SurvivalBoost to select the censori…
ogrisel Jul 4, 2024
6c644a6
Keep IPCW estimator private API for now
ogrisel Jul 4, 2024
012cd5e
replacing any occurence of _est to _estimator
juAlberge Jul 5, 2024
176c54b
fixing math, harmonising different notations.
juAlberge Jul 5, 2024
c591791
Example survival analysis (#70)
juAlberge Aug 15, 2024
cc6b90c
enhance documentation
Vincent-Maladiere Aug 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 1 addition & 11 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Estimators
:template: class.rst
:nosignatures:

GradientBoostingIncidence
SurvivalBoost


Metrics
Expand All @@ -42,13 +42,3 @@ Datasets
data.make_synthetic_competing_weibull
data.load_seer


Inverse Probability Censoring Weight
------------------------------------

.. autosummary::
:toctree: generated/
:template: class.rst
:nosignatures:

IPCWEstimator
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
'sphinx.ext.autosummary',
"sphinx.ext.intersphinx",
'numpydoc',
'sphinx_design',
]
templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
Expand Down
192 changes: 82 additions & 110 deletions examples/plot_marginal_cumulative_incidence_estimation.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,76 +4,57 @@
==================================================

This example demonstrates how to estimate the marginal cumulative incidence
using :class:`hazardous.GradientBoostingIncidence` and compares the results to
the Aalen-Johansen estimator and to the theoretical cumulated incidence curves
on synthetic data.
using :class:`hazardous.SurvivalBoost` and compares the results to the
Aalen-Johansen estimator and to the theoretical cumulated incidence curves on
synthetic data.

Here the data is generated by taking the minimum time of samples from three
competing Weibull distributions with fixed parameters and without any
conditioning covariate. In this case, the Aalen-Johansen estimator is expected
to be unbiased, and this is empirically confirmed by this example.

The :class:`hazardous.GradientBoostingIncidence` estimator on the other hand is
a predictive estimator that expects at least one conditioning covariate. In
this example, we use a dummy covariate that is constant for all samples. Here
we are not interested in the discrimination power of the estimator: there is
none by construction, since we do not have access to informative covariates.
Instead we empirically study its marginal calibration, that is, its ability to
The :class:`hazardous.SurvivalBoost` estimator on the other hand is a
predictive estimator that expects at least one conditioning covariate. In this
example, we use a dummy covariate that is constant for all samples. Here we are
not interested in the discrimination power of the estimator: there is none by
construction, since we do not have access to informative covariates. Instead we
empirically study its marginal calibration, that is, its ability to
approximately recover an unbiased estimate of the marginal cumulative incidence
function for each competing event.

This example also highlights that :class:`hazardous.GradientBoostingIncidence`
estimates noisy cumulative incidence functions, which are not smooth and not
even monotonically increasing. This is a known limitation of the estimator,
and attempting to enforce monotonicity at training time typically introduces
severe over-estimation bias for large time horizons.
This example also highlights that :class:`hazardous.SurvivalBoost` estimates
noisy cumulative incidence functions, which are not smooth and not even
monotonically increasing. This is a known limitation of the estimator, and
attempting to enforce monotonicity at training time typically introduces severe
over-estimation bias for large time horizons.
"""

# %%
from time import perf_counter
import numpy as np
from scipy.stats import weibull_min
from sklearn.base import clone
import pandas as pd
import matplotlib.pyplot as plt

from hazardous import GradientBoostingIncidence
from hazardous import SurvivalBoost
from hazardous.data import make_synthetic_competing_weibull
from lifelines import AalenJohansenFitter

rng = np.random.default_rng(0)
n_samples = 3_000

# Non-informative covariate because scikit-learn estimators expect at least
# one feature.
X_dummy = np.zeros(shape=(n_samples, 1), dtype=np.float32)

base_scale = 1_000.0 # some arbitrary time unit

distributions = [
{"event_id": 1, "scale": 10 * base_scale, "shape": 0.5},
{"event_id": 2, "scale": 3 * base_scale, "shape": 1},
{"event_id": 3, "scale": 3 * base_scale, "shape": 5},
]
event_times = np.concatenate(
[
weibull_min.rvs(
dist["shape"],
scale=dist["scale"],
size=n_samples,
random_state=rng,
).reshape(-1, 1)
for dist in distributions
],
axis=1,
event_dist_shapes = (0.5, 1.0, 5.0)
event_dist_scales = (10, 3, 3)
n_events = len(event_dist_shapes)

_, y_uncensored = make_synthetic_competing_weibull(
n_samples=n_samples,
n_events=n_events,
censoring_relative_scale=0,
return_X_y=True,
shape_ranges=[(shape, shape) for shape in event_dist_shapes],
scale_ranges=[(scale, scale) for scale in event_dist_scales],
base_scale=base_scale,
random_state=0,
)
first_event_idx = np.argmin(event_times, axis=1)

y_uncensored = pd.DataFrame(
dict(
event=first_event_idx + 1, # 0 is reserved as the censoring marker
duration=event_times[np.arange(n_samples), first_event_idx],
)
)
y_uncensored["event"].value_counts().sort_index()
t_max = y_uncensored["duration"].max()

# %%
Expand All @@ -88,7 +69,7 @@
# distribution <https://en.wikipedia.org/wiki/Weibull_distribution>`_:


def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
def weibull_hazard(t, shape=1.0, scale=1.0):
# Plug an arbitrary finite hazard value at t==0 because fractional powers
# of 0 are undefined.
#
Expand All @@ -103,19 +84,25 @@ def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
#
# Note that true CIFs are independent of the censoring distribution. We can use
# them as reference to check that the estimators are unbiased by the censoring.
# Here are the two estimators of interest:
#
# We first define the two estimators of interest. The
# :class:`hazardous.SurvivalBoost` instance uses the Kaplan-Meier estimator on
# the negated event labels (1 for censoring, 0 for any event) to estimate
# internal IPCW weights. This is a valid choice in this context because we do
# not have access to any informative covariate (either for censoring or for the
# events of interest).

calculate_variance = n_samples <= 5_000
aj = AalenJohansenFitter(calculate_variance=calculate_variance, seed=0)

gb_incidence = GradientBoostingIncidence(
survival_boost = SurvivalBoost(
learning_rate=0.03,
n_iter=100,
max_leaf_nodes=5,
hard_zero_fraction=0.1,
min_samples_leaf=50,
loss="ibs",
show_progressbar=False,
ipcw_strategy="kaplan-meier",
random_state=0,
)

Expand All @@ -128,21 +115,25 @@ def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
# theoretical CIFs:


def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=None):
_, axes = plt.subplots(figsize=(12, 4), ncols=len(distributions), sharey=True)
def plot_cumulative_incidence_functions(y, survival_boost=None, aj=None):
"""Plot cause-specific cumulative incidence per event using a dummy covariate"""
_, axes = plt.subplots(figsize=(12, 4), ncols=n_events, sharey=True)

# Compute the estimate of the CIFs on a coarse grid.
coarse_timegrid = np.linspace(0, t_max, num=100)

# Compute the theoretical CIFs by integrating the hazard functions on a
# fine-grained time grid. Note that integration errors can accumulate quite
# quickly if the time grid is resolution too coarse, especially for the
# quickly if the time grid's resolution is too coarse, especially for the
# Weibull distribution with shape < 1.
tic = perf_counter()
fine_time_grid = np.linspace(0, t_max, num=10_000_000)
dt = np.diff(fine_time_grid)[0]
all_hazards = np.stack(
[weibull_hazard(fine_time_grid, **dist) for dist in distributions],
[
weibull_hazard(fine_time_grid, shape, scale * base_scale)
for shape, scale in zip(event_dist_shapes, event_dist_scales)
],
axis=0,
)
any_event_hazards = all_hazards.sum(axis=0)
Expand All @@ -157,6 +148,20 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
"Cause-specific cumulative incidence functions"
f" ({censoring_fraction:.1%} censoring)"
)
# Non-informative covariate because scikit-learn estimators expect at least
# one feature.
X_dummy = np.zeros(shape=(n_samples, 1), dtype=np.float32)
if survival_boost is not None:
tic = perf_counter()
survival_boost.fit(X_dummy, y)
duration = perf_counter() - tic
print(f"SurvivalBoost fit: {duration:.3f} s")
tic = perf_counter()
cif_preds = survival_boost.predict_cumulative_incidence(
X_dummy, coarse_timegrid
)
duration = perf_counter() - tic
print(f"SurvivalBoost prediction: {duration:.3f} s")

for event_id, (ax, hazards_i) in enumerate(zip(axes, all_hazards), 1):
theoretical_cif = (hazards_i * any_event_survival).cumsum(axis=-1) * dt
Expand All @@ -172,28 +177,22 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
label="Theoretical incidence",
),

if gb_incidence is not None:
tic = perf_counter()
gb_incidence.set_params(event_of_interest=event_id)
gb_incidence.fit(X_dummy, y)
duration = perf_counter() - tic
print(f"GB Incidence for event {event_id} fit in {duration:.3f} s")
tic = perf_counter()
cif_pred = gb_incidence.predict_cumulative_incidence(
X_dummy[0:1], coarse_timegrid
)[0]
duration = perf_counter() - tic
print(f"GB Incidence for event {event_id} prediction in {duration:.3f} s")
if survival_boost is not None:
cif_pred = cif_preds[:, event_id][0]
ax.plot(
coarse_timegrid,
cif_pred,
label="GradientBoostingIncidence",
label="SurvivalBoost",
)
ax.set(title=f"Event {event_id}")

if aj is not None:
# Randomly break tied durations, to silence a warning raised by the
# Aalen-Johansen estimator.
rng = np.random.default_rng(0)
jitter = rng.normal(scale=1e-3, size=y.shape[0])
tic = perf_counter()
aj.fit(y["duration"], y["event"], event_of_interest=event_id)
aj.fit(y["duration"] + jitter, y["event"], event_of_interest=event_id)
duration = perf_counter() - tic
print(f"Aalen-Johansen for event {event_id} fit in {duration:.3f} s")
aj.plot(label="Aalen-Johansen", ax=ax)
Expand All @@ -205,7 +204,7 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=


plot_cumulative_incidence_functions(
distributions, gb_incidence=gb_incidence, aj=aj, y=y_uncensored
survival_boost=survival_boost, aj=aj, y=y_uncensored
)

# %%
Expand All @@ -216,31 +215,23 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
# Add some independent censoring with some arbitrary parameters to control the
# amount of censoring: lowering the expected value bound increases the amount
# of censoring.
censoring_times = weibull_min.rvs(
1.0,
scale=1.5 * base_scale,
size=n_samples,
random_state=rng,
)
y_censored = pd.DataFrame(
dict(
event=np.where(
censoring_times < y_uncensored["duration"], 0, y_uncensored["event"]
),
duration=np.minimum(censoring_times, y_uncensored["duration"]),
)
_, y_censored = make_synthetic_competing_weibull(
n_samples=n_samples,
n_events=n_events,
censoring_relative_scale=1.5,
return_X_y=True,
shape_ranges=[(shape, shape) for shape in event_dist_shapes],
scale_ranges=[(scale, scale) for scale in event_dist_scales],
base_scale=base_scale,
random_state=0,
)
y_censored["event"].value_counts().sort_index()

# %%
plot_cumulative_incidence_functions(
distributions, gb_incidence=gb_incidence, aj=aj, y=y_censored
)
plot_cumulative_incidence_functions(survival_boost=survival_boost, aj=aj, y=y_censored)
# %%
#
# Note that the Aalen-Johansen estimator is unbiased and empirically recovers
# the theoretical curves both with and without censoring. The
# GradientBoostingIncidence estimator also appears unbiased by censoring, but
# SurvivalBoost estimator also appears unbiased by censoring, but
# the predicted curves are not smooth and not even monotonically increasing. By
# adjusting the hyper-parameters, notably the learning rate, the number of
# boosting iterations and leaf nodes, it is possible to somewhat control the
Expand All @@ -251,22 +242,3 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
# Alternatively, we could try to enable a monotonicity constraint at training
# time, however, in practice this often causes a sever over-estimation bias for
# the large time horizons:

# %%
#
# Finally let's try again to fit the GB Incidence models using a monotonicity
# constraint:

monotonic_gb_incidence = clone(gb_incidence).set_params(
monotonic_incidence="at_training_time"
)
plot_cumulative_incidence_functions(
distributions, gb_incidence=monotonic_gb_incidence, y=y_censored
)
# %%
#
# The resulting incidence curves are indeed monotonic. However, for smaller
# training set sizes, the resulting models can be significantly biased, in
# particular large time horizons, where the CIFs are getting flatter. This
# effect diminishes with larger training set sizes (lower epistemic
# uncertainty).
Loading
Loading