Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified Beta Geometric model #1301

Merged
merged 25 commits into from
Dec 24, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
d480c2b
Modified Beta Geometric model
PabloRoque Dec 19, 2024
efa1d04
Run pre-commit, skipping mypy unrelated errors
PabloRoque Dec 19, 2024
645f1fa
Fix convergence tests and add _model_type
PabloRoque Dec 19, 2024
6d3301c
Add ModBetaGeoModel to __all__
PabloRoque Dec 19, 2024
59e7722
Delete L81-82 comment on likelihood refactor
PabloRoque Dec 20, 2024
3cc020c
MBG tests coefficients match those of lifetimes
PabloRoque Dec 20, 2024
a993465
Delete test_numerically_stable_logp
PabloRoque Dec 20, 2024
7e1f364
Parametrize test_missing_cols
PabloRoque Dec 20, 2024
8b47ff3
Rename ModBetaGeoModel -> ModifiedBetaGeoModel
PabloRoque Dec 20, 2024
632ae67
Add MBG/NBD notebook
PabloRoque Dec 20, 2024
291f28d
Update paper reference to MBG/NBD model in docs/source/notebooks/clv/…
PabloRoque Dec 20, 2024
7101f80
Improve ModifiedBetaGeoModel docstring
PabloRoque Dec 20, 2024
a15a8d7
Updated expected_purchases docstring
PabloRoque Dec 20, 2024
2476268
Update expected_probability_alive doctring
PabloRoque Dec 20, 2024
c868668
Update expected_purchases_new_customer docstring
PabloRoque Dec 20, 2024
1ca5a9a
Mode dataframe definition out of with-block in test_customer_id_dupli…
PabloRoque Dec 20, 2024
3a30fc3
Reduce rtol in model convergence test
PabloRoque Dec 20, 2024
1b13a12
Raise NotImplementedError in ModifiedBetaGeoModel.expected_probabilit…
PabloRoque Dec 20, 2024
56e8ac0
Add Weibull Prior to notebook
PabloRoque Dec 20, 2024
208e4f9
Add expected_probability_no_purchase function signature. Remove assoc…
PabloRoque Dec 20, 2024
473616f
Merge branch 'main' into modified-beta-geo
PabloRoque Dec 23, 2024
7a065c6
Add MBG/NBD to clv_quickstart, index.md and README
PabloRoque Dec 23, 2024
2303f82
Remove Weibull priors section in mbg_nbd.ipynb
PabloRoque Dec 23, 2024
5a7ae75
Clear imports cell output
PabloRoque Dec 23, 2024
d494fff
Add MBG/NBD notebook to index.md
PabloRoque Dec 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pymc_marketing/clv/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
BetaGeoModel,
GammaGammaModel,
GammaGammaModelIndividual,
ModBetaGeoModel,
ParetoNBDModel,
ShiftedBetaGeoModelIndividual,
)
Expand All @@ -40,6 +41,7 @@
"BetaGeoModel",
"GammaGammaModel",
"GammaGammaModelIndividual",
"ModBetaGeoModel",
"ParetoNBDModel",
"ShiftedBetaGeoModelIndividual",
"customer_lifetime_value",
Expand Down
2 changes: 2 additions & 0 deletions pymc_marketing/clv/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
GammaGammaModel,
GammaGammaModelIndividual,
)
from pymc_marketing.clv.models.modified_beta_geo import ModBetaGeoModel
from pymc_marketing.clv.models.pareto_nbd import ParetoNBDModel
from pymc_marketing.clv.models.shifted_beta_geo import ShiftedBetaGeoModelIndividual

Expand All @@ -30,6 +31,7 @@
"CLVModel",
"GammaGammaModel",
"GammaGammaModelIndividual",
"ModBetaGeoModel",
"ParetoNBDModel",
"ShiftedBetaGeoModelIndividual",
)
316 changes: 316 additions & 0 deletions pymc_marketing/clv/models/modified_beta_geo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
# Copyright 2024 The PyMC Labs Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Modified Beta-Geometric Negative Binomial Distribution (MBG/NBD) model for a non-contractual customer population across continuous time.""" # noqa: E501

import warnings

import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt
import xarray
from pymc.distributions.dist_math import check_parameters
from scipy.special import hyp2f1

from pymc_marketing.clv.models import BetaGeoModel
from pymc_marketing.clv.utils import to_xarray


class ModBetaGeoModel(BetaGeoModel):
ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
r"""Also known as the MBG/NBD model.

Based on [5]_, [6]_, this model has the following assumptions:
ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
1) Each individual, ``i``, has a hidden ``lambda_i`` and ``p_i`` parameter
2) These come from a population wide Gamma and a Beta distribution
respectively.
3) Individuals purchases follow a Poisson process with rate :math:`\lambda_i*t` .
4) At the beginning of their lifetime and after each purchase, an
individual has a p_i probability of dieing (never buying again).

References
----------
.. [5] Batislam, E.P., M. Denizel, A. Filiztekin (2007),
"Empirical validation and comparison of models for customer base
analysis,"
International Journal of Research in Marketing, 24 (3), 201-209.
.. [6] Wagner, U. and Hoppe D. (2008), "Erratum on the MBG/NBD Model,"
International Journal of Research in Marketing, 25 (3), 225-226.
"""

_model_type = "MBG/NBD"

def build_model(self) -> None: # type: ignore[override]
"""Build the model."""
coords = {"customer_id": self.data["customer_id"]}
with pm.Model(coords=coords) as self.model:
# purchase rate priors
alpha = self.model_config["alpha_prior"].create_variable("alpha")
r = self.model_config["r_prior"].create_variable("r")

# dropout priors
if "a_prior" in self.model_config and "b_prior" in self.model_config:
a = self.model_config["a_prior"].create_variable("a")
b = self.model_config["b_prior"].create_variable("b")
else:
# hierarchical pooling of dropout rate priors
phi_dropout = self.model_config["phi_dropout_prior"].create_variable(

Check warning on line 67 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L67

Added line #L67 was not covered by tests
"phi_dropout"
)
kappa_dropout = self.model_config[

Check warning on line 70 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L70

Added line #L70 was not covered by tests
"kappa_dropout_prior"
].create_variable("kappa_dropout")

a = pm.Deterministic("a", phi_dropout * kappa_dropout)
b = pm.Deterministic("b", (1.0 - phi_dropout) * kappa_dropout)

Check warning on line 75 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L74-L75

Added lines #L74 - L75 were not covered by tests

def logp(t_x, x, a, b, r, alpha, T):
"""
Compute the log-likelihood of the BG/NBD model.

The log-likelihood expression here aligns with expression (4) from [3]
due to the possible numerical instability of expression (3).
ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
"""
a1 = pt.gammaln(r + x) - pt.gammaln(r) + r * pt.log(alpha)
a2 = (
pt.gammaln(a + b)
+ pt.gammaln(b + x + 1)
- pt.gammaln(b)
- pt.gammaln(a + b + x + 1)
)
a3 = -(r + x) * pt.log(alpha + T)
a4 = (
pt.log(a)
- pt.log(b + x)
+ (r + x) * (pt.log(alpha + T) - pt.log(alpha + t_x))
)

logp = a1 + a2 + a3 + pt.logaddexp(a4, 0)

return check_parameters(
logp,
a > 0,
b > 0,
alpha > 0,
r > 0,
msg="a, b, alpha, r > 0",
)

pm.Potential(
ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
"likelihood",
logp(
x=self.data["frequency"],
t_x=self.data["recency"],
a=a,
b=b,
alpha=alpha,
r=r,
T=self.data["T"],
),
)

def expected_num_purchases(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this method just to deprecate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this should be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is to be deleted, then we should as well delete the associated BetaGeoModel.expected_num_purchases.

Notice that we are inheriting, and a user may by mistake call the method.

Copy link
Collaborator

@ColtAllen ColtAllen Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to do this in a separate PR prior to merging this one. I don't think it should be added to the release notes because it's not deprecation in the truest sense of the term, but rather renaming a method.

self,
customer_id: np.ndarray | pd.Series,
t: np.ndarray | pd.Series | pt.TensorVariable,
frequency: np.ndarray | pd.Series | pt.TensorVariable,
recency: np.ndarray | pd.Series | pt.TensorVariable,
T: np.ndarray | pd.Series | pt.TensorVariable,
) -> xarray.DataArray:
r"""Compute the expected number of purchases for a customer.

This is a deprecated method and will be removed in a future release.
Please use `BetaGeoModel.expected_purchases` instead.
"""
warnings.warn(

Check warning on line 135 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L135

Added line #L135 was not covered by tests
"Deprecated method. Use 'expected_purchases' instead.",
FutureWarning,
stacklevel=1,
)

t = np.asarray(t)
if t.size != 1:
t = to_xarray(customer_id, t)

Check warning on line 143 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L141-L143

Added lines #L141 - L143 were not covered by tests

T = np.asarray(T)
if T.size != 1:
T = to_xarray(customer_id, T)

Check warning on line 147 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L145-L147

Added lines #L145 - L147 were not covered by tests

# print(customer_id)
x, t_x = to_xarray(customer_id, frequency, recency)

Check warning on line 150 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L150

Added line #L150 was not covered by tests

a, b, alpha, r = self._unload_params()

Check warning on line 152 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L152

Added line #L152 was not covered by tests

hyp_term = hyp2f1(r + x, b + x + 1, a + b + x, t / (alpha + T + t))
first_term = (a + b + x) / (a - 1)
second_term = 1 - hyp_term * ((alpha + T) / (alpha + t + T)) ** (r + x)
numerator = first_term * second_term

Check warning on line 157 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L154-L157

Added lines #L154 - L157 were not covered by tests

denominator = 1 + (a / (b + x)) * ((alpha + T) / (alpha + t_x)) ** (r + x)

Check warning on line 159 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L159

Added line #L159 was not covered by tests

return (numerator / denominator).transpose(

Check warning on line 161 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L161

Added line #L161 was not covered by tests
"chain", "draw", "customer_id", missing_dims="ignore"
)

def expected_purchases(
self,
data: pd.DataFrame | None = None,
*,
future_t: int | np.ndarray | pd.Series | None = None,
) -> xarray.DataArray:
r"""Compute the expected number of future purchases across *future_t* time periods given *recency*, *frequency*, and *T* for each customer.

The *data* parameter is only required for out-of-sample customers.

Adapted from equation (10) in [1]_, and *lifetimes* package:
https://github.com/CamDavidsonPilon/lifetimes/blob/41e394923ad72b17b5da93e88cfabab43f51abe2/lifetimes/fitters/beta_geo_fitter.py#L201

Parameters
----------
future_t : int, array_like
Number of time periods to predict expected purchases.
data : ~pandas.DataFrame
Optional dataframe containing the following columns:

* `customer_id`: Unique customer identifier
* `frequency`: Number of repeat purchases
* `recency`: Time between the first and the last purchase
* `T`: Time between first purchase and end of observation period; model assumptions require T >= recency

References
----------
.. [1] Fader, Peter S., Bruce G.S. Hardie, and Ka Lok Lee (2005a),
"Counting Your Customers the Easy Way: An Alternative to the
Pareto/NBD Model," Marketing Science, 24 (2), 275-84.
https://www.brucehardie.com/papers/bgnbd_2004-04-20.pdf

ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
""" # noqa: E501
if data is None:
data = self.data

Check warning on line 199 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L198-L199

Added lines #L198 - L199 were not covered by tests

if future_t is not None:
data = data.assign(future_t=future_t)

Check warning on line 202 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L201-L202

Added lines #L201 - L202 were not covered by tests

dataset = self._extract_predictive_variables(

Check warning on line 204 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L204

Added line #L204 was not covered by tests
data, customer_varnames=["frequency", "recency", "T", "future_t"]
)
a = dataset["a"]
b = dataset["b"]
alpha = dataset["alpha"]
r = dataset["r"]
x = dataset["frequency"]
t_x = dataset["recency"]
T = dataset["T"]
t = dataset["future_t"]

Check warning on line 214 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L207-L214

Added lines #L207 - L214 were not covered by tests

hyp_term = hyp2f1(r + x, b + x + 1, a + b + x, t / (alpha + T + t))
first_term = (a + b + x) / (a - 1)
second_term = 1 - hyp_term * ((alpha + T) / (alpha + t + T)) ** (r + x)
numerator = first_term * second_term
denominator = 1 + (a / (b + x)) * ((alpha + T) / (alpha + t_x)) ** (r + x)

Check warning on line 220 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L216-L220

Added lines #L216 - L220 were not covered by tests

return (numerator / denominator).transpose(

Check warning on line 222 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L222

Added line #L222 was not covered by tests
"chain", "draw", "customer_id", missing_dims="ignore"
)

def expected_purchases_new_customer(
self,
data: pd.DataFrame | None = None,
*,
t: int | np.ndarray | pd.Series | None = None,
) -> xarray.DataArray:
r"""Compute the expected number of purchases for a new customer across *t* time periods.

Adapted from equation (9) in [1]_, and `lifetimes` library:
https://github.com/CamDavidsonPilon/lifetimes/blob/41e394923ad72b17b5da93e88cfabab43f51abe2/lifetimes/fitters/beta_geo_fitter.py#L328

Parameters
----------
t : array_like
Number of time periods over which to estimate purchases.

References
----------
.. [1] Fader, Peter S., Bruce G.S. Hardie, and Ka Lok Lee (2005a),
"Counting Your Customers the Easy Way: An Alternative to the
Pareto/NBD Model," Marketing Science, 24 (2), 275-84.
http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf

"""
ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
# TODO: This is extraneous now, but needed for future covariate support.
if data is None:
data = self.data

Check warning on line 252 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L251-L252

Added lines #L251 - L252 were not covered by tests

if t is not None:
data = data.assign(t=t)

Check warning on line 255 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L254-L255

Added lines #L254 - L255 were not covered by tests

dataset = self._extract_predictive_variables(data, customer_varnames=["t"])
a = dataset["a"]
b = dataset["b"]
alpha = dataset["alpha"]
r = dataset["r"]
t = dataset["t"]

Check warning on line 262 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L257-L262

Added lines #L257 - L262 were not covered by tests

hyp_term = hyp2f1(r, b + 1, a + b, t / (alpha + t))
first_term = b / (a - 1)
second_term = 1 - hyp_term * (alpha / (alpha + t)) ** (r)

Check warning on line 266 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L264-L266

Added lines #L264 - L266 were not covered by tests

return (first_term * second_term).transpose(

Check warning on line 268 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L268

Added line #L268 was not covered by tests
"chain", "draw", "customer_id", missing_dims="ignore"
)

def expected_probability_alive(
self,
data: pd.DataFrame | None = None,
) -> xarray.DataArray:
r"""Compute the probability a customer with history *frequency*, *recency*, and *T* is currently active.

The *data* parameter is only required for out-of-sample customers.

Adapted from page (2) in Bruce Hardie's notes [1]_, and *lifetimes* package:
https://github.com/CamDavidsonPilon/lifetimes/blob/41e394923ad72b17b5da93e88cfabab43f51abe2/lifetimes/fitters/beta_geo_fitter.py#L260

Parameters
----------
data : *pandas.DataFrame
Optional dataframe containing the following columns:

* `customer_id`: Unique customer identifier
* `frequency`: Number of repeat purchases
* `recency`: Time between the first and the last purchase
* `T`: Time between first purchase and end of observation period, model assumptions require T >= recency

References
----------
.. [1] Fader, P. S., Hardie, B. G., & Lee, K. L. (2008). Computing
P (alive) using the BG/NBD model. http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf.

"""
ColtAllen marked this conversation as resolved.
Show resolved Hide resolved
if data is None:
data = self.data

Check warning on line 300 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L299-L300

Added lines #L299 - L300 were not covered by tests

dataset = self._extract_predictive_variables(

Check warning on line 302 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L302

Added line #L302 was not covered by tests
data, customer_varnames=["frequency", "recency", "T"]
)

a = dataset["a"]
b = dataset["b"]
alpha = dataset["alpha"]
r = dataset["r"]
x = dataset["frequency"]
t_x = dataset["recency"]
T = dataset["T"]

Check warning on line 312 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L306-L312

Added lines #L306 - L312 were not covered by tests

proba = 1.0 / (1 + (a / (b + x)) * ((alpha + T) / (alpha + t_x)) ** (r + x))

Check warning on line 314 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L314

Added line #L314 was not covered by tests

return proba.transpose("chain", "draw", "customer_id", missing_dims="ignore")

Check warning on line 316 in pymc_marketing/clv/models/modified_beta_geo.py

View check run for this annotation

Codecov / codecov/patch

pymc_marketing/clv/models/modified_beta_geo.py#L316

Added line #L316 was not covered by tests
Loading
Loading