Why is the dirichlet distribution the prior for our beta's within the WeightedSumFitter class? #326

jbordon619 · 2024-05-05T06:31:29Z

jbordon619
May 5, 2024

I've been diving in to the code for causalpy to better understand what's going under the hood to maybe apply it elsewhere in the future. During my dive I found the dirichlet distribution.

I didn't know about this distribution at all previously but what I gathered is that it's similar to the beta distribution but can handle more options than just "success" and "failures" and it's pdf output is a vector that sums to one. From the outputs I've been getting from WeightedSumFitter I see that the beta values seem to add to one, is this correct?

If it is correct how do you feel about this constraint for when we set up our equation between the "control" geo's to predict the "test" geo?

(Sorry if this second part doesn't make sense I'm still struggling understanding some core concepts)
I also found that the dirichlet distribution is good for bayesian modeling for scaling purposes because of it's conjugacy. I thought conjugacy only mattered for non markov chain based bayesian modeling? Does conjugacy help our chains converge faster?

drbenvincent · 2024-05-05T12:25:46Z

drbenvincent
May 5, 2024
Maintainer

Hi @jbordon619. Thanks for the question. I'll see what I can do to help and explain.

If we consider the synthetic control model in general, we are basically modelling the synthetic control unit as a weighted sum of the control units. For this general model, there is no necessity to add a constraint on the sum of the weights. We could just use a pm.Normal or a pm.ZeroSumNormal to create a vector of weights. However there may be a couple of issues here:

If we have negative weights then we are saying that a control unit would be the opposite (anti-correlated) with the target unit. This might make sense in some contexts, but in others it may not make sense at all. The answer here is possibly high domain specific. But in general I believe we try to pick control units which are similar to the target unit.
Having a set of weights that sum to >1 would mean that we are extrapolating beyond the range of the control units. This is explained relatively well in Causal Inference for the Brave and True. A concrete example of this would be the GDP/Brexit example. So we have a set of time series of GDP's from different countries. If our target country was in fact the USA then its GDP would be way larger than any of the GDP time series for individual European countries. So in order to estimate the GDP as a weighted sum of GDP's of much smaller magnitude countries then the weights would have to sum to >1 and we would be 'extrapolating' or 'magnifying up'. Whether this is or isn't a problem might also be domain specific, but it could be a problem because you are scaling up both the true signal as well as noise. But see the Causal Inference for the Brave and True link for more on this. So in order to avoid extrapolation and confine ourselves to interpolation, the sum of the weights are constrained to be equal to 1.

Ok, so now I think we've covered some motivation for why the sum of the weights should be constrained to 1. The next question is what kind of prior could we use in order to achieve this? The Dirichlet distribution is quite handy here because it's a multivariate distribution and the sum of any samples will always be 1. You can run something like this to see:

>>> import pymc as pm
>>> d = pm.Dirichlet.dist([1., 1., 1., 1.])
>>> draws = pm.draw(d, draws=10)
>>> draws
array([[0.15981701, 0.10814745, 0.68524163, 0.04679391],
       [0.38037676, 0.48804699, 0.06844366, 0.06313259],
       [0.4849404 , 0.25824312, 0.16171177, 0.09510471],
       [0.48553828, 0.25301148, 0.02597401, 0.23547624],
       [0.2162953 , 0.46682933, 0.14262581, 0.17424957],
       [0.06870385, 0.10662605, 0.36808074, 0.45658937],
       [0.08015813, 0.25092438, 0.52085491, 0.14806258],
       [0.69236722, 0.16837711, 0.13146731, 0.00778836],
       [0.50637089, 0.31081917, 0.06371161, 0.11909833],
       [0.52838634, 0.13247783, 0.22715454, 0.1119813 ]])
>>> draws.sum(axis=1)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

The Dirichlet distribution is also quite nice because the hyper parameters (the [1., 1., 1., 1.] in the example above) gives you different distributions of weights. This can be handy because sometimes you might believe that each of the control units should be weighted roughly equally, or sometimes you might believe that a few should be weighted highly while many have very low weights.

Right now, the WeightedSumFitter only uses a vector of ones as the hyper parameters, but there's no reason why we couldn't add the ability for the user to specify their own hyperparameters.

Briefly on the conjugacy... MCMC methods don't require us to use conjugate distributions. However in PyMC there is a lot of cool work happening with automatic graph re-writing, so in the future there may be automatic detection of conjugate distributions which could allow for graph re-writes which could give significant computational speed-ups. But that specific point is probably best discussed on the pymc discourse for example.

1 reply

jbordon619 May 5, 2024
Author

This is super helpful thank you so much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the dirichlet distribution the prior for our beta's within the WeightedSumFitter class? #326

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why is the dirichlet distribution the prior for our beta's within the WeightedSumFitter class? #326

jbordon619 May 5, 2024

Replies: 1 comment · 1 reply

drbenvincent May 5, 2024 Maintainer

jbordon619 May 5, 2024 Author

jbordon619
May 5, 2024

Replies: 1 comment 1 reply

drbenvincent
May 5, 2024
Maintainer

jbordon619 May 5, 2024
Author