Skip to content

Latest commit



205 lines (156 loc) · 8.39 KB

File metadata and controls

205 lines (156 loc) · 8.39 KB

Modeling the president’s popularity

Data analysis

  • TODO Copy data analysis over from notebook


Polls are often released several times a month, not always from the same pollsters. In this configuration we think it makes sense to aggregate polls and compute the popularity per month. Although polls leave the option to not respond, we choose to ignore this option; a poll will thus consist in a number $Nresp$ of respondants and $N+$ of respondants with a positive approval of the president. We model this with a Binomial response model:

  N_{+} \sim \mathrm{Binomial}(p^{+}_m,\;N_{resp})

where $p_m$ is the /popularity/ of the president, i.e. the probability that any person picked at random in the population would have a positive opinion about their action. This popularity at any given month $p_m$ is a function of different factors $μ_m$, $α_p$ and $α_m$:

p^{+}_m = \mathrm{invlogit}(\beta_m + \alpha_p + \alpha_m)

Pollster and method bias

As we saw earlier, pollsters all have a different bias, and we model the bias of each pollster $p$ assume

  \alpha_p \sim \mathrm{Normal}(\mu_p, \sigma_p)

Similarly, we saw that each polling method had its own bias and we model it as:

  \alpha_m \sim \mathrm{Normal}(\mu_m, \sigma_m)

Intrinsinc popularity

The reputation of the president any given month, except during the first month of their term, depends on their popularity the previous month. We assume that the hidden state $μ_m$ that represents the president’s popularity depends on the previous month’s as

  \beta_{m} \sim \mathrm{Normal}(\beta_{m-1}, \sigma_\beta)
import matplotlib.pyplot as plt
import arviz as az
import numpy as np
import pandas as pd
import pymc3 as pm
data = pd.read_csv('./popularity/plot_data/raw_polls.csv', parse_dates = True, index_col="Unnamed: 0")
data['year'] = data.index.year
data['month'] = data.index.month
data['sondage'] = data['sondage'].replace('Yougov', 'YouGov')
data['method'] = data['method'].replace('face-to-face&internet', 'face to face')
2002-05-15   chirac2    Ifop         924         phone       0.51          0.44  2002      5
2002-05-20   chirac2  Kantar         972  face to face       0.50          0.48  2002      5
2002-05-23   chirac2     BVA        1054         phone       0.52          0.37  2002      5
2002-05-26   chirac2   Ipsos         907         phone       0.48          0.48  2002      5
2002-06-16   chirac2    Ifop         974         phone       0.49          0.43  2002      6

Implementations in PyMC3

Pooled popularity / unpooled biases

We first consider a model where the parameters of the model are completely pooled between mandates. We do not model for now respondants who neither approve nor disapprove.

pollster_id = pd.Categorical(data["sondage"]).codes
method_id = pd.Categorical(data["method"]).codes
months = np.hstack(
    [pd.Categorical(data[data.president == president].index.to_period('M')).codes for president in data.president.unique()]
respondants = data["samplesize"].astype('int').values
approvals = (data['samplesize'] * data['p_approve']).astype('int').values
num_pollsters = len(np.unique(pollster_id))
num_method = len(np.unique(method_id))
num_months = np.max(months) + 1

with pm.Model() as pooled_popularity:
    alpha_p = pm.Normal("alpha_p", 0, .15, shape=num_pollsters)
    alpha_m = pm.Normal("alpha_m", 0, .15, shape=num_method)

    mu = pm.GaussianRandomWalk(

    popularity = pm.Deterministic(
        pm.math.invlogit(mu[months] + alpha_p[pollster_id] + alpha_m[method_id]),

    N_approve = pm.Binomial("N_approve", respondants, popularity, observed=approvals)
with pooled_popularity:
    posterior = pm.sample(1000, chains=2)
def inv_logit(x):
    return 1 / (1 + np.exp(-x))

We plot the posterior distribution of the pollsters’ bias $α_p$ and the methods’ bias $α_m$

There is a stark difference in terms of biases! Let us focus on pollsters’ and methods’ biases individually to see if the results match what we saw in the data.

avg_pollster_bias = list(np.mean(inv_logit(posterior['alpha_p']), axis=0))
pollsters = pd.Categorical(data['sondage']).unique()
dict(zip(pollsters, avg_pollster_bias))
avg_method_bias = list(np.mean(inv_logit(posterior['alpha_m']), axis=0))
method = pd.Categorical(data['method']).unique()
dict(zip(method, avg_method_bias))

If $α_p$ or $α_m$ is smaller than 0.5 this means that we need to subtract the biais to the “real” popularity factor, in other words that the corresponding pollster or method is biased towards giving larger popularity rates. On the other hand it is greater than .5 this means that the method or pollster is baised towards giving lower popularity rates. The results shows above are thus consistent with the above data analysis.

Let us now post the posterior curves for the evolution of the latent popularity:

fig, ax = plt.subplots(figsize=(8,4.5))
for i in range(1000):
    ax.plot(range(60), inv_logit(posterior['mu'][i,:]), alpha=.005, color="blue")
ax.set_xlabel("Months into term")

The model does learn the general decrease in popularity as the terms progress.

  • try different values for the prior variance of the random walk $σ$
  • learn the variance of the random walk sigma
  • set up a prediction pipeline for the popularity over the next months