-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathStats_Gen_Map.Rmd
473 lines (321 loc) · 22.3 KB
/
Stats_Gen_Map.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
---
title: "A roadmap for statistical genetics"
author: Taotao Tan
output:
pdf_document: default
html_document: default
date: "2024-07-26"
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# A roadmap for statistical genetics
The field of statistical genetics involves lots of mathematical derivations. However, the derivations can be forgotten unless it is well-documented. This is my personal documentation with essential results in stats gen.
* [Topic 1](#Topic1): Linear regression with scaled genotype and phenotype.
* [Topic 2](#Topic2): LD score regression.
* [Topic 3](#Topic3): Polygenic score.
* [Topic 4](#Topic4): TWAS.
\pagebreak
### Topic 1: Linear regression with scaled genotype and phenotype {#Topic1}
Statistical geneticist often scale genotype and phenotype before GWAS analysis. This procedure can typically simplify the mathematics, and has a few important consequences. They also tend to parameterize the model with heritability. Let's define some notations that we will use for the entire documentation.
$$
\begin{split}
G: &\text{ standardized genotype matrix} \\
y: &\text{ standardized phenotype} \\
\lambda, \hat\lambda: &\text{ true and estimated causal effect sizes} \\
\beta, \hat\beta: &\text{ true and estimated marginal effect sizes} \\
R: &\text{ LD matrix/ correlation matrix} \\
h^2: &\text{ narrow sense heritability} \\
N: &\text{ sample size} \\
M: &\text{ number of variants } \\
\end{split}
$$
The model we are considering is
$$
\begin{split}
y &= G \lambda + \varepsilon \\
\varepsilon &\sim N(0, (1 - h^2) I)
\end{split}
$$
Assuming each variant only explains a tiny bit of phenotypical variation, I present a few important results:
1. GWAS effect size estimates is $\hat \beta = \frac{1}{N}G^Ty$.
2. GWAS standard error is a constant $s.e. = 1/\sqrt{N}$.
3. GWAS z score is $\hat\beta \sqrt{N}$.
4. LD is $R = \frac{G^T G}{N}$. it is not always invertable, but we can add a small diagonal matrix as for regularization.
5. multiple regression results is $\hat \lambda = R^{-1} \hat \beta$.
6. The underlying genetic value is $G \lambda$, therefore the heritability is defined as $h^2 = \frac{\lambda^T G^T G \lambda}{N} = \lambda^T R \lambda = \beta^T R^{-1} \beta$
7. $\hat \lambda$ is an unbiased estimator of $\lambda$. More specifically, $\hat \lambda \sim MVN(\lambda, \frac{1 - h^2}{N} R^{-1} )$
8. $\hat \beta$ is an unbiased estimator of $\beta$, More specifically, $\hat \beta = R \hat \lambda \sim MVN(R\lambda, \frac{1 - h^2}{N} R)$. The diagonal elements are squared standard error of individual marker, which is usually close enough to $1/N$ when heritability is small
Additionally, due to the sample size is typically large in GWAS, any additional degree of freedom will be ignored. Therefore, it is uncommon to see d.o.f adjustment like $n - k$ in GWAS. In other words, we are performing asymptotic inference in GWAS.
Surprisingly, covariates are often ignored in the formulation: $y = G \lambda + \varepsilon$. Some model either regress out covariates, or assume $\mathbb{C}ov[G, \varepsilon] \neq 0$. Therefore, it's important to think carefully about covariates.
Here is a demonstration using simulated data:
```{r}
path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/APOE_1000G_FIN_74SNPS."
haps = read.table(paste0(path,"txt"))
info = read.table(paste0(path,"legend.txt"),header = T, as.is = T)
n = 1000
G = as.matrix(haps[sample(1:nrow(haps), size = n, repl = T),] +
haps[sample(1:nrow(haps), size = n, repl = T),])[,1:12]
row.names(G) = NULL
colnames(G) = NULL
G_ = scale(G)
# simulate 3 causal SNPs
lambda = rep(0, ncol(G_))
lambda[c(3, 4, 7)] = c(0.3, -0.1, 0.15)
R = t(G_) %*% G_ /n # LD matrix
h2 = t(lambda) %*% R %*% lambda # heritability
# there are some randomness about err, re-scale it to have a better control of h2
err = scale(rnorm(n))
err_ = (err - mean(err)) * c(sqrt(1 - h2))
y = G_ %*% lambda + err_
y_ = scale(y)
### Key results:
beta_hat = 1/n * t(G_) %*% y_ # GWAS effect size estimates
z = sqrt(n) * beta_hat # z-score estimates
se_method1 = 1/sqrt(n) # standard error
pval = pchisq(z^2, df = 1, lower.tail = F) # p value
beta = R %*% lambda
# another way to compute standard error, which is pretty close to 1/sqrt(n)
se_method2 = sqrt(diag(as.numeric((1 - h2)/n) * R))
```
*Observation*: In reality, we don't know which SNP is causal, and we don't know their magnitude. It's possible to use the estimated marginal effect size to estimate the heritability with $\hat h^2 = \hat \beta^T R^{-1} \hat \beta$. But this approach relies on inverting the LD matrix, which is practically impossible. In the toy dataset, this approach would over estimate the true heritability.
Another approach is to use the variant with the most significant p value to estimate $h^2$. The toy dataset suggest it under estimates the heritability, perhaps because single variants doesn't carry all information of the locus.
```{r}
t(beta_hat) %*% solve(R) %*% beta_hat # use all variants
var(G_[,3] * beta_hat[3]) # use top variants
```
\pagebreak
### Topic 2: LD score regression. {#Topic2}
LDSC is proposed in [this](https://www.nature.com/articles/ng.3211) landmark paper, in which it described how LD affect the probability of a variant being significant. Under infinitesimal model, LDSC states $\mathbb{E}[\chi_j^2] = \frac{Nh^2}{M} l_j + 1$, where $l_j \equiv \sum_{k = 1}^M r_{jk}^2$ is the LD score. To carry out the derivation, one must treat the effect size as random: $\lambda_j \sim N(0, \frac{h^2}{M})$.
In GWAS, the marginal effect size estimates (condition on true marginal effect size) is normally distributed: $\hat \beta_j | \beta_j \sim N(\beta_j, \frac{1}{N})$. Equivalently, $\hat \beta_j | \lambda \sim N(\sum_{k = 1}^{M} r_{jk} \lambda_k, \frac{1}{N})$.
I first state some quantities that will be useful for the derivations. Those quantities should be easy to varify:
$$
\begin{split}
\mathbb{E}[\lambda_j] &= 0 \\
\mathbb{E}[\lambda_j^2] &= \frac{h^2}{M} \\
\mathbb{E}[\hat \beta_j | \lambda_j] &= \sum_{k = 1}^{M} r_{jk} \lambda_k \\
\mathbb{V}ar[\hat \beta_j | \lambda_j] &= \frac{1}{N} \\
\mathbb{E}[\hat \beta_j^2 | \lambda_j] &= \mathbb{V}ar[\hat \beta_j | \lambda_j] + \mathbb{E}^2[\hat \beta_j | \lambda_j] = \frac{1}{N} + (\sum_{k = 1}^{M} r_{jk} \lambda_k )^2
\end{split}
$$
Before we investigate $\mathbb{E}[\chi_j^2]$, let's express $\mathbb{E}[\hat \beta_j^2]$:
$$
\begin{split}
\mathbb{E}[\hat \beta_j^2] &= \mathbb{E}[ \mathbb{E}[\hat \beta_j^2 \mid \lambda]] \\
&= \mathbb{E}[\frac{1}{N} + (\sum_{k = 1}^{M} r_{jk} \lambda_k )^2] \\
&= \frac{1}{N} + \mathbb{E}[ (\sum_{k = 1}^M r_{jk} \lambda_k)^2 ] \\
&= \frac{1}{N} + \mathbb{E}[ (r_{j1} \lambda_1 + r_{j2} \lambda_2 + ...)^2 ] \\
&= \frac{1}{N} + \mathbb{E}[ \sum_{k = 1}^{M} (r_{jk} \lambda_k )^2 + 2 \cdot \sum_{p \neq q} r_{jp} r_{jq} \lambda_p \lambda_q ] \\
&= \frac{1}{N} + \sum_{k = 1}^M r_{jk}^2 \cdot \frac{h^2}{M} \\
&= \frac{h^2}{M} l_j + \frac{1}{N}
\end{split}
$$
Further,
$$
\begin{split}
\mathbb{E}[\chi_j^2] & = \mathbb{E}[(\frac{\hat \beta_j}{1/ \sqrt{N}})^2] \\
&= N \mathbb{E}[\hat \beta_j^2] \\
&= \frac{Nh^2}{M} l_j + 1
\end{split}
$$
The derivation took the insight that only marginal effect size are observed. Therefore, we investigate the statistical property of the **marginal distribution of marginal effect sizes** (a.k.a $p(\hat \beta)$, but not the conditional distribution $p(\hat \beta \mid \lambda)$). Biologically, if one variant has more LD friends, then it is more likely to be significant. LDSC has been further extended to study binary traits, partition heritability, and genetic correlation between traits.
Here is a simulation I have (with some code borrowed from [Matti Pirinen's](https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/) incredible tutorial).
```{r}
path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/APOE_1000G_FIN_74SNPS."
haps = read.table(paste0(path,"txt"))
info = read.table(paste0(path,"legend.txt"),header = T, as.is = T)
n = 1000
G = as.matrix(haps[sample(1:nrow(haps), size = n, repl = T),] + haps[sample(1:nrow(haps), size = n, repl = T),])
row.names(G) = NULL
colnames(G) = NULL
G_ = scale(G)
h2 = 0.05
R = t(G_) %*% G_ /n # LD matrix
get_chi2<- function(G_, h2, R){
lambda = rnorm(ncol(G_), mean = 0, sd = sqrt(h2/ncol(G_)))
err = scale(rnorm(n))
err_ = (err - mean(err)) * c(sqrt(1 - h2))
y = G_ %*% lambda + err_
y_ = scale(y)
##
beta_hat = t(G_) %*% y / n
chi2 = n * beta_hat^2
return(as.vector(chi2))
}
# do this 300 times, and average them to get the expectation
chi2_simulations = replicate(n = 300, get_chi2(G_, h2, R))
exp_chi2 = rowMeans(chi2_simulations)
ldsc = rowSums(R^2)
plot(ldsc, exp_chi2)
abline(lm(exp_chi2 ~ ldsc), col = "red")
abline(a = 1, b = (n * h2 / ncol(G)), col = "blue")
legend(4, 10, legend=c("Fitted line", "Predicted line"),
fill = c("red","blue"))
```
Thanks to Arslan Zaidi's comment - the simulation now aligns with the derivation :)
\pagebreak
### Topic 3: Polygenic Score {#Topic3}
Polygenic score (PRS) investigates the genetic liability of certain diseases. Given the training data, we might compute the polygenic score as $PRS_i = \sum_{j = 1}^{M} \hat \beta_j G_{ij}$ for the testing cohort. Most of the PRS methods paper, such as [PRS-CS](https://www.nature.com/articles/s41467-019-09718-5), [LDPred](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4596916/) aim to recover causal effects $\lambda$ from the observed marginal effect size estimates $\hat \beta_j$. Here let's consider a infinitesimal model (LDpred-inf).
We assume the causal effect size $\lambda \sim MVN(0, \frac{h^2}{M}I)$ (called infinitesimal model). From Topic 1, we also have $\hat \beta | \lambda \sim MVN(R \lambda, \frac{1 - h^2}{N}R)$. The Bayesian inference recipe with conjugate prior normal distribution gives us (according to this [document](https://gregorygundersen.com/blog/2020/11/18/bayesian-mvn/) ):
$$
\begin{split}
p(\lambda \mid \hat \beta) &\propto f(\hat \beta \mid \lambda) \cdot f(\lambda) \\
&\propto exp \{ - \frac{1}{2} (\hat \beta - R\lambda )^T (\frac{1 - h^2}{N}R)^{-1} (\hat \beta - R\lambda ) \} \cdot exp \{ - \frac{1}{2}\lambda^T (\frac{h^2}{M})^{-1} \lambda \} \\
&\propto exp\{ - \frac{1}{2}[\frac{N}{1 - h^2}\cdot (\hat \beta - R\lambda )^T R^{-1}(\hat \beta - R\lambda ) +\frac{M}{h^2} \lambda^T \lambda ] \} \\
&\propto exp \{- \frac{1}{2} [\frac{N}{1 - h^2} \cdot (\hat \beta^T R^{-1} \hat \beta - \hat \beta^T R^{-1} R \lambda -\lambda^T R R^{-1}\hat \beta + \lambda^T RR^{-1}R\lambda) + \frac{M}{h^2} \lambda^T \lambda] \} \\
&\propto exp \{- \frac{1}{2} [\lambda^T(\frac{N}{1 - h^2}R + \frac{M}{h^2} I)\lambda - 2 \frac{N}{1 - h^2} \hat \beta^T \lambda ] \}
\end{split}
$$
Let $K = \frac{N}{1 - h^2}R + \frac{M}{h^2} I$, $b = \frac{N}{1 - h^2} \hat \beta$, and use the "Completing the square" [technique](https://gregorygundersen.com/blog/2019/09/18/completing-the-square/), we have:
$$
\begin{split}
p(\lambda \mid \hat \beta) &\propto f(\hat \beta \mid \lambda) \cdot f(\lambda) \\
&\propto exp \{ (\lambda -K^{-1}b)^T K (\lambda -K^{-1}b) \}
\end{split}
$$
Therefore, the posterior distribution of the causal effect size is
$$
\begin{split}
\lambda \mid \hat \beta &\sim MVN(K^{-1}b, K^{-1}) \\
&\sim MVN((R + \frac{M(1 - h^2)}{Nh^2} I)^{-1} \hat\beta, [\frac{N}{1 - h^2}R + \frac{M}{h^2} I]^{-1})
\end{split}
$$
One might claim that the heritability of a region is small enough, such that $1 - h^2 \approx 1$, Therefore, we can further simplify the expression, and obtain the mean and variance of the posterior causal effect size:
$$
\begin{split}
\mathbb{E}[\lambda \mid \hat \beta ] &= (R + \frac{M}{Nh^2} I)^{-1} \hat\beta \\
\mathbb{V}ar[\lambda \mid \hat \beta ] &= [NR + \frac{M}{h^2} I]^{-1}
\end{split}
$$
This expression is identical to what's mentioned in [PRS-CS paper](https://www.nature.com/articles/s41467-019-09718-5) (equation 13). But I have a few more remarks about this model:
1. In both PRS-CS and LDpred manuscript, they have an additional subscript to denote a small region of the genome (in PRS-CS, LD is denoted as $D_l$ to indicate the $l$-th region). This is because LD panel is pre-computed in blocks realistically.
2. This approach attempts to solve a Bayesian inference problem *without* looking at individual-level data.
3. This infinitesimal Bayesian regression approach is identical to Ridge regression.
4. The heritability $h^2$ is treated as a parameter for the prior distribution, which must be specified according to domain knowledge before we run this analysis. This is not always trivial in realistic PRS analysis. Therefore, when we don't have any prior information about a disease, we might try grid search to find the best $h^2$. In machine learning lingo, this is referred as "hyper-parameter" tunning.
<br>
<br>
The framework can be further extended to multi-ancestry setting. Let's assume that we have the same causal effect sizes $\lambda$ across ancestries. For each ancestry $k$, we have $\hat \beta_{k} | \lambda \sim MVN(R_k \lambda, \frac{1}{N}R_k)$. We might infer the posterior effect size $f(\lambda\mid \hat \beta_1, \hat\beta_2, ...)$:
$$
\begin{split}
f(\lambda\mid \hat \beta_1, \hat\beta_2, ...) &\propto f(\beta_1, \hat\beta_2, ... \mid \lambda) \cdot f(\lambda) \\
&\propto \prod_{k = 1} f(\hat \beta_k \mid \lambda) \cdot f(\lambda)
\end{split}
$$
It's possible to expand the equation to find the posterior mean of $\lambda \mid \hat \beta_1, \hat \beta_2...$, which would be a function of the prior distribution of $\lambda$, observed effect sizes, and LD panel for each ancestry. However, this might not be a reasonable assumption about the prior distribution of $\lambda$, as it might be different across populations (or heritability might be different, which implies different amount of regularization). [PRS-CSx](https://www.nature.com/articles/s41588-022-01054-7) instead infers the posterior effect size for each ancestry.
```{r}
path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/APOE_1000G_FIN_74SNPS."
haps = read.table(paste0(path,"txt"))
info = read.table(paste0(path,"legend.txt"),header = T, as.is = T)
n = 1000
G = as.matrix(haps[sample(1:nrow(haps), size = n, repl = T),] + haps[sample(1:nrow(haps), size = n, repl = T),])
row.names(G) = NULL
colnames(G) = NULL
G_ = scale(G)
h2 = 0.05
M = ncol(G_)
lambda = rnorm(ncol(G_), mean = 0, sd = sqrt(h2/M))
err = scale(rnorm(n))
err_ = (err - mean(err)) * c(sqrt(1 - h2))
y = G_ %*% lambda + err_
y_ = scale(y)
##
beta_hat = t(G_) %*% y / n
R = t(G_) %*% G_ /n # LD matrix
lambda_posterior = solve(R + M/(n *h2) * diag(M)) %*% beta_hat
plot(beta_hat, lambda_posterior, main = "GWAS effect sizes vs posterior effect sizes",
xlim = c(-0.1, 0.1), ylim = c(-0.05, 0.05), xlab = "GWAS effect size", ylab = "Posterior mean of lambda")
abline(a = 0, b = 1)
```
Clearly, there is a strong shrinkage of the marginal effect size.
\pagebreak
### Topic 4: TWAS {#Topic4}
Transcriptome-wide association studies (TWAS) aims to identify associations between gene expression and trait of interest. In an ideal word where we have both RNA-seq and trait data for tens of thousands of individuals, performing a TWAS analysis would be very easy: simply regress trait by expression. However, GTEx, the largest collection of expression data, has only collected ~700 RNA-seq data without trait value. This preclude a direct association test between expression and trait. Despite the limitation, the GTEx collected the genetic data for all the perticipants, and trained models to predict expression value from variants across multiple tissues. The variants that strongly associate with gene expression are named as eQTLs.
For a given cohort with only genetic and phenotype data, we can first imputed/predicted RNA expression from genetics data, then regress the phenotype on the predicted RNA expressions. Let's define some additional notations:
$$
\begin{split}
\hat x_k: &\text{ standardized imputed RNA expression for gene k} \\
\hat w_k: &\text{ pre-trained weights from GTEx to predict gene k}
\end{split}
$$
For a new dataset with genotype and phenotype information, we can first impute the expression by $\hat x_k = G \hat w_k$. When predicting expression level, we typically only restrict to a small region of a genome, referred as cis-eQTL. We then regress the phenotype by the predicted expression, to obtain the effect size and p-value for each gene. TWAS employs a two-stage least square regression, and is theoretically immune to any confounding between expression and trait. The model is identical to Mendelian randomization, where we treat gene expression as an exposure. With individual level genotype and phenotype information, we might perform TWAS as:
$$
\begin{split}
\hat \beta^{TWAS}_k &= \frac{\mathbb{C}ov[\hat x_k, y]}{\mathbb{V}ar[\hat x_k]} \\
&= \frac{\hat x_k^T y}{\hat x_k^T \hat x_k} \\
&= \frac{\hat w_k^T G^T y}{\hat w_k^T G^T G \hat w_k}
\end{split}
$$
Notice we have $\hat \beta = G^T y/N$, and $R = G^TG/N$, this allows us to further compress the expression to:
$$
\hat \beta^{TWAS}_k = \frac{w_k^T \hat \beta}{\hat w_k^TR w_k} = \frac{w_k^T z}{\sqrt{N} \hat w_k^TR w_k}
$$
As the phenotypical variance explained by a single locus is so small, the GWAS marginal effect size is distributed as $\hat \beta \sim MVN(R\lambda, \frac{1}{N}R)$. According to [linear transformation](https://statproofbook.github.io/P/mvn-ltt.html) of a multi-variate Gaussian random variable, we have:
$$
\begin{split}
\hat \beta^{TWAS}_k &\sim MVN(\frac{w_k^T R \lambda}{\hat w_k^TR w_k}, \frac{1}{N \hat w_k^TR w_k})\\
s.e &= \frac{1}{\sqrt{N \hat w_k^TR w_k}} \\
z &= \frac{ \beta^{TWAS}_k}{s.e} = \frac{\sqrt{N} \hat w^T \hat \beta}{(\hat w_k^TR w_k)^\frac{1}{2}}
\end{split}
$$
The above expression is convenient, as it allows for TWAS analysis with only GWAS summary statistics and a matched LD panel. The expression seems consistent with [Sasha Gusev's presentation](https://www.youtube.com/watch?v=cfEGf6ezR-c&t=1660s). I also attached some simulation to demonstrate this quantity.
```{r}
path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/APOE_1000G_FIN_74SNPS."
haps = read.table(paste0(path,"txt"))
info = read.table(paste0(path,"legend.txt"),header = T, as.is = T)
### Consider this is GTEx data
n1 = 1000
G1 = as.matrix(haps[sample(1:nrow(haps), size = n1, repl = T),1:10] +
haps[sample(1:nrow(haps), size = n1, repl = T),1:10])
row.names(G1) = NULL
colnames(G1) = NULL
G1_ = scale(G1)
M = ncol(G1_)
lambda = c(0, 0, 0, 0.3, 0, 0, 0, 0, 0, 0) # a very strong eQTL
x1 = G1_ %*% lambda + rnorm(n1, mean = 0, sd = sqrt(1 - var(G1_ %*% lambda)))
x1_ = scale(x1)
# the expression model uses all the SNPs, might need to add regularization term in realistic analysis
w = solve(t(G1_) %*% G1_) %*% t(G1_) %*% x1_
### Consider this is my own data with genotype and trait
n2 = 6000
G2 = as.matrix(haps[sample(1:nrow(haps), size = n2, repl = T),1:10] +
haps[sample(1:nrow(haps), size = n2, repl = T),1:10])
row.names(G2) = NULL
colnames(G2) = NULL
G2_ = scale(G2)
R = cor(G2_)
beta_twas = 0.6
# x2 is not unknown, but we generate the true expression data
x2 = G2_ %*% lambda + rnorm(n2, mean = 0, sd = sqrt(1 - var(G2_ %*% lambda)))
x2_ = scale(x2)
y2 = x2_ %*% beta_twas + rnorm(n2, mean = 0, sd = sqrt(1 - var(x2_ %*% beta_twas)))
y2_ = scale(y2)
# with individual level data
beta_twas_hat_method1 = cov(y2_, G2_ %*% w)/var(G2_ %*% w)
# a GWAS effect size is pre-computed
beta_hat = 1/n2 * (t(G2_) %*% y2_)
# with sumstats level data and LD
beta_twas_hat_method2 = t(w) %*% beta_hat / (t(w) %*% R %*% w)
# z score
z_twas_method2 = sqrt(n2) * t(w) %*% beta_hat / sqrt(t(w) %*% R %*% w)
```
A few remarks about TWAS:
1. TWAS can be interpreted as a special usage of Mendelian randomization. One might think TWAS can identify gene expressions that causally affect the phenotype, as MR does. But due to complications such as co-expression, this is generally not true.
2. Standard TWAS analysis ignored the variability of the imputed expression data, which might induce inflated False positives. [Xue et al.](https://pubmed.ncbi.nlm.nih.gov/31821608/) examnined this query, and concludes that it is mostly fine.
3. The weights for predicting gene expression are often obtained from GTEx, but would these weights as predictive across different ancestries? [Chen et al.](https://www.nature.com/articles/s41588-022-01282-x) presented a method that integrate multiple GWAS results to perform TWAS. But I think there are rooms for more methodology development.
\pagebreak
### Topic 5: Fine-mapping {#Topic5}
Fine-mapping attempts to find the posterior probability of a variant being causal. The initial analytical framework was introduced in [Maller et al](https://www.nature.com/articles/ng.2435), where a single variant is assumed to be causal among a locus. Fine-mapping algorithm follows a Bayesian framework:
$$
y = G \lambda + \varepsilon, \text{where } \lambda = r b
$$
# Incomplete...
In a [paper](https://www.nature.com/articles/s41588-021-00961-5) by Ding et al., they raised an interesting question about the uncertainty of PRS score for each individual. The rationale is that the effect size estimates are uncertain due to finite sample used in GWAS. When calculating PRS via aggregating effect sizes and genotypes, the effect size uncertainty would propagate and affect PRS. Here I attempt to replicate the simulation presented in Figure 1, and derive some quantities.
Under infinitesimal model, and no LD, we assume the causal effect size $\lambda_j = \beta_j \sim N(0, h^2/M)$. Therefore, we have:
1. $\hat \beta_j \mid \beta_j \sim N(\beta_j, \frac{1}{N})$.
2. The posterior distribution is $\beta_j \mid \hat \beta_j \sim N((1 + \frac{M}{h^2 N})^{-1}\hat \beta, \frac{1}{N}(1 + \frac{M}{h^2N})^{-1} )$
*Observation*: this formulation is almost identical to PRS-CS paper, with the except that it didn't consider LD structure.
**Topics to explore**:
1. Statistical geneticists like to use the reference panel as an alternative to individual level data. For this quantity to hold, we must have the LD panel $R$ close enough to in-sample LD. However, this might not always work, as the realization of a random matrix can be quite different. Some simulation reveals that GWAS results with LD panel can differ by 40%, sometimes with different signs.
2. Co-expression gene network explained by LD
3. Apply SKAT on imputed expression, or access heritability explained by imputed expression, or genetic correlation.