-
Notifications
You must be signed in to change notification settings - Fork 3
/
01_population_structure.Rmd
106 lines (80 loc) · 4.42 KB
/
01_population_structure.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "Population Structure"
output: github_document
---
```{r echo=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8,
echo=FALSE, warning=FALSE, message=FALSE)
```
```{r}
library(tidyverse)
library(ggrepel)
library(ggpubr)
library(RColorBrewer)
```
```{r}
source("R/load.R")
source("R/constants.R")
sample_table <- load_sample_table()
```
PCAngsd was used to examine population structure. This program is specifically designed to work with low coverage data and calculates a sample covariance matrix (used for PCA plotting) and admixture proportions based on the optimal number of clusters.
As input data PCAngsd we used filtered SNPs in vcf format that were called using Freebayes. The script [01_import_vcf.sh](hpc/NGSAdmix/01_import_vcf.sh) uses ANGSD to convert these Freebayes SNP calls into Beagle format while also filtering for Hardy Weinberg equilibrium (p < 1e-6). The command takes the form
```bash
angsd -vcf-gl <input_vcf> -out <beagle_output> -fai <genome_index> -nind 148 -doMaf 1 -doGlf 2 -doMajorMinor 1 -SNP_pval 1e-6
```
PCAngsd was then run on these Beagle formatted files using the command;
```bash
python pcangsd.py -beagle <beagle_file> -threads 40 -admix -admix_save -admix_auto 10000 -o <output_file>
```
The covariance matrix can be used as the basis for a PCA. Plotting the first two principle components reveals the clear Magnetic Island - North distinction. It also reveals two clear outliers `MI-2-9` and `MI-1-16`.
```{r}
accelerate_covmat <- read_table2("hpc/NGSadmix/accelerate_fb_pcangst_2.cov",col_names = FALSE) %>%
as.matrix()
sample_ids <- read_tsv("hpc/NGSAdmix/accelerate_individuals.txt",col_names = "sample_id") %>%
pull(sample_id)
colnames(accelerate_covmat) <- sample_ids
rownames(accelerate_covmat) <- sample_ids
pop_eigen <- eigen(accelerate_covmat)
pop_pca <- data.frame(x=pop_eigen$vectors[,1],y=pop_eigen$vectors[,2],sample_id = sample_ids) %>%
left_join(sample_table) %>%
mutate(pop_order = site_order()[location_id])
outliers <- c("MI-2-9","MI-1-16") # These don't seem to belong to either of the two main populations
hybrids <- c("DI-2-4","PI-1-16","MI-1-1")
label_pca <- pop_pca %>% mutate(label = ifelse(sample_id %in% c(outliers,hybrids), sample_id,""))
ggplot(pop_pca ,aes(x=x,y=y)) +
geom_point(aes(color=reorder(location_id,pop_order)),size=1) +
theme_pubr() +
scale_color_manual(values = site_colors(), labels=site_labels()[names(site_colors())]) +
theme(legend.position = "bottom") +
geom_label_repel(data=label_pca,aes(label=label), size=2,min.segment.length = 0.1,point.padding = 0.3, nudge_x = 0.01) +
xlab("PC1") + ylab("PC2") +
theme(legend.title = element_blank(), text = element_text(size=10), legend.position = "right")
ggplot(pop_pca ,aes(x=x,y=y)) + geom_point(aes(color=reorder(location_id,pop_order)),size=1) +
theme_pubr() +
scale_color_manual(values = site_colors(), labels=site_labels()[names(site_colors())]) +
theme(legend.position = "bottom") +
geom_label_repel(data=label_pca,aes(label=label), size=2,min.segment.length = 0.1,point.padding = 0.3, nudge_x = 0.01) +
xlab("PC1 (14.4%)") + ylab("PC2 (0.8%)") +
theme(legend.title = element_blank(), text = element_text(size=5), legend.position = "right")
ggsave("figures/pcangst_accelerate.png",width = 12,height=8, units="cm")
```
Admixture proportions are also calculated by PCAngsd (based on optimal K = 2). These can be plotted in the style of a STRUCTURE plot as follows;
```{r}
q2_admix <- read_table2("hpc/NGSAdmix/accelerate_fb_pcangst_2.K2.a625.0.qopt",col_names = c("X2","X1")) %>%
add_column(sample_id = sample_ids) %>%
gather(cluster,proportion,-sample_id) %>%
left_join(sample_table) %>%
mutate(pop_order = site_order()[location_id])
pop_colors <- site_colors()[c('MI',"FI")]
names(pop_colors) <- c("X1","X2")
ggplot(q2_admix ,aes(x=reorder(sample_id,pop_order),y=proportion)) +
geom_bar(aes(fill=cluster,color=cluster),stat="identity") +
facet_wrap(~reorder(location_name,desc(pop_order)), scales = "free_y", ncol = 1, strip.position = "left") +
theme_pubclean() +
xlab("") + ylab("Proportion") +
scale_color_manual(values = pop_colors) +
scale_fill_manual(values = pop_colors,labels = c("X1"="Magnetic Island","X2"="North")) +
theme(axis.text.y = element_blank(), legend.title = element_blank(), legend.position = 'none') +
theme(line = element_blank(), strip.text = element_text(size=12)) +
guides(color=FALSE) + coord_flip()
```