Skip to content

Commit

Permalink
Merge pull request #2 from AnastasiiaInna/master
Browse files Browse the repository at this point in the history
R + Spark
  • Loading branch information
Anastasiia authored Mar 18, 2017
2 parents 0b6dc9a + 702e592 commit 25952ce
Show file tree
Hide file tree
Showing 200 changed files with 15,832 additions and 524 deletions.
173 changes: 0 additions & 173 deletions R/TuningForClassification.R

This file was deleted.

141 changes: 139 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,139 @@
# popgen
Population structure analysis using the NGS data
## **Popgen:** a toolset for population structure analysis using the NGS data

The goal of the project is to analyze the influence of various pre-processing methods, such as variant filtering, LD-pruning, and PCA or multi-dimensional scaling transformations of genotype matrix, and their parameters on the final results of the classification/clustering algorithms for population structure analysis.

Popgen icnludes two frameworks:

* *popgenR*, a tool for the analysis of population structure with automated estimation of the optimal QC’s and machine learning’s parameters. Such a framework is capable to fine-tune these parameters for different levels of ethnic populations, such as super- and sub-population groups define in 1000 Genomes dataset.

* *popgenSpark*, a tool that deal with dataset containing large number of samples (e.g. > 10,000 individuals). Both dimensionality reduction and machine learning methods have been implemented using Apache Spark.


# Installation
* **popgenR**

Install the package from the source code:
```sh
R CMD INSTALL popgen_0.1.0_R_x86_64-pc-linux-gnu.tar.gz
```
* **popgenSpark**

In the folder *popgen-spark* run
```sh
sbt assembly
```
## Examples

* **popgenR**
1. Load data:

``` R
genofile <- chr22load() # Load chr 22
panel <- panelLoad() # Load panel file
pop.df <- panel[, c(1, 3)] # Select SampleId and Region
colnames(pop.df) <- c("SampleId", "Region")
```

2. Tune the LD:

```R
min.maf <- 0.005
max.maf <- 0.05
npc <- 5
ld.set <- seq(0.1, 1, by = 0.1)

ld.tuning.result <- tuningLD(genofile, pop.df, min.maf, max.maf, ld.set, n.pc = n.pc,
n.clusters = length(unique((pop.df$Region)))) # ld tuning
ld.tuning.result$crossTbl # tuned parameters
plotTuning(ld.tuning.result$crossTbl) # plot
```

3. Perform clustering and plot the map:

```R
n.clusters <- length(unique((pop.df$Region))) # perform clustering for n = number of population groups

pca.df <- getQCPCAdf(genofile, pop.df, min.maf, max.maf, ld = ld.tuning.result$bestLD, npc = npc,
with.eig.tbl = F) # QC analysis, ld.tuning.result$bestLD is tuned LD

obj.hclust <- HClust( data = pca.df, k = n.clusters) # hierarchical clustering
obj.kmeans <- kmeansClust( data = pca.df, k = n.clusters) # kmeans
obj.emc <- emcClust( data = pca.df, k = n.clusters) # em

summaryClustering(obj.hclust, obj.kmeans, obj.emc) # clustering evaluation metrics

obj.hclust@conf.matr # confusion matrix

prediction.df <- cbind(pca.df, obj.hclust@prediction) # bind with PCs, needed for plot

attach(prediction.df)
plotMap2(prediction.df, PC3, PC2) # or plotMap(prediction.df, PC3, PC2)
detach(prediction.df)

```

More examples can be found in the folder *popgen-0.1.0/tests/*


* **popgenSpark**

1. PCA with Spark Mllib

```scala

val df = sqlContext.read.parquet("path/to/chr22snps.parquet")
val nPC: Int = 5
val labels: String = "Region"
val ignoredColumn = "SampleId"

val colNames = ds.columns.filter(c => c != labels && c != ignoredColumn)

val assembler = new VectorAssembler()
.setInputCols(colNames)
.setOutputCol("pcafeatures")

val pca = new PCA()
.setInputCol("pcafeatures")
.setOutputCol("features")
.setK(nPC)

val pipeline = new Pipeline()
.setStages(Array(assembler, pca))

val model = pipeline.fit(ds)
val pcaFeaturesDF = model.transform(ds).select(labels, ignoredColumn, "features")

```
2. K-means with H2O:

```scala

val numClust = 5

val h2oDF = h2oContext.asH2OFrame(df)
h2oDF.replace(h2oDF.find("Region"), h2oDF.vec("Region").toCategoricalVec()).remove()
h2oDF.update()

val kmeansParameters = new KMeansParameters()
kmeansParameters._train = h2oDF._key
kmeansParameters._response_column = labels
kmeansParameters._ignored_columns = ignoredColumns
kmeansParameters._k = numClust

val kmeans = new KMeans(kmeansParameters)
val kmeansModel = kmeans.trainModel().get()

val predictionDF = kmeansModel.score(h2oDF)
predictionDF.add("prediction", predictionDF.vec("predict").toCategoricalVec)
predictionDF.update()

```


# Run the test

* **popgenSpark**

```sh
spark-submit <spark-options> --class com.zsibio.popgen.ClusteringMllibDF <path/to/jar> <path/to/chr.parquet>
```
12 changes: 0 additions & 12 deletions man/tuningLD.Rd

This file was deleted.

12 changes: 0 additions & 12 deletions man/tuningLDnPC.Rd

This file was deleted.

12 changes: 0 additions & 12 deletions man/tuningLdMaf.Rd

This file was deleted.

12 changes: 0 additions & 12 deletions man/tuningMAF.Rd

This file was deleted.

12 changes: 0 additions & 12 deletions man/tuningMAFPC.Rd

This file was deleted.

12 changes: 0 additions & 12 deletions man/tuningNumPC.Rd

This file was deleted.

2 changes: 2 additions & 0 deletions popgen-R/popgen-0.1.0/.Rbuildignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
^.*\.Rproj$
^\.Rproj\.user$
Loading

0 comments on commit 25952ce

Please sign in to comment.