Skip to content

zhouhufeng/FAVORannotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

License: GPL v3

FAVORannotator

FAVORannotator is an R program for performing functional annotation of any genetic study (e.g. Whole-Genome/Whole-Exome Sequencing/Genome-Wide Association Studies) using the FAVOR backend database to create an annotated Genomic Data Structure (aGDS) file by storing the genotype data (in VCF or GDS format) and their functional annotation data in an all-in-one file.

For generating GDS/aGDS from raw VCF files, please refer to the detailed tutorial here.

1.Introduction

FAVORannotator is an open-source pipeline for functionally annotating and efficiently storing the genotype and variant functional annotation data of any genetic study (e.g. GWAS/WES/WGS). Functional annotation data is stored alongside with genotype data in an all-in-one aGDS file, through using the FAVORannotator. It then facilitates a wide range of functionally-informed downstream analyses (Figure 1).

FAVORannotator first converts a genotype VCF input file to a GDS file, searches the variants in the GDS file using the FAVOR database for their functional annotations, and then integrates these annotations into the GDS file to create an aGDS file. This aGDS file allows both genotype and functional annotation data to be stored in a single unified file (Figure 1). Furthermore, FAVORannotator can be conveniently integrated into STAARpipeline, a rare variant association analysis tool, to perform association analysis of large-scale WGS/WES studies.

FAVORannotator workflow

Figure 1. FAVORannotator workflow.

2. FAVORannotator differnt versions (SQL, CSV and Cloud Versions)

There are three main versions of FAVORannotator: SQL, CSV and Cloud.

All the versions of FAVORannotator requires the same set of R libraries. The postgreSQL version requires postgreSQL installation, and CSV version requires the XSV software dependencies, Cloud version also requires the XSV software dependencies.

All the FAVORannotator versions produced identical results and have similar performance, they only differ on the computing environments where FAVORannotator is deployed. Users can choose the different versions of FAVORannotator according to their computing platforms and use cases.

FAVORannotator accomplishes both high query speed and storage efficiency due to its optimized configurations and indices. Its offline nature avoids the excessive waiting time and file size restrictions of FAVOR online operation.

2.1 FAVORannotator SQL version

It is important to note that the FAVORannotator SQL version PostgreSQL database differs from other storage because it needs to be running in order to be accessed. Thus, users must ensure the database is running before running annotations.

Once the FAVORannotator database is booted on and running, the following connection information must be specified for the FAVORannotator R program to access the database : DBName, Host, Port, User, and Password.

This above specialized database setting, ensure the high query speed. Here shows the detail features described above.

FAVORannotator SQL version Tech Features

Figure 2. FAVORannotator SQL version workflow and differences highlights.

2.2 FAVORannotator CSV version

FAVORannotator CSV version database adopts the similar strategies of slicing both database and query inputs into smaller pieces and create index with each of the smaller chucks of database so as to achieve high performance and fast query speed as the SQL version.

Differs from SQL version, CSV version database is static, and the query depends upon the xsv software, and therefore does not need to ensure the database is running before running annotations. The CSV version database is static and have much easier way to access through xsv software rather than acquiring the details of the running postgreSQL database, therefore widen the application of FAVORannotator in case computing platform does not support postgreSQL installation.

FAVORannotator CSV version Tech Features

Figure 3. FAVORannotator CSV version workflow and differences highlights.

2.3 FAVORannotator Cloud version

FAVORannotator Cloud version develop based on the CSV version (no pre-install database) adopts the similar strategies of slicing both database and query inputs into smaller pieces and create index with each of the smaller chucks of database so as to achieve high performance and fast query speed as the SQL/CSV version. But the FAVORannotator Cloud version download the FAVOR databases (Full Databaseor Essential Database) on the fly, requires no pre-install FAVOR database on the computing platform.

Cloud version database download from (FAVOR on Harvard Database when FAVORannotator is executed, and after the download finishes, database is decompressed. The downloaded database is CSV version, which is static, and the query depends upon the xsv software therefore requires minimal dependencies and running database management systems.

FAVORannotator Cloud version Tech Features

Figure 4. FAVORannotator Cloud version workflow and differences highlights.

3. Obtain the FAVOR Database

3.1 Obtain the database through direct downloading

  1. Download the FAVORannotator data file from here (download URL, under the "FAVORannotator" tab).
  2. Decompress the downloaded data.
  3. Move the decompressedd database to the location, and update location info on '''config.R'''.

3.2 FAVOR databases host on Harvard Dataverse

FAVOR databases (Essential Database and Full Database) are hosting on (Harvard Database).

FAVORannotator Cloud version Tech Features

Figure 5. FAVOR Databases on Harvard Dataverse (both Essential Database and Full Database).

3.3 FAVOR Essential Database

(FAVOR Essential Database) containing 20 essential annotation scores. This FAVOR Essential Database is comprised of a collection of essential annotation scores for all possible SNVs (8,812,917,339) and observed indels (79,997,898) in Build GRCh38/hg38.

3.4 FAVOR Full Database

(FAVOR Full Database) containing 160 essential annotation scores. This FAVOR Full Database is comprised of a collection of full annotation scores for all possible SNVs (8,812,917,339) and observed indels (79,997,898) in Build GRCh38/hg38.

4. Resource requirements

The resources utilized by the FAVORannotator R program and PostgreSQL instance are largely dependent upon the size of the input variants.

For the both the SQL and CSV versions of FAVORannotator, 60,000 samples of WGS variant sets were tested. The whole functional annotation finished in parallel in 1 hour using 24 computing cores (Intel cascade lake with 2.9 GHz frequency). The memory consumed by each instance varies (usually within 18 GB), as there are different amounts of variants associated with each chromosome.

5. Resource requirements

The resources utilized by the FAVORannotator R program and PostgreSQL instance are largely dependent upon the size of the input variants.

For the both the SQL and CSV versions of FAVORannotator, 60,000 samples of WGS variant sets were tested. The whole functional annotation finished in parallel in 1 hour using 24 computing cores (Intel cascade lake with 2.9 GHz frequency). The memory consumed by each instance varies (usually within 18 GB), as there are different amounts of variants associated with each chromosome.

6. How to Use FAVORannotator

6.1 SQL/CSV versions

Installing and run FAVORannotator to perform functional annotation requires only 2 major steps:

I. Install software dependencies and prepare the database (process varies between systems).

II. Run FAVORannotator (CSV or SQL versions).

The first step depends on whether FAVORannotator is the SQL or CSV version, and depends on different computing platforms. The following sections detail the process for major platforms. The second step (running FAVORannotator) will be detailed first, as it is consistent across platforms.

6.2 No pre-install databases version

There are a few user cases where download the database and configuration can be difficult, we simply the FAVORannotator by including the downloading, decompression, update config.R, include database location and output location all into the FAOVRannotator (no pre-install database version), users only need to put the R scripts in to the directory with enough storage and run the program.

I. Install software dependencies.

II. Run FAOVRannotator (no pre-install database version).

6.3 Cloud version

Based on the FAOVRannotator (no pre-install database version), we develop the FAOVRannotator cloud-native app, in the cloud platform like Terra and DNAnexus, or on the virtual machines of Google Cloud Platform (GCP), Amazon Web Services (AWS), Microsoft Azure. With the dockerized images and workflow languages, FAVORannotator can be executed through the user-friendly and drag-and-drop graphical interface, with no scripting nor programming skills required from the users.

FAVORannotator Versions

Figure 6. FAVORannotator Different Versions.

7. SQL version

7.1 Run FAVORannotator SQL version

Once PostgreSQL is running, the database can be imported and FAVORannotator can be executed as follows. Please find the R scripts in the Scripts/SQL/ folder.

Important: Before run FAVORannotator SQL version, please update the file locations and database info on the config.R file. FAVORannotator relies on the file locations and database info for the annotation.

  1. Create GDS file from the input VCF file:
  • $ Rscript convertVCFtoGDS.r chrnumber
  1. Run FAVORannotator:
  • $ Rscript FAVORannotatorv2aGDS.r chrnumber

chrnumber are the numeric number indicating which chromosome this database is reading from, chrnumber can be 1, 2, ..., 22.

Scripts for submitting jobs for all chromosomes simultaneously have been provided. They use SLURM, which is supported by many high-performance clusters, and utilize parallel jobs to boost performance.

A SLURM script to simplify the process can be found here: (submission.sh).

7.2 Install and prepare the database for SQL version

The FAVORannotator SQL version relies upon the PostgreSQL Database Management System (DBMS). PostgreSQL is a free and open-source application which emphasizes extensibility and SQL compliance. It is a highly stable DBMS, backed by more than 20 years of community development. PostgreSQL is used to manage data for many web, mobile, geospatial, and analytics applications. Its advanced features, including diverse index types and configuration options, have been carefully selected for FAVORannotator so that end users do not need to worry about the implementation.

How to use FAVORannotator will be explained from the following steps. PostgreSQL is available in most platforms. Each of these platforms has a different process for installing software, which affects the first step of installing FAVORannotator.

Once PostgreSQL is running, the database can be imported and FAVORannotator can be executed as follows:

  1. Once the server is running, Load the database: $ psql -h hostname -p port_number -U username -f your_file.sql databasename

    e.g. $ psql -h c02510 -p 582  -f /n/SQL/ByChr7FAVORDBxO.sql Chr7

  2. Now the PostgreSQL hosting FAVORannotator backend database is up and running it is listening for the query from FAVORannotator R program.

  3. Update the config.R file with the PostgreSQL instance information (database name, port, host, user, password):

7.3 Install PostgreSQL (FAVORannotator SQL version)

The following steps have been written for major computing environments in order to best account for all possibilities. The following steps are for the widely used operating system (Ubuntu) on a virtual machine.

  1. Install the required software:
  • $ sudo apt install postgresql postgresql-contrib
  1. Start and run PostgreSQL:
  • $ sudo -i -u postgres
  • $ psql
  1. [Optional] For installing the database on external storage (Edit the configuration file):
  • The file is located at /etc/postgresql/12/main/postgresql.conf
  • Change the line in file “postgresql.conf”, data_directory = 'new directory of external storage'
  • Reboot the data directory, $ sudo systemctl start postgresql

For more detailed instructions on how to use FAVORannotator (SQL version) on the Harvard FASRC Slurm Cluster, please refer to the detailed tutorial here.

8. CSV version

8.1 Run FAVORannotator CSV version

Once CSV database is downloaded and decompressed, the database is readable by FAVORannotator can be executed as follows. Please find the R scripts in the Scripts/CSV/ folder.

Important: Before run FAVORannotator CSV version, please update the file locations and database info on the config.R file. FAVORannotator relies on the file locations and database info for the annotation.

  1. Create GDS file from the input VCF file:
  • $ Rscript convertVCFtoGDS.r chrnumber
  1. Run FAVORannotator:
  • $ Rscript FAVORannotatorv2aGDS.r chrnumber

Scripts for submitting jobs for all chromosomes simultaneously have been provided. They use SLURM, which is supported by many high-performance clusters, and utilize parallel jobs to boost performance.

A SLURM script to simplify the process can be found here: (submission.sh).

chrnumber are the numeric number indicating which chromosome this database is reading from, chrnumber can be 1, 2, ..., 22.

8.2 Install and prepare the database for CSV version

FAVORannotator (CSV version) depends on the xsv software and the FAVOR database in CSV format. Please install the xsv software and download the FAVOR database CSV files (under the "FAVORannotator" tab) before using FAVORannotator (CSV version).

8.3 Install xsv (FAVORannotator CSV version)

The following steps have been written for major computing environments in order to best account for all possibilities. The following steps are for the widely used operating system (Ubuntu) on a virtual machine.

  1. Install Rust and Cargo:
  • $ curl https://sh.rustup.rs -sSf | sh
  1. Source the environment:
  • $ source $HOME/.cargo/env
  1. Install xsv using Cargo:
  • $ cargo install xsv

9 No pre-install databases version

9.1 Install xsv (No need to pre-install database but xsv need to be installed)

The following steps have been written for major computing environments in order to best account for all possibilities. The following steps are for the widely used operating system (Ubuntu) on a virtual machine.

  1. Install Rust and Cargo:
  • $ curl https://sh.rustup.rs -sSf | sh
  1. Source the environment:
  • $ source $HOME/.cargo/env
  1. Install xsv using Cargo:
  • $ cargo install xsv

9.2 Run FAVORannotator no pre-install databases version

FAVOR database can be downloaded on the fly and decompressed automatically in the scripts, this version of FAVORannotator will remove the burden of download the backend database and update the config.R. The database is downloaded and decompressed automatically and is readable by FAVORannotator can be executed as follows.

Please find the R scripts in the Scripts/SQL/ folder.

Important: This version of FAVORannotator no pre-install version does not need to update config.R file. This version of FAVORannotator directly download FAVORdatabase (Full or Essential versions) from the Harvard Dataverse to the default file locations and database info for the annotation. Just put the FAVORannotator script in the directory with ample storage all the database and index and intermediate files will be generated in the directory.

  1. Create GDS file from the input VCF file:
  • $ Rscript convertVCFtoGDS.r input.vcf output.gds
  1. Run FAVORannotator for the FAVOR Essential Database:
  • $ Rscript FAVORannotatorCSVEssentialDB.R output.gds chrnumber
  1. Run FAVORannotator for the FAVOR Full Database:
  • $ Rscript FAVORannotatorCSVFullDB.R output.gds chrnumber

chrnumber are the numeric number indicating which chromosome this database is reading from, chrnumber can be 1, 2, ..., 22.

Scripts for submitting jobs for all chromosomes simultaneously have been provided. They use SLURM, which is supported by many high-performance clusters, and utilize parallel jobs to boost performance.

A SLURM script to simplify the process can be found here: (submission.sh).

10. Cloud Version

10.1 Run FAVORannotator Cloud Version

For Cloud environment, we simplified the process of database set up and remove the configration files. FAVOR database can be downloaded on the fly and decompressed automatically in the scripts, this version of FAVORannotator will remove the burden of download the backend database and update the config.R. The database is downloaded and decompressed automatically and is capable of seamless integration to the workflow languages of the cloud platform. It currently works for cloud platforms like Terra, DNAnexus, etc. This tutorial uses Terra as an example to illustrate the functional annotation process.

Please find the R scripts in the Scripts/Cloud/ folder.

Important: This version of FAVORannotator based on the no pre-install version does not need config.R file. This version of FAVORannotator directly download FAVORdatabase (Full or Essential versions) from the Harvard Dataverse to the default file locations and database info for the annotation. Just put the FAVORannotator script in the directory with ample storage all the database and index and intermediate files will be generated in the directory. These database files and intermediate files in the working directories will be removed in most cloud platforms.

  1. Create GDS file from the input VCF file:
  • $ Rscript convertVCFtoGDS.r input.vcf output.gds

2.1 Run FAVORannotator for the FAVOR Essential Database:

  • $ Rscript FAVORannotatorTerraEssentialDB.R output.gds chrnumber

2.2. Run FAVORannotator for the FAVOR Essential Database workflow:

  • $ java -jar cromwell-30.2.jar run FAVORannotatorEssentialDB.wdl --inputs file.json

3.1 Run FAVORannotator for the FAVOR Full Database:

  • $ Rscript FAVORannotatorTerraEssentialDB.R output.gds chrnumber

chrnumber are the numeric number indicating which chromosome this database is reading from, chrnumber can be 1, 2, ..., 22.

3.2. Run FAVORannotator for the FAVOR Full Database workflow:

  • $ java -jar cromwell-30.2.jar run FAVORannotatorFullDB.wdl --inputs file.json

FAVORannotator Cloud Version

Figure 7. FAVORannotator Cloud Native Workflow on Terra.

11. Other Functions and Utilities

11.1 Convert VCF to aGDS

The following functions have been written for the purpose of converting VCF files to GDS/aGDS files. Please find the R scripts in the Scripts/UTL/ folder.

  1. If users wish to convert VCF files that only contain genotype data into GDS files for the following annoation process:
  • $ Rscript convertVCFtoGDS.r input.vcf output.agds
  1. If users wish to convert Variant List that does not contain genotype data into GDS files for the following annoation process, after formatting the varaint list into the same VCF format, following R scripts can generate the empty GDS file that do not have genotype data just the varaint info:
  • $ Rscript convertVCFtoGDS.r inputVariantList.vcf output.agds
  1. If users already annotated VCF files using SpnEff,BCFTools, VarNote, Vcfanno and just wish to use aGDS for the following analysis, running the followign R script to convert annotated VCF files into aGDS file
  • $ Rscript convertVCFtoGDS.r annotated.vcf output.agds

11.2 Add In Functional Annotations to aGDS

  1. If users have external annotation sources or annotation in text tables that containing varaint sets, this function will be able to add in the new functional annotations into the new node of aGDS files:
  • $ Rscript FAVORannotatorAddIn.R input.agds AnnotationFile.tsv

11.3 Extract Variant Functional Annotation to Text Tables from aGDS

  1. If users prefer to have the Variant Functional Annotation results write into Text Tables, this Rscripts will be able to extract the functional annotation from aGDS and write into the text tables:
  • $ Rscript FAVORaGDSToText.R annotated.agds AnnotationTextTable.tsv

12 Demo Using Real Example (1000 Genomes Project Data)

The following steps are the demo of how to FAVORannotato through using real genotype data from 1000 Genomes Project. From the step of obtaining the genotype data to the end point of creating aGDS are illustrated here below in the step by step process.

12.1 Download the 1000G VCF

If users can use command line below to obtain the (1000G) from the FTP (1000 Genomes official website), for the following process.

Change the directory:

  • $ cd ../../Data/TestData/1000G/

Download VCF to the directory (chr22):

  • $ wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz

Additionally if download chr1:

  • $ wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz

12.2 Convert VCF to GDS (chr22)

Users can use command line below to convert the VCF to GDS.

Change the directory:

  • $ cd ../../../Scripts/UTL

Run program to create GDS:

  • $ Rscript convertVCFtoGDS.r ../../Data/TestData/Input/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz ../../Data/1000G/All.chr22.27022019.GRCh38.phased.gds

And you will get the following output on terminal:

   ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz (176.9M)
   file format: VCFv4.3
   the number of sets of chromosomes (ploidy): 2
   the number of samples: 2,548
   genotype storage: bit2
   compression method: LZMA_RA
   of samples: 2548
   	...

12.3 Annotate GDS using FAVORannotator to create aGDS (no pre-install version)

Users can use following command to annotate GDS using FAVORannotator to create aGDS .

Change the directory:

  • $ cd ../../Data/1000G/

Copy FAVORannotator program to the current directory:

  • $ cp ../../../Scripts/CSV/FAVORannotatorCSVEssentialDB.R .
  • $ cp ../../../Scripts/CSV/FAVORannotatorCSVFullDB.R .

Run program to annotate GDS using FAVORannotator reading FAVOR Essential Database to create aGDS(chr22):

  • $ Rscript FAVORannotatorCSVEssentialDB.R All.chr22.27022019.GRCh38.phased.gds 22

And you will get the following output on terminal:

[1] gds.file:  All.chr22.27022019.GRCh38.phased.gds
[1] chr:  22
[1] use_compression Yes
--2022-09-14 16:42:28--  https://dataverse.harvard.edu/api/access/datafile/6170504

Run program to annotate GDS using FAVORannotator reading FAVOR Full Database to create aGDS(chr22):

  • $ Rscript FAVORannotatorCSVFullDB.R All.chr22.27022019.GRCh38.phased.gds 22

And you will get the following output on terminal:

[1] gds.file:  All.chr22.27022019.GRCh38.phased.gds
[1] chr:  22
[1] use_compression: Yes
--2022-09-14 16:39:31--  https://dataverse.harvard.edu/api/access/datafile/6358299


13 Dependencies

FAVORannotator imports R packages: dplyr, SeqArray, gdsfmt, RPostgreSQL, stringr, readr, stringi. These dependencies should be installed before running FAVORannotator.

FAVORannotator (SQL version) depends upon PostgreSQL software.

FAVORannotator (CSV version) depends upon xsv software.

Data Availability

The whole-genome individual functional annotation data assembled from a variety of sources and the computed annotation principal components are available at the Functional Annotation of Variant - Online Resource (FAVOR) site.

Version

The current version is 1.1.1 (August 30th, 2022).

License

This software is licensed under GPLv3.

GPLv3 GNU General Public License, GPLv3

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published