UK Biobank (UKB) has collected array genotyping, whole exome sequencing (WES), and whole genome sequencing (WGS) for half a million participants aged 40-69 years. In addition to the genetic data, researchers can gain access to thousands of traits/phenotypes of the participants, making UKB one of the richest data sources for performing GWAS.
Most recent GWAS tools (e.g. BOLT-LMM, SAIGE, regenie) were designed to be applied in 2-steps. The first step accounts for genetic architectures for population stratifications and relatedness by modeling genotype makers, usually with whole genome array genotyping data. The second step utilizes the fitted model and performs association testing with immputed genotypes, WES or WGS data.
UKB released array genotyping data on GRCh37 reference build (link), while WES and WGS data are being released on GRCh38 (link, link). In order to perform association testing with the sequncing data on step-2, it is required to get the array genotyping data on the same reference coordinate as the sequencing data.
Since the scale of UKB array genotyping data is large (~800K genome-wide variant markers over ~500K individuals), we will need to better manage the pre-processing, post-processing, and the overall computer resources of this liftover work. This note is describing a processing workflow on how I convert the UKB array genotyping data from GRCh37 to hs38DH (primary assembly of GRCh38 + ALT contigs + decoy contigs + HLA genes).
There are many tools available to align and convert coordinates from one build to another. E.g. CrossMap, liftOver, Remap, etc. These tools are capable to determine the coordinate in a target genome assembly build (target), according to where it was first determined in another build (source).
I chose Picard LiftoverVcf
(link) for converting all ~500K participants’ genotypes to GRCh38. It recovers the scenarios when an ALT allele equals the target's REF allele by swapping REF/ALT variants. This can happen when the allele presentation of PLINK genotype file has different order than REF/ALT, or the variant has considered an ALT in the source build but now considered an REF in the target build.
The pipeline consits two parts. First, it scatters the job of running picard LiftoverVcf
for all 26 (autosomal+X+Y+XY+MT) chromosomes. This also includes the pre-processing of converting the PLINK BED/BIM/FAM file sets into VCFs, as Picard LiftoverVcf
requires VCF input, and the post-processing of coverting the lifted VCFs back to PLINK. Secondly, it gathers and merges all lifted PLINKs, and splitting it into per-chromosome PLINK files.
For each chromosome, it would take up to 200 GB disk space (around 30x of the original BED file size) for mediating the liftover work. Picard LiftoverVcf
can be memory intensive for the scale of the sample size, here we designated 50 GB of memory for each chromosome's liftover work. To further optimize the memory resources required for the pipeline, the variant sorting is disabled in Picard LiftoverVcf
as it is found to be a memory intensive operation. Instead, the pipeline sorts the intermediate lifted VCF using bcftools sort
before converting it back to PLINK format.
The entire liftover process and the compute resource requirements for the work are summarized in a WDL workflow (liftover_plink_beds.wdl). The workflow utilizes bcftools, picard, plink and plink2 and they are wrapped as using this Dockerfile a docker image (link).
Of all the 805,426 markers reported in the UKB genome-wide genotyping array, 803,700 (99.79%) were lifted from GRCh37 to GRCh38. For a quick check, I compared the result with an orthogonal data source -- annotations for the Axiom UKB WCSG array on hg38 build.
There are 707,082 of the 805,426 UKB markers having records in the hg38 annotation file (NetAffx CSV file). Of these, 706,158 markers were lifted to GRCh38 using the liftover workflow presented. We found that 99.98% (705,997 / 706,158) of the lifted markers are consensus with the array’s hg38 annotation. This is suggesting that the liftover pipeline works properly. The rest 161 markers are having discrepancies because of chromosome mismatching (52), position mismatching (23), allele mismatching (60), or missing annotation (26).
Theoretically the WDL workflow can be executed anywhere with any of the execution engines listed https://github.com/openwdl/wdl#execution-engines. I have only compiled it and tested it on Research Analysis Platform (RAP) and DNAnexus Platform, which uses dxCompiler.
# Compile the workflow
$ java -jar <path_to>/dxCompiler-2.4.3.jar compile liftover_plink_beds.wdl -project <project-xxxx>
workflow-yyyy
Where the project-xxxx corresponds to the unique project ID of yours on RAP, and workflow-yyyy is the compiled workflow ID for the liftover pipeline.
Once the workflow is compiled, the liftover analysis can be run by feeding the BED/BIM/FAM files along with the corresponding liftover chain file (e.g. b37ToHg38.over.chain) and the target build’s fasta or fasta.gz file (e.g. 1000genomes.grch38.fasta-index.tar.gz, or GRCh38_full_analysis_set_plus_decoy_hla.fa
under /Bulk/Exome sequences/Exome OQFE CRAM files/helper_files/
on RAP) by first ensuring a copy of each reference file on the corresponding RAP project. All input parameters can be provided in JSON format (e.g. liftover_input_template.json) .
# Execute the compiled workflow on RAP/DNAnexus
$ dx run workflow-yyyy -f input.json --brief -y
analysis-zzzz
Alternatively, input parameters can also be specified via the webUI of RAP.
Runtime/Cost:* The liftOver analysis for all 26 chromosomes' PLINK files (Data-Field 22418) from GRCh37 to GRCh38 finished within 17.5 hours wall clock time on RAP. With the RAP compute rate, the cost is ~£56.5 with on-demand instances (priority High). If using spot instances (priority Low), the cost can be lowered to ~£10.5.
* The estimates were generated in April 2021. All numbers can be subject to change. Compute rate could be changing by RAP's Terms of Service. Depending on the instance type of choice, the implementation of the pipeline and more, there is always room for further optimizations on runtime or cost, or both.
Thanks to Tony Marcketta (@AMarcketta), Joe Foster, Chiao-Feng Lin (@chiaofenglin), Peter Nguyen, Chai Fungtammasan (@Chai_Arkarachai), and Jason Chin (@infoecho) for the discussions on the liftover process. This is part of our data preparation process for our research project using the UK Biobank Resource under Application Number ‘46926’. We thank all the participants in the UK Biobank study.
This content may contain proprietary information owned by DNAnexus and is for research purposes only. There is no expectation to sell or support the research and the user retains all accountability and liability attached to the use of the content. This content may be used under the MIT license.