Over 30 million people have taken DNA tests to determine their ancestry through computer genetic genealogy. By processing the digitized sequences of DNA bases, sophisticated computer algorithms can identify if one’s ancestors came from a number of ethnic groups.
DNA is a sensitive personal data as it can identify an individual uniquely.
Fully Homomorphic Encryption can make DNA ancestry identification secure for users, as it allows for the use of encrypted DNA sequences during models training and predictions.
DNA ancestry identification is a complex process that involves multiple steps. First, DNA phasing assigns alleles (the As, Cs, Ts and Gs in DNA strands) to the paternal and maternal chromosomes. Second, ancestry can be determined by referencing specific segments of the DNA with large databases of DNA of known ancestry. An alternative is to use machine learning to classify each such segment and, finally, to aggregate the ancestry of each individual segment into a final classification.
G-Nomix - fast, scalable, and accurate local ancestry method for DNA ancestry prediction.
We chose it to be our Non-FHE model, because it is simple and very effective.
You can read about G-Nomix advantages in paper.
Out FHE model for this project has two stages: a classifier (base model) that performs an initial estimate of the ancestry probabilities within genomics windows, and a second stage consisting of another module (smoother) that learns to combine and refine these estimates, significantly increasing our accuracy.
For the classifier (base model), we use Logistic Regression as a more accurate solution according to the benchmark.
For the smoother we use XGBoost as the most successful smoother, surpassing alternatives like linear convolutional filters and conditional random fields (CRFs).
Since it is an FHE model, we use a Concrete ML versions of the base and smooth models.
So our FHE model is basically a fork of the Gnomix model, which we named the ConcreteGnomix.
Training dataset is generated based on query file from 1000 genomes project.
In order to download it to /data directory just execute corresponding cell in main.ipynb
For comparison, we tried three options:
- Gnomix model (Non-FHE)
- ConcreteGnomix model, which uses FHE simulation at both stages (Sim-FHE)
- ConcreteGnomix model, which uses FHE only at the first stage (Half-FHE)
Non-FHE | Sim-FHE | Half-FHE | |
---|---|---|---|
Accuracy | 97.75 | 97.26 | 97.26 |
Inference time | 0.912712 | 19.036097 | 826.388379 |
We got almost the same accuracy for both models.
This is an expected outcome, as the models are very similar.
Half-FHE's inference time is three orders of magnitude greater than Non-FHE time.
Similar results were obtained in Season 4 bounty project submissions.
Compiled Concrete models cannot be pickled
This leads to the following problems:
- We can't use multiprocessing for Logistic Regression
- There is no easy way to save/dump model
Long inference/prediction time
Even in a half-FHE mode, it takes around 15 minutes (on our server) to get a prediction on one query.
Because of this:
- We didn't get metrics for the full FHE model (we stopped prediction on one query after 1000 minutes)
- We don't see much value in the client-server deployment approach for this project
Based on the results of our work, we believe that Fully Homomorphic Encryption can make DNA ancestry identification secure for users.
Usage of Concrete ML models (instead of non-FHE) does not impact accuracy. And even a relatively long prediction time does not matter much, since the user's DNA sequence processing takes days anyway.
- Build docker container using files from /docker
- Clone this repo
- Run main.ipynb
- Change constants and/or config.yaml (if needed)
NOTE: Model trainnig is a very RAM intensive task. You need at least 100Gb of RAM to run main.ipynb with default params