Project Summary/Abstract
Population scale genome sequencing projects such as The 1000 Genomes, TOPMed, and All of US Program
will generate genotype data for millions of individuals. This number increases substantially if the recreational
usage of genetic data from genealogy companies, such as 23andme, is accounted for. Sharing and analyzing
this data create monumental challenges for the privacy of participants. Recently the hackers began targeting
genealogy databases such as the hacking of GEDmatch in 2020. Due to the large scale and high dimensions of
genomic data, analysis workflows require large computational resources. This incentivizes companies, hospitals,
and research labs to use outsourcing services from third parties to analyze and interpret genomic data such that
the genomic data is stored on untrusted 3rd party servers.
In this proposal, we focus on the secure outsourcing of genotype imputation, which is a computationally intensive
and central task in large-scale genotype analysis. Genotype imputation is the prediction of missing or low-quality
variant genotypes using a small set of variant genotypes that are measured using, for example, genotyping
arrays, low-coverage, or targeted sequencing. It is a vital step for analyzing raw genomic data for quality control,
predicting missing genotypes, variant phasing, and fine mapping of associations to identify causal variants. When
combined with sparse arrays, imputation can greatly reduce the cost of population-scale and family-based
genotyping. For example, the All of Us Project will rely on a custom genotyping array, Infinium Global Diversity
Panel, to decrease the cost of genotyping millions of individuals. Imputation methods will be of vital importance
for this task. To perform these enormous tasks, the imputation methods require large computational resources
and are often outsourced to 3rd party “imputation servers”. These servers will soon process thousands, If not
millions, of genomes and store sensitive genomic data. Unfortunately, these services are not strictly secure
neither from unauthorized hackers nor from curious users who have authorized access to the servers. There is
an urgent need for privacy-aware imputation methods that can be deployed on even untrusted 3rd party services
such as high-performance cloud platforms so that outsourcing can be safely performed at population scale.
Our proposed methods use state-of-the-art homomorphic encryption that provides perfect genomic data security
while in transit, at rest, and even while imputation is being performed. We design new and efficient “encryption-
amenable” methods and frameworks for protecting the study participants and their families, and for protecting
the population panels, i.e., underrepresented populations. Our benchmarks show that secure methods achieve
high imputation accuracy even on commodity hardware with comparable time as the state-of-the-art non-secure
methods. Proposed methods can provide practical population-scale genomic privacy and security for imputation
and association studies.