Recent years have witnessed the development of large research projects that involve
genotyping hundreds of thousands of individuals, on which we have available detailed medical records.
Examples include the All of us research project, the Million Veteran Program, and the UKBiobank
resource. Often, whole-genome sequencing data is also available for a substantial fraction of the
individuals. These large samples, with their precise genotypic and phenotypic information, give
us the opportunity to bring our understanding of the relations between genetic variation and
traits of medical interest to the next level.
While the initial small sample sizes available for genome wide association studies (GWAS)
motivated analyses that were approximative in nature, we are now in the position to probe more
closely the genetic causal mechanisms underlying medically relevant phenotypes. We can aspire
to distinguish variants that have causal effects from those that are associated because of linkage
disequilibrium or population structure. Indeed, we need to pay even greater attention to the
implications of hidden confounders: even small effects become significant when sample sizes are
Increasing the resolution with which we can describe causal mechanisms will result in the
identification of clearer targets for drug development. It will also improve the precision of
personalized risk evaluations based on genotypes: if we can construct risk scores using variants
that are truly causal, their performance will remain solid across ethnicities and environmental
To zoom in on genetic variants with causal effects, this project will leverage a set of new
statistical methodologies that the investigators have recently introduced. These new approaches
are remarkably flexible, in that they do not rely on specific assumptions of how phenotypes
are linked to genetic variants. Indeed, they allow researchers to capitalize on powerful machine
learning algorithms and, crucially, equip their results with precise replicability guarantees.
We have assembled a diverse and complementary team, including experts in statistical
genomics, methodological statistics and computer science, with a strong record both of software
development and genetic data analysis. A postdoctoral scholar and two graduate students will
contribute to the research program, and the interdisciplinary training they will acquire in
statistics, computation and genetics will add another substantial benefit.