Project Summary
Substance use disorders (SUDs) have heterogeneous clinical manifestations and environmental and genetic risk
factors intertwined etiology, demanding phenotype refinement and etiology elucidation for precise prevention,
diagnosis, and treatment. Many genome-wide association studies (GWASs) have been carried out in recent years,
aiming to discover the genetic risk factors of various forms of SUDs, such as cocaine and opioid use disorders.
The high level of heterogeneity in both clinical presentations and etiology of SUDs compromises the effort for
their genetic association discovery. As a result, the identified associations only explain a very small portion of
the estimated heritability in twin-based studies, implying that the majority is still in the wild. In existing
association studies, a heterogeneous composite trait (e.g., cocaine dependence diagnosis and diagnostic criteria
count) was often used as the outcome variable and the specific set of phenotypes associated genetic variants is
unclarified. Furthermore, the lack of mechanistic understanding of the identified associations hampers the
translation of these discoveries into actionable targets to improve the disease management. In response to these
challenges, novel machine learning methods will be developed enabling the integrative analysis of data from
multiple dimensions, including phenotype, environment, genotype, and functional genomics. The developed
methods will be employed to mine a large dataset aggregated for genetic study of SUDs and data available from
multiple repositories, such as dbGap, UKBiobank, Roadmap, ENCODE, and NCBI GEO, aiming at 1) deriving
severity indices of SUDs that have maximum heritability estimate, 2) identifying novel genetic risk factors for
SUDs, 3) unraveling the association between heterogeneous clinical presentations and genetic variations in
candidate genomic regions, and 4) elucidating the functional impact of genetic variants associated with SUDs and
producing actionable findings. In Aim #1, a machine learning method for deriving severity indices by heritable
component analysis taking into account gene-environment interplay will be developed and used to derive severity
indices of SUDs, followed by GWASs. In Aim #2, a multi-view clustering framework that accounts for gene-
environment interplay will be developed and used to elucidate SUD phenotypes associated with genetic variations
in candidate genomic regions, followed by GWASs. In Aim #3, deep neural networks with novel architectures
will be trained under a novel multi-task learning framework to predict functional genomic events in varying cell
types from a wide range of brain regions and used to elucidate the functional impact of the genetic variants
discovered by GWASs.