PROJECT SUMMARY
Due to mandates from funding agencies and publishers, high-throughput, molecular data from Down syndrome
individuals and controls (mostly humans and mice) are available in public repositories. Researchers can use such
data to corroborate their own ¿ndings and pose new research questions. Doing so would help to leverage prior
investments and complement efforts by the INCLUDE Data Coordinating Center (DCC) to generate data for new
cohorts. Our proposal focuses speci¿cally on mRNA expression and DNA methylation data. These data types
shed light on how genes are regulated, how molecular aberrations lead to medical conditions, and how medical
outcomes can be predicted, potentially leading to improved diagnostics, treatments, and insights into human
health and disease. However, many data-generation platforms are used for these data types, and researchers
use a wide range of techniques for normalizing the data, checking data quality (if they check at all), and mapping
to gene annotations. To reuse the data most effectively, the data must be reprocessed from its original form;
normalized and quality checked consistently; and mapped to current annotations. Agencies who manage public
repositories lack resources and expertise to perform these steps. In our ¿rst aim, we will address this problem using
a data-curation approach. We have identi¿ed 148 datasets speci¿c to Down Syndrome that we believe should be
prioritized for reuse. Using our expertise in molecular-data processing and bioinformatics, we will re-normalize,
quality-check, summarize, and annotate the data using an approach that maximizes consistency for all of the
datasets. Additionally, we will map the metadata to biomedical-ontology terms in collaboration with the INCLUDE
DCC. We expect that these efforts will reduce barriers for researchers in the Down syndrome community to reuse
the data and accelerate progress in the ¿eld. Our second aim focuses on interoperability. For many research
questions, a single dataset is insuf¿cient. Sample sizes may be small and/or a single dataset may not represent
the range of phenotypes or other factors necessary to answer a given question. Therefore, it is often crucial to
integrate datasets from multiple sources. However, systematic differences between datasets are inevitable due to
differences in populations, laboratory conditions, and environmental factors. Failing to adjust for these differences
will likely lead to biased conclusions. We will evaluate the feasibility of using generative neural networks, a type
of algorithm that is highly con¿gurable and is behind many of the most in¿uential arti¿cial-intelligence advances
of the past decade. We will apply these algorithms in the context of studying medical conditions that co-occur
with DS, such as autoimmune conditions, dementia-related disease, congenital heart defects, and leukemias. Our
algorithms will search for systematic patterns that differ between datasets and generate a modi¿ed version of the
data in which those differences have been minimized yet the biologically relevant patterns have been retained.