RESEARCH SUMMARY
One of the key challenges in Alzheimer’s disease (AD) is the early detection of individuals who are at risk of
developing the condition, and subsequently making predictions about how rapidly their condition will progress.
To facilitate this, the parent grant of this proposal (R01-AG058676-01A1) brings together ¿ve leading well char-
acterized Alzheimer’s cohorts to clarify risk and protective factors for Alzheimer’s dementia: the Adult Children
Study (ACS), the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarkers and
Lifestyle Flagship Study of Ageing (AIBL), the Dominantly Inherited Alzheimer Network (DIAN) and the National
Alzheimer’s Coordinating Center (NACC). The resulting cohort is critical for understanding the factors which
precipitate or delay dementia diagnosis. The large size of this dataset motivates the application of machine learn-
ing(ML) and arti¿cial intelligence (AI) methodologies to address the critical questions of classifying individuals
as at risk of AD and for making predictions about the future progression in terms of the disease, symptoms and
function.
However, there are stark contrasts between the AD space and those domains where ML/AI techniques have
been used to great success, such as computer vision, text mining and audio analytics. In particular, the smaller
sample size, even for the largest datasets collected to date, means that highly expressive models are just as likely
to detect technical artefacts in the data as real biological signal, especially when looking at high-dimensional in-
formation where AI/ML is most appropriate. Moreover, standard ML/AI approaches do not typically work with
missing values, a common feature of clinical AD cohorts, where certain types of measurements are dif¿cult to
collect. Finally, the areas where AI/ML have ful¿lled their promise are those where the barrier to entry for both ML
and domain specialists has been reduced. This democratisation has been achieved largely by providing bench-
mark datasets that have been processed to remove technical artefacts, have comprehensive documentation and
adhere to under Findable, Accessible, Interoperable, and Reusable (FAIR) data principles.
In this grant, we seek to conduct high-resolution, multi-modal harmonisation and cross-dataset imputation of
the ¿ve leading AD cohorts aiming to improve their suitability for AI/ML approaches, proving these in a form that is
FAIR, well-documented and easy to use with AI/ML frameworks. While the parent grant performs some harmoni-
sation of summary statistics (e.g. average amyloid levels from PET imaging, cognitive test summary statistics) and
imputation using classical approaches, this extension seeks to improve the level of granularity (e.g entire images,
individual or small groups of questions in cognitive tests) at which harmonisation is performed, and incorporate
biological prior knowledge into data imputation. This will be achieved by leveraging recent advancements in al-
gorithmic bias-removal and matrix completion, and will incorporate our understanding of disease processes and
the nature of the modalities being analysed. The latter will put additional constraints on the inferential, allowing
it to produce more accurate and sensible estimations of measurements. We will provide FAIR-curated version
of these datasets, along with software to enable researchers to adjust the level of harmonisation and imputation
as needed, based on the strengths and weaknesses of the underlying algorithms developed and implemented in
this proposal.
The integration of the ¿ve largest longitudinal AD cohorts, as per the parent grant of this proposal, provides
an invaluable opportunity to explore the power of AI/ML to improve our ability to detect AD early on and make
accurate forecasts about individual’s change over time. This proposal seeks to bring modern AI/ML methods
¿rmly into the AD community by producing curated de-biased datasets that can be seamlessly integrated into
most existing ML pipelines. By improving the quality of the underlying data at a higher resolution than has been
done before, predictive and prognostics models derived from this work are likely to be substantially more powerful
than past approaches.