A critical characteristic of human language is our ability to understand multi-word sequences whose
meaning is greater than the sum of their parts. Recent work from the PIs of this proposal (Toneva and Wehbe,
2019; Jain and Huth, 2018) and others (Schrimpf et al., 2020a; Caucheteux & King, 2020) has shown that
cortical representations of multi-word sequences can be modeled much more accurately than before by
using neural network language models, a machine learning technique that has revolutionized the natural
language processing (NLP) field (Devlin et al., 2019; Radford et al., 2019). However, under the current
paradigm these models must first be trained on separate NLP tasks and only then used to model the brain,
creating a guess-and-check cycle that is not guaranteed to converge on the actual computations that
humans perform. Here we propose to break this cycle by directly training neural network models to
estimate the functions that the brain uses to combine words. To be able to optimally predict fMRI and MEG
responses, these models will need to capture the composition principles governing which words the brain
attends to, and how information is combined across words. These models will help uncover specific
computations underlying language processing in the brain, enable computational testing of neurolinguistic
theories, and inspire or directly improve models used in NLP.
Accomplishing these goals, however, will require overcoming one major obstacle. Training neural net-
work language models typically requires orders of magnitude more data than existing neuroimaging
datasets. To address this issue, one central goal of this proposed project is to collect a very large fMRI and
MEG dataset comprising roughly one million words of natural language stimuli. We plan to use the unique
dataset and computational modeling framework to address three scientific aims.
Aim 1: Create brain activity prediction benchmarks to foster interaction between neuroscience and NLP.
Aim 2: Use data-driven models to test existing neurolinguistic theories & develop new accounts of the
computations underlying word composition in the brain. Aim 3: Leverage information in different brain
areas to help solve computationally defined language tasks.
Successful completion of the proposed work will provide mechanistic insight into language processing,
with a computational architecture tracing information flow among brain areas and describing the tasks they
perform. Beyond its basic cognitive neuroscience implications, we expect this work will enable better
understanding of language impairments and help identify targeted therapies.
RELEVANCE (See instructions):
Through collecting, analyzing, and disseminating large-scale neuroimaging datasets collected while
participants listen to natural, narrative speech, this proposal aims to improve our understanding of the
normal function of the language system. Specifically, this work seeks to improve and validate
computational models of speech language processing in the human brain.