Abstract
Meaning in speech is conveyed by time-varying structures, such as phonemes and words, that have highly
variable durations. As a consequence, there is a fundamental difference between integrating across physical
time (e.g., 100 ms) and speech structure (e.g., a phoneme). Auditory neurophysiology models typically assume
that neural integration is yoked to physical time, while many psycholinguistic theories posit that integration in
speech is yoked to abstract structures such as phonemes. At present, very little is known about whether neural
computations in the cortex are yoked to time or structure. As a consequence, it is unclear if there is a change
from time- to structure-yoked integration across the cortex, and if so, where this transition occurs and what types
of structures and computations might explain it. Filling this knowledge gap is essential to linking auditory models
and cognitive theories, constructing integrated neurocomputational models of auditory-speech processing, and
understanding how auditory deficits and neurological disorders impact the neural computations that underlie
speech perception. Here, we fill this knowledge gap by systematically testing whether neural integration windows
throughout the human cortex are yoked to time or structure and developing unified computational models that
can account for both time- and structure-yoked computation in the brain. Our experimental approach is to rescale
the duration of all speech structures (e.g., using stretching/compression) and measure the extent to which the
neural integration window rescales with structure duration. We measure integration windows using temporally
precise intracranial recordings from human neurosurgical patients, combined with a novel experimental method
that makes it possible to estimate integration windows from highly nonlinear systems like the brain (Aim I). We
also use the dense, whole-brain coverage of functional MRI to spatially map time- and structure-yoked integration
(Aim II), and we leverage statistical decomposition techniques developed by the PI to integrate our intracranial
and fMRI data. Finally, we use encoding models to directly examine the neural integration of specific, theoretically
important acoustic features and speech structures, as well as develop new computational models that can
explain time- and structure-yoked integration in a common framework (Aim III). Preliminary data suggest there
is a transition from time- to structure-yoked integration across the putative cortical hierarchy with weak structure
yoking in the superior temporal gyrus, where selectivity for speech structure first emerges, and strong structure
yoking in higher-order regions of the superior temporal sulcus that integrate over longer multi-second timescales.
We also show that deep neural networks trained to recognize speech structure directly from sound learn to
integrate across speech using short time-yoked windows at early layers and long structure-yoked windows at
later layers, providing a promising model to account for our neural data and helping to bridge the gap between
traditional auditory and psycholinguistic theories.