Studying eye movements has been central to understanding active vision, attention, and cognition.
Computational models have helped advance the field by assessing the visual features and computations
guiding eye movements and have helped understand human visual and cognitive dysfunctions. However,
even with 20 years of computational models, we are still far from adequately modeling eye movements and
decisions in natural tasks with real-world images. Models often miss incorporating how our vision degrades
towards the visual periphery, do not incorporate a human’s intention (task), and critically do not have a learned
understanding of scenes and objects nor language to guide the fixations. Our goal is to combine
developments in powerful vision Transformer models with computational models of human vision to create a
Foveated Search Transformer Model (FST) that can understand simple linguistic instructions to execute eye
movements that gather information for the task with an understanding of other objects in the scene. Our work
will focus on visual search for objects in real-world scenes “never seen” by the model. We hypothesize that the
developed FST model will reach human accuracy levels and will capture some of the landmark eye movement
behaviors such as manipulations of context (location, size, and semantic relationship of the target object to the
surrounding scene). ). We also hypothesize that the model will predict human behavior and fixations better
than baseline models such as Saliency, Deep Gaze, and a version of the FST model with disabled contextual
understanding. To achieve our goal, we propose two specific aims. SA1. To develop a Foveated Transformer
Search (FST) model that learns eye movements that are task-optimizing, understands scene semantics, and
captures landmark contextual effects of human search; SA2. To develop a visual-language Foveated Search
Transformer (FST-L) model that can interpret language and search for specific targets with descriptive details
provided in a sentence. The developed FST models will be compared to human eye movements and search
decisions as well as baseline models. If successful, the newly developed model will open many new avenues
of research on eye movements with more naturalistic tasks and allow prediction of the functional impact of
visual disorders in eye movements and subsequent perceptual decisions. The model will also provide a tool to
expand current investigations of search-related neural activity using computational models.