Development and validation of diffusion-model enabled computational observers for clinical diagnostic task performance evaluation of deep learning CT algorithms - Objective assessment of image quality (IQ) using task-based performance metrics is important for developing deep-learning image reconstruction and post processing algorithms, and for translating the champion algorithms to the clinics. Ideally, task performance evaluation should use human observers. But human observer studies are time consuming and expensive, therefore cannot be realistically applied to evaluating and optimizing algorithms during early-stage development. Computer-based human-performance mimicking model observers (MO) can substitute for human observers and hold great potential as a tool for IQ evaluation. However, MO studies are not often used for DL algorithm evaluation. One possible reason could be due to the weaknesses of MOs for task performance evaluation. Diagnostic tasks are complicated due to imperfect data acquisition, e.g., quantum noise, and the large patient anatomical variation. Historically, the patient anatomical variation is quantitatively intractable, while the quantum noise can be handled well by the known data acquisition model. In this context, the current paradigm of MOs circumvents our inability to quantify patient anatomical variation, but relies on multiple realizations of quantum noise to define a patient population. This approach leads to the intertwined weaknesses of (1) simplified task definitions and (2) burdensome data requirement, which may attribute to the absence of MOs in DL algorithm evaluation. The challenge of characterizing patient anatomical variation can now be overcome, thanks to the advent of score-based probabilistic diffusion models (DMs). Score-based DMs, in addition to generating image samples as in all generative models, have the unique capability of calculating the exact log-density (or log-likelihood) of image samples. This capability can be exploited to calculate (log-)likelihood ratio type test statistics and allows us to define a new paradigm of MOs for realistic clinical tasks with free-form, essentially unlimited signal and background variability, here “unlimited” means “to the extent of variation that the DM training data are curated for.” At the same time, relying on anatomical variation reduces the burdensome, multiple noise, data requirement, so that our new MOs are also data-efficient and applicable to patient data that do not come with multiple noise realizations. Our specific aims focus on (1) developing the computational framework of such diffusion-model enabled MOs, (2) validating MO performance using clinical detection and localization tasks. Our MOs break free from the historical confines that have hindered their adoption for routine task-performance evaluation of diagnostic image quality. They hold great potential for identifying promising early-stage algorithms, for offering a performance predictor that correlates well with diagnostic task performance, and for eventually translating the champion algorithms to clinical deployment for improved patient care.