Project Summary
Artificial intelligence (AI) systems are increasingly used in radiation oncology for tasks such as image
reconstruction and registration, autosegmentation, synthetic CT generation, and treatment planning. However,
AI design fundamentally challenges existing quality assurance (QA) paradigms which imperils the quality and
safety of AI for clinical use. Addressing the unmet need of QA for clinical AI is critical as the potential for
performance degradation of AI systems in the clinic is high. Domain shift - when the distribution of data used
during training is different from the distribution of data encountered during deployment - is a critical problem that
can lead to significant errors in AI performance. This is a common occurrence in clinical environments, where
scanner performance varies over time due to changes in imaging protocols or sequences, equipment
degradation, or replacement with a different make or model. Monitoring clinical AI system performance for signs
of domain shift is of utmost importance to ensure safe and high-quality use. Development of robust QA tools and
practices to verify and monitor the performance of AI systems is therefore critical as these systems enter the
clinical arena. In this project, we will develop a new type of QA approach amenable for closed-source, clinical AI
systems. Our approach supported by our preliminary data is to design a series of detectors which monitor the
input imaging data and AI system output for changes and link these changes to an actionable tolerance through
a prediction model, without the need to access the AI system internals. Our overall hypothesis is that the
expected performance of clinical AI systems is predicted within 5% error by monitoring only the AI system inputs
and outputs. In Specific Aim 1, we will develop a QA framework for AI systems which were trained with a ground
truth set of labels, using autosegmentation as a model system. We will build compression algorithms to encode
features from the distribution of inputs (images) and separately on the distribution of outputs (contours). A
prediction model taking these distributions as input and predicting the contour accuracy will be built. We will
develop the QA framework on a set of existing AI systems including two commercial AI and several in-house
autosegmentation algorithms. In Specific Aim 2, we will focus on AI systems which do not use a ground truth
during training, using synthetic CT generation as a model system. A similar approach as in SA1 using
compression to build distributions of input and output latent features will be used. Instead of predicting accuracy
(which requires a ground truth), we will develop a model to monitor the distribution of outputs. In Specific Aim 3,
we will deploy our quality assurance frameworks in a prospective clinical study involving multiple institutions and
evaluate effectiveness in ensuring the safe and high-quality deployment of clinical AI systems. We will also share
our frameworks and data with the broader community to promote best practices in AI quality assurance. We
expect that our QA framework will significantly improve the safety and effectiveness of clinical AI systems in
radiation oncology, by ensuring that these systems are robust to domain shift and other sources of error.