Wednesday, September 17, 2025 9/17/2025

From Text to Translation: Using Language Models to Resolve and Classify Variants

Award Number: R21HG014015
ORGANIZATION: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
OPDIV: NIH
AWARD CLASS: DISCRETIONARY
AWARD ACTIVITY TYPE: SCIENTIFIC/HEALTH RESEARCH (INCLUDES SURVEYS)
PERIOD OF PERFORMANCE START DATE: 09/01/2025
PERIOD OF PERFORMANCE END DATE: 08/31/2027

Group Awards By:

View Award Description

From Text to Translation: Using Language Models to Resolve and Classify Variants - Project Summary: Deep learning methods toward resolving uncertain variant classifications Genomic sequencing can substantially improve clinical management, by optimizing surveillance and treatment options, and improving risk assessment. As the interpretation of genetic variants increases, thousands of new variant interpretations are entering variant databases each month. Most variants in these databases have insufficient evidence to be classified as pathogenic or benign, and as a result are classified as Variants of Uncertain Significance (VUSs). Despite potentially increasing risk, information about these variants cannot be communicated to providers or patients due to a lack of structured evidence. This translational gap is preventing many patients who collectively carry such variants from benefiting from genomic medicine. ClinVar, a large diagnostic variant database contains a unique abundance of predictive information that has been curated by clinical experts over many years. This includes over 1.1 million plaintext diagnostic reports that often describe case data, literature review, and an analysis of computational predictions or functional assay data. We will use these clinical reports to make predictions of pathogenicity, and to identify which specific sources of evidence of pathogenicity are provided in each report. This project will enhance the value of data in ClinVar, a public resource used by thousands of investigators, clinicians, and bioinformatic pipelines. We will first optimize a text classification model to make predictions from diagnostic summaries, evaluating and fine-tuning a set of large language models which have been trained on different text corpora. Using clinical reports and known classifications from ClinVar variant submissions, we will evaluate different filtering criteria used in the training process. We measure performance on high confidence labeled data which have been previously reviewed by expert panels, as well as on bona fide VUSs, using expert panel curated variant interpretations as ground truth validation data. Next, we identify the information from these reports which drive predictions using post-hoc explainability methods (attention mapping, representation probing, and causal mediation analysis), and then map this evidence to biomedical concepts related to variant interpretation and pathogenicity, using a knowledge graph which is refined to highlight these concepts relevant to diagnostic review criteria. Finally, we will measure the extent to which these approaches can identify complementary evidence across variant reports generated by different clinical labs related to the same variant, which can be used to re-classify VUS or resolve a variant with conflicting interpretations. We will manually review a set of clinical reports to evaluate accuracy of the sources of information that have been recovered. If evidence is sufficient, we will identify up to 100 variants which are carried by participants in the Mass General Brigham biobank, and attempt to update their variant classifications so that these results can be communicated to patients.


Issue Date FY	Funding FY	Legal Entity Name	Legal Entity Address	Legal Entity City	Legal Entity State	Legal Entity Zip Code	Legal Entity COUNTY	Legal Entity COUNTRY	Assistance Listing	Award Code	Budget Year	Action Date	Action Type	Action Amount

Issue Date FY: 2025 ( Subtotal = $240,647 )
2025	2025	BRIGHAM & WOMENS HOSPITAL INC	75 FRANCIS ST	BOSTON	MA	02115	SUFFOLK	USA	Human Genome Research	000	1	8/15/2025	NEW	$240,647
														Subtotal = $240,647

Grand Total All Awards = $240,647

Top

All Categories

About

Search

Reports

Data Submission

Award Information

From Text to Translation: Using Language Models to Resolve and Classify Variants

Award Number: R21HG014015

ORGANIZATION: NATIONAL HUMAN GENOME RESEARCH INSTITUTE

OPDIV: NIH

AWARD CLASS: DISCRETIONARY

AWARD ACTIVITY TYPE: SCIENTIFIC/HEALTH RESEARCH (INCLUDES SURVEYS)

PERIOD OF PERFORMANCE START DATE: 09/01/2025

PERIOD OF PERFORMANCE END DATE: 08/31/2027

Federal Websites

Department of Health & Human Services

HHS Operating Divisions

HHS Staff Divisions

Download A Document Viewer