Some of the largest challenges in connecting a patient’s health status to the genomic data at hand are due to the lack of an aggregated summary of published research data. This is an area that is ideally suited to the application of AI methods such as Large Language Models (LLMs) even though few clinical applications of such LLMs are in common use today because of the limited coverage that existing models have, and the lack of knowledge around the sensitivity and specificity of LLMs in a clinical setting.
LLMs that can interpolate from aggregated research data should perform better than LLMs that are extrapolating results from a data corpus. General purpose LLMs might also suffer from confounding observations due to a lack of topical or disease focus. In short, while the applications of LLMs to clinically interpret a patient’s genomic information hold promise, the field lacks a rigorous and objective framework to assess the state-of-the-art.
The development of PhenoPackets¹ has provided an opportunity to construct a necessary and objective framework to evaluate different variant ranking and interpretation methods to assess their “clinical” performance. PhenoPackets catalog known genomic mutations that have been associated with disease conditions observed in patients; these mutations can be seeded into virtual genome-wide allelic profiles that are constructed from a healthy cohort such as 1KGP² to create virtual patients with a known disease and the disease’s known genomic variant.
This framework of “diseased” virtual patients can be used to assess a variant ranking method’s ability to identify disease within a patient (sensitivity) and to identify the disease-causing mutation within the genomic data of the patient (specificity). Since many approaches typically return a rank ordered list of variants it is likely that the “specificity” should be evaluated using the top ranked variant, within the top five variants, etc., for a large enough cohort of virtual patients to gather statistical representations of performance for comparative purposes.
This PhenoPacket-derived evaluation framework was recently used to evaluate the performance of the InheriNext® system from Compass Bioinformatics and four publicly available systems in common use. The results³ reveal that InheriNext’s ranking algorithm led the way with sensitivities for the top variant, top 5 variants, and top 10 variants of 84.6%, 95.0%, and 98.6%, respectively. Importantly, the time required to receive this quality of result from each raw exome sequencing data was ~5 minutes.
From here, interpretation of the top ranked variants is an area where LLMs can be used to evaluate these variants within the context of a patient’s presentation. The use of LLMs is critical in converting these ranked variants into useful variant interpretations that can help guide disease diagnosis and treatment.
While LLMs are already able to produce useful results for variant interpretation, there are many areas of potential improvement in the application of AI methods broadly, and LLMs specifically, to address the clinical interpretation challenges in genomic medicine. One opportunity for improvement involves the integration of LLMs with ranking methods to maintain very high sensitivity in the top ranked variant (or even the top 3 variants) rather than the top 5 or top 10 variants. A more succinct result would simplify clinical application.
Another advance would be to use patient presentation to select a more focused LLM that considers a patient’s medical context to help diagnose subtle differences in the course of disease that could influence therapy decisions.
An additional improvement would leverage a patient’s genetic ethnicity to focus an LLM around those factors known to be specifically relevant within specific ethnicities and not others. This would continue to remove the historic bias in clinical knowledge that has been derived primarily from European populations. It is fortunate that these are all directions of ongoing research and development in the broader field of AI being applied to improve the many applications of genomic medicine. We should anticipate and encourage the ongoing work by all groups working to improve AI’s impact on genomic medicine.
¹ Danis, D., et al. (2025). A corpus of GA4GH Phenopackets: Case-level phenotyping for genomic diagnostics and discovery. Human Genetics and Genomics Advances, 6(1), 100371.
² https://en.wikipedia.org/wiki/1000_Genomes_Project
³ Preprint: Chang, Ju-Yuan, et al. Evaluating a Standard Benchmark for Gene Prioritization: The InheriNext® Algorithm’s Integration of Genomic and Phenotypic Information. bioRxiv (2025): 2025-02.