美国加州大学圣地亚哥分校Dexter Pratt和Trey Ideker共同合作,近期取得重要工作进展。他们研究评估了能发现基因集功能的大语言模型。相关研究成果2024年11月28日在线发表于《自然—方法学》杂志上。
据介绍,基因集富集是功能基因组学的支柱,但它依赖于不完整的基因功能数据库。
研究人员评估了五种大语言模型(LLM)在分子理论和自信评估的支持下,发现基因集所代表的共同功能的能力。对于来自基因本体论的策划基因集,GPT-4在73%的情况下建议与策划名称相似的功能,更高的自信预示着更高的相似性。
相反,随机基因集在87%的病例中正确地产生了零置信度。其他LLM(GPT-3.5、Gemini Pro、Mixtral Instruct和Llama2 70b)在功能恢复方面各不相同,但对随机集的信心是错误的。
在组学数据的基因簇中,GPT-4识别出45%病例的共同功能,少于功能富集,但具有更高的特异性和基因覆盖率。对支持性原理和引用的手动审查发现,这些功能在很大程度上是可验证的。
总之,这一研究表明LLM是有价值的组学助手。
附:英文原文
Title: Evaluation of large language models for discovery of gene set function
Author: Hu, Mengzhou, Alkhairy, Sahar, Lee, Ingoo, Pillich, Rudolf T., Fong, Dylan, Smith, Kevin, Bachelder, Robin, Ideker, Trey, Pratt, Dexter
Issue&Volume: 2024-11-28
Abstract: Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants.
DOI: 10.1038/s41592-024-02525-x
Source:https://www.nature.com/articles/s41592-024-02525-x
Nature Methods:《自然—方法学》,创刊于2004年。隶属于施普林格·自然出版集团,最新IF:47.99
官方网址:https://www.nature.com/nmeth/
投稿链接:https://mts-nmeth.nature.com/cgi-bin/main.plex