Our AI writing assistant, WriteUp, can assist you in easily writing any text. Click here to experience its capabilities.

PubMed GPT: a Domain-Specific Large Language Model for Biomedical Text

View Original View Raw

Summary

This article discusses a partnership between MosaicML and the Stanford Center for Research on Foundation Models (CRFM) that has led to the creation of the domain-specific large language model (LLM), PubMed GPT, for biomedicine. The model was trained on data from the PubMed Abstracts and PubMed Central portions of the Pile dataset using MosaicML Cloud on 128 NVIDIA A100-40GB GPUs. The results showed that models trained on domain-specific data can outperform general-purpose models and compete with expert-designed, domain-specific model architectures. PubMed GPT was evaluated on several question and answer (QA) benchmarks, and compared against other models like DRAGON, GPT-Neo 2.7B, Galactica, BioLinkBERT and PubMedBERT. The results showed that PubMed GPT outperformed DRAGON on MedQA-USMLE, setting a new state-of-the-art, and matches DRAGON’s performance on two other QA tasks, PubMedQA and BioASQ. The article concludes that LLMs are remarkably versatile, pre-training on domain-specific data beats general-purpose data, and focused models achieve higher quality with fewer resources.

Q&As

What are the capabilities of industry-specific large language models?
The capabilities of industry-specific large language models include natural language generation, image generation, speech synthesis, and multi-modal combinations of these applications.

How was the PubMed GPT model trained?
The PubMed GPT model was trained using the MosaicML Cloud platform and the Stanford Center for Research on Foundation Models (CRFM). It was based on a HuggingFace GPT model (decoder-only transformer) with 2.7B parameters and a maximum context length of 1024 tokens. It used a custom biomedical tokenizer trained on PubMed Abstracts with a vocabulary size of 28896.

What is the performance of the PubMed GPT model compared to other models?
The PubMed GPT model outperforms DRAGON on MedQA-USMLE, setting a new state-of-the-art, and matches DRAGON’s performance on two other QA tasks, PubMedQA and BioASQ. It also significantly outperforms the general-purpose GPT-Neo, a similar model with 2.7B parameters trained on text and code from many domains, by a significant margin (17% on MedQA). It also outperforms Galactica (120B) on MedQA, and competes with its behavior on PubMedQA and BioASQ.

What are the advantages of domain-specific data over general-purpose data?
The advantages of domain-specific data over general-purpose data include improved performance, increased quality with less data and compute, and the ability to focus on a single scientific domain to arrive at a much smaller model that can still compete with larger models.

What are the implications of this research for the development of AI systems?
The implications of this research for the development of AI systems are that large language models offer the promise of new capabilities for many companies and researchers, with the potential to deliver increased quality with less data and compute than often assumed. Additionally, pre-training on domain-specific data can beat general-purpose data, and focused models can achieve higher quality with fewer resources.

AI Comments

👍 This article is a great demonstration of the capabilities of industry-specific large language models for the field of biomedicine and shows the potential of domain-specific language generation models in real-world applications.

👎 This article is a bit too technical and hard to understand for those without a background in engineering and research.

AI Discussion

Me: It's about a new large language model for biomedical text that has been developed by MosaicML and the Stanford Center for Research on Foundation Models (CRFM). They trained a 2.7B parameter GPT on biomedical data from PubMed and it achieved state-of-the art results on medical question and answer text from the US Medical Licensing Exam (USMLE).

Friend: That's really impressive. What would be the implications of this research?

Me: The research reinforces existing research that shows standard LLMs trained on domain-specific data can outperform general-purpose models and compete with expert-designed, domain-specific model architectures. It also shows that the same simple LLM training recipe can be used to train a model for legal or financial domain expertise. Furthermore, it demonstrates that by focusing on a single scientific domain, a model can still compete with larger models that target multiple scientific domains. Finally, it suggests that large language models can deliver increased quality with less data and compute than often assumed.

Action items

Research other domain-specific language models for biomedical NLP, such as BioMegatron, GatorTron, and BioGPT.
Explore MosaicML’s Composer and Streaming Dataset libraries for training large, custom models across hundreds of GPUs.
Try out MosaicML Cloud for training and monitoring ML training jobs, scaling training across multiple GPUs and nodes, and speeding up training with algorithmic and system efficiency methods.

Technical terms

Large Language Models (LLMs): A type of artificial intelligence model that uses deep learning to generate natural language.
Domain-Specific: Refers to data or models that are tailored to a specific domain or industry.
GPT: A type of language model developed by OpenAI that uses a transformer architecture to generate natural language.
USMLE: The United States Medical Licensing Exam, a test administered to medical students in the United States.
PyTorch: An open-source machine learning library for Python.
FSDP: A PyTorch backend for fully sharded data parallel training.
StreamingDataset: A library for hosting arbitrary data (text, images, etc.) as shards in object storage and then streaming that data to a training job anywhere in the world.
MedQA-USMLE: A question and answer benchmark consisting of questions and answers taken from previous Medical Licensing Exams given to doctors in the United States.
DRAGON: A state-of-the-art biomedical language model released last month by members of the CRFM team.
GPT-Neo: A general-purpose language model with 2.7B parameters trained on text and code from many domains.
Galactica: A 120B parameter LLM, trained on a corpus of over 48 million papers, textbooks, scientific websites, encyclopedias, and other sources of scientific knowledge across multiple domains.
BioLinkBERT: A biomedical model trained by Stanford CRFM that uses the link structure of documents during training.
PubMedBERT: A domain-specific language model for biomedical NLP.
BioMegatron, GatorTron, and BioGPT: Other biomedical systems that use different evaluation tasks or setups than PubMed GPT.
MeQSum: A medical question summarization benchmark.