Awesome-Medical-Healthcare-Dataset-For-LLM

A curated list of popular Datasets, Models and Papers for LLMs in Medical/Healthcare.

Datasets

中文

数据集名称	内容概述	获取链接	数据大小
MedDialog	MedDialog数据集（中文）包含了医生和患者之间的对话（中文）。它有110万个对话和400万个话语。数据还在不断增长，会有更多的对话加入。原始对话来自好大夫网。	下载链接	3.3GB
Chinese medical dialogue data 中文医疗对话数据集	数据中有六个科室的医学问答数据: <Andriatria_男科> 94596个问答对 <IM_内科> 220606个问答对 <OAGD_妇产科> 183751个问答对<Oncology_肿瘤科> 75553个问答对 <Pediatric_儿科> 101602个问答对 <Surgical_外科> 115991个问答对总计 792099个问答对	下载链接	800k 条，330MB
Huatuo-26M	Huatuo-26M 是迄今为止最大的中医问答数据集。该数据集包含超过2600万对高质量的医学问答对，涵盖疾病、症状、治疗和药物信息等广泛主题。	下载链接	4.54GB
huatuo_encyclopedia_qa	该数据集共有364,420条医疗QA数据，其中一些数据以不同的方式包含多个问题。我们从纯文本(例如，医学百科全书和医学文章)中提取医学QA对。我们在中文维基百科上收集了8699个疾病百科条目和2736个药物百科条目。此外，我们还从钱文健康网站抓取了226432篇高质量的医学文章。	下载链接	605MB
中文医疗对话数据集(华佗项目)	22万条中文医疗对话数据集(华佗项目)：FreedomIntelligence/HuatuoGPT-sft-data-v1	下载链接	333MB
医疗大模型数据集(包括预训练、指令微调和奖励数据集)	240万条中文医疗数据集(包括预训练、指令微调和奖励数据集)	下载链接	2.1GB
外科问诊数据BillGPT/Chinese-medical-dialogue-data	60.8K条外科问诊数据，示例："患者:新癀片有什么用,想问一下新癀片吃了有什么作用呀？医生:病情分析：您好：新癀片主要是可以清热解毒，活血化瘀，消肿止痛。用于热毒瘀血所致的咽喉肿痛、牙痛、痹痛、胁痛、黄疸、无名肿毒等症。指导意见：如果您有咽喉疼痛等症状服用效果是很好的，但是有胃炎的朋友尽量不要服用，有一定的胃肠反应，里面也含有对胃有刺激成分。"	下载链接	936MB
中文医学指令精调/指令微调数据集(Instruct-tuning)	采用了公开和自建的中文医学知识库，主要参考了cMeKG。医学知识库围绕疾病、药物、检查指标等构建，字段包括并发症，高危因素，组织学检查，临床症状，药物治疗，辅助治疗等。利用GPT3.5接口围绕医学知识库构建问答数据，设置了多种Prompt形式来充分利用知识。	下载链接	7.6K条
MeChat：中文心理健康支持对话大模型与数据集	数据集通过 ChatGPT 改写真实的心理互助 QA 为多轮的心理健康支持多轮对话（single-turn to multi-turn inclusive language expansion via ChatGPT），该数据集含有 56k 个多轮对话，其对话主题、词汇和篇章语义更加丰富多样，更加符合在长程多轮对话的应用场景。	下载链接	56k条
CMB-Chinese Medical Benchmark	CMB是一个全方位多层次的中文医疗模型评估平台。它共包含280839道多项选择题和74例复杂病例问诊题，涵盖了所有医学临床工种和不同职业级别的考试，综合考察模型的医学知识和临床问诊能力	下载链接	30MB
ChatMed_Consult_Dataset	ChatMed_Consult_Dataset 中的query(或者是prompt)来自于互联网上的医疗问诊问题(110,113)，反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI GPT-3.5引擎回答的。我们后续会对互联网上的医生回答与患者回答进行筛选甄别，择优选择，构建质量更优的数据集。	下载链接	395MB
中医药指令数据集ChatMed_TCM_Dataset	以开源的[中医药知识图谱] (https://github.com/ywjawmw/TCM_KG) 为基础，采用以实体为中心的自指令方法(entity-centric self-instruct)，调用ChatGPT得到11w+的围绕中医药的指令数据。	下载链接	110MB
cMedQA中文社区医学问答数据集	华人社区医疗问答的数据集，该数据集是1.0版本，提供方将不时更新和扩充数据库。为了保护隐私，数据是匿名的，不包括个人信息。	下载链接	80MB
WebMedQA 线上医学QA	WebMedQA是一个从百度医生和120Ask等在线健康咨询网站收集的真实**医学问答数据集。用户首先填写个人信息表格，然后描述他们的疾病和健康问题。这些问题对所有注册的临床医生和用户开放，直到问题提出者选择最满意的答案并结束问题。医生和热心的用户可以在发布的问题下提供诊断和建议，他们的标题和专业与他们的答案一起显示。提问者也可以进一步询问他们是否对其中一个答案感兴趣。每个问题所属的类别也由其提出者选择。	下载链接	75MB
ChineseBLUE基准	ChineseBLUE基准由不同的带有语料库的生物医学文本挖掘任务组成。这些任务涵盖了各种文本类型(生物医学网络数据和临床笔记)、数据集大小和难度，更重要的是，突出了常见的生物医学文本挖掘挑战。	下载链接	400MB
Yidu-S4K	命名实体识别,实体及属性抽取	下载链接	4K条
Yidu-N7K	临床语标准化	下载链接	7K条
HealthCareMagic-100k	来自HealthCareMagic.com的10万次病人和医生之间的真实对话	下载链接	137MB
icliniq-10k	来自icliniq.com网站的病人和医生之间的10K条真实对话	下载链接	20MB
GenMedGPT-5k	5k从ChatGPT GenMedGPT-5k和疾病数据库中生成了患者和医生之间的对话。	下载链接	5K条

英文

数据集名称	内容概述	获取链接	数据大小
MIMIC-III	EHR	https://mimic.mit.edu/docs/iii/	58,976 hospital admissions for 38,597 patients
MIMIC-IV	EHR	https://mimic.mit.edu/docs/iv/	covering a decade of admissions between 2008 and 2019
CPRD	EHR	https://cprd.com/data	over 2,000 primary care practices and include 60 million patients
PubMed	Scientific Literature	https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/	35M citations and abstracts of biomedical literature
PMC	Scientific Literature	https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk	8 million full-text article records
RCT	Scientific Literature	https://github.com/bwallace/RCT-summarization-data	4,528 abstract
MS$\hat{~}$2	Scientific Literature	https://github.com/allenai/ms2/	470,402 abstract
CDSR	Scientific Literature	https://github.com/qiuweipku/Plain\_language\_summarization	7,805 abstract
SumPubMed	Scientific Literature	https://github.com/vgupta123/sumpubmed	33,772 abstract
The Pile	Scientific Literature	https://pile.eleuther.ai/	825 GB English text
S2ORC	Scientific Literature	https://github.com/jbshp/GenCompareSum	63,709 abstract
CORD-19	Scientific Literature	https://github.com/allenai/cord19	1M papers
MeQSum	Medical Question Summarization	https://github.com/abachaa/MeQSum	1000 instances
CHQ-Sum	Medical Question Summarization	https://github.com/shwetanlp/Yahoo-CHQ-Summ	1507 instances
UMLS	Knowledge Base	https://www.nlm.nih.gov/research/umls/index.html	2M entities for 900K concepts
COMETA	Web Data (social media)	https://github.com/cambridgeltl/cometa	800K Reddit posts
MedDialog	Dialogue	https://github.com/UCSD-AI4H/COVID-Dialogue	3.66 million conversations
CovidDialog	Dialogue	https://github.com/UCSD-AI4H/COVID-Dialogue	603 consultations
Medical Flashcards	Dialogue	https://github.com/kbressem/medalpaca	33955 instances
Wikidoc	Dialogue	https://huggingface.co/datasets/medalpaca/medical\_meadow\_wikidoc	67704 instances
Wikidoc Patient Information	Dialogue	https://huggingface.co/datasets/medalpaca/medical\_meadow\_wikidoc\_patient\_information	5942 instances
MEDIQA	Dialogue	https://huggingface.co/datasets/medalpaca/medical\_meadow\_wikidoc\_patient\_information	2208 instances
CORD-19	Dialogue	https://huggingface.co/datasets/medalpaca/medical\_meadow\_cord19	1056660 instances
MMMLU	Dialogue	https://huggingface.co/datasets/medalpaca/medical\_meadow\_mmmlu	3787 instances
Pubmed Causal	Dialogue	https://huggingface.co/datasets/medalpaca/medical\_meadow\_pubmed\_causal	2446 instances
ChatDoctor	Dialogue	https://github.com/Kent0n-Li/ChatDoctor	215000 instances
Alpaca-EN-AN	English Instructions	https://github.com/tatsu-lab/stanford\_alpaca/blob/main/alpaca\_data.json	52K instructions
Alpaca-CH-AN	Chinese Instructions	https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/tree/main/data	52K instructions
ShareGPT	Conversations	https://huggingface.co/datasets/philschmid/sharegpt-raw	61653 long conversations
WebText	Web Data	https://commoncrawl.org/the-data/get-started/	40 GB of text
OpenWebText	Web Data	https://skylion007.github.io/OpenWebTextCorpus/	38 GB of text
Colossal Clean Crawled Corpus	Web Data	https://www.tensorflow.org/datasets/catalog/c4	806 GB of text
OpenI	EHR, Multimodel	https://openi.nlm.nih.gov/faq\#collection	3.7 million images from about 1.2 million papers
U-Xray	Multimodel	https://openi.nlm.nih.gov/	3,955 reports and 7,470 images
ROCO	Multimodel	https://github.com/razorx89/roco-dataset	81,000 radiology images and corresponding captions
MedICaT	Multimodel	https://github.com/allenai/medicat	17,000 images includes captions
PMC-OA	Multimodel	https://huggingface.co/datasets/axiong/pmc\_oa\_beta	1.6M image-caption pairs
CheXpert	Multimodel	https://aimi.stanford.edu/chexpert-chest-x-rays	224,316 chest radiographs with associated reports
PadChest	Multimodel	http://bimcv.cipf.es/bimcv-projects/padchest/	160,000 images with related text
MIMIC-CXR	Multimodel	https://mimic.mit.edu/docs/iv/modules/cxr/	227,835 imaging studies for 64,588 patients
PMC-15M	Multimodel		15 million Figure-caption
pairs	https://arxiv.org/abs/2303.00915
OpenPath	Multimodel	https://laion.ai/blog/laion-5b/	208,414 pathology images related descriptions

Models

Title	Institute	Date	Code
DoctorGPT	Sirajraval, GPT School	2023-08	Github
CoDoC: Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians	DeepMind, Google	2023-07	Github
Med-PaLM 2: Towards Expert-Level Medical Question Answering with Large Language Models	Google	2023-05	-
Capabilities of GPT-4 on Medical Challenge Problems	Microsoft, OpenAI	2023-03	-
BioMedLM-PubMedGPT: A purpose-built AI model trained to interpret biomedical language	Stanford CRFM, MosaicML	2022-12	HuggingFace
Med-PaLM: Large Language Models Encode Clinical Knowledge	Google	2022-12	Github
ClinicalT5: A Generative Language Model for Clinical Text	University of Oregon, Baidu	2022-12	HuggingFace
GatorTron: A large language model for electronic health records	University of Florida, NVIDIA	2022-12	HuggingFace
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining	Microsoft Research	2022-09	HuggingFace
BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model	Tsinghua University	2022-04	HuggingFace
KeBioLM: Improving Biomedical Pretrained Language Models with Knowledge	Tsinghua, Alibaba	2021-04	Github
Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art	Meta	2020-11	Github
BioMegatron: Larger Biomedical Domain Language Model	NVIDIA	2020-10	HuggingFace
PubMedBERT: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing	Microsoft Research	2020-07	HuggingFace
Publicly Available Clinical BERT Embeddings	MIT CSAIL	2019-04	HuggingFace
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission	Harvard, Princeton, NYU	2019-04	Github
BioBERT: a pre-trained biomedical language representation model for biomedical text mining	Korea University	2019-01	Github

Popular LLMs

Model Name	Base	Para. (B)	Features	Date	Link
GatorTron	Transformer	0.345, 3.9, 8.9	Training from scratch	06/2022	https://github.com/uf-hobi-informatics-lab/GatorTron
Codex-Med	GPT-3.5	175	CoT, Zero-shot	07/2022	https://github.com/vlievin/medical-reasoning
Galactica	Transformer	1.3, 6.4, 30, 120	Reasoning, Multidisciplinary	11/2022	https://galactica.org
Med-PaLM	Flan-PaLM/PaLM	540	CoT, Self-consistency	12/2022	-
GPT-4-Med	GPT-4	-	no specialized prompt crafting	03/2023	-
DeID-GPT	GPT-4	-	De-identifying	03/2023	https://github.com/yhydhx/ChatGPT-API
ChatDoctor	LLaMA	7	Retrieve online, external knowledge	03/2023	https://github.com/Kent0n-Li/ChatDoctor
DoctorGLM	ChatGLM	6	Extra prompt designer	04/2023	https://github.com/xionghonglin/DoctorGLM
MedAlpaca	LLaMA	7, 13	Adapt to Medicine	04/2023	https://github.com/kbressem/medAlpaca
BenTsao	LLaMA	7	Knowledge graph	04/2023	https://github.com/SCIR-HI/ Huatuo-Llama-Med-Chinese
PMC-LLaMA	LLaMA	7	Adapt to Medicine	04/2023	https://github.com/chaoyi-wu/PMC-LLaMA
Visual Med-Alpaca	LLaMA	7	multimodal generative model, Self-Instruct	04/2023	https://github.com/cambridgeltl/visual-med-alpaca
BianQue~	ChatGLM	6	Chain of Questioning	04/2023	https://github.com/scutcyr/BianQue
Med-PaLM 2	PaLM 2	340	Ensemble refinement, CoT, Self-consistency	05/2023	-
GatorTronGPT	GPT-3	5, 20	Training from scratch for medicine	05/2023	https://github.com/uf-hobi-informatics-lab/GatorTronGPT
HuatuoGPT	Bloomz	7	Reinforced learning from AI feedback	05/2023	https://github.com/FreedomIntelligence/HuatuoGPT
ClinicalGPT	BLOOM	7	multi-round dialogue consultations	06/2023	-
MedAGI	MiniGPT-4	-	multimodal, AGI	06/2023	https://github.com/JoshuaChou2018/MedAGI
LLaVA-Med	LLaVA	13	multimodal, self-instruct, curriculum learning	06/2023	https://github.com/microsoft/LLaVA-Med
OphGLM	ChatGLM	6	multimodal, Ophthalmology LLM	06/2023	https://github.com/ML-AILab/OphGLM
SoulChat	ChatGLM	6	Mental Healthcare	06/2023	https://github.com/scutcyr/SoulChat
Med-Flamingo	Flamingo	80	multimodal, Few-Shot generative medical VQA	07/2023	https://github.com/snap-stanford/med-flamingo

Papers

Med-PaLM 2: Towards Expert-Level Medical Question Answering with Large Language Models [Paper]
KeBioLM: Improving Biomedical Pretrained Language Models with Knowledge [Paper]
BioELMo: Probing Biomedical Embeddings from Language Models [Paper]
BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model [Paper]
ClinicalT5: A Generative Language Model for Clinical Text [Paper]
GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records [Paper]
ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models [Paper] [Code]
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [Paper]
Capabilities of GPT-4 on Medical Challenge Problems [Paper]
BioBERT: a pre-trained biomedical language representation model for biomedical text mining [Paper]
Publicly Available Clinical BERT Embeddings [Paper]
BioMegatron: Larger Biomedical Domain Language Model [Paper]
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks [Paper]
Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction [Paper]
DoctorGLM: Fine-tuning your chinese doctor is not a herculean task [Paper] [Code]
HuatuoGPT, Towards Taming Language Models To Be a Doctor [Paper] [Code]
BioELECTRA:Pretrained Biomedical text Encoder using Discriminators [Paper]
LinkBERT: Pretraining Language Models with Document Links [Paper]
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining [Paper]
Large Language Models Encode Clinical Knowledge [Paper]
A large language model for electronic health records [Paper]
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing [Paper]
BEHRT: Transformer for Electronic Health Records [Paper]
Federated Learning of Medical Concepts Embedding using BEHRT [Paper] [Code]
RadBERT: Adapting Transformer-based Language Models to Radiology [paper] [HuggingFace]
Highly accurate protein structure prediction with AlphaFold [Paper] [Code]
Accurate prediction of protein structures and interactions using a three-track neural network [Paper]
Protein complex prediction with AlphaFold-Multimer [Paper]
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours [Paper] [Code]
HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle [Paper] [Code]
Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold [Paper] [Code]
OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization [Paper] [Code]
ManyFold: an efficient and flexible library for training and validating protein folding models [Paper] [Code]
ColabFold: making protein folding accessible to all [Paper] [Code]
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences [Paper] [Code]
ProGen: Language Modeling for Protein Generation [Paper] [Code]
ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing [Paper] [Code]
Evolutionary-scale prediction of atomic level protein structure with a language model [Paper]
High-resolution de novo structure prediction from primary sequence [Paper] [Code]
Single-sequence protein structure prediction using a language model and deep learning [Paper]
Improved the Protein Complex Prediction with Protein Language Models [Paper]
MSA Transformer [Paper] [Code]
Deciphering antibody affinity maturation with language models and weakly supervised learning [Paper]
xTrimoABFold: De novo Antibody Structure Prediction without MSA [Paper]
scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data [Paper] [Code]
Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions [Paper] [Code]
E2Efold-3D: End-to-End Deep Learning Method for accurate de novo RNA 3D Structure Prediction [Paper] [Code]

Other Awesome List

LLM-for-Healthcare - A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics.

Licenses

本项目遵循 MIT License.

本项目遵循 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

引用

如果本项目对您有帮助，请引用我们的项目。

@misc{medllmdata2023,
  author = {Jun Wang, Changyu Hou, Xiaorui Wang, Jingjing Gong ,Chen Song, Qi Shen, Guotong Xie},
  title = {Awesome Dataset for Medical LLM: A curated list of popular Datasets, Models and Papers for LLMs in Medical/Healthcare},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/onejune2018/Awesome-Medical-Healthcare-Dataset-For-LLM}},
}
```

jadessli / awesome-medical-healthcare-dataset-for-llm Goto Github PK

awesome-medical-healthcare-dataset-for-llm's Introduction

Awesome-Medical-Healthcare-Dataset-For-LLM

Datasets

中文

英文

Models

Popular LLMs

Papers

Other Awesome List

Licenses

引用

awesome-medical-healthcare-dataset-for-llm's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent