DMPKBench (DMPK LLM Evaluation Benchmark)
Multi-modal benchmark for evaluating LLMs and agents in drug metabolism and pharmacokinetics. 120,000+ QA pairs covering experimental design, ADMET optimization, PK modeling, and preclinical-to-clinical translation. Includes SMILES, data tables, PK curves.
Composite
83.8
Experimental validation
None
Stages
IND-enablingLead ID / ADMET
Modalities
texttabularsmall-molecule
Task types
llm_evaluationpk_predictionadmet_prediction
Size
QA_pairs: 120,000
competency_areas: 5
competency_areas: 5
License
Academic
First release
2025-09
Last updated
2025-12
Official site
→ project page
Leaderboard
→ leaderboard
Dataset
Code / GitHub
HuggingFace
→ HF
Paper
DMPKBench: A Comprehensive Multi-Modal Benchmark for DMPK LLM Evaluation · · 2025 · paper · 4 citations
Flags
llm_benchmarkmulti_modalchinese_benchmark
Experts
—
Groups
—
Hosted by
—
Related benchmarks
Rubric (7-criterion)
rigor
5
coverage
5
maintenance
3
adoption
3
quality
4
accessibility
4
industry_relevance
5
Notes
From GHDDI (Gates Foundation China). LLM accuracy ranges 11-89% across tasks. Models excel at knowledge tasks but struggle with multi-modal reasoning (PK curves, data tables). Critical for evaluating LLM utility in pharma DMPK departments.