Experimental validation

How rigorously has each benchmark been validated experimentally?
Clinical = sourced from real clinical-trial outcomes · Wet-lab confirmed = top predictions tested in-lab · Prospective = designed as a forward-looking test set · Retrospective = historical hold-out only · None = no experimental grounding.

Clinical: 7

Wet-lab confirmed: 20

Prospective: 7

Retrospective: 56

Clinical (7)

Benchmark	Stages	Score	Notes
MIMIC-IV Benchmark Tasks	Phase IIIClinical DevelopmentPost-market / RWE	89.4	Canonical clinical ML benchmark. Credentialed access limits casual use.
ClinBench Quarterly — Q2 2026	phase-iiphase-iiiClinical Development	87.6	New track in Q2 2026 for endpoint adjudication.
ClinBench Quarterly (Insilico)	Phase IIPhase IIIClinical Development	81.5	Benchmark refresh cadence beats all academic trial outcome benchmarks. Leaderboards test frontier LLMs against quarterly-updated splits.
CPTAC Proteogenomic Benchmarks	Disease ModelingTarget IDPhase II	80.8	Deep integrative oncology data.
HINT / TrialBench	Phase IIPhase IIIClinical Development	76.5	Limited by ClinicalTrials.gov quality.
Trial Outcome Prediction (TOP)	Phase IIIClinical Development	76.5	Often reported alongside HINT.
CT-Outcome (TrialBench v2)	Phase IIPhase III	73.4	Temporal splits are key improvement.

Wet-lab confirmed (20)

Benchmark	Stages	Score	Notes
Open Targets Platform	Disease ModelingTarget ID	100.0	Industry gold standard for target prioritization. Quarterly versioned releases.
DepMap (Cancer Dependency Map)	Target IDDisease Modeling	100.0	Quarterly release cadence.
Protein Language Model Eval 2026	Virtual CellHit ID	100.0	Meta FAIR + EvolutionaryScale collaboration; includes held-out targets with wet-lab fitness.
ChEMBL	Hit IDLead ID / ADMET	97.5	Underlies ~80% of public bioactivity ML benchmarks.
ProteinGym	Target IDLead ID / ADMETIND-enabling	97.5	Field standard. Clinical track enables fair ESM/EVE/AlphaMissense comparison.
Therapeutic Antibody Design Benchmark 2026	Hit IDLead ID / ADMET	97.0	Top-ranked submissions had wet-lab binding measured (Kd + aggregation) by independent labs.
Protein Design Benchmark 2026	Hit ID	97.0	All submitted designs characterized in IPD / external wet labs.
RxRx3 Phenomics Benchmark	Hit IDLead ID / ADMET	94.9	Real phenomics data from Recursion's lab; public subsets only. Full dataset is proprietary (see private_benchmarks).
LINCS L1000 / CMap	Virtual CellDisease ModelingTarget ID	89.9	Foundational pharma resource for MoA work. Batch effects require careful handling.
canSAR	Target IDHit ID	89.4	Deep oncology focus; widely-used druggability predictor.
PubChem BioAssay	Hit ID	88.6	Broadest HTS repository; quality heterogeneous.
Cell Line Sensitivity Benchmark (CLSB)	Target IDLead ID / ADMET	88.1	DepMap-adjacent but adds new splits and PRISM v4.
TargetBench (Insilico)	Target IDDisease Modeling	84.6	Disease-organized target ID benchmark — unique axis. Frontier LLM leaderboard.
ISM Benchmarks: ADMET (Insilico)	Lead ID / ADMETIND-enabling	84.6	Broader endpoint coverage than TDC ADMET. Side-by-side with TDC mirror on DDB.
Longevity Compound Benchmark	Hit IDLead ID / ADMET	84.6	Insilico-hosted; unique in bridging cheminformatics and aging biology.
CRISPR Outcome Prediction Benchmark	Hit ID	79.5	Prospective track added in Q1 2026.
IgLM / AntiBERTa benchmarks	Hit IDDevelopmental Candidate	77.5	Moves toward true developability benchmarks.
Geneformer Eval	Virtual Cell	77.0	Author-led eval; still widely re-run on OpenProblems tasks.
TDC DrugSyn (OncoPolyPharm + DrugComb_NCI60)	Developmental CandidateLead ID / ADMET	77.0	Important for combination therapy design.
scGPT Evaluation Suite	Virtual Cell	73.7	Evaluation dominated by authors' own model — flagged self-referential. Pair with OpenProblems for fair comparison.

Prospective (7)

Benchmark	Stages	Score	Notes
Virtual Cell Benchmark Suite 2026	Virtual Cell	97.0	Successor to Open Problems perturbation benchmark. Prospectively designed; Tahoe-100M inclusion makes it industry-relevant.
ASAP Discovery Antiviral 2025	Hit IDLead ID / ADMET	93.9	Top predictions are synthesized and tested; a rare prospective public benchmark.
Longevity Benchmark (Insilico)	Disease ModelingTarget IDPost-market / RWE	90.6	Unique, broad longevity/aging benchmark slice — nothing else in the field covers aging comparably. Leaderboard features frontier LLMs.
Polaris ADMET	Lead ID / ADMET	88.4	Industry splits enforce blinded eval; highest industry relevance among ADMET benchmarks.
CZ Virtual Cell Challenge	Virtual CellTarget ID	88.1	Gold standard-in-the-making for foundation-model era perturbation prediction. Hidden test → strong against leakage.
mRNA Design Benchmark (CodonBench 2026)	Hit IDLead ID / ADMET	82.0	Designed with Moderna and Deep Genomics; includes held-out wet-lab validation track.
Polaris Biologics (Polyreactivity / SEC / Tm)	Developmental Candidate	79.0	Industry-donated; growing.

Retrospective (56)

Benchmark	Stages	Score	Notes
TDC ADMET Group	Lead ID / ADMET	100.0	Most-adopted ADMET benchmark. 100+ leaderboard submissions.
SAbDab	Hit IDLead ID / ADMETDevelopmental Candidate	100.0	Canonical antibody structure resource. Weekly updates.
Observed Antibody Space (OAS)	Hit IDLead ID / ADMET	97.5	Underlies AbLang, IgLM, AntiBERTa — industry-adopted.
PoseBusters	Hit ID	97.0	Exposed major failure modes in AlphaFold-Multimer/DiffDock/RFAA. Default pharma filter.
PLINDER	Hit ID	97.0	Replaces PDBbind as the modern leakage-controlled docking standard.
PLINDER v2 Protein-Ligand Benchmark	Hit ID	97.0	PLINDER is consistently cited as the go-to replacement for PDBbind in modern docking evaluation.
STRING	Target IDDisease Modeling	94.9	Workhorse for network-based target ID. Distinguish functional vs physical edges.
CASP15	Hit IDTarget ID	94.9	Biennial. Introduced ligand prediction category.
CASP16	Hit ID	94.4	First full multimer+ligand+RNA joint eval.
CAMEO weekly targets	Hit ID	94.4	Weekly cadence complements biennial CASP.
Boltz-1 Structure Prediction Benchmark	Hit ID	94.4	Open-source companion to commercial structure predictors; benchmark splits audited against AlphaFold 3 leakage.
ORD Reaction Benchmark	Developmental Candidate	93.9	Modern open reaction corpus; industry-scale.
Open Problems: Perturbation Prediction	Virtual Cell	91.9	Best-in-class rigor (Viash workflow, hidden test, NeurIPS track).
PrimeKG	Disease ModelingTarget ID	91.9	Modern, well-engineered KG; strong for GNN drug repurposing.
FAERS (raw)	Post-market / RWE	91.1	Known under-/over-reporting biases.
scPerturb	Virtual CellTarget ID	88.9	Canonical harmonized resource. Strong Perturb-seq coverage; weaker for chemical perturbations.
PINDER	Hit ID	88.9	Expected PPI docking standard.
Practical Molecular Optimization (PMO)	Lead ID / ADMETDevelopmental Candidate	88.9	Sample-efficiency focus exposed shortcomings of reward-maxing methods.
CoV-AbDab	Hit ID	88.9	Narrow modality but critical for pandemic-preparedness ML.
ISM Benchmarks: GPCRs (Insilico)	Hit IDLead ID / ADMET	87.6	Largest open GPCR affinity benchmark. Leaderboards test external frontier LLMs — not self-referential.
CAPRI Rounds	Hit ID	86.3	Oldest PPI prediction benchmark.
ToxCast	Lead ID / ADMETIND-enabling	85.6	Regulatory-grade broad tox dataset.
GNNBench-Drug 2026	Hit IDLead ID / ADMET	85.6	IBM-led; overlaps with MoleculeNet but adds modern splits.
CAFA5	Target ID	84.3	CAFA5 broke attendance records.
MoleculeACE	Lead ID / ADMET	83.3	Critical stress-test for generalization; exposed GNN weaknesses.
MatBench	Developmental Candidate	83.3	Materials-science benchmark; relevant for formulation / co-crystal work.
OffSides / TWOSIDES	Post-market / RWE	83.0	Key benchmark for DDI + adverse event ML.
DrugComb 2.0 Synergy Benchmark	Lead ID / ADMETDevelopmental Candidate	83.0	Industry-relevant for combination oncology.
DMPK Integrated Benchmark	Lead ID / ADMETDevelopmental Candidate	82.5	AZ/Merck/Pfizer contributed held-out test molecules.
DOCKSTRING	Hit ID	81.3	Vina scores are a proxy; not a replacement for wet assays.
DisGeNET	Disease ModelingTarget ID	81.0	Commercial license required for industry. Text-mining noise limits quality.
LIT-PCBA	Hit ID	80.8	Much fairer than DUD-E; small target count limits coverage.
FLIP	Target IDDevelopmental Candidate	80.8	Complements ProteinGym (smaller but carefully designed splits).
GuacaMol	Lead ID / ADMETDevelopmental Candidate	80.5	First-generation generative benchmark; largely superseded by PMO for goal-directed.
Open Systems Pharmacology / PK-Sim	Phase IIND-enabling	80.3	Open alternative to Simcyp.
ADMET-AI	Lead ID / ADMET	79.5	Strong baselines + web tool; builds on TDC.
AMES (mutagenicity)	IND-enablingLead ID / ADMET	79.5	Core gentox endpoint.
scImmuneBench	Virtual CellDisease Modeling	79.5	Useful for cell-therapy companies evaluating immune foundation models.
MoleculeNet	Lead ID / ADMETHit ID	78.0	Widely cited (3600+); aging splits with known scaffold leakage.
USPTO-50K / USPTO-MIT (Retrosynthesis)	Lead ID / ADMETDevelopmental Candidate	78.0	Known leakage across canonical splits; use time-split or ORD for fairer eval.
Tox21	Lead ID / ADMETIND-enabling	77.5	Field-standard tox benchmark; endpoint count small vs modern suites.
Obach PK Dataset	Phase IIND-enablingLead ID / ADMET	77.0	Small but highest-quality human-PK dataset.
CASF-2016	Hit ID	76.2	Authoritative scoring-power eval; update cadence slow.
PDBbind	Hit IDLead ID / ADMET	75.9	Scaffold/temporal leakage well-documented. Pair with CASF + LeakyPDB.
SIDER	Post-market / RWEIND-enabling	74.9	Aging but still widely used. TWOSIDES/OffSides offer newer signals.
TAPE	Target IDDevelopmental Candidate	74.9	Historically important; largely superseded by ProteinGym/FLIP for fitness and by PEER for broader tasks.
Simcyp Validation Sets	Phase IPhase IIIND-enabling	74.4	Industry gold standard but proprietary. Open benchmarks exist via OSP Suite.
PEER	Target IDDevelopmental Candidate	74.4	Broader than TAPE, tighter than ProteinGym; good middle ground.
ClawBio Skill Correctness Bench	Disease ModelingTarget IDClinical Development	74.2	Independent third-party bench structurally precludes self-reference. Coverage narrow but rigor exemplary.
hERG (cardio-tox) TDC	IND-enablingLead ID / ADMET	73.9	Small but widely benchmarked. Industry pairs with SafetyPanel-5.
DILI / LD50 Zhu	IND-enablingLead ID / ADMET	73.9	Essential IND-enabling endpoints.
DUD-E	Hit ID	72.9	Well-known analog bias in decoy selection; use LIT-PCBA / PLINDER for fair VS.
MOSES	Lead ID / ADMETDevelopmental Candidate	72.4	Distribution-learning metrics known to saturate.
PerturbBench	Virtual Cell	71.4	Pharma-led (Genentech); well-specified eval.
ClinTox	Lead ID / ADMETIND-enabling	65.6	Small, binary; saturated. Useful only as sanity check.
DEKOIS 2.0	Hit ID	57.5	Historical reference; use LIT-PCBA / PLINDER for modern VS.

Compare:

Open comparison →