Experimental validation

How rigorously has each benchmark been validated experimentally?
Clinical = sourced from real clinical-trial outcomes · Wet-lab confirmed = top predictions tested in-lab · Prospective = designed as a forward-looking test set · Retrospective = historical hold-out only · None = no experimental grounding.

Clinical: 7
Wet-lab confirmed: 20
Prospective: 7
Retrospective: 56

Clinical (7)

BenchmarkStagesScoreNotes
MIMIC-IV Benchmark TasksPhase IIIClinical DevelopmentPost-market / RWE89.4Canonical clinical ML benchmark. Credentialed access limits casual use.
ClinBench Quarterly — Q2 2026phase-iiphase-iiiClinical Development87.6New track in Q2 2026 for endpoint adjudication.
ClinBench Quarterly (Insilico)Phase IIPhase IIIClinical Development81.5Benchmark refresh cadence beats all academic trial outcome benchmarks. Leaderboards test frontier LLMs against quarterly-updated splits.
CPTAC Proteogenomic BenchmarksDisease ModelingTarget IDPhase II80.8Deep integrative oncology data.
HINT / TrialBenchPhase IIPhase IIIClinical Development76.5Limited by ClinicalTrials.gov quality.
Trial Outcome Prediction (TOP)Phase IIIClinical Development76.5Often reported alongside HINT.
CT-Outcome (TrialBench v2)Phase IIPhase III73.4Temporal splits are key improvement.

Wet-lab confirmed (20)

BenchmarkStagesScoreNotes
Open Targets PlatformDisease ModelingTarget ID100.0Industry gold standard for target prioritization. Quarterly versioned releases.
DepMap (Cancer Dependency Map)Target IDDisease Modeling100.0Quarterly release cadence.
Protein Language Model Eval 2026Virtual CellHit ID100.0Meta FAIR + EvolutionaryScale collaboration; includes held-out targets with wet-lab fitness.
ChEMBLHit IDLead ID / ADMET97.5Underlies ~80% of public bioactivity ML benchmarks.
ProteinGymTarget IDLead ID / ADMETIND-enabling97.5Field standard. Clinical track enables fair ESM/EVE/AlphaMissense comparison.
Therapeutic Antibody Design Benchmark 2026Hit IDLead ID / ADMET97.0Top-ranked submissions had wet-lab binding measured (Kd + aggregation) by independent labs.
Protein Design Benchmark 2026Hit ID97.0All submitted designs characterized in IPD / external wet labs.
RxRx3 Phenomics BenchmarkHit IDLead ID / ADMET94.9Real phenomics data from Recursion's lab; public subsets only. Full dataset is proprietary (see private_benchmarks).
LINCS L1000 / CMapVirtual CellDisease ModelingTarget ID89.9Foundational pharma resource for MoA work. Batch effects require careful handling.
canSARTarget IDHit ID89.4Deep oncology focus; widely-used druggability predictor.
PubChem BioAssayHit ID88.6Broadest HTS repository; quality heterogeneous.
Cell Line Sensitivity Benchmark (CLSB)Target IDLead ID / ADMET88.1DepMap-adjacent but adds new splits and PRISM v4.
TargetBench (Insilico)Target IDDisease Modeling84.6Disease-organized target ID benchmark — unique axis. Frontier LLM leaderboard.
ISM Benchmarks: ADMET (Insilico)Lead ID / ADMETIND-enabling84.6Broader endpoint coverage than TDC ADMET. Side-by-side with TDC mirror on DDB.
Longevity Compound BenchmarkHit IDLead ID / ADMET84.6Insilico-hosted; unique in bridging cheminformatics and aging biology.
CRISPR Outcome Prediction BenchmarkHit ID79.5Prospective track added in Q1 2026.
IgLM / AntiBERTa benchmarksHit IDDevelopmental Candidate77.5Moves toward true developability benchmarks.
Geneformer EvalVirtual Cell77.0Author-led eval; still widely re-run on OpenProblems tasks.
TDC DrugSyn (OncoPolyPharm + DrugComb_NCI60)Developmental CandidateLead ID / ADMET77.0Important for combination therapy design.
scGPT Evaluation SuiteVirtual Cell73.7Evaluation dominated by authors' own model — flagged self-referential. Pair with OpenProblems for fair comparison.

Prospective (7)

BenchmarkStagesScoreNotes
Virtual Cell Benchmark Suite 2026Virtual Cell97.0Successor to Open Problems perturbation benchmark. Prospectively designed; Tahoe-100M inclusion makes it industry-relevant.
ASAP Discovery Antiviral 2025Hit IDLead ID / ADMET93.9Top predictions are synthesized and tested; a rare prospective public benchmark.
Longevity Benchmark (Insilico)Disease ModelingTarget IDPost-market / RWE90.6Unique, broad longevity/aging benchmark slice — nothing else in the field covers aging comparably. Leaderboard features frontier LLMs.
Polaris ADMETLead ID / ADMET88.4Industry splits enforce blinded eval; highest industry relevance among ADMET benchmarks.
CZ Virtual Cell ChallengeVirtual CellTarget ID88.1Gold standard-in-the-making for foundation-model era perturbation prediction. Hidden test → strong against leakage.
mRNA Design Benchmark (CodonBench 2026)Hit IDLead ID / ADMET82.0Designed with Moderna and Deep Genomics; includes held-out wet-lab validation track.
Polaris Biologics (Polyreactivity / SEC / Tm)Developmental Candidate79.0Industry-donated; growing.

Retrospective (56)

BenchmarkStagesScoreNotes
TDC ADMET GroupLead ID / ADMET100.0Most-adopted ADMET benchmark. 100+ leaderboard submissions.
SAbDabHit IDLead ID / ADMETDevelopmental Candidate100.0Canonical antibody structure resource. Weekly updates.
Observed Antibody Space (OAS)Hit IDLead ID / ADMET97.5Underlies AbLang, IgLM, AntiBERTa — industry-adopted.
PoseBustersHit ID97.0Exposed major failure modes in AlphaFold-Multimer/DiffDock/RFAA. Default pharma filter.
PLINDERHit ID97.0Replaces PDBbind as the modern leakage-controlled docking standard.
PLINDER v2 Protein-Ligand BenchmarkHit ID97.0PLINDER is consistently cited as the go-to replacement for PDBbind in modern docking evaluation.
STRINGTarget IDDisease Modeling94.9Workhorse for network-based target ID. Distinguish functional vs physical edges.
CASP15Hit IDTarget ID94.9Biennial. Introduced ligand prediction category.
CASP16Hit ID94.4First full multimer+ligand+RNA joint eval.
CAMEO weekly targetsHit ID94.4Weekly cadence complements biennial CASP.
Boltz-1 Structure Prediction BenchmarkHit ID94.4Open-source companion to commercial structure predictors; benchmark splits audited against AlphaFold 3 leakage.
ORD Reaction BenchmarkDevelopmental Candidate93.9Modern open reaction corpus; industry-scale.
Open Problems: Perturbation PredictionVirtual Cell91.9Best-in-class rigor (Viash workflow, hidden test, NeurIPS track).
PrimeKGDisease ModelingTarget ID91.9Modern, well-engineered KG; strong for GNN drug repurposing.
FAERS (raw)Post-market / RWE91.1Known under-/over-reporting biases.
scPerturbVirtual CellTarget ID88.9Canonical harmonized resource. Strong Perturb-seq coverage; weaker for chemical perturbations.
PINDERHit ID88.9Expected PPI docking standard.
Practical Molecular Optimization (PMO)Lead ID / ADMETDevelopmental Candidate88.9Sample-efficiency focus exposed shortcomings of reward-maxing methods.
CoV-AbDabHit ID88.9Narrow modality but critical for pandemic-preparedness ML.
ISM Benchmarks: GPCRs (Insilico)Hit IDLead ID / ADMET87.6Largest open GPCR affinity benchmark. Leaderboards test external frontier LLMs — not self-referential.
CAPRI RoundsHit ID86.3Oldest PPI prediction benchmark.
ToxCastLead ID / ADMETIND-enabling85.6Regulatory-grade broad tox dataset.
GNNBench-Drug 2026Hit IDLead ID / ADMET85.6IBM-led; overlaps with MoleculeNet but adds modern splits.
CAFA5Target ID84.3CAFA5 broke attendance records.
MoleculeACELead ID / ADMET83.3Critical stress-test for generalization; exposed GNN weaknesses.
MatBenchDevelopmental Candidate83.3Materials-science benchmark; relevant for formulation / co-crystal work.
OffSides / TWOSIDESPost-market / RWE83.0Key benchmark for DDI + adverse event ML.
DrugComb 2.0 Synergy BenchmarkLead ID / ADMETDevelopmental Candidate83.0Industry-relevant for combination oncology.
DMPK Integrated BenchmarkLead ID / ADMETDevelopmental Candidate82.5AZ/Merck/Pfizer contributed held-out test molecules.
DOCKSTRINGHit ID81.3Vina scores are a proxy; not a replacement for wet assays.
DisGeNETDisease ModelingTarget ID81.0Commercial license required for industry. Text-mining noise limits quality.
LIT-PCBAHit ID80.8Much fairer than DUD-E; small target count limits coverage.
FLIPTarget IDDevelopmental Candidate80.8Complements ProteinGym (smaller but carefully designed splits).
GuacaMolLead ID / ADMETDevelopmental Candidate80.5First-generation generative benchmark; largely superseded by PMO for goal-directed.
Open Systems Pharmacology / PK-SimPhase IIND-enabling80.3Open alternative to Simcyp.
ADMET-AILead ID / ADMET79.5Strong baselines + web tool; builds on TDC.
AMES (mutagenicity)IND-enablingLead ID / ADMET79.5Core gentox endpoint.
scImmuneBenchVirtual CellDisease Modeling79.5Useful for cell-therapy companies evaluating immune foundation models.
MoleculeNetLead ID / ADMETHit ID78.0Widely cited (3600+); aging splits with known scaffold leakage.
USPTO-50K / USPTO-MIT (Retrosynthesis)Lead ID / ADMETDevelopmental Candidate78.0Known leakage across canonical splits; use time-split or ORD for fairer eval.
Tox21Lead ID / ADMETIND-enabling77.5Field-standard tox benchmark; endpoint count small vs modern suites.
Obach PK DatasetPhase IIND-enablingLead ID / ADMET77.0Small but highest-quality human-PK dataset.
CASF-2016Hit ID76.2Authoritative scoring-power eval; update cadence slow.
PDBbindHit IDLead ID / ADMET75.9Scaffold/temporal leakage well-documented. Pair with CASF + LeakyPDB.
SIDERPost-market / RWEIND-enabling74.9Aging but still widely used. TWOSIDES/OffSides offer newer signals.
TAPETarget IDDevelopmental Candidate74.9Historically important; largely superseded by ProteinGym/FLIP for fitness and by PEER for broader tasks.
Simcyp Validation SetsPhase IPhase IIIND-enabling74.4Industry gold standard but proprietary. Open benchmarks exist via OSP Suite.
PEERTarget IDDevelopmental Candidate74.4Broader than TAPE, tighter than ProteinGym; good middle ground.
ClawBio Skill Correctness BenchDisease ModelingTarget IDClinical Development74.2Independent third-party bench structurally precludes self-reference. Coverage narrow but rigor exemplary.
hERG (cardio-tox) TDCIND-enablingLead ID / ADMET73.9Small but widely benchmarked. Industry pairs with SafetyPanel-5.
DILI / LD50 ZhuIND-enablingLead ID / ADMET73.9Essential IND-enabling endpoints.
DUD-EHit ID72.9Well-known analog bias in decoy selection; use LIT-PCBA / PLINDER for fair VS.
MOSESLead ID / ADMETDevelopmental Candidate72.4Distribution-learning metrics known to saturate.
PerturbBenchVirtual Cell71.4Pharma-led (Genentech); well-specified eval.
ClinToxLead ID / ADMETIND-enabling65.6Small, binary; saturated. Useful only as sanity check.
DEKOIS 2.0Hit ID57.5Historical reference; use LIT-PCBA / PLINDER for modern VS.
Compare:
Open comparison →