Initiatives (Meta-platforms, Consortia, Competitions)

31 initiatives collectively track 1,749 benchmarks. Sorted by benchmarks tracked. Click any header to re-sort.

#	Initiative	Kind	Benchmarks tracked	As of	Host	Score	Links
1	Therapeutics Data Commons (TDC)	meta-platform	83	2026-05-12	Zitnik Lab, Harvard Medical School (+ MIT, Stanford, Georgia Tech collaborators)	100.0	site · gh
2	CASP (Critical Assessment of Structure Prediction)	competition	16	2026-05-12	Prediction Center, UC Davis	100.0	site · gh
3	ProteinGym	meta-platform	217	2026-05-12	Marks Lab (Harvard) + OATML (Oxford) + DeepMind	97.5	site · gh
4	ELIXIR Infrastructure	consortium	18	2026-05-12	EMBL-EBI + 23 EU member nodes	97.5	site · gh
5	CAMEO	competition	4	2026-05-12	Biozentrum Basel + SIB	94.4	site · gh
6	PoseBusters Evaluation Suite	meta-platform	3	2026-05-12	Oxford OPIG (Deane Lab)	93.9	site · gh
7	PLINDER / PINDER	meta-platform	2	2026-05-12	Biozentrum Basel + VantAI + Isomorphic Labs + EPFL	93.9	site · gh
8	Open Problems in Single-Cell Analysis	consortium	29	2026-05-12	CZI + Helmholtz Munich + Yale + HMS	91.9	site · gh
9	Polaris Hub	meta-platform	48	2026-05-12	Polaris consortium (Valence Labs, Recursion, Novartis, Pfizer, Merck, AstraZeneca)	91.4	site · gh
10	ScienceAIBench	meta-platform	227	2026-05-12	Insilico Medicine	90.6	site · gh
11	DREAM Challenges	competition	74	2026-05-12	Sage Bionetworks + IBM + academic partners	89.4	site · gh
12	MIMIC-IV / eICU	data-platform	14	2026-05-12	MIT Lab for Computational Physiology	89.4	site · gh
13	CZI Virtual Cell / CellxGene / VCC	consortium	12	2026-05-12	Chan Zuckerberg Initiative / CZ Biohub	88.9	site · gh
14	Open Reaction Database (ORD)	data-platform	1	2026-05-12	ORD consortium (Doyle, Coley, Pfizer, Merck, BASF)	88.9	site · gh
15	Drug Discovery Benchmarks (DDB)	meta-platform	206	2026-05-12	Insilico Medicine	87.6	site · gh
16	CAFA	competition	6	2026-05-12	Radivojac / Friedberg / Jiang consortium	86.8	site · gh
17	CAPRI	competition	56	2026-05-12	EBI + CCP4	86.3	site · gh
18	FAERS / SIDER / OffSides / TWOSIDES	data-platform	4	2026-05-12	FDA CDER + Tatonetti Lab	85.6	site · gh
19	InsilicoBench	meta-platform	162	2026-05-12	Insilico Medicine	84.6	site · gh
20	FLIP	meta-platform	15	2026-05-12	Rostlab TUM + AlQuraishi Lab Columbia	80.8	site · gh
21	CPTAC	consortium	10	2026-05-12	NCI Office of Cancer Clinical Proteomics Research	80.8	site · gh
22	DeepChem	meta-platform	40	2026-05-12	DeepChem community	80.0	site · gh
23	MoleculeNet	meta-platform	17	2026-05-12	DeepChem community (Pande Lab alumni)	78.0	site · gh
24	TrialBench / HINT / TOP	meta-platform	4	2026-05-12	Fu/Sun Lab, Georgia Tech + HMS	76.5	site · gh
25	ClawBio Benchmarks	meta-platform	10	2026-05-03	ClawBio (open source, MIT)	74.2	site · gh
26	EU-OPENSCREEN / EUbOPEN	consortium	5	2026-05-12	EU-OPENSCREEN ERIC + IMI EUbOPEN	73.9	site · gh
27	PKU-AIDD / ChinaDrug Benchmarks	consortium	7	2026-05-12	PKU + SIMM CAS + Tsinghua + Baidu + Huawei	73.9	site · gh
28	Kaggle — Pharma / Bio Competitions	competition	23	2026-05-12	Google / Kaggle + sponsoring companies	71.9	site · gh
29	Papers With Code — Drug Discovery	meta-platform	120	2026-05-12	Meta AI / Papers With Code community	71.6	site · gh
30	PDBbind / CASF	meta-platform	6	2026-05-12	SIMM, Chinese Academy of Sciences	70.4	site · gh
31	HuggingFace — Bio/Chem Datasets	data-platform	310	2026-05-12	HuggingFace + community uploaders	67.8	site · gh

Per-initiative detail

Therapeutics Data Commons (TDC) — 83 benchmarks tracked (as of 2026-05-12)

Open-science platform curating ML datasets/tasks across the drug discovery pipeline with unified API, splits, and leaderboards.

Kind

meta-platform

Host

Zitnik Lab, Harvard Medical School (+ MIT, Stanford, Georgia Tech collaborators)

Founded

2021-02

License model

MIT (code); per-dataset licenses for data

URL

https://tdcommons.ai/

GitHub

https://github.com/mims-harvard/TDC

Composite score

100.0

Flags

none

Count methodology: Scraped tdcommons.ai single_pred/multi_pred/generation overview pages 2026-05-12: single-pred ~38 datasets (ADME/Tox/HTS/QM/Yields/Epitope/Develop/CRISPROutcome), multi-pred ~32 datasets (DTI/DDI/PPI/GDA/DrugRes/DrugSyn/PeptideMHC/AntibodyAff/MTI/Catalyst/TCREpitope/TrialOutcome/ProteinPeptide/PerturbOutcome/scDTI), generation ~13 (MolGen/RetroSyn/Reaction/SBDD). 8 named leaderboard groups.

Breakdown

single_prediction: 38
multi_prediction: 32
generation: 13
leaderboard_groups: 8

Individually catalogued benchmarks hosted here

TDC ADMET Group — 100.0
PrimeKG — 91.9
Practical Molecular Optimization (PMO) — 88.9
ToxCast — 85.6
OffSides / TWOSIDES — 83.0
DisGeNET — 81.0
GuacaMol — 80.5
AMES (mutagenicity) — 79.5
USPTO-50K / USPTO-MIT (Retrosynthesis) — 78.0
Tox21 — 77.5
TDC DrugSyn (OncoPolyPharm + DrugComb_NCI60) — 77.0
Obach PK Dataset — 77.0
HINT / TrialBench — 76.5
hERG (cardio-tox) TDC — 73.9
DILI / LD50 Zhu — 73.9
DUD-E — 72.9
MOSES — 72.4
ClinTox — 65.6

Notes

Most comprehensive ML-ready therapeutics benchmark hub. NeurIPS 2021 + Nat Chem Bio 2022.

CASP (Critical Assessment of Structure Prediction) — 16 benchmarks tracked (as of 2026-05-12)

Biennial blind evaluation of protein structure prediction; drove AlphaFold's validation.

Kind

competition

Host

Prediction Center, UC Davis

Founded

1994

License model

Public

URL

https://predictioncenter.org/

GitHub

N/A

Composite score

100.0

Flags

none

Count methodology: predictioncenter.org archives: CASP1 (1994) through CASP16 (2024) = 16 editions; ~100 targets × ~5 categories per edition.

Breakdown

editions: 16
categories_per_edition: 5
targets_avg: 100

Individually catalogued benchmarks hosted here

CASP15 — 94.9
CASP16 — 94.4

Notes

Historical gold standard for blind evaluation. CASP15 added ligands; CASP16 added multimer + RNA.

ProteinGym — 217 benchmarks tracked (as of 2026-05-12)

Large-scale benchmark for protein fitness prediction from DMS + clinical variant effects.

Kind

meta-platform

Host

Marks Lab (Harvard) + OATML (Oxford) + DeepMind

Founded

2022

License model

MIT

URL

https://proteingym.org/

GitHub

https://github.com/OATML-Markslab/ProteinGym

Composite score

97.5

Flags

none

Count methodology: ProteinGym v1.2 README + NeurIPS 2023 paper: 217 DMS substitution assays + 66 indel assays + 2525 ClinVar clinical variants.

Breakdown

dms_substitutions: 217
dms_indels: 66
mutations: 2700000
clinical_variants: 2525

Individually catalogued benchmarks hosted here

ProteinGym — 97.5

Notes

De facto standard for variant effect prediction. Clinical track enables ESM/EVE/AlphaMissense fair comparison.

ELIXIR Infrastructure — 18 benchmarks tracked (as of 2026-05-12)

European life-science data infrastructure hosting benchmark-relevant resources (UniProt, Ensembl, ChEMBL, PDBe, IntAct).

Kind

consortium

Host

EMBL-EBI + 23 EU member nodes

Founded

2013

License model

Mostly CC-BY

URL

https://elixir-europe.org/

GitHub

N/A

Composite score

97.5

Flags

none

Count methodology: ELIXIR Core Data Resources list 2026-05: ~18 resources with benchmark/leaderboard components.

Breakdown

core_data_resources: 18
member_nodes: 23

Individually catalogued benchmarks hosted here

Open Targets Platform — 100.0
ChEMBL — 97.5
STRING — 94.9

Notes

Meta-resource of meta-resources.

CAMEO — 4 benchmarks tracked (as of 2026-05-12)

Continuous weekly blind eval of protein 3D / multimer / ligand prediction using pre-release PDB structures.

Kind

competition

Host

Biozentrum Basel + SIB

Founded

2013

License model

CC-BY 4.0

URL

https://www.cameo3d.org/

GitHub

N/A

Composite score

94.4

Flags

none

Count methodology: cameo3d.org 2026-05: 4 active categories — 3D monomer, 3D multimer, model quality, ligand pocket; ~1000 targets/year.

Breakdown

monomer: 1
multimer: 1
quality_estimation: 1
ligand: 1

Individually catalogued benchmarks hosted here

CAMEO weekly targets — 94.4

Notes

Excellent continuous cadence complementing CASP.

PoseBusters Evaluation Suite — 3 benchmarks tracked (as of 2026-05-12)

Physics-aware validation of docking/co-folding poses; 19 checks + curated test sets.

Kind

meta-platform

Host

Oxford OPIG (Deane Lab)

Founded

2023-08

License model

BSD-3-Clause

URL

https://posebusters.readthedocs.io/

GitHub

https://github.com/maabuu/posebusters

Composite score

93.9

Flags

none

Count methodology: GitHub README: PoseBusters v1 (308 complexes), v2 (428), Astex Diverse Set (85) = 3 canonical suites.

Breakdown

test_sets: 3
validation_checks: 19

Individually catalogued benchmarks hosted here

PoseBusters — 97.0

Notes

Changed pose-prediction evaluation norms; default pharma filter now.

PLINDER / PINDER — 2 benchmarks tracked (as of 2026-05-12)

Leakage-controlled protein-ligand (PLINDER) and protein-protein (PINDER) docking datasets.

Kind

meta-platform

Host

Biozentrum Basel + VantAI + Isomorphic Labs + EPFL

Founded

2024-07

License model

CC-BY 4.0

URL

https://www.plinder.sh/

GitHub

https://github.com/plinder-org/plinder

Composite score

93.9

Flags

none

Count methodology: plinder.sh + pinder.sh: 2 major benchmarks (PLINDER 460k systems, PINDER 267k systems).

Breakdown

plinder_systems: 460000
pinder_systems: 267498

Individually catalogued benchmarks hosted here

PLINDER — 97.0
PINDER — 88.9

Notes

Replacing PDBbind/CASF for modern docking ML eval.

Open Problems in Single-Cell Analysis — 29 benchmarks tracked (as of 2026-05-12)

Community benchmark suite for single-cell analysis with reproducible Viash/Nextflow pipelines and NeurIPS tracks.

Kind

consortium

Host

CZI + Helmholtz Munich + Yale + HMS

Founded

2021-06

License model

MIT

URL

https://openproblems.bio/

GitHub

https://github.com/openproblems-bio/openproblems

Composite score

91.9

Flags

none

Count methodology: openproblems.bio task registry + Luecken et al. Nat Biotech 2025: 29 benchmark tasks (batch integration, denoising, dim-reduction, label projection, perturbation, spatial, multimodal).

Breakdown

batch_integration: 3
perturbation: 4
multimodal: 6
label_transfer: 4
spatial: 5
other: 7

Individually catalogued benchmarks hosted here

Open Problems: Perturbation Prediction — 91.9

Notes

Gold-standard single-cell benchmarking rigor; Nat Biotech 2025.

Polaris Hub — 48 benchmarks tracked (as of 2026-05-12)

Industry-curated small-molecule benchmarks with working groups on method-comparison standards.

Kind

meta-platform

Host

Polaris consortium (Valence Labs, Recursion, Novartis, Pfizer, Merck, AstraZeneca)

Founded

2023-10

License model

CC-BY or Polaris Community License per benchmark

URL

https://polarishub.io/

GitHub

https://github.com/polaris-hub/polaris

Composite score

91.4

Flags

none

Count methodology: polarishub.io/benchmarks public listing 2026-05: ~48 public benchmarks across Recursion, Valence, Novartis, AstraZeneca, Polaris Small Molecule Steering Committee orgs.

Breakdown

public_benchmarks: 48
datasets: 60
competitions: 4

Individually catalogued benchmarks hosted here

Polaris ADMET — 88.4
Polaris Biologics (Polyreactivity / SEC / Tm) — 79.0

Notes

Industry-led counterweight to academic benchmarks. Strong on method-comparison rigor.

ScienceAIBench — 227 benchmarks tracked (as of 2026-05-12)

Insilico Medicine's public scientific-AI benchmark portal. Spans biology (longevity, target ID), affinity/binding, ADMET, clinical trials, biologics, materials; leaderboards benchmark frontier LLMs (GPT-5.x, Claude Opus/Sonnet 4.x, Gemini 3, Grok 4.1, DeepSeek v3.2, Kimi K2.x).

Kind

meta-platform

Host

Insilico Medicine

Founded

2025

License model

CC-BY (per portal); academic-friendly

URL

https://scienceaibench.insilico.com/

GitHub

N/A — hosted portal

Composite score

90.6

Flags

none

Count methodology: Fetched https://scienceaibench.insilico.com/api/benchmarks on 2026-05-12; meta.totalBenchmarks=227 across 7 taxonomy categories × 17 suites. Leaderboard submitters are external frontier LLMs (top entries: Grok 4.1, GPT 5.1/5.2, Claude Opus 4.5/4.6, Gemini 3 Flash, DeepSeek v3.2, Kimi K2.5). Not self-referential — Insilico's own models are not on the leaderboards.

Breakdown

Biology (TargetBench + Longevity): 29
Affinity and Binding: 94
Chemical Synthesis (Retrosynthesis): 2
ADMET, PK & Safety: 50
Clinical Trials (ClinBench Quarterly): 25
Biologics: 6
Materials (MatBench + others): 21

Individually catalogued benchmarks hosted here

TDC ADMET Group — 100.0
Longevity Benchmark (Insilico) — 90.6
ISM Benchmarks: GPCRs (Insilico) — 87.6
TargetBench (Insilico) — 84.6
ISM Benchmarks: ADMET (Insilico) — 84.6
MatBench — 83.3
ClinBench Quarterly (Insilico) — 81.5
Polaris Biologics (Polyreactivity / SEC / Tm) — 79.0
hERG (cardio-tox) TDC — 73.9

Notes

Biggest of the three Insilico portals. Live leaderboards regenerate against frontier LLMs — therefore NOT flagged self-referential. Strong longevity / aging benchmark slice (unique). Moves up the aging-relevance ranking.

DREAM Challenges — 74 benchmarks tracked (as of 2026-05-12)

Long-running crowd-sourced biomedical prediction challenges, many pharma-sponsored.

Kind

competition

Host

Sage Bionetworks + IBM + academic partners

Founded

2006

License model

Per-challenge (mostly CC-BY-NC)

URL

https://dreamchallenges.org/

GitHub

https://github.com/dreamchallenges

Composite score

89.4

Flags

none

Count methodology: dreamchallenges.org/closed-challenges + /active as of 2026-05: 74 completed/active challenges; ~38 drug-discovery-relevant.

Breakdown

drug_sensitivity: 9
target_prediction: 6
toxicity: 5
disease_subtyping: 12
other_biomed: 42

Individually catalogued benchmarks hosted here

none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)

Notes

Historical impact on field norms. Cadence has slowed 2022+.

MIMIC-IV / eICU — 14 benchmarks tracked (as of 2026-05-12)

ICU EHR datasets used for clinical outcome, adverse-event, and PK/PD benchmarks.

Kind

data-platform

Host

MIT Lab for Computational Physiology

Founded

2016 / 2020 (v4)

License model

PhysioNet credentialed

URL

https://physionet.org/content/mimiciv/

GitHub

https://github.com/MIT-LCP/mimic-code

Composite score

89.4

Flags

none

Count methodology: PhysioNet + BigBio MIMIC-IV benchmarks 2026-05: 14 derived benchmarks (mortality, LOS, readmission, sepsis, AKI, drug dosing, phenotyping).

Breakdown

outcome_prediction: 6
drug_dosing: 3
adverse_event: 3
phenotyping: 2

Individually catalogued benchmarks hosted here

MIMIC-IV Benchmark Tasks — 89.4

Notes

Canonical for clinical ML. US-centric.

CZI Virtual Cell / CellxGene / VCC — 12 benchmarks tracked (as of 2026-05-12)

Umbrella for CZI-funded virtual-cell benchmark initiatives: CellxGene, Virtual Cell Challenge, Tabula atlases.

Kind

consortium

Host

Chan Zuckerberg Initiative / CZ Biohub

Founded

2016 / 2024 (VCC)

License model

CC-BY 4.0

URL

https://chanzuckerberg.com/science/programs-resources/virtual-cells/

GitHub

https://github.com/chanzuckerberg

Composite score

88.9

Flags

none

Count methodology: chanzuckerberg.com/science 2026-05: Virtual Cell Challenge (4 tracks), CellxGene Census benchmarks (4), Tabula Sapiens-derived eval suites (4).

Breakdown

virtual_cell_challenge: 4
cellxgene: 4
tabula: 4

Individually catalogued benchmarks hosted here

Open Problems: Perturbation Prediction — 91.9
CZ Virtual Cell Challenge — 88.1

Notes

VCC is becoming the canonical virtual-cell benchmark.

Open Reaction Database (ORD) — 1 benchmarks tracked (as of 2026-05-12)

Open reaction repository in a schema-validated format; enables reaction / yield / retrosynthesis benchmarks.

Kind

data-platform

Host

ORD consortium (Doyle, Coley, Pfizer, Merck, BASF)

Founded

2021-07

License model

CC-BY-SA 4.0

URL

https://open-reaction-database.org/

GitHub

https://github.com/open-reaction-database

Composite score

88.9

Flags

none

Count methodology: open-reaction-database.org 2026-05: ~2.1M reactions as single versioned benchmark corpus.

Breakdown

reactions: 2100000
contributing_orgs: 30

Individually catalogued benchmarks hosted here

ORD Reaction Benchmark — 93.9

Notes

Biggest open reaction corpus; industry donations accelerating.

Drug Discovery Benchmarks (DDB) — 206 benchmarks tracked (as of 2026-05-12)

Insilico's drug-discovery-specific benchmark portal: TargetBench, Longevity Benchmark, GPCR affinity, PDBbind-style tasks, ISM ADMET, TDC ADMET mirror, ClinBench, biologics.

Kind

meta-platform

Host

Insilico Medicine

Founded

2025

License model

CC-BY (per portal)

URL

https://ddb.insilico.com/

GitHub

N/A — hosted portal

Composite score

87.6

Flags

none

Count methodology: Fetched https://ddb.insilico.com/api/benchmarks on 2026-05-12; meta.totalBenchmarks=206 across 6 categories × 15 suites.

Breakdown

Biology (TargetBench + Longevity): 29
Affinity and Binding: 94
Chemical Synthesis: 2
ADMET, PK & Safety: 50
Clinical Trials: 25
Biologics: 6

Individually catalogued benchmarks hosted here

TDC ADMET Group — 100.0
Longevity Benchmark (Insilico) — 90.6
ISM Benchmarks: GPCRs (Insilico) — 87.6
TargetBench (Insilico) — 84.6
ISM Benchmarks: ADMET (Insilico) — 84.6
MatBench — 83.3
ClinBench Quarterly (Insilico) — 81.5
AMES (mutagenicity) — 79.5
Polaris Biologics (Polyreactivity / SEC / Tm) — 79.0
Obach PK Dataset — 77.0
hERG (cardio-tox) TDC — 73.9
DILI / LD50 Zhu — 73.9

Notes

Drug-discovery focused cut. Includes a mirror of TDC ADMET for cross-platform comparability.

CAFA — 6 benchmarks tracked (as of 2026-05-12)

Blind eval of protein function prediction against time-delayed UniProt-GOA.

Kind

competition

Host

Radivojac / Friedberg / Jiang consortium

Founded

2010

License model

Public

URL

https://biofunctionprediction.org/

GitHub

N/A

Composite score

86.8

Flags

none

Count methodology: biofunctionprediction.org archives: CAFA1–5 (2010–2023) + CAFA6 announced 2025 = 6 editions.

Breakdown

editions: 6
cafa5_targets: 142000

Individually catalogued benchmarks hosted here

CAFA5 — 84.3

Notes

CAFA5 (Kaggle, 2023) drew 1625 teams.

CAPRI — 56 benchmarks tracked (as of 2026-05-12)

Blind prediction of protein-protein complexes, protein-peptide, and protein-ligand assemblies.

Kind

competition

Host

EBI + CCP4

Founded

2001

License model

Public

URL

https://www.ebi.ac.uk/pdbe/complex-pred/capri/

GitHub

N/A

Composite score

86.3

Flags

none

Count methodology: EBI CAPRI archive: Round 1 (2001) through Round 56 (2024).

Breakdown

rounds: 56
targets_total_approx: 300

Individually catalogued benchmarks hosted here

CAPRI Rounds — 86.3

Notes

Oldest PPI prediction benchmark.

FAERS / SIDER / OffSides / TWOSIDES — 4 benchmarks tracked (as of 2026-05-12)

FDA adverse event reports + SIDER/OffSides/TWOSIDES derivatives for post-market signal detection.

Kind

data-platform

Host

FDA CDER + Tatonetti Lab

Founded

1969 / 2012

License model

Public / CC-BY

URL

https://www.fda.gov/drugs/surveillance/questions-and-answers-fdas-adverse-event-reporting-system-faers

GitHub

N/A

Composite score

85.6

Flags

none

Count methodology: FAERS (19M+ reports) + 3 derived benchmarks (SIDER, OffSides, TWOSIDES) = 4.

Breakdown

faers_reports: 19000000
sider_pairs: 139000
offsides_signals: 438000
twosides_combo: 870000

Individually catalogued benchmarks hosted here

FAERS (raw) — 91.1
OffSides / TWOSIDES — 83.0
SIDER — 74.9

Notes

Essential for pharmacovigilance ML. Known reporting biases.

InsilicoBench — 162 benchmarks tracked (as of 2026-05-12)

Compact cut of the Insilico benchmark stack focused on biology (longevity), GPCR affinity, retrosynthesis, ADMET, and clinical trials.

Kind

meta-platform

Host

Insilico Medicine

Founded

2025

License model

CC-BY (per portal)

URL

https://insilicobench.insilico.com/

GitHub

N/A — hosted portal

Composite score

84.6

Flags

none

Count methodology: Fetched https://insilicobench.insilico.com/api/benchmarks on 2026-05-12; meta.totalBenchmarks=162 across 5 categories (Biology 19, Affinity/Binding 88, Chemical Synthesis 2, ADMET 28, Clinical Trials 25).

Breakdown

Biology (Longevity): 19
Affinity and Binding: 88
Chemical Synthesis: 2
ADMET, PK & Safety: 28
Clinical Trials: 25

Individually catalogued benchmarks hosted here

Notes

Curated subset of ScienceAIBench. Same leaderboard model pool → also NOT self-referential.

FLIP — 15 benchmarks tracked (as of 2026-05-12)

Protein fitness benchmarks focused on realistic train/test splits (AAV, GB1, Meltome, SCL, Bind).

Kind

meta-platform

Host

Rostlab TUM + AlQuraishi Lab Columbia

Founded

2021-12

License model

CC-BY 4.0

URL

https://benchmark.protein.properties/

GitHub

https://github.com/J-SNACKKB/FLIP

Composite score

80.8

Flags

none

Count methodology: FLIP README: 5 landscapes × 3 splits = 15 benchmarks.

Breakdown

landscapes: 5
splits_per_landscape: 3

Individually catalogued benchmarks hosted here

FLIP — 80.8

Notes

Complementary to ProteinGym (smaller but careful splits).

CPTAC — 10 benchmarks tracked (as of 2026-05-12)

Integrated proteogenomic datasets across 10 tumor types; hosts DREAM proteogenomic benchmarks.

Kind

consortium

Host

NCI Office of Cancer Clinical Proteomics Research

Founded

2011

License model

dbGaP controlled / public tiers

URL

https://proteomics.cancer.gov/programs/cptac

GitHub

https://github.com/PayneLab/cptac

Composite score

80.8

Flags

none

Count methodology: CPTAC data portal 2026-05: 10 tumor types with full proteogenomic characterization (BR, CO, EN, GBM, HNSCC, LSCC, LUAD, OV, PDAC, CCRCC).

Breakdown

tumor_types: 10
samples: 1600
omics_layers: 6

Individually catalogued benchmarks hosted here

CPTAC Proteogenomic Benchmarks — 80.8

Notes

Deep but narrow (oncology).

DeepChem — 40 benchmarks tracked (as of 2026-05-12)

OSS library bundling molecular ML benchmark datasets and baselines; hosts MoleculeNet.

Kind

meta-platform

Host

DeepChem community

Founded

2016

License model

MIT

URL

https://deepchem.io/

GitHub

https://github.com/deepchem/deepchem

Composite score

80.0

Flags

none

Count methodology: deepchem.molnet module listing: ~40 packaged datasets (MoleculeNet core + extensions).

Breakdown

moleculenet_core: 17
adme_tox_extensions: 10
protein: 5
materials: 4
misc: 4

Individually catalogued benchmarks hosted here

MoleculeNet — 78.0

Notes

Excellent reproducibility — one-liner dataset loaders.

MoleculeNet — 17 benchmarks tracked (as of 2026-05-12)

Benchmark suite covering quantum, physical, biophysical, physiological molecular ML tasks.

Kind

meta-platform

Host

DeepChem community (Pande Lab alumni)

Founded

2018-03

License model

MIT

URL

https://moleculenet.org/

GitHub

https://github.com/deepchem/deepchem

Composite score

78.0

Flags

data-leakage-known

Count methodology: Wu et al. 2018 Chem Sci + DeepChem repo enumeration: QM7/QM7b/QM8/QM9, ESOL, FreeSolv, Lipophilicity, PCBA, MUV, HIV, BACE, BBBP, Tox21, ToxCast, SIDER, ClinTox, PDBbind (17).

Breakdown

quantum: 4
physical_chem: 3
biophysics: 4
physiology: 6

Individually catalogued benchmarks hosted here

GuacaMol — 80.5
MoleculeNet — 78.0
Tox21 — 77.5
MOSES — 72.4
ClinTox — 65.6

Notes

Historically foundational; many splits have documented leakage. Community has largely moved to TDC / Polaris for new work.

TrialBench / HINT / TOP — 4 benchmarks tracked (as of 2026-05-12)

Suite of benchmarks for clinical trial outcome prediction.

Kind

meta-platform

Host

Fu/Sun Lab, Georgia Tech + HMS

Founded

2022

License model

MIT

URL

https://github.com/futianfan/clinical-trial-outcome-prediction

GitHub

https://github.com/futianfan/clinical-trial-outcome-prediction

Composite score

76.5

Flags

none

Count methodology: Fu et al. 2022-2024: HINT (17k trials), TOP (17k), TrialBench (21k trials, 12k drugs), CT-Outcome.

Breakdown

hint_trials: 17000
top_trials: 17000
trialbench_trials: 21000
task_variants: 12

Individually catalogued benchmarks hosted here

Notes

First rigorous ML benchmarks on trial outcomes. Limited by CTgov quality.

ClawBio Benchmarks — 10 benchmarks tracked (as of 2026-05-03)

Public scientific-correctness leaderboard for bio-analysis skills. Independent third-party benchmark (clawbio_bench, authored by Biostochastics LLC) tests ClawBio skills on safety, correctness, honesty. Public failure surface with remediation tasks.

Kind

meta-platform

Host

ClawBio (open source, MIT)

Founded

2026-04

License model

MIT

URL

https://clawbio.ai/benchmarks.html

GitHub

https://github.com/biostochastics/clawbio_bench

Composite score

74.2

Flags

none

Count methodology: Scraped https://clawbio.ai/benchmarks.html on 2026-05-12; last bench run 2026-05-03 against ClawBio commit 7820473 using clawbio_bench v0.1.5. 10 skills audited: claw-metagenomics, equity-scorer, nutrigx-advisor, bio-orchestrator, pharmgx-reporter, fine-mapping, clinical-variant-reporter, cvr-acmg-correctness, gwas-prs, cvr-variant-identity. 168/182 tests passing (92.3%).

Breakdown

skills_audited: 10
tests_total: 182
tests_passing: 168
pass_rate_pct: 92.3

Individually catalogued benchmarks hosted here

ClawBio Skill Correctness Bench — 74.2

Notes

Independent third-party bench in a separate repo — structurally NOT self-referential. Coverage narrow (bio-analysis skills) but rigor is exemplary (safety × correctness × honesty tri-dimensional). Model for how skill/agent correctness should be audited.

EU-OPENSCREEN / EUbOPEN — 5 benchmarks tracked (as of 2026-05-12)

EU chemical biology ERIC compound libraries + EUbOPEN chemogenomic probes.

Kind

consortium

Host

EU-OPENSCREEN ERIC + IMI EUbOPEN

Founded

2018

License model

CC-BY

URL

https://www.eu-openscreen.eu/

GitHub

N/A

Composite score

73.9

Flags

none

Count methodology: eu-openscreen.eu + eubopen.org 2026-05: ECBD (1), Bioactivity sets (2), EUbOPEN probe set (1), EUOS solubility (1).

Breakdown

chem_libraries: 2
bioactivity_benchmarks: 2
probe_sets: 1

Individually catalogued benchmarks hosted here

none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)

Notes

EUOS solubility benchmark (on Polaris) is most ML-ready.

PKU-AIDD / ChinaDrug Benchmarks — 7 benchmarks tracked (as of 2026-05-12)

PKU AI Drug Discovery + SIMM CAS + Tsinghua + Baidu + Huawei benchmark releases.

Kind

consortium

Host

PKU + SIMM CAS + Tsinghua + Baidu + Huawei

Founded

2020

License model

Apache-2.0 / MIT

URL

https://aidd.pku.edu.cn/

GitHub

https://github.com/pku-aidd

Composite score

73.9

Flags

self_referential

Count methodology: PKU-AIDD + SIMM CAS + BDBench GitHub 2026-05: 7 public releases (PocketBench, ProteinInvBench, GeoMol-CN, HelixFold-Bench, UniMol-Bench, BDBench, PDBbind-China).

Breakdown

pocket: 1
inverse_folding: 1
molecular_geometry: 1
foundation_models: 3
pdbbind: 1

Individually catalogued benchmarks hosted here

none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)

Notes

Growing Chinese benchmark ecosystem. Some self-referential flags (HelixFold on its own bench).

Kaggle — Pharma / Bio Competitions — 23 benchmarks tracked (as of 2026-05-12)

Industry-sponsored ML competitions (Merck MAC 2012, Open Problems ×3, NovoZymes, BMS, CAFA 5).

Kind

competition

Host

Google / Kaggle + sponsoring companies

Founded

2010

License model

Per-competition

URL

https://www.kaggle.com/competitions

GitHub

N/A

Composite score

71.9

Flags

none

Count methodology: Kaggle search for bio/chem/pharma competitions 2010-2025: 23 distinct drug-discovery-adjacent competitions identified.

Breakdown

molecule_activity: 5
single_cell: 4
protein_function: 4
histopathology: 6
other: 4

Individually catalogued benchmarks hosted here

none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)

Notes

Impactful one-off events; leaderboards go stale post-close.

Papers With Code — Drug Discovery — 120 benchmarks tracked (as of 2026-05-12)

Aggregates published ML benchmarks with linked code; crowd-curated.

Kind

meta-platform

Host

Meta AI / Papers With Code community

Founded

2018

License model

Per benchmark

URL

https://paperswithcode.com/area/medical

GitHub

N/A

Composite score

71.6

Flags

none

Count methodology: paperswithcode.com/area/medical + /task search 2026-05: ~120 drug-discovery-adjacent benchmarks (DTI, generation, ADMET, structure, drug response, etc.).

Breakdown

dti: 18
molecule_generation: 15
admet: 14
protein_structure: 22
drug_response: 11
other: 40

Individually catalogued benchmarks hosted here

MoleculeNet — 78.0
USPTO-50K / USPTO-MIT (Retrosynthesis) — 78.0
TAPE — 74.9
PEER — 74.4

Notes

Useful for discovery; curation quality varies sharply.

PDBbind / CASF — 6 benchmarks tracked (as of 2026-05-12)

Curated experimental binding affinities for PDB complexes + CASF scoring power tests.

Kind

meta-platform

Host

SIMM, Chinese Academy of Sciences

Founded

2004

License model

Academic-only

URL

http://www.pdbbind.org.cn/

GitHub

N/A

Composite score

70.4

Flags

data-leakage-known

Count methodology: pdbbind.org.cn: 2 splits (refined + general) × 3 CASF editions (2013, 2016, 2020) = 6 configurations.

Breakdown

pdbbind_refined_2020: 5316
pdbbind_general_2020: 19443
casf_editions: 3

Individually catalogued benchmarks hosted here

CASF-2016 — 76.2
PDBbind — 75.9

Notes

Known leakage; still dominant in published benchmarks. Academic-only licensing limits pharma use.

HuggingFace — Bio/Chem Datasets — 310 benchmarks tracked (as of 2026-05-12)

HuggingFace Datasets hub filtered for bio/chem benchmarks (tdc, bigbio, InstaDeep).

Kind

data-platform

Host

HuggingFace + community uploaders

Founded

2020

License model

Per-dataset

URL

https://huggingface.co/datasets

GitHub

N/A

Composite score

67.8

Flags

none

Count methodology: huggingface.co/datasets tag search (biology/chemistry/medical/drug-discovery) + curated orgs tdc/bigbio/InstaDeepAI 2026-05: ~310 entries, with duplication.

Breakdown

molecular: 90
protein: 70
clinical_text: 80
genomic: 40
other: 30

Individually catalogued benchmarks hosted here

none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)

Notes

High discoverability, low quality floor.