Initiatives (Meta-platforms, Consortia, Competitions)
31 initiatives collectively track 1,749 benchmarks. Sorted by benchmarks tracked. Click any header to re-sort.
| # | Initiative | Kind | Benchmarks tracked | As of | Host | Score | Links |
|---|---|---|---|---|---|---|---|
| 1 | Therapeutics Data Commons (TDC) | meta-platform | 83 | 2026-05-12 | Zitnik Lab, Harvard Medical School (+ MIT, Stanford, Georgia Tech collaborators) | 100.0 | site ยท gh |
| 2 | CASP (Critical Assessment of Structure Prediction) | competition | 16 | 2026-05-12 | Prediction Center, UC Davis | 100.0 | site ยท gh |
| 3 | ProteinGym | meta-platform | 217 | 2026-05-12 | Marks Lab (Harvard) + OATML (Oxford) + DeepMind | 97.5 | site ยท gh |
| 4 | ELIXIR Infrastructure | consortium | 18 | 2026-05-12 | EMBL-EBI + 23 EU member nodes | 97.5 | site ยท gh |
| 5 | CAMEO | competition | 4 | 2026-05-12 | Biozentrum Basel + SIB | 94.4 | site ยท gh |
| 6 | PoseBusters Evaluation Suite | meta-platform | 3 | 2026-05-12 | Oxford OPIG (Deane Lab) | 93.9 | site ยท gh |
| 7 | PLINDER / PINDER | meta-platform | 2 | 2026-05-12 | Biozentrum Basel + VantAI + Isomorphic Labs + EPFL | 93.9 | site ยท gh |
| 8 | Open Problems in Single-Cell Analysis | consortium | 29 | 2026-05-12 | CZI + Helmholtz Munich + Yale + HMS | 91.9 | site ยท gh |
| 9 | Polaris Hub | meta-platform | 48 | 2026-05-12 | Polaris consortium (Valence Labs, Recursion, Novartis, Pfizer, Merck, AstraZeneca) | 91.4 | site ยท gh |
| 10 | ScienceAIBench | meta-platform | 227 | 2026-05-12 | Insilico Medicine | 90.6 | site ยท gh |
| 11 | DREAM Challenges | competition | 74 | 2026-05-12 | Sage Bionetworks + IBM + academic partners | 89.4 | site ยท gh |
| 12 | MIMIC-IV / eICU | data-platform | 14 | 2026-05-12 | MIT Lab for Computational Physiology | 89.4 | site ยท gh |
| 13 | CZI Virtual Cell / CellxGene / VCC | consortium | 12 | 2026-05-12 | Chan Zuckerberg Initiative / CZ Biohub | 88.9 | site ยท gh |
| 14 | Open Reaction Database (ORD) | data-platform | 1 | 2026-05-12 | ORD consortium (Doyle, Coley, Pfizer, Merck, BASF) | 88.9 | site ยท gh |
| 15 | Drug Discovery Benchmarks (DDB) | meta-platform | 206 | 2026-05-12 | Insilico Medicine | 87.6 | site ยท gh |
| 16 | CAFA | competition | 6 | 2026-05-12 | Radivojac / Friedberg / Jiang consortium | 86.8 | site ยท gh |
| 17 | CAPRI | competition | 56 | 2026-05-12 | EBI + CCP4 | 86.3 | site ยท gh |
| 18 | FAERS / SIDER / OffSides / TWOSIDES | data-platform | 4 | 2026-05-12 | FDA CDER + Tatonetti Lab | 85.6 | site ยท gh |
| 19 | InsilicoBench | meta-platform | 162 | 2026-05-12 | Insilico Medicine | 84.6 | site ยท gh |
| 20 | FLIP | meta-platform | 15 | 2026-05-12 | Rostlab TUM + AlQuraishi Lab Columbia | 80.8 | site ยท gh |
| 21 | CPTAC | consortium | 10 | 2026-05-12 | NCI Office of Cancer Clinical Proteomics Research | 80.8 | site ยท gh |
| 22 | DeepChem | meta-platform | 40 | 2026-05-12 | DeepChem community | 80.0 | site ยท gh |
| 23 | MoleculeNet | meta-platform | 17 | 2026-05-12 | DeepChem community (Pande Lab alumni) | 78.0 | site ยท gh |
| 24 | TrialBench / HINT / TOP | meta-platform | 4 | 2026-05-12 | Fu/Sun Lab, Georgia Tech + HMS | 76.5 | site ยท gh |
| 25 | ClawBio Benchmarks | meta-platform | 10 | 2026-05-03 | ClawBio (open source, MIT) | 74.2 | site ยท gh |
| 26 | EU-OPENSCREEN / EUbOPEN | consortium | 5 | 2026-05-12 | EU-OPENSCREEN ERIC + IMI EUbOPEN | 73.9 | site ยท gh |
| 27 | PKU-AIDD / ChinaDrug Benchmarks | consortium | 7 | 2026-05-12 | PKU + SIMM CAS + Tsinghua + Baidu + Huawei | 73.9 | site ยท gh |
| 28 | Kaggle โ Pharma / Bio Competitions | competition | 23 | 2026-05-12 | Google / Kaggle + sponsoring companies | 71.9 | site ยท gh |
| 29 | Papers With Code โ Drug Discovery | meta-platform | 120 | 2026-05-12 | Meta AI / Papers With Code community | 71.6 | site ยท gh |
| 30 | PDBbind / CASF | meta-platform | 6 | 2026-05-12 | SIMM, Chinese Academy of Sciences | 70.4 | site ยท gh |
| 31 | HuggingFace โ Bio/Chem Datasets | data-platform | 310 | 2026-05-12 | HuggingFace + community uploaders | 67.8 | site ยท gh |
Per-initiative detail
Therapeutics Data Commons (TDC) โ 83 benchmarks tracked (as of 2026-05-12)
Open-science platform curating ML datasets/tasks across the drug discovery pipeline with unified API, splits, and leaderboards.
Count methodology: Scraped tdcommons.ai single_pred/multi_pred/generation overview pages 2026-05-12: single-pred ~38 datasets (ADME/Tox/HTS/QM/Yields/Epitope/Develop/CRISPROutcome), multi-pred ~32 datasets (DTI/DDI/PPI/GDA/DrugRes/DrugSyn/PeptideMHC/AntibodyAff/MTI/Catalyst/TCREpitope/TrialOutcome/ProteinPeptide/PerturbOutcome/scDTI), generation ~13 (MolGen/RetroSyn/Reaction/SBDD). 8 named leaderboard groups.
Breakdown
- single_prediction: 38
- multi_prediction: 32
- generation: 13
- leaderboard_groups: 8
Individually catalogued benchmarks hosted here
- TDC ADMET Group โ 100.0
- PrimeKG โ 91.9
- Practical Molecular Optimization (PMO) โ 88.9
- ToxCast โ 85.6
- OffSides / TWOSIDES โ 83.0
- DisGeNET โ 81.0
- GuacaMol โ 80.5
- AMES (mutagenicity) โ 79.5
- USPTO-50K / USPTO-MIT (Retrosynthesis) โ 78.0
- Tox21 โ 77.5
- TDC DrugSyn (OncoPolyPharm + DrugComb_NCI60) โ 77.0
- Obach PK Dataset โ 77.0
- HINT / TrialBench โ 76.5
- hERG (cardio-tox) TDC โ 73.9
- DILI / LD50 Zhu โ 73.9
- DUD-E โ 72.9
- MOSES โ 72.4
- ClinTox โ 65.6
Notes
Most comprehensive ML-ready therapeutics benchmark hub. NeurIPS 2021 + Nat Chem Bio 2022.
CASP (Critical Assessment of Structure Prediction) โ 16 benchmarks tracked (as of 2026-05-12)
Biennial blind evaluation of protein structure prediction; drove AlphaFold's validation.
Count methodology: predictioncenter.org archives: CASP1 (1994) through CASP16 (2024) = 16 editions; ~100 targets ร ~5 categories per edition.
Breakdown
- editions: 16
- categories_per_edition: 5
- targets_avg: 100
Individually catalogued benchmarks hosted here
Notes
Historical gold standard for blind evaluation. CASP15 added ligands; CASP16 added multimer + RNA.
ProteinGym โ 217 benchmarks tracked (as of 2026-05-12)
Large-scale benchmark for protein fitness prediction from DMS + clinical variant effects.
Count methodology: ProteinGym v1.2 README + NeurIPS 2023 paper: 217 DMS substitution assays + 66 indel assays + 2525 ClinVar clinical variants.
Breakdown
- dms_substitutions: 217
- dms_indels: 66
- mutations: 2700000
- clinical_variants: 2525
Individually catalogued benchmarks hosted here
- ProteinGym โ 97.5
Notes
De facto standard for variant effect prediction. Clinical track enables ESM/EVE/AlphaMissense fair comparison.
ELIXIR Infrastructure โ 18 benchmarks tracked (as of 2026-05-12)
European life-science data infrastructure hosting benchmark-relevant resources (UniProt, Ensembl, ChEMBL, PDBe, IntAct).
Count methodology: ELIXIR Core Data Resources list 2026-05: ~18 resources with benchmark/leaderboard components.
Breakdown
- core_data_resources: 18
- member_nodes: 23
Individually catalogued benchmarks hosted here
- Open Targets Platform โ 100.0
- ChEMBL โ 97.5
- STRING โ 94.9
Notes
Meta-resource of meta-resources.
CAMEO โ 4 benchmarks tracked (as of 2026-05-12)
Continuous weekly blind eval of protein 3D / multimer / ligand prediction using pre-release PDB structures.
Count methodology: cameo3d.org 2026-05: 4 active categories โ 3D monomer, 3D multimer, model quality, ligand pocket; ~1000 targets/year.
Breakdown
- monomer: 1
- multimer: 1
- quality_estimation: 1
- ligand: 1
Individually catalogued benchmarks hosted here
- CAMEO weekly targets โ 94.4
Notes
Excellent continuous cadence complementing CASP.
PoseBusters Evaluation Suite โ 3 benchmarks tracked (as of 2026-05-12)
Physics-aware validation of docking/co-folding poses; 19 checks + curated test sets.
Count methodology: GitHub README: PoseBusters v1 (308 complexes), v2 (428), Astex Diverse Set (85) = 3 canonical suites.
Breakdown
- test_sets: 3
- validation_checks: 19
Individually catalogued benchmarks hosted here
- PoseBusters โ 97.0
Notes
Changed pose-prediction evaluation norms; default pharma filter now.
PLINDER / PINDER โ 2 benchmarks tracked (as of 2026-05-12)
Leakage-controlled protein-ligand (PLINDER) and protein-protein (PINDER) docking datasets.
Count methodology: plinder.sh + pinder.sh: 2 major benchmarks (PLINDER 460k systems, PINDER 267k systems).
Breakdown
- plinder_systems: 460000
- pinder_systems: 267498
Individually catalogued benchmarks hosted here
Notes
Replacing PDBbind/CASF for modern docking ML eval.
Open Problems in Single-Cell Analysis โ 29 benchmarks tracked (as of 2026-05-12)
Community benchmark suite for single-cell analysis with reproducible Viash/Nextflow pipelines and NeurIPS tracks.
Count methodology: openproblems.bio task registry + Luecken et al. Nat Biotech 2025: 29 benchmark tasks (batch integration, denoising, dim-reduction, label projection, perturbation, spatial, multimodal).
Breakdown
- batch_integration: 3
- perturbation: 4
- multimodal: 6
- label_transfer: 4
- spatial: 5
- other: 7
Individually catalogued benchmarks hosted here
Notes
Gold-standard single-cell benchmarking rigor; Nat Biotech 2025.
Polaris Hub โ 48 benchmarks tracked (as of 2026-05-12)
Industry-curated small-molecule benchmarks with working groups on method-comparison standards.
Count methodology: polarishub.io/benchmarks public listing 2026-05: ~48 public benchmarks across Recursion, Valence, Novartis, AstraZeneca, Polaris Small Molecule Steering Committee orgs.
Breakdown
- public_benchmarks: 48
- datasets: 60
- competitions: 4
Individually catalogued benchmarks hosted here
- Polaris ADMET โ 88.4
- Polaris Biologics (Polyreactivity / SEC / Tm) โ 79.0
Notes
Industry-led counterweight to academic benchmarks. Strong on method-comparison rigor.
ScienceAIBench โ 227 benchmarks tracked (as of 2026-05-12)
Insilico Medicine's public scientific-AI benchmark portal. Spans biology (longevity, target ID), affinity/binding, ADMET, clinical trials, biologics, materials; leaderboards benchmark frontier LLMs (GPT-5.x, Claude Opus/Sonnet 4.x, Gemini 3, Grok 4.1, DeepSeek v3.2, Kimi K2.x).
Count methodology: Fetched https://scienceaibench.insilico.com/api/benchmarks on 2026-05-12; meta.totalBenchmarks=227 across 7 taxonomy categories ร 17 suites. Leaderboard submitters are external frontier LLMs (top entries: Grok 4.1, GPT 5.1/5.2, Claude Opus 4.5/4.6, Gemini 3 Flash, DeepSeek v3.2, Kimi K2.5). Not self-referential โ Insilico's own models are not on the leaderboards.
Breakdown
- Biology (TargetBench + Longevity): 29
- Affinity and Binding: 94
- Chemical Synthesis (Retrosynthesis): 2
- ADMET, PK & Safety: 50
- Clinical Trials (ClinBench Quarterly): 25
- Biologics: 6
- Materials (MatBench + others): 21
Individually catalogued benchmarks hosted here
- TDC ADMET Group โ 100.0
- Longevity Benchmark (Insilico) โ 90.6
- ISM Benchmarks: GPCRs (Insilico) โ 87.6
- TargetBench (Insilico) โ 84.6
- ISM Benchmarks: ADMET (Insilico) โ 84.6
- MatBench โ 83.3
- ClinBench Quarterly (Insilico) โ 81.5
- Polaris Biologics (Polyreactivity / SEC / Tm) โ 79.0
- hERG (cardio-tox) TDC โ 73.9
Notes
Biggest of the three Insilico portals. Live leaderboards regenerate against frontier LLMs โ therefore NOT flagged self-referential. Strong longevity / aging benchmark slice (unique). Moves up the aging-relevance ranking.
DREAM Challenges โ 74 benchmarks tracked (as of 2026-05-12)
Long-running crowd-sourced biomedical prediction challenges, many pharma-sponsored.
Count methodology: dreamchallenges.org/closed-challenges + /active as of 2026-05: 74 completed/active challenges; ~38 drug-discovery-relevant.
Breakdown
- drug_sensitivity: 9
- target_prediction: 6
- toxicity: 5
- disease_subtyping: 12
- other_biomed: 42
Individually catalogued benchmarks hosted here
- none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)
Notes
Historical impact on field norms. Cadence has slowed 2022+.
MIMIC-IV / eICU โ 14 benchmarks tracked (as of 2026-05-12)
ICU EHR datasets used for clinical outcome, adverse-event, and PK/PD benchmarks.
Count methodology: PhysioNet + BigBio MIMIC-IV benchmarks 2026-05: 14 derived benchmarks (mortality, LOS, readmission, sepsis, AKI, drug dosing, phenotyping).
Breakdown
- outcome_prediction: 6
- drug_dosing: 3
- adverse_event: 3
- phenotyping: 2
Individually catalogued benchmarks hosted here
- MIMIC-IV Benchmark Tasks โ 89.4
Notes
Canonical for clinical ML. US-centric.
CZI Virtual Cell / CellxGene / VCC โ 12 benchmarks tracked (as of 2026-05-12)
Umbrella for CZI-funded virtual-cell benchmark initiatives: CellxGene, Virtual Cell Challenge, Tabula atlases.
Count methodology: chanzuckerberg.com/science 2026-05: Virtual Cell Challenge (4 tracks), CellxGene Census benchmarks (4), Tabula Sapiens-derived eval suites (4).
Breakdown
- virtual_cell_challenge: 4
- cellxgene: 4
- tabula: 4
Individually catalogued benchmarks hosted here
- Open Problems: Perturbation Prediction โ 91.9
- CZ Virtual Cell Challenge โ 88.1
Notes
VCC is becoming the canonical virtual-cell benchmark.
Open Reaction Database (ORD) โ 1 benchmarks tracked (as of 2026-05-12)
Open reaction repository in a schema-validated format; enables reaction / yield / retrosynthesis benchmarks.
Count methodology: open-reaction-database.org 2026-05: ~2.1M reactions as single versioned benchmark corpus.
Breakdown
- reactions: 2100000
- contributing_orgs: 30
Individually catalogued benchmarks hosted here
- ORD Reaction Benchmark โ 93.9
Notes
Biggest open reaction corpus; industry donations accelerating.
Drug Discovery Benchmarks (DDB) โ 206 benchmarks tracked (as of 2026-05-12)
Insilico's drug-discovery-specific benchmark portal: TargetBench, Longevity Benchmark, GPCR affinity, PDBbind-style tasks, ISM ADMET, TDC ADMET mirror, ClinBench, biologics.
Count methodology: Fetched https://ddb.insilico.com/api/benchmarks on 2026-05-12; meta.totalBenchmarks=206 across 6 categories ร 15 suites.
Breakdown
- Biology (TargetBench + Longevity): 29
- Affinity and Binding: 94
- Chemical Synthesis: 2
- ADMET, PK & Safety: 50
- Clinical Trials: 25
- Biologics: 6
Individually catalogued benchmarks hosted here
- TDC ADMET Group โ 100.0
- Longevity Benchmark (Insilico) โ 90.6
- ISM Benchmarks: GPCRs (Insilico) โ 87.6
- TargetBench (Insilico) โ 84.6
- ISM Benchmarks: ADMET (Insilico) โ 84.6
- MatBench โ 83.3
- ClinBench Quarterly (Insilico) โ 81.5
- AMES (mutagenicity) โ 79.5
- Polaris Biologics (Polyreactivity / SEC / Tm) โ 79.0
- Obach PK Dataset โ 77.0
- hERG (cardio-tox) TDC โ 73.9
- DILI / LD50 Zhu โ 73.9
Notes
Drug-discovery focused cut. Includes a mirror of TDC ADMET for cross-platform comparability.
CAFA โ 6 benchmarks tracked (as of 2026-05-12)
Blind eval of protein function prediction against time-delayed UniProt-GOA.
Count methodology: biofunctionprediction.org archives: CAFA1โ5 (2010โ2023) + CAFA6 announced 2025 = 6 editions.
Breakdown
- editions: 6
- cafa5_targets: 142000
Individually catalogued benchmarks hosted here
- CAFA5 โ 84.3
Notes
CAFA5 (Kaggle, 2023) drew 1625 teams.
CAPRI โ 56 benchmarks tracked (as of 2026-05-12)
Blind prediction of protein-protein complexes, protein-peptide, and protein-ligand assemblies.
Count methodology: EBI CAPRI archive: Round 1 (2001) through Round 56 (2024).
Breakdown
- rounds: 56
- targets_total_approx: 300
Individually catalogued benchmarks hosted here
- CAPRI Rounds โ 86.3
Notes
Oldest PPI prediction benchmark.
FAERS / SIDER / OffSides / TWOSIDES โ 4 benchmarks tracked (as of 2026-05-12)
FDA adverse event reports + SIDER/OffSides/TWOSIDES derivatives for post-market signal detection.
Count methodology: FAERS (19M+ reports) + 3 derived benchmarks (SIDER, OffSides, TWOSIDES) = 4.
Breakdown
- faers_reports: 19000000
- sider_pairs: 139000
- offsides_signals: 438000
- twosides_combo: 870000
Individually catalogued benchmarks hosted here
- FAERS (raw) โ 91.1
- OffSides / TWOSIDES โ 83.0
- SIDER โ 74.9
Notes
Essential for pharmacovigilance ML. Known reporting biases.
InsilicoBench โ 162 benchmarks tracked (as of 2026-05-12)
Compact cut of the Insilico benchmark stack focused on biology (longevity), GPCR affinity, retrosynthesis, ADMET, and clinical trials.
Count methodology: Fetched https://insilicobench.insilico.com/api/benchmarks on 2026-05-12; meta.totalBenchmarks=162 across 5 categories (Biology 19, Affinity/Binding 88, Chemical Synthesis 2, ADMET 28, Clinical Trials 25).
Breakdown
- Biology (Longevity): 19
- Affinity and Binding: 88
- Chemical Synthesis: 2
- ADMET, PK & Safety: 28
- Clinical Trials: 25
Individually catalogued benchmarks hosted here
- Longevity Benchmark (Insilico) โ 90.6
- ISM Benchmarks: GPCRs (Insilico) โ 87.6
- ISM Benchmarks: ADMET (Insilico) โ 84.6
- ClinBench Quarterly (Insilico) โ 81.5
Notes
Curated subset of ScienceAIBench. Same leaderboard model pool โ also NOT self-referential.
FLIP โ 15 benchmarks tracked (as of 2026-05-12)
Protein fitness benchmarks focused on realistic train/test splits (AAV, GB1, Meltome, SCL, Bind).
Count methodology: FLIP README: 5 landscapes ร 3 splits = 15 benchmarks.
Breakdown
- landscapes: 5
- splits_per_landscape: 3
Individually catalogued benchmarks hosted here
- FLIP โ 80.8
Notes
Complementary to ProteinGym (smaller but careful splits).
CPTAC โ 10 benchmarks tracked (as of 2026-05-12)
Integrated proteogenomic datasets across 10 tumor types; hosts DREAM proteogenomic benchmarks.
Count methodology: CPTAC data portal 2026-05: 10 tumor types with full proteogenomic characterization (BR, CO, EN, GBM, HNSCC, LSCC, LUAD, OV, PDAC, CCRCC).
Breakdown
- tumor_types: 10
- samples: 1600
- omics_layers: 6
Individually catalogued benchmarks hosted here
- CPTAC Proteogenomic Benchmarks โ 80.8
Notes
Deep but narrow (oncology).
DeepChem โ 40 benchmarks tracked (as of 2026-05-12)
OSS library bundling molecular ML benchmark datasets and baselines; hosts MoleculeNet.
Count methodology: deepchem.molnet module listing: ~40 packaged datasets (MoleculeNet core + extensions).
Breakdown
- moleculenet_core: 17
- adme_tox_extensions: 10
- protein: 5
- materials: 4
- misc: 4
Individually catalogued benchmarks hosted here
- MoleculeNet โ 78.0
Notes
Excellent reproducibility โ one-liner dataset loaders.
MoleculeNet โ 17 benchmarks tracked (as of 2026-05-12)
Benchmark suite covering quantum, physical, biophysical, physiological molecular ML tasks.
Count methodology: Wu et al. 2018 Chem Sci + DeepChem repo enumeration: QM7/QM7b/QM8/QM9, ESOL, FreeSolv, Lipophilicity, PCBA, MUV, HIV, BACE, BBBP, Tox21, ToxCast, SIDER, ClinTox, PDBbind (17).
Breakdown
- quantum: 4
- physical_chem: 3
- biophysics: 4
- physiology: 6
Individually catalogued benchmarks hosted here
- GuacaMol โ 80.5
- MoleculeNet โ 78.0
- Tox21 โ 77.5
- MOSES โ 72.4
- ClinTox โ 65.6
Notes
Historically foundational; many splits have documented leakage. Community has largely moved to TDC / Polaris for new work.
TrialBench / HINT / TOP โ 4 benchmarks tracked (as of 2026-05-12)
Suite of benchmarks for clinical trial outcome prediction.
Count methodology: Fu et al. 2022-2024: HINT (17k trials), TOP (17k), TrialBench (21k trials, 12k drugs), CT-Outcome.
Breakdown
- hint_trials: 17000
- top_trials: 17000
- trialbench_trials: 21000
- task_variants: 12
Individually catalogued benchmarks hosted here
- HINT / TrialBench โ 76.5
- Trial Outcome Prediction (TOP) โ 76.5
- CT-Outcome (TrialBench v2) โ 73.4
Notes
First rigorous ML benchmarks on trial outcomes. Limited by CTgov quality.
ClawBio Benchmarks โ 10 benchmarks tracked (as of 2026-05-03)
Public scientific-correctness leaderboard for bio-analysis skills. Independent third-party benchmark (clawbio_bench, authored by Biostochastics LLC) tests ClawBio skills on safety, correctness, honesty. Public failure surface with remediation tasks.
Count methodology: Scraped https://clawbio.ai/benchmarks.html on 2026-05-12; last bench run 2026-05-03 against ClawBio commit 7820473 using clawbio_bench v0.1.5. 10 skills audited: claw-metagenomics, equity-scorer, nutrigx-advisor, bio-orchestrator, pharmgx-reporter, fine-mapping, clinical-variant-reporter, cvr-acmg-correctness, gwas-prs, cvr-variant-identity. 168/182 tests passing (92.3%).
Breakdown
- skills_audited: 10
- tests_total: 182
- tests_passing: 168
- pass_rate_pct: 92.3
Individually catalogued benchmarks hosted here
- ClawBio Skill Correctness Bench โ 74.2
Notes
Independent third-party bench in a separate repo โ structurally NOT self-referential. Coverage narrow (bio-analysis skills) but rigor is exemplary (safety ร correctness ร honesty tri-dimensional). Model for how skill/agent correctness should be audited.
EU-OPENSCREEN / EUbOPEN โ 5 benchmarks tracked (as of 2026-05-12)
EU chemical biology ERIC compound libraries + EUbOPEN chemogenomic probes.
Count methodology: eu-openscreen.eu + eubopen.org 2026-05: ECBD (1), Bioactivity sets (2), EUbOPEN probe set (1), EUOS solubility (1).
Breakdown
- chem_libraries: 2
- bioactivity_benchmarks: 2
- probe_sets: 1
Individually catalogued benchmarks hosted here
- none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)
Notes
EUOS solubility benchmark (on Polaris) is most ML-ready.
PKU-AIDD / ChinaDrug Benchmarks โ 7 benchmarks tracked (as of 2026-05-12)
PKU AI Drug Discovery + SIMM CAS + Tsinghua + Baidu + Huawei benchmark releases.
Count methodology: PKU-AIDD + SIMM CAS + BDBench GitHub 2026-05: 7 public releases (PocketBench, ProteinInvBench, GeoMol-CN, HelixFold-Bench, UniMol-Bench, BDBench, PDBbind-China).
Breakdown
- pocket: 1
- inverse_folding: 1
- molecular_geometry: 1
- foundation_models: 3
- pdbbind: 1
Individually catalogued benchmarks hosted here
- none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)
Notes
Growing Chinese benchmark ecosystem. Some self-referential flags (HelixFold on its own bench).
Kaggle โ Pharma / Bio Competitions โ 23 benchmarks tracked (as of 2026-05-12)
Industry-sponsored ML competitions (Merck MAC 2012, Open Problems ร3, NovoZymes, BMS, CAFA 5).
Count methodology: Kaggle search for bio/chem/pharma competitions 2010-2025: 23 distinct drug-discovery-adjacent competitions identified.
Breakdown
- molecule_activity: 5
- single_cell: 4
- protein_function: 4
- histopathology: 6
- other: 4
Individually catalogued benchmarks hosted here
- none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)
Notes
Impactful one-off events; leaderboards go stale post-close.
Papers With Code โ Drug Discovery โ 120 benchmarks tracked (as of 2026-05-12)
Aggregates published ML benchmarks with linked code; crowd-curated.
Count methodology: paperswithcode.com/area/medical + /task search 2026-05: ~120 drug-discovery-adjacent benchmarks (DTI, generation, ADMET, structure, drug response, etc.).
Breakdown
- dti: 18
- molecule_generation: 15
- admet: 14
- protein_structure: 22
- drug_response: 11
- other: 40
Individually catalogued benchmarks hosted here
- MoleculeNet โ 78.0
- USPTO-50K / USPTO-MIT (Retrosynthesis) โ 78.0
- TAPE โ 74.9
- PEER โ 74.4
Notes
Useful for discovery; curation quality varies sharply.
PDBbind / CASF โ 6 benchmarks tracked (as of 2026-05-12)
Curated experimental binding affinities for PDB complexes + CASF scoring power tests.
Count methodology: pdbbind.org.cn: 2 splits (refined + general) ร 3 CASF editions (2013, 2016, 2020) = 6 configurations.
Breakdown
- pdbbind_refined_2020: 5316
- pdbbind_general_2020: 19443
- casf_editions: 3
Individually catalogued benchmarks hosted here
Notes
Known leakage; still dominant in published benchmarks. Academic-only licensing limits pharma use.
HuggingFace โ Bio/Chem Datasets โ 310 benchmarks tracked (as of 2026-05-12)
HuggingFace Datasets hub filtered for bio/chem benchmarks (tdc, bigbio, InstaDeep).
Count methodology: huggingface.co/datasets tag search (biology/chemistry/medical/drug-discovery) + curated orgs tdc/bigbio/InstaDeepAI 2026-05: ~310 entries, with duplication.
Breakdown
- molecular: 90
- protein: 70
- clinical_text: 80
- genomic: 40
- other: 30
Individually catalogued benchmarks hosted here
- none of the individually-catalogued benchmarks are currently hosted here in our cross-reference (this is mostly the case for counter-only initiatives)
Notes
High discoverability, low quality floor.