Methodology
BioBenchmarks follows the drug-discovery-benchmark-eval skill
(stored at ~/.openclaw/workspace/skills/drug-discovery-benchmark-eval/SKILL.md).
This page mirrors the key elements.
1. Pipeline Taxonomy โ 12 canonical stages
- Virtual Cell โ cell-state foundation models, perturbation prediction
- Disease Modeling โ disease signatures, mechanism maps
- Target ID โ target-disease association, essentiality, druggability
- Hit ID โ virtual screening, docking, bioactivity
- Lead ID / ADMET โ property prediction (absorption, distribution, metabolism, excretion, toxicity)
- Developmental Candidate โ multi-parameter optimization, DMPK integration
- IND-enabling โ safety, tox, PK projection
- Phase I โ human PK/PD, dose prediction
- Phase II โ efficacy prediction, biomarker qualification
- Phase III โ outcome prediction, endpoint modeling
- Clinical Development (cross-phase) โ trial design, patient stratification
- Post-market / RWE โ adverse events, signal detection
2. Benchmark Rubric (7+ criteria, 1โ5 each)
- Scientific rigor โ peer review, reproducibility, controls
- Coverage โ task breadth + data volume
- Active maintenance โ cadence of updates
- Community adoption โ citations, stars, leaderboard entries
- Data quality โ curation, QC, known-issue tracking
- Accessibility โ license + install experience
- Industry relevance โ pharma-validated translational signal
Composite = weighted mean with rigor 1.5ร, coverage 1.2ร, adoption 1.2ร, others 1.0ร, normalized to 0โ100.
3. Expert Rubric
- Benchmarks authored ยท Benchmark citations ยท Scope ยท Community role ยท Recency ยท Rigor flags
4. Group Rubric
- Output volume ยท Quality (median rubric of their benchmarks) ยท Breadth ยท Openness ยท Industry uptake ยท Longevity ยท Translational signal
5. Anti-gaming rules
- Benchmarks maintained by the same group whose model dominates are flagged
self_referential. - Commercial-only license โ
license-gated-commercialflag, reduced accessibility. - Documented leakage โ
data-leakage-knownflag, reduced quality. - Deprecated benchmarks are retained with a
deprecated-recommend-replaceflag โ never silently dropped.
6. Anti-patterns avoided
- Not every benchmark is a 4 or 5. Differentiated scoring.
- No fabricated numbers โ
N/A โ <reason>rather than guesses. - No US-only bias โ Chinese, European, Indian, Canadian benchmarks represented.
- No small-molecule-only bias โ biologics, cell therapies, clinical/RWE equally tracked.
- No academic-only bias โ industry-produced benchmarks (Polaris, Insilico portals, Genentech PerturbBench) rank alongside.
Full skill file
The canonical methodology is the SKILL.md at:
/Users/azhkclaw/.openclaw/workspace/skills/drug-discovery-benchmark-eval/SKILL.md