Methodology

BioBenchmarks follows the drug-discovery-benchmark-eval skill (stored at ~/.openclaw/workspace/skills/drug-discovery-benchmark-eval/SKILL.md). This page mirrors the key elements.

1. Pipeline Taxonomy — 12 canonical stages

Virtual Cell — cell-state foundation models, perturbation prediction
Disease Modeling — disease signatures, mechanism maps
Target ID — target-disease association, essentiality, druggability
Hit ID — virtual screening, docking, bioactivity
Lead ID / ADMET — property prediction (absorption, distribution, metabolism, excretion, toxicity)
Developmental Candidate — multi-parameter optimization, DMPK integration
IND-enabling — safety, tox, PK projection
Phase I — human PK/PD, dose prediction
Phase II — efficacy prediction, biomarker qualification
Phase III — outcome prediction, endpoint modeling
Clinical Development (cross-phase) — trial design, patient stratification
Post-market / RWE — adverse events, signal detection

2. Benchmark Rubric (7+ criteria, 1–5 each)

Scientific rigor — peer review, reproducibility, controls
Coverage — task breadth + data volume
Active maintenance — cadence of updates
Community adoption — citations, stars, leaderboard entries
Data quality — curation, QC, known-issue tracking
Accessibility — license + install experience
Industry relevance — pharma-validated translational signal

Composite = weighted mean with rigor 1.5×, coverage 1.2×, adoption 1.2×, others 1.0×, normalized to 0–100.

3. Expert Rubric

Benchmarks authored · Benchmark citations · Scope · Community role · Recency · Rigor flags

4. Group Rubric

Output volume · Quality (median rubric of their benchmarks) · Breadth · Openness · Industry uptake · Longevity · Translational signal

5. Anti-gaming rules

Benchmarks maintained by the same group whose model dominates are flagged self_referential.
Commercial-only license → license-gated-commercial flag, reduced accessibility.
Documented leakage → data-leakage-known flag, reduced quality.
Deprecated benchmarks are retained with a deprecated-recommend-replace flag — never silently dropped.

6. Anti-patterns avoided

Not every benchmark is a 4 or 5. Differentiated scoring.
No fabricated numbers — N/A — <reason> rather than guesses.
No US-only bias — Chinese, European, Indian, Canadian benchmarks represented.
No small-molecule-only bias — biologics, cell therapies, clinical/RWE equally tracked.
No academic-only bias — industry-produced benchmarks (Polaris, Insilico portals, Genentech PerturbBench) rank alongside.

Full skill file

The canonical methodology is the SKILL.md at:

/Users/azhkclaw/.openclaw/workspace/skills/drug-discovery-benchmark-eval/SKILL.md