@October 8, 2025 MedLM Weekly

MedARC Evals

Reminder that at the end of October we would are planning on releasing a blogpost introducing our new medical LM evaluation benchmark with results from both proprietary and open models. (Name pending). This benchmark will use our verifiers environments to run the evaluations with a new script/minimal library to tie everything together.

This week we want to finish all of our dataset environments so we can begin benchmarking models next week.

Timeline

We’d like to release our blogpost and benchmark at the end of October during the week of the 27th.

Tentative timeline:

This week we want to finalize our datasets
Third week of October to run the evals, debug any potential issues, and start writing the blogpost
Early last week of October: finish our writeup of the results

Status

Thanks to everyone who submitted PRs, we have the following datasets completed:

M-ARC - Max Kieffer
HealthBench - Sameed Khan
MMLU-Pro Health - Max Kieffer
MedQA - Ahmed Essouaied
MetaMedQA - Ahmed Essouaied
PubMedQA - rscgh
MedBullets - Max Kieffer
MedCaseReasoning - Tanishq