Reminder that at the end of October we would are planning on releasing a blogpost introducing our new medical LM evaluation benchmark with results from both proprietary and open models. (Name pending). This benchmark will use our verifiers environments to run the evaluations with a new script/minimal library to tie everything together.
This week we want to finish all of our dataset environments so we can begin benchmarking models next week.
We’d like to release our blogpost and benchmark at the end of October during the week of the 27th.
Tentative timeline:
Thanks to everyone who submitted PRs, we have the following datasets completed: