Reminder: Our goal is to train an open source model

Baichuan-M2’s training system
Just a quick reminder before we start, our end goal for this project is to train an open source medical reasoning model, probably using GPT-OSS 20b as the base. We are working our way right to left on this chart, starting with evaluations and RL environments, then we will perform RL with our environments, work on SFT data, then mid-training data. (Remember this is a research project, so plans might change as we learn along the way)
MedARC Evals*
This month we want to finalize our evaluation environments and use them to benchmark open and proprietary model’s medical knowledge. We will release our results as a MedARC blogpost, crediting anyone who has made a meaningful contribution.
This will likely be implemented using a helper script/lightweight library to tie all our medical evaluation environments together and report all the results back in one easy to manage location
Incomplete List of Open Questions
- Which datasets should we include in a addition to the datasets we already have issues open for?
- Incomplete list of possible datasets:
- Any open MedHELM datasets we don’t already cover
- MIMIC
- MedDialog
- Craft-MD
- M-ARC
- MEDIQ?
- Which models are best suited for LLM as a Judge in our evaluations?
- Many papers use gpt-4o-mini, some newer papers o4-mini. What about gpt-5-mini? gemini flash? and open source judge?
- How well do the mini models judge the full sized models? gpt-mini vs gpt, gemini flash vs pro
- What is the variance between different models as judges? What about the same model?
- How to efficiently measure variance for evaluated models?
- For MQC, we could have a few rollouts, one standard order and a couple in seeded random order
- Can we bootstrap confidence intervals using a subset?
Timeline
We’d like to release our blogpost and benchmark at the end of October during the week of the 27th.
Tentative timeline:
- Next two weeks to explore the above questions and finalize our datasets
- Third week of October to run the evals, debug any potential issues, and start writing the blogpost
- Early last week of October: finish our writeup of the results