Project lead: @Tanishq Abraham

Discord channel: https://discord.com/channels/1025299671226265621/1400621466483167334

MedARC Meeting Calendar: public link | iCal link

Untitled

Project description

The goal of this project is to improve the medical reasoning abilities of open-source language models via post-training in order to obtain an SOTA medical language model.

We plan to post-train an open-source 7-8B param model (likely Qwen3 unless a better model comes along) on medical text.

This project is divided into three areas: evaluation, data, training.

Evaluation:

In order to train a useful model, it is important to be able to evaluate it on meaningful benchmarks. Unfortunately, the state of medical LLM benchmarks is quite poor.

You can read a bit about my perspective (although slightly outdated) on this over here.

The goal of this part of the project is to construct an automated evaluation suite for medical LLMs. We plan to evaluate on:

Basic multiple choice Q&A datasets

Due to the reasons described in my blog post, at this point these datasets serve more as sanity checks, but do not, by themselves, serve as useful measure of practical model performance.

Open question: Could removing the multiple choices and treating them as open-ended benchmarks provide useful value?

Harder multiple-choice Q&A datasets