Paul updates

Various ablations and trying things out with dinov3 and long-story short, nothing looks especially promising in terms of improved Bach downstream eval performance using the dinov3 checkpoints. So for the sake of time constraints I’m thinking best plan is to stick with OpenMidnight and try not to innovate upon it, and instead try to innovate in terms of how to connect OpenMidnight to whole-slide context + better downstream performance as a result of that.

(1) I tried lowering the clip_grad from 30 to 3, as 3 was what was used in dinov2 and it's also what's used in all the configs for dinov3 except for the 7B pretraining model config (and high-res config). Result is that it didn't seem to make much difference, maybe helps a tiny bit so I'm sticking with using it moving forward

(2) tried lowering the kde regularization weight from 0.05 to 0.01, the training dip still happens but it's not as huge; an issue I have is that I'm doing these ablations with a full node and after 2 hours it only trains 10,000 steps so it's hard for me to balance doing lots of ablation tests and running them long enough to be maximally informative; I'm sticking with the original 0.05 kde regularization weight moving forward

(3) I reran the original OpenMidnight training run and observed that even in this context we also observe an initial training dip in bach downstream accuracy, it's just less drastic and quickly gets corrected

(4) I checked bach downstream performance on non-finetuned dino checkpoints:

So the difference in starting point performance across the checkpoints is not as huge as I thought between vitg and vithplus, and there is a decent boost for the vit7b

(5) Discovered a mistake in the original OpenMidnight model where we weren't loading in the pretrained positional embeddings correctly... we initially thought that if you keep to the vit-giant model size then you dont need to interpolate the pos embeddings when loading in Meta's dinov2 checkpoint. But actually the checkpoint Meta shared already had its positional embeddings interpolated to global crop size 518 (which they do for inference and high-resolution adaptation), so we needed to interpolate it again back down to 224 before starting finetuning. Basically I think this means the original OpenMidnight had randomly initialized positional embeddings? Im not sure if this practically makes much difference. (This issue doesn't apply to dinov3 because they used RoPE.)

(6) Been training a vithplus model using the dinov2 codebase to see if swapping out the checkpoints but using the original OpenMidnight code produces best results... so far looks unlikely, after the initial huge dip in downstream accuracy by vithplus it doesnt seem to be able to recover to surpass OpenMidnight's vitg model

(7) Also adapted the dinov2 and dinov3 path-fm codebases to support loading in the vit7b model, but it really isn't able to train well on a single H100 node. It requires me to enable activation checkpointing and fp8 matmuls and lower the batch size to fit, and it still ended up OOM when it came to FSDPCheckpointer saving

(8) tried imagenet normalization vs tcga normalization, didnt seem to make any difference

Dataset preprocessing comparisons

@ Anis

Nuances surrounding magnification terminology

“Magnification level” → “20x” refers to how many times they objectively magnified the tissue slide with their microscope; but not every microscope’s 20x is identical, so it’s better to use objective micrometers per pixel (e.g., 0.25 µm/px aka MPP)

“microns per pixel” aka micrometers per pixel is the actual physical size of each pixel

“pyramid levels” or downsample levels: level 0 is highest resolution and level 1 2 3 refer to progressively downsampling the WSI; it’s not standard what the downsampling factor is between levels, usually 4x or 2x