Untitled
CXR-BENCH
[Initial tasks listed below! We can modify depending on what capabilities of the model we want to highlight]
[Datasets/Tasks highlighted in green are completely unseen during training]
Evaluation Axis 1: Coarse-Grained Image Understanding
TASKS:
- [Single Disease Classification][CXR-LT or PadChest]: We use the test set of CXR-LT or PadChest (which are complex datasets) and create multiple-choice questions. We provide four options, each of which has a single disease. We can select hard negatives.
- (e.g. Which of the following diseases are visible in the CXR? (a) pneumonia, (b) cardiomegaly, (c) no disease, etc.).
- [Single Disease Classification][ChexPert]: We repeat the above procedure for the human-annotated Chexpert test set (standard benchmark).
- [Multi-Disease Classification][CXR-LT or PadChest]: We use the test set of CXR-LT or PadChest and create multiple-choice questions. We provide four options, each of which includes multiple diseases. We can select hard negatives with high overlap.
- (e.g. Which of the following diseases are visible in the CXR? (a) pneumonia, pneumothorax, atelectasis, (b) cardiomegaly, atelectasis, etc.).
- [Multi-Disease Classification][ChexPert]: We repeat the above procedure for the human-annotated Chexpert test set.
- [View Classification][ChexPert/MIMIC]: We use the test set of Chexpert and create multiple-choice questions with various views.
- (e.g. Which of the following correctly characterizes the X-ray view? (a) LL (b) Lateral (c) AP (d) PA)
- [View Matching][ChexPert/MIMIC]: We use the test set of Chexpert and provide multiple views to the model.
- (e.g. Do the two CXRs come from the same study? (a) yes (b) no)
- [Image-Text Matching][MIMIC-CXR]: We use the reports in the MIMIC-CXR test set and randomly select a phrase. We then use ChatGPT to subtly alter the meaning of the phrase, such as by changing the disease name, changing “small” to “large”, etc. We can control this across multiple axes, such as abnormality size, abnormality type, negation, etc. Then, we set it up as a two-option multiple-choice task.
- (e.g. Which of the following phrases accurately describes the content of the image? (a) There is pleural effusion in the left lower lung, (b) There is pleural effusion in the right lower lung)
- We can potentially get this reviewed by radiologists
- [Image-Text Matching][Synthetic CXR]: Same task as above, but we use synthetic CXRs generated using RoentGen.
- [Close-Ended VQA][RadRestruct and SLAKE]: VQA benchmark.
EVALUATIONS:
- We are directly evaluating close-ended responses. The best way to do this is by using log-likelihoods associated with each option letter.
- Automated metrics: overall accuracy
- Slice evaluations: Report performance on long-tail/rare diseases, where our model will hopefully lead to significant improvements over methods like GPT-4V.
Evaluation Axis 2: Fine-Grained Image Understanding
TASKS:
- [Chest Tube Segmentation][Candid]: Given an image, the model must return bounding box coordinates for chest tubes. We use the Candid test set.
- [Rib Fracture Segmentation][Candid]: Given an image, the model must return bounding box coordinates for rib fractures. We use the Candid test set.
- [Pneumothorax Segmentation][SIIM]: Given an image, the model must return bounding box coordinates for pneumothorax. This is a standard benchmark.
- [Abnormality Grounding][VinDr-CXR]: Given an abnormality that may or may not be in the image, return bounding boxes.
- [Grounded Diagnosis][VinDr-CXR]: Given a bounding box, generate a diagnosis for the region. This is a multiple-choice task with several possible answers as well as a “no disease detected” option.
EVALUATIONS: