“Science-in-the-open” [1] has the potential to redefine the traditional scientific framework and greatly accelerate high-impact research. It usually involves a large team of volunteer researchers asynchronously working together in a public Discord server. This approach allows one to harness crowd-sourced intelligence to tackle ambitious problems collaboratively and attract specialized expertise and creative approaches you might not find in a traditional lab.
Unfortunately, in practice, the benefits are often outweighed by management challenges. With an open invite, you’ll likely have a heterogeneous group of contributors: different backgrounds, skill levels, time zones, and commitment levels. Coordinating a larger, decentralized team is hard. There’s a risk of wasting time onboarding people who later disappear, or trying to mentor enthusiastic novices at the expense of research progress. Newcomers may find it impossible to figure out the project status or how to contribute amidst the chatter. Progress can stall as volunteers are under no obligation to stay or finish tasks.
Below, I outline strategies that have worked in my experience to optimally steer open science collaborations. Our MindEye papers [2, 3] implemented these strategies for successful publication in NeurIPS and ICML, and we are now introducing the below structure into all projects conducted in our MedARC Discord server.
Summary
In my opinion, the following strategies promote the best open science collaborations:
- Flat and simple codebase so contributors can easily acquaint themselves with the codebase.
- Strong, reliable project lead who provides consistent, immediate progress updates and handles task delegation.
- Communicate frequently and transparently. Use a central hub like Discord for real-time updates and hold regular group video calls to summarize progress and delegate tasks.
- Simple 3-platform organization: Discord as the core hub for all communication, GitHub for sharing code (using forks and small, minimally changed pull requests), and Notion for summarizing everyone’s updates before the next video call.
- Self-onboarding: interested new contributors should spectate the next group video call and be assigned a low priority “gatekeeping” task (e.g., fork the repo and reproduce a certain analysis) as a low-effort means to weed out unserious volunteers. 1-on-1 conversations with volunteers should be avoided unless they’ve already demonstrated they are a serious, reliable contributor.
- Provide a free, shared computing workspace so everyone has access to the same resources and file system.
- Keep the project focused by working towards doing the minimal steps necessary to reach a tangible deliverable (i.e., research paper).
- Devise a fair and low-barrier system for authorship and credit to reward volunteers, while ensuring that author contributions are transparent.
- Bad actors stealing your ideas/code is not a huge concern in practice, and there are various ways to mitigate this risk.
I’ll dive into each of these points below, then conclude with some optimistic thoughts on MedARC’s future role to support open science collaborations.
Keep the Codebase Simple and Flat
For open collaborations, it’s important to maintain an interpretable and lightweight codebase. Your repository should be as welcoming as possible to a newcomer who might be browsing it to understand the project. A flat code structure [4] with minimal dependencies seems to work best.
- Limit complexity: Minimize the number of folders, files, and external dependencies. Ideally, everything needed to run the project fits in a single directory tree without deep nesting. This makes it easier to navigate and grasp the pipeline.
- Include only what’s used: Do not accumulate dead code or experimental scripts in the main repository. Such code should be in branches outside main.
- Fewer abstractions: Aim for straightforward code that a person can read and follow without digging through many layers of indirection. It’s fine if some code is duplicated a bit for clarity, see Hugging Face’s “Repeat Yourself” philosophy [5] and Dan Abramov's talk on the "WET" codebase [6].