Stability HPC Cluster User Guide: https://stabilityai.notion.site/Stability-HPC-Cluster-User-Guide-226c46436df94d24b682239472e36843

Stability HPC Quickstart Guide: https://stabilityai.notion.site/stabilityai/Getting-Started-with-Stability-AI-HPC-Cluster-dee69433b5d74bdfa6eb52c728c7ec25

SLURM Guide: https://slurm.schedmd.com/quickstart.html

As an external team (i.e., not employed by Stability), our login node is ext1.hpc.stability.ai. This is a shared external research team node with an independent "a40x" partition containing 116 40GB A100 8x gpus.

SSHing into ext1 (change yourname to the username you setup when you first created your stabilityai hpc account, and change .ssh/stability to the path to your private ssh key): ssh -i ~/.ssh/stability yourname[@ext1.hpc.stability.ai](<mailto:[email protected]>)

Setting up your folders & environment

First make sure you have configured your Stability AI user account (see Stability HPC User Guide).

Assuming you are able to ssh into ext1, you should now make a python environment for yourself in your home directory (/admin/home-yourname).

Create venv environment

Create a virtual environment in your home directory

cd ~

python3.11 -m venv mindeye

source mindeye/bin/activate

Edit .bashrc file to ensure you activate the venv every time you ssh into the cluster

nano .bashrc # this navigates you into editing your .bashrc file

# scroll to the bottom of the file and add the following line:

source ~/mindeye/bin/activate

# save your edits: (ctrl-x --> y)

Symlink your home .cache folder to the shared proj-fmri cache folder (you should not be filling space in home whenever possible)

ln -s /weka/proj-fmri/shared/cache /admin/home-**yourusername**/.cache

Pip install whatever packages you need

Create your own proj-fmri folder

Make a folder for yourself at /weka/proj-fmri/yourname, the proj-fmri folder is a shared workspace where we do all our work.

Quota limits

Your home folder should take up less than 200GB and your /weka files/folders should take up less than 500GB! If you are past this quota limit you will need to remove files or stash stuff on s3 storage (see below s3 section). You can also make use of /scratch for temporary storage, which has its own independent shared quota amongst all users of several terabytes (only accessible from within compute node). If you use /scratch, remember to delete the files when you are done.

To save on data storage, please everyone use /weka/proj-fmri/shared to store files useful to the whole team, and specifically save cached models (e.g., downloading models from huggingface) to /weka/proj-fmri/shared/cache. E.g., we do not all need to download Stable Diffusion!

Example of caching versatile diffusion:

from diffusers import VersatileDiffusionPipeline
cache_dir = '/weka/proj-fmri/shared/cache'
image_encoder = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion",cache_dir=cache_dir)