Stability HPC Cluster User Guide: https://stabilityai.notion.site/Stability-HPC-Cluster-User-Guide-226c46436df94d24b682239472e36843
Stability HPC Quickstart Guide: https://stabilityai.notion.site/stabilityai/Getting-Started-with-Stability-AI-HPC-Cluster-dee69433b5d74bdfa6eb52c728c7ec25
SLURM Guide: https://slurm.schedmd.com/quickstart.html
As an external team (i.e., not employed by Stability), our login node is ext1.hpc.stability.ai
. This is a shared external research team node with an independent "a40x" partition containing 116 40GB A100 8x gpus.
SSHing into ext1 (change yourname to the username you setup when you first created your stabilityai hpc account, and change .ssh/stability to the path to your private ssh key):
ssh -i ~/.ssh/stability yourname[@ext1.hpc.stability.ai](<mailto:[email protected]>)
First make sure you have configured your Stability AI user account (see Stability HPC User Guide).
Assuming you are able to ssh into ext1, you should now make a python environment for yourself in your home directory (/admin/home-yourname).
cd ~
python3.11 -m venv mindeye
source mindeye/bin/activate
nano .bashrc # this navigates you into editing your .bashrc file
# scroll to the bottom of the file and add the following line:
source ~/mindeye/bin/activate
# save your edits: (ctrl-x --> y)
ln -s /weka/proj-fmri/shared/cache /admin/home-**yourusername**/.cache
Make a folder for yourself at /weka/proj-fmri/yourname
, the proj-fmri folder is a shared workspace where we do all our work.
Your home folder should take up less than 200GB and your /weka files/folders should take up less than 500GB! If you are past this quota limit you will need to remove files or stash stuff on s3 storage (see below s3 section). You can also make use of /scratch for temporary storage, which has its own independent shared quota amongst all users of several terabytes (only accessible from within compute node). If you use /scratch, remember to delete the files when you are done.
To save on data storage, please everyone use /weka/proj-fmri/shared
to store files useful to the whole team, and specifically save cached models (e.g., downloading models from huggingface) to /weka/proj-fmri/shared/cache
. E.g., we do not all need to download Stable Diffusion!
Example of caching versatile diffusion:
from diffusers import VersatileDiffusionPipeline
cache_dir = '/weka/proj-fmri/shared/cache'
image_encoder = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion",cache_dir=cache_dir)