22.6 C
London
Friday, September 20, 2024

Running Distributed Virtual Cluster (DVC) on SLURM Cluster: A Comprehensive Guide

Introduction
In the world of data science and machine learning, there are few things more frustrating than when your local development setup becomes insufficient to handle the demands of your project. This can happen when your dataset grows too large or your deep learning model requires multiple high-end GPUs. In this blog post, we’ll discuss how to scale up your DVC experiments on a SLURM cluster and provide some code to get you started.

Why DVC on SLURM?
As the DevOps movement has demonstrated, maintaining consistent software delivery without well-designed tooling (like CI/CD) and a conducive developer culture (like PRs or working in small batches) is challenging. In our domain, the same principles apply: to move fast while maintaining high quality, reliability, and reproducibility, we need to adopt best DevOps practices. By integrating DVC, we provide our researchers with a seamless experience that extends Git to data science and ML, allowing them to focus on what matters most – experimentation and innovation.

Our Approach
At Exscientia, we aim to change the way the world discovers and develops new medicines. To maintain a frictionless developer experience even as model sizes grow beyond the means of the humble laptop, we surveyed our computational estate to design an effective developer experience. We needed to support multiple teams with on-demand Jupyter or RStudio instances as well as workflow orchestration engines. To run large unsupervised jobs, interactive analyses, and development sessions across many domains and technologies, we selected a cloud-deployed SLURM cluster.

Remote DVC Experiments on SLURM Cluster
To submit a DVC experiment to the cluster, we create a Bash script that describes the exact resources required. SLURM will wait until those resources are available and then execute the job. Here’s a code snippet that illustrates the process:

#!/bin/bash

# Set the job name and directory
RDVC_JOB_NAME="my_experiment"
RDVC_JOB_DIR="/tmp/${RDVC_JOB_NAME}"

# Clone the Git repository
git clone --branch ${RDVC_JOB_REPO_BRANCH} ${RDVC_JOB_REPO_URL} ${RDVC_JOB_DIR}

# Activate the Python environment
source./.venv/bin/activate

# Run the DVC experiment
dvc exp run --pull --allow-missing -S fabric=gpu

# Push the experiment to the remote
dvc exp push ${RDVC_JOB_REPO_URL}

Conclusion
By integrating DVC with our SLURM cluster, we’ve created a seamless experience that extends Git to data science and ML. Our researchers can focus on what matters most – experimentation and innovation – while we handle the scalability and reproducibility. If you’re interested in customizing our approach to your team’s needs, feel free to fork and modify our tool.

Frequently Asked Questions

How does DVC work on SLURM clusters?

DVC extends Git to data science and ML, allowing you to manage and reproduce your experiments with ease. On a SLURM cluster, you can submit a DVC experiment to the cluster, which will wait until the required resources are available and then execute the job.

How do I get started with DVC on SLURM clusters?

You can start by creating a Bash script that describes the exact resources required for your DVC experiment. Then, submit the script to the SLURM cluster, which will execute the job. We’ve provided a code snippet to get you started.

What are the benefits of using DVC on SLURM clusters?

By using DVC on a SLURM cluster, you can enjoy a seamless experience that extends Git to data science and ML. This allows you to focus on what matters most – experimentation and innovation – while we handle the scalability and reproducibility.

How do I customize DVC on SLURM clusters to my team’s needs?

You can customize our approach by forking and modifying our tool. We encourage you to experiment and adapt our approach to your team’s needs.

What if my job fails on the SLURM cluster?

If your job fails, you can consult its log at ~/.rdvc/logs and then try to reproduce the submission script from an interactive session.

Latest news
Related news
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x