The most important issue that we have to take in account when we want to run a parallel job in slurm is to use always the ‘srun’ command.

 

We have two ways to run a parallel job:

 

  1. Using a bash script and using the sbatch command.
  2. Using the srun command directly

 

Using a bash script and using the sbatch command

First, we have to prepare a bash script in which we are going to indicate how many computers and how many cores each computer we ask for will run our job with:

 

test-mpi.sh

#!/bin/bash
#
#SBATCH -p normal # partition (queue)
#SBATCH -N 2 # number of nodes
#SBATCH --ntasks-per-node=4 # number of cores in each node
#SBATCH --mem-per-cpu 1000 # mem to user for each core
#SBATCH -t 0-00:15 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
#SBATCH --mail-type=START,END,FAIL # notifications for job start, done & fail
#SBATCH [email protected] # send-to address

 

# Parallel program to run
srun mpi-example/mpi_mm

 

 

We can use diferents flags, for instance:

 

#SBATCH --ntasks=16 : How many cores in total we are asking for. These cores are going to be split by nodes (example: we can use 10 cores in one node and 6 cores  in the second node)
#SBATCH --mem=16G : Specifies the real memory required per node. A memory size specification of zero is treated as a special case and grants the job access to all of the memory on each node.

 

Using the srun command directly

 

We can avoid using a job script and indicate all the flags in a command line:

 

srun -N 4 --ntasks-per-node=8 --mem-per-cpu=2GB -t 00:30:00 mpi-example/mpi_mm

 

This way can be very useful if we want to use a script that send several jobs to the cluster.