Slurm
About Slurm
Jobs are managed on luria.mit.edu using Slurm.
Slurm is an advanced job scheduler for a cluster environment.
The main purpose of a job scheduler is to utilize system resources in the most efficient way possible.
The number of tasks (slots) required for each job should be specified with the "-n" flag
Each node provides 16,32, or 96 slots.
The process of submitting jobs to Slurm is done using a script.
Creating a simple Slurm script
The process of submitting jobs to Slurm is done generally using a script. The job script allows all options and the programs/commands to be placed in a single file.
It is possible to specify options via command line, but it becomes cumbersome when the number of options is significant.
An example of a script that can be used to submit a job to the cluster is reported below. Start by opening a file and copy and paste the following commands, then save the file as myjob.sh or any other meaningful name. Note: Job names can not start with a number.
The first 5 lines specify important information about the job submitted, the rest of the file contain some simple UNIX commands (date, sleep) and comments (lines starting with #####).
The "#SBATCH" is used in the script to indicate an slurm option.
#SBATCH -N 1: You should always include this line exactly. The number followed by -N must always be 1 unless you run MPI applications which is rare for typical bioinformatics software.
#SBATCH -n: This is the number of tasks requested. The recommended maximum number allowed is 16 in the normal partition, i.e., don't ask for more than 16 tasks in your script unless you receive special instructions from the system administrator. It is important to request resources as accurate as you can. If possible, please do not request more than what you need, and do not request less than what you need. The best way to find out how much you need is through testing. While your job is running, you can ssh to the node and use the top command to see if it is using the requested resources properly. Note that what you requested from the slurm scheduler by -n is not the actual CPUs allocated by the OS.
#SBATCH --mail-user=[]: You must replace [] with your email address
#SBATCH -t [min] OR -t [days-hh:mm:ss]: Specifies the wall clock limit. It has a time limit of 14 days at maximum, i.e. a job can not run for more than 14 days on luria.
Submitting a job
Submit your job by executing the command:
where myjob.sh is the name of the submit script. After submission, you should see the message: Submitted batch job XXX where XXX is an auto-incremented job number assigned by the scheduler. For example: Submitted batch job 3200863
Submitting jobs to a specific node
To submit your job to a specific node, use the following command:
where X is a number specifying the node you intend to use. For example, the following command will submit myjob.sh to node c5 : sbatch -w c5 myjob.sh
To submit your job while excluding nodes (for example exclude c[5-22]), use the following command:
Interactive Sessions
You should not run interactive jobs on the head node of Luria. The head node is shared by all users. An interactive job may negatively affect how other users interact with the head node or even make the head node inaccessible to all users. Thus, instead of running myjob.sh on the head node, you should run "sbatch myjob.sh". However, you can run an interactive job on a compute node. This can be done using command "srun --pty bash", which will open up a remote shell on a random compute node.
Then you can run program interactively. This is often useful when you are compiling, debugging, or testing a program, and the program does not take long to finish.
Sometimes your program (such as matlab or R) may need the X11 window for graphical user interface, and then you can use the command srun --pty bash. You will also need to install an X11 client such as Xming or XQuartz on your machine to display X window and enable X11 forwarding on your ssh client.
Remember to exit cleanly from interactive sessions when done; otherwise it will be killed without your notice.
User job limitations
A user can submit up to 1000 jobs at a time. Jobs are typically scheduled on on a first-come, first-served basis. If you submit a lot of jobs at a time and take a lot of resources, others will have to wait until your jobs complete which are not optimal for cluster usage. If you do need to submit a lot of jobs, please add these options in your job scripts.
The nice option is to lower the job priority and the exclude option is to exclude half of the nodes for your jobs. This will allow others’ jobs getting a chance to run while allowing you to run some of your jobs.
Monitoring and controlling a job
To monitor the progress of your job use the command:
To display information relative only to the jobs you submitted, use the following (where username is your username):
A useful tip on customizing the output of squeue
Get more information on a job
Viewing job results
Any job run on the cluster is output to a slurm output file slurm-XXX.out where XXX is the job ID number (for example: slurm-3200707.out).
After submitting myjob.sh, any output that would normally be printed out to the screen is now redirected to:slurm-XXX.out
You can also redirect output within the submission script.
Deleting a job
To stop and delete a job, use the following command:
where XXX is the job number assigned by slurm when you submit the job using sbatch. You can only delete your jobs.
Checking the host status
To check the status of host and its nodes, you can use the following command:
states of nodes:
Customizing output
Job arrays
Slurm job arrays provide an easy way to submit a large number of independent processing jobs. For example, job arrays can be used to process the same workflow with different datasets. When a job array script is submitted, a specified number of array tasks are created based on the master sbatch script.
SLURM provides a variable named $SLURM_ARRAY_TASK_ID to each task. It can be used inside the job script to handle input/output for that task. For example, create a file named ex3.sh with the following lines.
Example job
Create a slurm job script named ex1.sh that processes a fastq file. Include the following lines in your job script. Determine what should be the appropriate -n value. Use the top command to watch the CPU and memory while the job is running on a compute node.
Now you want to process multiple fastq files and run the script in parallel. One way to do is to make the fastq filename as an argument and sbatch the script with the filename in the argument. Create a new script named ex2.sh and include the following lines
You can submit the script twice with arguments
Running special software through Slurm
Running Jupyter notebook
Get an interactive terminal
You will get a node allocated by the slurm scheduler. For example, c2.
Start notebook on the allocated node (e.g. c2).
Open ssh conncetion to Luria.mit.edu and the node (e.g. c2) from your local machine, i.e. your local desktop or laptop SSH client from either the Terminal (Mac) or the PowerShell (Windows). Replace username with your own username, and c2 with the actual compute node.
The above commands are actually running
It is likely that some other user has taken port 8888 ($HEAD_NODE_PORT) on the head node. In that case, you will get an error "bind: Address already in use". You should then change $HEAD_NODE_PORT from 8888 to a different port such as 8887 or 8886.
You can also change $MY_MACHINE_PORT and $COMPUTE_NODE_PORT, but that is only needed if you have another process that has taken 8888 on your local machine, or another user happens to take 8888 on the same compute node.
Tunnel from your local machine (either Windows or Mac) to Jupyter notebook running on $COMPUTE_NODE_NAME
Direct your browser on your local machine to http://localhost:8888
Close connection when finished
Running Rstudio server
There is no Rstudio server installed on Luria. You can run Rstudio server using a singularity image with a wrapper script. Please see an example wrapper. The steps of getting an interactive terminal, and opening ssh tunnel from your local machine to the allocated computing node are similar to the Running Jupyter notebook steps above, but with a different module (module load singularity) and a different port by default (for example, http://localhost:8787 in your local machine). Here local machine refers to your Mac or Windows PC. Run the ssh command from the Terminal (Mac), or the Windows PowerShell (Windows). For example
Tip 1: To allow the Rstudio server singularity container to access your data located in your storage server (e.g. rowley or bmc-lab2), you need to edit the rstudio.sh file to bind your data path. At the end of the rstudio.sh script, you will see a singularity exec command with many --bind arguments. In the script, you should add additional --bind arguments, for example, --bind /net/rowley/ifs/data// or --bind /net/bmc-lab2/data/lab//, where you need to replace labname and username with actual values.
Tip 2: Ignore the open an SSH tunnel command printed in the rstudio.sh standard output, use the ssh command in the above example instead on your local machine.
Tip 3: If someone else has taken the port 8787 on the head node of Luria, you will get an error like "bind: Address already in use" when you run the ssh command from your laptop. In that case, choose a different port number, e.g. 8777 (please refer to the previous section on Jupyter notebook for an explanation of port numbers). For example
Alternatively, you can choose a different port number before starting the rstudio.sh script on the compute node, for example
Tip 4: The script uses a custom image with Seurat dependencies pre-installed. You can select your own R version or image based on the documentation of the example wrapper
Tip 5: If you are using Windows Secure CRT to connect to Luria, you can set up Port Forwarding (Tunneling). Select Options -> Session Options. Click on the "Port Forwarding" and then Click on the "Add" button to add a forwarding setup. You will get the Local Port Forwarding Properties dialog. Choose a Name, and then set the port for both Local and Remote. For example, 8777. On the head node of Luria, you will also need to run
Running Matlab
An example MATLAB script: matlabcode.m
Note: fname need to be changed correspondingly
A shell job submission script submitting matlabcode.m to a compute nodes
To automate Matlab script, interactive session can be turned off.
On a UNIX system, MATLAB uses the X-Windows Interface to interact with the user. In a batch execution, this communication is not necessary.
We don't need the initial "splash" which displays the MATLAB logo. We can turn this off with the -nosplash option.
We also don't want the interactive menu window to be set up, which can be suppressed with the -nodisplay option
Two other options are available which may be useful in suppressing visual output and logs, the -nojvm option ("no Java Virtual Machine") and -nodesktop.
Last updated