All pages
Powered by GitBook
1 of 21

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Package Management

Slurm solves problems 1. and 2., but we're still left with problem 3. managing packages on our cluster when the needs of our users vary in different, and sometimes contradictory ways.

There are three main methods by which this problem is tackled on Luria:

  • Environment modules

  • Conda environments

  • Containerization

We'll discuss the first two methods, then come back to containerization a bit later, as it is a much more complex topic.

Docker

  • Founded in 2010

    • Most popular container engine

    • Available on Linux, Mac, Windows

    • Requires root privileges

Docker manages Dockerhub, a central repository where people can upload Docker images to share to the wider community.

We're going to grab an image from Dockerhub and use it for some examples. A very basic image available on Dockerhub is the Ubuntu image, which provides a bare-bones Ubuntu environment.

High Performance Computing Clusters

High-performance computing clusters have multiple CPUs and massive amounts of RAM all distributed between multiple machines, and all used by multiple people with varying needs. Because of this, multiple common problems arise:

  1. Distributing the processing load from so many people's jobs onto computing nodes.

  2. Running interactive programs on the computing cluster.

  3. Managing packages when users have varying, sometimes contradictory needs.

How do we manage these problems on our computing cluster?

Advanced Utilization of IGB Computational Resources

Docker Installation

The easiest way of installing Docker is by downloading . Docker Desktop is a nice GUI front-end for Docker, so you can see your containers, images, active builds, login to your Docker Hub account, etc. It also comes with the Docker command line tools that you'll need for building Docker images.

Download Docker Desktop for your computer and follow the instructions to install it.

On Mac, you can access the Docker command line tools by calling docker from Terminal.

On Windows, the Docker command line tools will be invoked by calling docker.exe from the Command Prompt.

For brevity's sake, the rest of this material will refer to the Docker command line tools by calling docker. If you're following along on Windows, make sure to replace docker with docker.exe.

Docker Desktop

Slurm

To solve problems 1. and 2. we use a program called Slurm on our cluster. Slurm is a "job scheduler"; essentially, it receives "jobs" and then sends them to computing nodes in a way that utilizes resources in the most efficient way possible.

You never want to run any resource-intensive program on the head node. Always delegate resource-intensive jobs to Slurm so that it will send your job to a computing cluster. This benefits you and all other users, as it gives your job more processing power, and leaves the head node's processing power free for Slurm to delegate people's jobs.

Our Luria cluster has the following nodes:

Nodes
CPU Cores
CPU Model

c1-4

16 cores

Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz

c5-40

8 cores

Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

b1-16

48 cores

Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz or 5220R CPU @ 2.20GHz

These nodes are organized into the following partitions:

kellis
bcc
normal

b1-12

b13-17, c1-4

c5-40

The kellis and bcc partitions are reserved for their respective labs. The normal partition is the default partition that can be used by any lab. You should never use the kellis and bcc partitions unless you have been given express permission to do so.

Interactive Sessions

While submitting a single script to Slurm can be useful, sometimes it's useful to quickly tests programs in the command line. How can you do that while still taking advantage of the compute nodes' processing power?

Slurm has a built-in utility for doing just this, srun. To run an interactive session, run:

srun --pty bash

This will assign you to a compute node, then start a bash shell session on that node. You can now run programs interactively on the command line, just as if you were on the head node. This is often useful when you are compiling, debugging, or testing a program, and the program does not take long to finish.

Remember to exit cleanly from interactive sessions when done; otherwise it will be killed without your notice.

The System Package Manager

Environment Modules

Environment modules are bundles that hold a program and all the shell environment information that the program needs to run. When these modules are loaded by a user, that user's shell environment will be modified to include the environment in the module. When the module is unloaded, the module's environment is cleanly removed and no changes are made to the user's original environment.

You can see what modules are available on Luria by running:

You'll notice we have modules for common scientific programs, such as bwa or R. You'll also notice that there are multiple versions of each of these programs.

Since the modules are completely separate from one another, there can be modules for different versions of the same software without any conflict between the two versions. This way, the needs of multiple users can be satisfied.

To load a module, you'd run:

So to load R v3.4.3:

Now that the module is loaded, any programs in the module should be available on the command line. So after loading the R module, you can run R.

Once you no longer need to use the module, you unload it by running:

So:

R will no longer be available on the command line.

Modules in Slurm

If you have a script, for example, an RScript that you want to submit to sbatch for processing, you'll have to make sure to write a script that first loads in the appropriate module, then runs the program or script you wanted to run.

For example, write a script named myRjob.sh:

This script makes sure that the R module is loaded before running the R script. Now, when you submit the job to Slurm with sbatch myRjob.sh, it will run correctly.

Example script 1:

Example script 2:

Example script 3:

module avail
module load <module name>/<module version>
module load r/3.4.3
module del <module name>
module del r
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mail-type=END	
#SBATCH [email protected]
###################################

module load r/3.4.3

Rscript /path/to/your/script.R
#!/bin/bash
#SBATCH -N 1 # Number of nodes. You must always set -N 1 unless you receive special instruction from the system admin
#SBATCH -n 8 # Number of tasks. Don't specify more than 16 unless approved by the system admin

module load fastqc/0.11.5
module load bwa/0.7.17
mkdir -p ~/data/class
cd ~/data/class
fastqc -o ~/data/class /net/rowley/ifs/data/dropbox/test_1.fastq
bwa mem -t8 -f ex1.sam /home/Genomes/bwa_indexes/mm10.fa /net/rowley/ifs/data/dropbox/UNIX/test_1.fastq
#!/bin/bash
#SBATCH -N 1                      # Number of nodes. You must always set -N 1 unless you receive special instruction from the system admin
#SBATCH -n 16                     # Number of taks. Don't specify more than 16 unless approved by the system admin

module load fastqc/0.11.5
module load bwa/0.7.17
FILE=$1
WORKDIR=~/data/class
mkdir -p $WORKDIR
cd $WORKDIR
fastqc -o $WORKDIR $FILE
bwa mem -t16 -f $(basename $FILE).sam /home/Genomes/bwa_indexes/mm10.fa $FILE
#!/bin/bash

#SBATCH -N 1 
#SBATCH -n 4
#SBATCH --array=1-2

module load fastqc/0.11.5
module load bwa/0.7.17
FASTQDIR=/net/rowley/ifs/data/dropbox/
WORKDIR=~/data/class
mkdir -p $WORKDIR
cd $WORKDIR
FILE=$(ls $FASTQDIR/*.fastq | sed -n ${SLURM_ARRAY_TASK_ID}p)
fastqc -o $WORKDIR $FILE
bwa mem -t4 -f $(basename $FILE).sam /home/Genomes/bwa_indexes/mm10.fa $FILE

Containerization

Recall that there is a third way by which we handle package management on Luria: containerization. Similar to Environment Modules and Conda environments, containerization is a way of packaging an environment into a container such that it contains the absolute bare minimum needed to run a specific set of software. Containers can be thought of as a very stripped down operating systems packaged into a little box, although that is not what they are.

Containers have their own filesystem where the software runs, and it is a snapshot of the filesystem needed for a piece of software to run. Therefore, if a container can run a piece of software now, it should be able to do so always.

Since a containerized application contains everything it needs in order to run, it is very easy to share containers with people so that they can easily run software.

Container pros:

  • Portable

  • Shareable

  • Consistent

Popular container engines:

  • Docker

  • Singularity

Checking the Status of Computing Nodes

You can check the general status the computing nodes by using the sinfo command, which will display the following:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 14-00:00:0      1 drain* c8
normal*      up 14-00:00:0      1  drain c39
normal*      up 14-00:00:0     18    mix c[5-7,9-14,16-19,21-22,24-26]
normal*      up 14-00:00:0      1  alloc c40
normal*      up 14-00:00:0     14   idle c[15,20,27-38]
bcc          up 14-00:00:0      5    mix b[13-16],c2
bcc          up 14-00:00:0      4   idle b17,c[1,3-4]
kellis       up 28-00:00:0     12    mix b[1-12]

We've also provided a custom command, nodeInf, that will give more detailed information about each node. For example:

NODELIST PARTITION CPUS(A/I/O/T) CPU_LOAD FREE_MEM   MEMORY    STATE
      b1    kellis    82/14/0/96   145.58   405507   768000    mixed
      b2    kellis    84/12/0/96   302.81   504957   768000    mixed
...
      c1       bcc     0/32/0/32     0.01   127590   128000     idle
      c2       bcc     8/24/0/32    15.34      380   128000    mixed
      c3       bcc     0/32/0/32     0.01   127534   128000     idle
      c4       bcc     0/32/0/32     0.01   127505   128000     idle
      c5   normal*     2/14/0/16     0.01   127561   128000    mixed
      c6   normal*     2/14/0/16    18.90      344   128000    mixed
...
  • NODELIST - The name of the node

  • PARTITION - The partition which the node belongs to

  • CPUS (A/I/O/T) - Enumerates what CPUs are Allocated/Idle/Other/Total # of CPUS

  • CPU_LOAD - The load on the CPU

  • FREE_MEM - The amount of free RAM on the node

  • MEMORY - The total amount of RAM on the node

  • STATE - the state of the node. Idle means it's not in use, mixed means it's in use but still has resources available, drained means it's fully in use.

SSH Port Forwarding

Many researchers' workflows consist of using special software such as Jupyter Notebooks or an RStudio server. While it's possible to run these pieces of software on your own laptops or desktops, sometimes the workload that's running on these is too resource-intensive for this to be quick or feasible at all. Thus, it would be convenient to be able to run these pieces of software on the computing cluster to take advantage of the raw processing power available on it.

Singularity

  • Begun in 2015

    • Linux only, primarily used in HPC

    • Integrates with Slurm

    • Does not require root privileges

Docker isn't available on Luria, but Singularity is, and it supports running Docker images. However, there are multiple Singularity registries online, similar to Dockerhub, which host many useful images built for use in Singularity. Below are two of these registries. It's best to look for images in these registries before deciding to use a Docker image, as Docker images can sometimes not work exactly as intended in Singularity.

Submitting Jobs / Slurm Scripts

The most simple way of submitting a job to Slurm is by creating a script for your job, then submitting that script by using the program sbatch.

For example, let's say I have the following script, named myjob.sh, which simply prints out the text "Hello, world!":

#!/bin/bash
echo "Hello, world!"

I could then submit this to Slurm by running sbatch myjob.sh. Slurm will give us the ID of the job, for example:

Submitted batch job 8985982

Slurm will receive the script, determine which of the computing nodes are best suited for runnning this script, send the script to that node and run it, then when the program is finished running, it will output a file in the format slurm-<ID>.out with whatever the program outputs. In this case, I should see a file named slurm-8985982.out with the contents Hello, world!.

In this case, we provide no configuration options to sbatch, so it will submit the job with default options. However, sbatch has options for specifying things like the number of nodes to submit a job to, how many CPU cores to use, and who to email regarding the status of a job. These options are useful, so when submitting a job, it's worth specifying them. sbatch allows us to specify these options in a script by providing comments at the beginning of the script. For example:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mail-type=END	
#SBATCH [email protected]
###################################
echo print all system information
uname -a
echo print effective userid
whoami
echo Today date is:
date
echo Your current directory is:
pwd
echo The files in your current directory are:
ls -lt
echo Have a nice day!
sleep 20
  • SBATCH -N 1 - Specifies the number of nodes Slurm should submit the job to. You must always set -N 1 unless you receive special instruction from the system admin.

  • SBATCH -n 1 - Specifies the number of cores Slurm should delegate to the job. Don't specify more than 16 unless approved by the system admin.

  • SBATCH --mail-type=END - Specifies when you should receive an email notification regarding the job. Options are BEGIN,END,FAIL,ALL.

  • SBATCH [email protected] - Specifies the email that should receive notifications about the status of this job.

Submitting Jobs to Specific Nodes

To submit your job to a specific node, use the following command:

sbatch -w [cX] [script_file]

Where X is a number specifying the node you intend to use. For example, the following command will submit myjob.sh to node c5:

sbatch -w c5 myjob.sh

To submit your job while excluding nodes (for example exclude nodes c5 to c22, use the following command:

sbatch --exclude c[5-22] myjob.sh

The same flags are applicable when running srun .

You can also add these flags to your script as SBATCH comments instead of submitting them as command line flags. For example, to submit a script to sbatch which you'd like to be submitted to node c5, you'd write:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mail-type=END	
#SBATCH [email protected]
#SBATCH -w c5
###################################
echo print all system information
uname -a
echo print effective userid
whoami
echo Today date is:
date
echo Your current directory is:
pwd
echo The files in your current directory are:
ls -lt
echo Have a nice day!
sleep 20

Submitting Jobs to Specific Partitions

To submit your jobs to a specific partition, use the following command:

sbatch -p [partition name]

Where [partition name] is one of: normal, bcc, kellis . If no partition is provided, Slurm defaults to normal.

You can also add this flag to your script as SBATCH comments instead of submitting them as command line flags. For example, to submit a script to sbatch which you'd like to be submitted to node c5, you'd write:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mail-type=END	
#SBATCH [email protected]
#SBATCH -p bcc
###################################
echo print all system information
uname -a
echo print effective userid
whoami
echo Today date is:
date
echo Your current directory is:
pwd
echo The files in your current directory are:
ls -lt
echo Have a nice day!
sleep 20

The same flag is applicable when running srun .

Monitoring and Controlling a Job

To monitor the progress of your job use the command squeue. To display information relative only to the jobs you submitted, use the following (where username is your username):

squeue -u username

Cancelling a Job

To cancel a job that you are running, your can use the scancelcommand and pass it your job ID:

scancel <JOB ID>

scancel 1234567

Conda Environments

Per , Conda is a tool that "provides package, dependency, and environment management for any language." Essentially, Conda lets you create your own environments that contain the necessary pieces of software needed for you to run whatever program(s) or pipelines you need.

Creating and Activating an Environment

Conda is provided on Luria as a module environment, so to use it you'll first have to load in the module miniconda3/v4.

When miniconda3 is loaded, you'll be asked to run:

Make sure to do so.

Now, the conda program will be available to you.

To create a new conda environment, you'll first have to name it. It's typical to make a new environment for a particular task or pipeline, or for a single tool that requires being isolated. Name the environment accordingly. Once the environment is created, it can be activated.

What's happening here? When you create an environment, Conda creates a new directory in ~/.conda/envs with the environment's name. This directory is where any packages and libraries that are installed via Conda will be placed. Activating a Conda environment will add this directory to your shell's environment, so that you can use any packages and libraries present in it as if they were installed to the system.

While the Conda environment is activated, you can use Conda to install packages to it. Any packages you tell it to install will be placed in the active environment directory.

Conda installs packages from what are called "channels". Channels are remote repositories that contain packages. Typical channels include anaconda, conda-forge, and bioconda. Each channel contains its own set of packages, so it's best to know what channel the software you need is located at. You can search channels for software by running:

Or by simply looking it up online.

Once you are done using an environment, you can deactivate it. This is similar to unloading an environment module. If you ever need that environment again, you simply activate it and proceed to use the programs you installed to it previously.

The following is a real-world example of a good use-case for creating a Conda environment.

Let's say you want to use the program radian, which provides a more modern R console experience than baseline R. However, there is no module available for radian. According to , radian is located on the conda-forge channel. Therefore, to make a Conda environment for radian and install both it and R, you'd do the following:

Now, whenever you want to run radian, you just load in Conda and activate the environment you created for it.

Declaratively Defining a Conda Environment

Instead of imperatively creating an environment, you can create a yaml file that describes the structure of your environment, such as your environment's name, what channels it will pull packages from, and what packages it needs, then have Conda create an environment from that file.

Defining an environment this way makes it easy to remember what packages you need for your use-case in case you need to recreate the environment in the future. It also makes it easy to share an environment with other researchers so that they can get up and running quickly.

Consider the following example: you need to run a pipeline that requires the following pieces of software with the corresponding version:

You can create the following yaml file called pipeline_example.yml:

This yaml file details the name of the Conda environment, what channels it should install packages from, what packages need to be installed, and the Conda prefix directory.

Now, you can have Conda create the environment and activate it:

This saves you the trouble of creating the environment yourself then manually installing each package.

If you ever need to make changes to this environment, you can update the yaml file, then run:

Conda Environments in Slurm

Using your Conda environments in a Slurm script is very similar to using Environment Modules in a Slurm script. You just have to append the script with code to load in miniconda3, then activate the appropriate environment, like so:

Sharing Conda Environments

If you would like to share a virtual environment that you've created with others, it's important to export the environment first. In so doing, you protect yourself from any modifications the other user might make to your environment, and make that environment portable, so that they can copy it to their own directory, or build on top of it without affecting your work.

For more information, see the official Conda documentation , and a more detailed guide from The Carpentries .

module load miniconda3/v4
source /home/software/conda/miniconda3/bin/condainit
conda create --name example_environment

conda activate example_environment
conda install pigz # install a parallel compression program

pigz --version

which pigz # this will show you that the pigz program is installed in the Conda environment directory
conda search <channel name>::<package>
conda deactivate

conda env list # Lists what environments you've created

conda activate example_environment # Activate the environment again
module load miniconda3/v4

source /home/software/conda/miniconda3/bin/condainit

conda create --name radian_environment

conda activate radian_environment

conda install -c conga-forge radian r-base
trim_galore/0.6.6
parallel/20200922
bioawk/1.0
perl/5.26.2
cutadapt/1.18
bowtie2/2.2.4
name: pipeline_example
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- trim-galore=0.6.6
- parallel=20200922
- bioawk=1.0
- perl=5.26.2
- cutadapt=1.18
- bowtie2=2.2.4
prefix: /home/software/conda/miniconda3
module load miniconda3/v4

source /home/software/conda/miniconda3/bin/condainit

conda env create -f pipeline_example.yml

conda activate pipeline_example
conda activate pipeline_example
conda env update --file pipeline_example.yml --prune
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mail-type=END	
#SBATCH [email protected]
###################################

module load miniconda3/v4
source /home/software/conda/miniconda3/bin/condainit
conda activate pipeline_example

# <your pipeline commands>
# To export a Conda environment to a YAML file:

[user1]~ conda activate myenv # Activate the environment you'd like to export
[user1]~ conda env export > environment.yml

# After grabbing the YAML file and copying it to their home directory, a user could create a new environment from the environment.yml file:

[user2]~ conda env create -f environment.yml # You may now share this file with whomever wishes to use it
[user2]~ conda activate myenv # Activate the new environment from the file
[user2]~ conda env list # Verify that the new environment was installed correctly:
their website
radian's documentation
here
here

SSH Port Forwarding Jupyter Notebooks

To install Jupyter Notebooks, it's best to create a Conda environment for it. Jupyter Notebooks is a fairly large piece of software, so Conda will have to calculate a lot of the environment's setup and then download a lot of packages. While this isn't necessarily resource-intensive, it's still a good idea to run this on a compute node.

# Connect to a compute node. Take note of which compute node
srun --pty bash

# Load the Conda module environment
module load miniconda3/v4

source /home/software/conda/miniconda3/bin/condainit

# SKIP CREATING THE CONDA JUPYTER ENVIRONMENT AND INSTALLING JUPYTER
# IF YOU HAVE AN EXISTING ENVIRONMENT.
# IF YOU HAVE AN ENVIRONMENT ALREADY, SIMPLY ACTIVATE IT

# Create Conda environment for Jupyter
conda create --name jupyter_environment

# Activate the Jupyter Conda environment
conda activate jupyter_environment

# Install Jupyter
conda install -c anaconda jupyter

Now that Jupyter has been installed and we're on a compute node, we should be able to start up a Jupyter Notebooks server by running the following:

jupyter notebook --ip=0.0.0.0 --port=12345

# This should output URLs to access Jupyter Notebooks from
# We will use the last URL
http://127.0.0.1:12345/?token=63e593c74248876b14c0d4299f454cb8ccd1f18725538c5e

What is happening here?

When a Jupyter Notebook server runs, it does two important things: it binds itself to an IP address, and it binds itself to a port.

By binding itself to an IP address, the server dictates what IP address it can be accessed at. The IP address it binds to will be inside of whatever IPs are available to the computer the server is running on.

However, many different services can run on a single computer, so services bind themselves to ports. When something tries to connect to the IP address of a computer, it must specify what port on that computer it wants to access. You can see this in the URL that Jupyter Notebooks gave me, as it has the URL 127.0.0.1 and the port 12345.

Jupyter Notebooks is now running and should be accessible at the specified URL + port. However, if we go to this URL in our computer's browser, we'll see that we aren't able to connect to it.

Why is this?

Jupyter Notebooks is running on one of the compute nodes. So when it binds to an IP address and port, it does so on that compute node's internal network. This internal network is not public, and can't be accessed outside of that same compute node. In fact, the IP address 127.0.0.1 (also known as localhost) is the "loopback address", AKA the computer referring to itself. So when we type this into our computers' web browsers, our computers are actually looking inside the port on their OWN network. Of course, our computer's aren't running Jupyter Notebooks, so we won't actually connect to anything.

How can we access this Jupyter Notebook server when it's running on a completely different private network?

Using SSH Port Forwarding

We're already using a tool that lets us create secure connections to another computer, SSH. We can leverage a feature built in to SSH to forward a remote computer's network to our local computer's network, SSH port forwarding.

However, remember the structure of our Luria cluster. We first SSH to the Luria head node. The compute nodes are not available for us to SSH into directly. Slurm has a feature that lets us SSH into a compute node as long as we have a job running on it, which in our case is true since we are running a Jupyter Notebooks instance on a compute node. Therefore, we will have to port forward through two networks in order to complete the connection between our local computer's network and the network on the compute node.

On your local computer, run the following command. Make sure to fill in the appropriate username, compute node, etc. (Do not run this on the VSCode terminal if you are using VSCode. Open Windows Powershell / MacOS Terminal and paste the command in there)

ssh -t <username>@luria.mit.edu -L 12345:localhost:12345 ssh <compute node where your job is running> -L 12345:localhost:12345

Now, if we open our browsers and navigate to the URL that Jupyter Notebooks gave us earlier, we should see Jupyter Notebooks!

Full Jupyter Notebooks Example

On Luria:

# Connect to a compute node. Take note of which compute node
srun --pty bash

# Load the Conda module environment
module load miniconda3/v4

source /home/software/conda/miniconda3/bin/condainit

# SKIP CREATING THE CONDA JUPYTER ENVIRONMENT AND INSTALLING JUPYTER
# IF YOU HAVE AN EXISTING ENVIRONMENT.
# IF YOU HAVE AN ENVIRONMENT ALREADY, SIMPLY ACTIVATE IT

# Create Conda environment for Jupyter
conda create --name jupyter_environment

# Activate the Jupyter Conda environment
conda activate jupyter_environment

# Install Jupyter
conda install -c anaconda jupyter

# Start Jupyter Notebooks. Choose a random 5-digit number for your port.
# Take note of the URL this gives you.
jupyter notebook --ip=0.0.0.0 --port=<port>

On your local computer, run the following command. Make sure to fill in the appropriate username, compute node, etc. (Do not run this on the VSCode terminal if you are using VSCode. Open Windows Powershell / MacOS Terminal and paste the command in there).

# SSH port forward
ssh -t <user>@luria.mit.edu -L <port>:localhost:<port> ssh <compute node> -L <port>:localhost:<port>

Open your web browser and navigate to the URL that Jupyter Notebooks gave you.

This process will look very similar for most software of this kind. Just run the software on a compute node and take note of the compute node and port that it is running on and adjust the SSH port forwarding as necessary.

Running Docker Images

To create a basic Docker container from the Debian image, we run the following:

docker run debian echo 'Hello, World!'

# If this is your first time running the ubuntu image, this will pull a lot of data from Dockerhub, then run:

Hello, World!

What's happening here? We invoke the docker command, and tell it to run a command in a container created using the debian image, which it gets by default from Dockerhub. The rest of this line simply tells Docker what command to run in the container, in this case echo 'Hello, World!'.

Important note: Docker images are built for specific CPU architectures (i.e. amd64 vs arm64) and you can only run an image if its the same architecture as your computer. Many popular Docker images have versions for both amd64 and arm64, but it's up to you to check whether or not a version compatible with your CPU's architecture exists before trying to run an image.

We can explore this container a bit more by creating an interactive session inside of it. This allows us to see the filesystem present in the container.

To start an interactive session in the container created using the ubuntu image:

docker run -it debian bash
root@dsliajldkajs:/#

Inside the container, we can't do very much, but we can see that it has its own filesystem with the usual FHS layout. We can certainly make changes to this filesystem similarly to how we do so on a normal Unix system. However, once the container stops running, any changes we make are reset once the container comes back up.

If data inside the container does not survive container reboots, how does any data persist?

There are two ways that Docker allows us to persist data: bind mounting the host's filesystem or creating a Docker volume.

We'll focus on bind mounting first. Bind mounting essentially lets you plug a hole in a Docker container filesystem that points to somewhere on your host computer's filesystem.

To bind mount a directory, when we run our container, we pass the -v flag then provide <source>:<destination> where <source> is the directory that you want accessed in the container and <destination> is where in the container the directory points to.

For example, to bind mount a local directory in the Ubuntu container

docker run -v "/home/asoberan:/mnt" -it debian bash
root@mlfmlkma:#/ ls /mnt

# Your files should be present in the /mnt directory inside the container

Now, if you create something in /mnt in the container, you'll see those changes made on your local directory as well.

This Ubuntu images is pretty barebones, as you've seen. Images such as this aren't meant for using outright, but instead for building upon to create other, more useful images.

We'll use one of these more useful images to set up an R development environment.

The image we'll be using is rocker/rstudio, an image made by the R community for setting up a barebones R environemnt or for building a more robust R environment.

Let's start up an interactive session using the rocker/rstudio image available on Dockerhub.

docker run --rm -it rocker/rstudio bash
root@damldkmsla:#/ R
> library("tidyverse")
> Error in library("tidyverse") : there is no package called ‘tidyverse’
> install.packages(c("tidyverse"))
# tidyverse installation output
> library(tidyverse)
── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
:heavy_check_mark: dplyr    1.1.4    :heavy_check_mark: readr    2.1.5
:heavy_check_mark: forcats  1.0.0    :heavy_check_mark: stringr  1.5.1
:heavy_check_mark: ggplot2  3.5.0    :heavy_check_mark: tibble   3.2.1
:heavy_check_mark: lubridate 1.9.3    :heavy_check_mark: tidyr    1.3.1
:heavy_check_mark: purrr    1.0.2
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
:heavy_multiplication_x: dplyr::filter() masks stats::filter()
:heavy_multiplication_x: dplyr::lag()   masks stats::lag()
:information_source: Use the conflicted package (<http://conflicted.r-lib.org/>) to force all
conflicts to become errors

As you can see, rocker/rstudio does not come with tidyverse built-in. However, the R environment it provides is just like any other R environment, so it's incredibly simple to install it.

Working with R in the command line can be fairly cumbersome. The real power of rocker/rstudio is that it comes built-in with an RStudio server.

By default, RStudio binds to the port 8787. However, the port in the container is in its own network. So like before, we'll need to port forward from the container to our local network. Thankfully, Docker has a built-in way of doing this, using the -p flag, which is supplied with <host port>:<container port>, where host port is the port on your own computer and container port is the port in the container. To keep things simple, we'll keep the port numbers the same.

docker run --rm -it -p 8787:8787 rocker/rstudio

This should start an RStudio server which you can access on your computer's web browser at http://localhost:8787.

Remember, any files you create in this RStudio Server are created in the Docker container. When the Docker container stops, those files will be gone. If you want to save your files or use R files you have from previous work, it's best to bind mount the directory with your files and make sure to only make changes to the bind-mounted directory in the container.

Differences from Docker

Before we use Singularity, we must understand that it works differently from Docker in very subtle ways:

  1. Images are files

  2. Images are read-only

  3. Singularity automatically bind-mounts your home directory

1. Images are files

Singularity can manage images itself so you never have to see where or how they're installed. However, images in Singularity can also be created as a SIF file that you manage just like any other file.

Since Singularity uses SIF files for images, Docker images will need to be converted to SIF. Thankfully, this feature is built into Singularity and will be invoked automatically when we run a Docker image in Singularity.

Managing SIF files ourselves can be useful for having one single image shared between an entire lab, instead of each lab member downloading their own image for the exact same tools.

2. Images are read-only

Just like Docker images, SIF files contain their own filesystem with the environment needed to run whatever program or programs are packaged in it. However, whereas Docker's filesystem lets you make temporary changes to this filesystem, the Singularity image's filesystem is read-only. Therefore, you won't be able to create or delete any files inside the image when you start an interactive session in it.

Images being read-only sometimes make running images made for Docker cumbersome, as we will see later.

3. Singularity automatically bind-mounts your home directory

When you enter a Singularity image, Singularity will automatically bind mount your home directory to the home directory inside the image. Therefore, everything inside the Singularity home directory will be write-able. You'll be able to see the tools available in the Singularity image to do work on files in your own home directory.

Running Nextflow / nf-core Pipelines

Nextflow is a system which allows you to build reproducible pipelines. It chains together simple actions to create a complex data analysis pipeline. People have used Nextflow to create bioinformatics pipelines for many different operations, including RNASeq analysis, Hi-C analysis, etc.

NF-Core is a "a community effort to collect a curated set of analysis pipelines built using Nextflow." You can find many popular bioinformatics Nextflow pipelines on .

We can take advantage of nf-core on our cluster by installing it in a Conda environment. Before doing so, however, we must set a couple of environment variables in our ~/.bashrc files that Nextflow and nf-core need to correctly cache the Singularity images they'll be using throughout the pipeline.

Edit your ~/.bashrc file and append these environment variables to the end of the file:

To make sure these environment variables are set, you can either log out of Luria and log back in, or run the following to load the new shell environment:

Installing nf-core / Nextflow

Nextflow and nf-core are installed through Conda, so we'll want to make sure we activate the Conda module before starting:

They also require us to have specific channels configured:

Once these channels have been added, we can go along with the installation:

Currently, Nextflow 24 is the most compatible version with our system. Nextflow will advise you to update, but please do not, as this will break your pipelines.

Using nf-core / Nextflow

You can either check to check what Nextflow pipelines are available, or you can use the command line nf-core tool. The command line tool will also give you information about what pipelines you have installed, the version installed, the last time you used them, etc.

Nextflow pipelines all require the revision number and different parameters for running. You can see what parameters are available for a particular revision of a pipeline and which are required at the pipeline's corresponding web page, or by running the pipeline without any parameters and reading the Nextflow error log.

Nextflow also requires you to specify a "profile" for running a pipeline. A profile is essentially a set of sensible settings that the pipeline should run with. Each pipeline has its own profile specific for itself, and two test profiles: test, which runs the pipeline with a minimal public dataset, and test_full, which runs the pipeline with a full-size public dataset.

In addition to these, nf-core provides profiles for common containerization software, such as Docker, Podman, and Singularity.

We're going to run an example rnaseq pipeline using rnaseq pipeline v3.14.0. The parameters for this pipeline are enumerated here: . The two required parameters are --input, the "path to comma-separated file containing information about the samples in the experiment" and --outdir, "the output directory where the results will be saved."

We'll use the test profile to ensure the pipeline can install and run correctly. We'll also use the singularity profile since Luria is set up for use with singularity. The test profile will give the pipeline its own inputs, so we'll only need to specify --outdir. Make sure you load in singularity since we're setting the singularity profile, instructing Nextflow to use singularity to set up the pipeline.

Nextflow will begin to download the necessary Singularity images to run the rnaseq pipeline v3.14.0. This should take anywhere between 7-12 minutes. Since we've set the necessary environment variables for Nextflow to see the Singularity image cache, subsequent runs of this revision of the pipeline will start up much faster.

As the Nextflow pipeline runs, it will put metadata into .nextflow/cache and other data into the work/ directory. If the pipeline errors out at any point, you can read the error log, fix the issue, then add the -resume flag to your command to resume from where you left off. Nextflow will read the metadata and data it generated in the previous run to know where in the pipeline to start back up from.

Once the pipeline is finished setting itself up, it will run with a minimal public dataset as input, then output the results into the test/ directory we specified. This directory will have extensive information about multiple points of the run.

Building Docker Images

Docker is a container engine, but it's also an image build tool. You can build Docker images yourself by creating a Dockerfile, essentially a file that outlines each step in creating your image.

Below are the common commands used in a Dockerfile to outline these steps:

  • FROM - Dictates what the base image you're building off of.

  • LABEL - A simple label attached to your image as metadata. A common label would be description for writing a description of the image.

  • RUN - Runs the command you specify in the image. For example, if the base image is Ubuntu, then you can run any Ubuntu commands here. Common things to run would be apt-get install <package> to install an Ubuntu package into your container.

  • CMD - The command that should run when the container is started. This tends to be the major software that is being packaged.

Knowing these is enough to build a simple Docker image. We'll be using this knowledge to build our own Docker image for Seurat.

Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate diverse types of single-cell data.

We'll use rocker/rstudio as a base so that we can have RStudio available to us automatically.

Create a file named "Dockerfile".

First, we must select the base image. We'll use rocker/rstudio version 4.3.2, which comes with R 4.3.2. We'll make sure to label the image with a simple description.

Then, we must outline the steps needed to install Seurat4. rocker/rstudio is built on top of Ubuntu, so any packages we need to install should use Ubuntu's apt-get utility. The following packages are needed for the installation of Seurat and other tools:

Now, we can run R to install Seurat and other useful R tools, including BiocManager, which we'll use in the next step to install useful bioinformatics R libraries.

Installing R libaries using BiocManager:

Installing other tools from GitHub:

All together, the Dockerfile should look like this:

Now that we have the Dockerfile, we can invoke the Docker build commands in the command line. We'll want to tag our Docker image with our name and the name of the image, preferably something descriptive. I'll choose asoberan/abrfseurat for my build.

Of course, each of you could build this yourselves and have a custom local copy of this image. However, the benefits of containerization are that it makes programs and environments portable. I've already created the image and uploaded it to Dockerhub. So instead of everyone needing to create their own image, you just pull my existing image and use it immediately.

I've created images for both amd64 and arm64. If you're running a PC or an Intel-based Mac, you'll want to use the tag latest-x86_64. If you're running Apple Silicon or another ARM processor, you'll want to use the tag latest-arm64.

Once the Docker image is pulled and runs, you can navigate to and login to the RStudio instance with user rstudio and the given password. All the libraries needed for Seurat should be available out of the box.

However, we've fallen into the same problem as previously: we are running this instance of RStudio locally on our computers. How can we take advantage of this image on the Luria cluster?

FROM rocker/rstudio:4.3.2
LABEL description="Docker image for Seurat4"
RUN apt-get update && apt-get install -y \
    libhdf5-dev build-essential libxml2-dev \
    libssl-dev libv8-dev libsodium-dev libglpk40 \
    libgdal-dev libboost-dev libomp-dev \
    libbamtools-dev libboost-iostreams-dev \
    libboost-log-dev libboost-system-dev \
    libboost-test-dev libcurl4-openssl-dev libz-dev \
    libarmadillo-dev libhdf5-cpp-103
RUN R -e "install.packages(c('Seurat', 'hdf5r', 'dplyr', 'cowplot', 'knitr', 'slingshot', 'msigdbr', 'remotes', 'metap', 'devtools', 'R.utils', 'ggalt', 'ggpubr', 'BiocManager'), repos='http://cran.rstudio.com/')"
RUN R -e "BiocManager::install(c('SingleR', 'slingshot', 'scRNAseq', 'celldex', 'fgsea', 'multtest', 'scuttle', 'BiocGenerics', 'DelayedArray', 'DelayedMatrixStats', 'limma', 'S4Vectors', 'SingleCellExperiment', 'SummarizedExperiment', 'batchelor', 'org.Mm.eg.db', 'AnnotationHub', 'scater', 'edgeR', 'apeglm', 'DESeq2', 'pcaMethods', 'clusterProfiler'))"
RUN R -e "remotes::install_github(c('satijalab/seurat-wrappers', 'kevinblighe/PCAtools', 'chris-mcginnis-ucsf/DoubletFinder', 'velocyto-team/velocyto.R'))"
FROM rocker/rstudio:4.3.2
LABEL description="Docker image for Seurat4"

RUN apt-get update && apt-get install -y \
    libhdf5-dev build-essential libxml2-dev \
    libssl-dev libv8-dev libsodium-dev libglpk40 \
    libgdal-dev libboost-dev libomp-dev \
    libbamtools-dev libboost-iostreams-dev \
    libboost-log-dev libboost-system-dev \
    libboost-test-dev libcurl4-openssl-dev libz-dev \
    libarmadillo-dev libhdf5-cpp-103

RUN R -e "install.packages(c('Seurat', 'hdf5r', 'dplyr', 'tidyverse', 'cowplot', 'knitr', 'slingshot', 'msigdbr', 'remotes', 'metap', 'devtools', 'R.utils', 'ggalt', 'ggpubr', 'BiocManager'), repos='http://cran.rstudio.com/')"

RUN R -e "BiocManager::install(c('SingleR', 'slingshot', 'scRNAseq', 'celldex', 'fgsea', 'multtest', 'scuttle', 'BiocGenerics', 'DelayedArray', 'DelayedMatrixStats', 'limma', 'S4Vectors', 'SingleCellExperiment', 'SummarizedExperiment', 'batchelor', 'org.Mm.eg.db', 'AnnotationHub', 'scater', 'edgeR', 'apeglm', 'DESeq2', 'pcaMethods', 'clusterProfiler'))"

RUN R -e "remotes::install_github(c('satijalab/seurat-wrappers', 'kevinblighe/PCAtools', 'chris-mcginnis-ucsf/DoubletFinder', 'velocyto-team/velocyto.R'))"
cd /path/to/directory/where/Dockerfile/is/located

docker buildx build -t asoberan/abrfseurat .
docker run --rm -it -p 8787:8787 asoberan/abrseurat:<tag>
http://localhost:8787
export NXF_SINGULARITY_CACHEDIR="$HOME/.singularity/cache"
export NXF_OFFLINE='TRUE'
source ~/.bash_profile
srun --pty bash # Start an interactive session on a compute node

module load miniconda3/v4

source /home/software/conda/miniconda3/bin/condainit
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name nf-core
conda activate nf-core
conda install python=3.12 nf-core=2.13.1 nextflow=24.10.4
nf-core list

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Pipeline Name             ┃ Stars ┃ Latest Release ┃      Released ┃ Last Pulled ┃ Have latest release? ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ riboseq                   │     4 │          1.0.1 │   2 weeks ago │           - │ -                    │
│ sarek                     │   339 │          3.4.1 │   1 weeks ago │           - │ -                    │
│ oncoanalyser              │    14 │            dev │  17 hours ago │           - │ -                    │
│ tfactivity                │     7 │            dev │     yesterday │           - │ -                    │
│ pangenome                 │    47 │          1.1.2 │  1 months ago │           - │ -                    │
│ scnanoseq                 │     2 │            dev │     yesterday │           - │ -                    │
│ fetchngs                  │   123 │         1.12.0 │  2 months ago │           - │ -                    │
│ rnaseq                    │   778 │         3.14.0 │  4 months ago │ 2 hours ago │ No (v3.14.0)         │
...........................................................................................................

│ slamseq                   │     4 │          1.0.0 │   4 years ago │           - │ -                    │
└───────────────────────────┴───────┴────────────────┴───────────────┴─────────────┴──────────────────────┘
module load singularity/3.10.4

nextflow run nf-core/rnaseq -r 3.14.0 -profile test,singularity --outdir test
ls test/
bbsplit  fastqc  multiqc  pipeline_info  salmon  star_salmon  trimgalore
nf-core's website
nf-core's website
https://nf-co.re/rnaseq/3.14.0/parameters

Running Images in Singularity

Singularity Commands

Singularity is packaged on Luria as an environment module, so you'll need to load the module in before invoking any Singularity commands. We'll also run these commands on an interactive session on a compute node so we don't spend the head node's resources.

Now, we can either have Singularity manage the image itself, or create the SIF file in our current directory. We'll do both in this exercise.

Let's run the same basic 'Hello, World!' command we did in Docker, again using the Debian Docker image:

srun --pty bash

module load singularity/3.10.4

singularity exec docker://debian echo 'Hello, World!'
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob 1468e7ff95fc done
Copying config d5269ef9ec done
Writing manifest to image destination
Storing signatures
2024/05/01 11:02:47  info unpack layer: sha256:1468e7ff95fcb865fbc4dee7094f8b99c4dcddd6eb2180cf044c7396baf6fc2f
INFO:    Creating SIF file...
Hello, World!

Instead of run, Singularity uses exec to execute programs inside of a container. The image we provide is a Docker image, so we specify that to singularity by prepending the image name with docker://. Singularity will look in Dockerhub for this image. Once it finds it, it will download the image, convert it to the SIF file format, place that SIF file in ~/.singularity, then execute the given command in the container. Subsequent calls to use this image will use this downloaded image instead of downloading the image every time you want to use it.

To run an interactive session in Singularity, you could do something similar to what we did in Docker, where we simply execute bash in the container.

singularity exec docker://debian bash
Singularity>

However, Singularity has a built-in command to do this that makes the syntax much nicer.

singularity shell docker://debian
Singularity>

Singularity automatically bind-mounts your user's home directory to the container, so you'll have access to your files like normal. However, your user's ~/data folder is a symbolic link to /net/<storage server>, which is not a directory inside of the Singularity container, so this symbolic link will be broken.

Like Docker, Singularity allows you to mount directories from your computer to inside the container. To get around the symbolic link issue when running the image from your home directory, you could simply mount the /net directory on Luria to the /net directory in your container. Since the /net directory contains your lab's storage server and you're keeping the same name on the container, the symbolic link should work as normal.

singularity shell --bind /net:/net docker://debian
Singularity> ls data
# You should see the files from your storage server

SIF Files

So far, we've been letting Singularity manage images itself. However, we can also instruct Singularity to download the Docker image, create a SIF file from it, and let us handle this SIF file. We do this by using the pull command:

singularity pull docker://debian

ls

# You should see a file named debian_latest.sif

Now, instead of instructing Singularity to use the docker://debian image, we can simply point it to the debian_latest.sif file. This means a lab can make a directory of common Singularity images and lab members simply run these images instead of every lab member pulling their own image. This way is also faster than the previous method.

singularity shell debian_latest.sif
Singularity>

singularity exec debian_latest.sif echo 'Hello, World!'
Hello, World!

Running RStudio with Seurat Tools

Let's run the image we created earlier. There's a pre-built version of this image available on Dockerhub at asoberan/abrfseurat.

Singularity has trouble interpreting symbolic links, and since the ~/data/ directory in a user's home directory is a symbolic link, we'll see issues when trying to run Singularity. To remedy this, we'll run the following command once we're in the ~/data/ folder:

cd $(pwd -P)

This will change our directory to the full physical path of our current working directory. So we'll be in the same directory, but without following the symbolic link in our home directories.

Now, we can begin to run Singularity images. The Singularity program is packaged as an environment module on Luria, so you'll have to load it in first. To start an interactive session in a Singularity container, we use singularity shell <image>. In this case, the image we'll be using is the asoberan/abrfseurat:latest-x86_64 image on Dockerhub, so we'll run the following:

module load singularity/3.10.4

singularity shell docker://asoberan/abrfseurat:latest-x86_64

This will begin pulling the Docker image, convert it to a SIF file, store it in ~/.singularity, which is a symbolic link to <your lab storage server user directory>/singularity, then run an interactive session on that image. Once it's done you'll have a shell session inside the image, and you'll be able to use the tools in the Singularity image. For example, you'll be able to use R:

[asoberan@luria test]$ singularity shell docker://asoberan/abrfseurat:latest-x86_64
Singularity> R

R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 

However, we will not be able to run RStudio on the image as it stands. This is because RStudio needs to create particular settings and database files at locations in the filesystem which are read-only in the Singularity image. To fix this, we'll need to create these directories ourselves. Below is a script that does just that, while also running the RStudio server from the Singularity image:

#!/bin/bash

#SBATCH --job-name=Rstudio       # Assign an short name to your job
#SBATCH --output=slurm.%N.%j.out     # STDOUT output file

module load singularity/3.10.4

workdir=$(python -c 'import tempfile; print(tempfile.mkdtemp())')

mkdir -p -m 700 ${workdir}/run ${workdir}/tmp ${workdir}/var/lib/rstudio-server
cat > ${workdir}/database.conf <<END
provider=sqlite
directory=/var/lib/rstudio-server
END

cat > ${workdir}/rsession.sh <<END
#!/bin/sh
export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
exec /usr/lib/rstudio-server/bin/rsession "\${@}"
END

chmod +x ${workdir}/rsession.sh

export SINGULARITY_BIND="${workdir}/run:/run,${workdir}/tmp:/tmp,${workdir}/database.conf:/etc/rstudio/database.conf,${workdir}/rsession.sh:/etc/rstudio/rsession.sh,${workdir}/var/lib/rstudio-server:/var/lib/rstudio-server"
export SINGULARITYENV_RSTUDIO_SESSION_TIMEOUT=0
export SINGULARITYENV_USER=$(id -un)
export SINGULARITYENV_PASSWORD=$(echo $RANDOM | base64 | head -c 20)

readonly PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

cat 1>&2 <<END

1. SSH tunnel from your workstation using the following command:

   ssh -t -L 8787:localhost:${PORT} ${SINGULARITYENV_USER}@luria.mit.edu ssh -t ${HOSTNAME} -L ${PORT}:localhost:${PORT}

   and point your web browser to http://localhost:8787

2. log in to RStudio Server using the following credentials:

   user: ${SINGULARITYENV_USER}
   password: ${SINGULARITYENV_PASSWORD}

When done using RStudio Server, terminate the job by:

1. Exit the RStudio Session ("power" button in the top right corner of the RStudio window)
2. Issue the following command on the login node:

      scancel -f ${SLURM_JOB_ID}
END

singularity exec --cleanenv -H ~/data:/home/rstudio docker://asoberan/abrfseurat:latest-x86_64 /usr/lib/rstudio-server/bin/rserver \
            --server-user ${USER} --www-port ${PORT} \
            --auth-none=0 \
            --auth-pam-helper-path=pam-helper \
            --auth-stay-signed-in-days=30 \
            --auth-timeout-minutes=0 \
            --rsession-path=/etc/rstudio/rsession.sh 
printf 'rserver exited' 1>&2

Let's go through the script step-by-step to understand what it's doing.

workdir=$(python -c 'import tempfile; print(tempfile.mkdtemp())')

mkdir -p -m 700 ${workdir}/run ${workdir}/tmp ${workdir}/var/lib/rstudio-server

cat > ${workdir}/database.conf <<END
provider=sqlite
directory=/var/lib/rstudio-server
END

This part of the script uses Python to creates a temporary directory that will be populated with directories to bind-mount in the Singularity container where writable file systems are necessary.

The latter portion of the script is making a file in the temporary directory, database.conf, with the contents you see. These settings are used by RStudio to configure the database.

cat > ${workdir}/rsession.sh <<END
#!/bin/sh
export OMP_NUM_THREADS=${SLURM_JOB_CPUS_PER_NODE}
exec /usr/lib/rstudio-server/bin/rsession "\${@}"
END

chmod +x ${workdir}/rsession.sh

Here, the script makes another script in the temporary directory, rsession.sh, with the contents you see. The script sets OMP_NUM_THREADS to prevent OpenBLAS (and any other OpenMP-enhanced libraries used by R) from spawning more threads than the number of processors allocated to the job. Then it makes this script executable.

export SINGULARITY_BIND="${workdir}/run:/run,${workdir}/tmp:/tmp,${workdir}/database.conf:/etc/rstudio/database.conf,${workdir}/rsession.sh:/etc/rstudio/rsession.sh,${workdir}/var/lib/rstudio-server:/var/lib/rstudio-server"
export SINGULARITYENV_RSTUDIO_SESSION_TIMEOUT=0
export SINGULARITYENV_USER=$(id -un)
export SINGULARITYENV_PASSWORD=$(echo $RANDOM | base64 | head -c 20)

readonly PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

This portion sets a couple of environment variables. The environment variables which begin with SINGULARITY will be used when we invoke the Singularity program, while the the environment variables which begin with SINGULARITYENV will be accessed inside of the Singularity image.

SINGULARITY_BIND is outlining the bind-mounts that should be created when we run the Singularity image. The bind-mounts are the temporary directories we made.

SINGULARITYENV_RSTUDIO_SESSION_TIMEOUT is setting the session timeout for RStudio. In this case, it's set not to suspend idle sessions.

SINGULARITYENV_USER is storing the user which will be used in RStudio. In this case it's ourselves.

SINGULARITYENV_PASSWORD is storing the password which will be used later in RStudio. The password is being generated using the random number generator built in to Linux.

PORT is finding an unused port number and storing it for later usage.

cat 1>&2 <<END

1. SSH tunnel from your workstation using the following command on your workstation:

   ssh -t -L 8787:localhost:${PORT} ${SINGULARITYENV_USER}@luria.mit.edu ssh -t ${HOSTNAME} -L ${PORT}:localhost:${PORT}

   and point your web browser to http://localhost:8787

2. log in to RStudio Server using the following credentials:

   user: ${SINGULARITYENV_USER}
   password: ${SINGULARITYENV_PASSWORD}

When done using RStudio Server, terminate the job by:

1. Exit the RStudio Session ("power" button in the top right corner of the RStudio window)
2. Issue the following command on the login node:

      scancel -f ${SLURM_JOB_ID}
END

This part of the script prints out information to the user so they can remember how to port-forward and what the login information for RStudio is.

singularity exec --cleanenv -H ~/data:/home/rstudio docker://asoberan/abrfseurat:latest-x86_64 /usr/lib/rstudio-server/bin/rserver \
            --server-user ${USER} --www-port ${PORT} \
            --auth-none=0 \
            --auth-pam-helper-path=pam-helper \
            --auth-stay-signed-in-days=30 \
            --auth-timeout-minutes=0 \
            --rsession-path=/etc/rstudio/rsession.sh 
printf 'rserver exited' 1>&2

This final piece is where Singularity actually runs the RStudio server program in asoberan/abrfseurat using all of the configuration created earlier in the script.

Save this script somewhere on the cluster. Send it a compute node using Slurm. Then, read the contents of the Slurm output file and you'll receive instructions to port forward from your workstation in order to access RStudio at http://localhost:8787.

sbatch seurat_script.sh

cat slurm-<id>.out

# Follow instructions
Singularity HPC Librarysingularityhub.github.io
Sylabscloud.sylabs.io
Logo
Logo