1 of 100

The Barbara K. Ostrom (1978) Bioinformatics and Co

The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility

The Integrated Genomics and Bioinformatics Core is comprised of the Genomics Facility and the Barbara K. Ostrom Bioinformatics Facility. Together, these shared institutional resources provide Koch Institute investigators with support throughout all phases of modern genomics experimentation.

The Barbara K. Ostrom (1978) Bioinformatics Facility is a shared institutional resource providing KI investigators with support, assistance and training in a wide range of bioinformatics topics. The core also maintains a high-performance computing resource offering many bioinformatics applications, substantial processing power and secure storage. The team also provides support for desktop computing and poster printing.

Access is available to all members of the MIT community, to the extent permitted by available capacity. In recognition of funding support, priority access is given to members of the Core’s sponsoring entities (Koch Institute, MIT Biology, MIT Biological Engineering and MIT Center for Environmental Health Science) and NCI-funded research projects. In special circumstances, access may be available to non-MIT users (details available on request from the Facility Scientific Director, Charlie Whittaker)

The Bioinformatics & Computing Core is supported in part by funding provided to the Koch Institute from a National Cancer Institute Cancer Center Support Grant.

Staff

Current

Vincent Butty - Bioinformatics Specialist
Huiming Ding - Bioinformatics Specialist
Duanduan Ma - Bioinformatics Specialist
Yann Vanrobaeys - Bioinformatics Specialist
Stuart Levine - BMC Director and Bioinformatics Specialist
Charlie Whittaker - BCC Leader and Bioinformatics Specialist
Charlie Demurjian - Senior Bioinformatics Data Specialist
Taïsha Joseph - Bioinformatics Data Specialist
Allen Soberanes - IT & Systems Administrator

Former

AJ Bhutkar - Sr Bioinformatics Specialist
Stephen Goldman - Computing Support Specialist
Dikshant Pradhan - Research Specialist
Jie Wu - Bioinformatics Specialist
Sebastian Hoersch - Bioinformatics Specialist
Paola Favaretto - Bioinformatics Specialist
Michael Moran - Computing Support Specialist
Jingzhi Zhu - Research Computing Specialist

Locations

Contact Information

https://ki-sbc.mit.edu/bioinformatics/contact

MIT Resources

https://accessibility.mit.edu/

Computing Resources

Archive Data Storage

TSM archive
- Free to archive
- Folders will be archived to TSM and then deleted from bmc-lab.
- Access to archived data and restores from archive will require assistance from systems administrators and this help will be billed at $90 per hour.

We can also help to transfer the data from bmc-lab to your own hardware.

Luria Cluster

Luria.mit.edu is a Linux cluster built in 2017, with the intent of providing computing resources for life sciences workloads for the Koch Cancer Research Institute and affiliated labs and DLCs at MIT.

It has 2272 cores, across 57 nodes:

Nodes

RAM

c1-9

128GB

c10

192GB

c11-40

96GB

b1-16

768GB or 384GB

Head node

64GB

Nodes

CPU Cores

CPU Model

c1-4

32 cores

Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz

c5-40

16 cores

Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

b1-16

96 cores

Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz or 5220R CPU @ 2.20GHz

Head node

32 cores

Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Accessing Luria

To request a Luria account, please email to [email protected] from your MIT email account with the following information:

Your lab affiliation.
What kind of work you are looking to accomplish on the Luria cluster.
Whether you have previous experiences in Linux and cluster computing.

Connecting to Luria

You must be connected to the MIT VPN if you are not connected to MITNet or the MIT Secure Wi-Fi.

Visual Studio Code

Unfortunately, the most recent versions of Visual Studio Code's Remote-SSH plugin no longer support operating systems with older versions of glibc, which includes Luria. There's no easy way around this besides downgrading VSCode to a version from before March 2025, or by changing your workflow.

If you still want to use VSCode but do not wish to downgrade it, you can mount your lab's storage over SMB, access your storage files in VSCode through your mounted folder, and manually SSH into Luria instead of using the Remote-SSH plugin by typing ssh <your username>@luria.mit.edu into the VSCode Terminal.

Keep in mind that mounting your storage server over SMB only gives you access to the files in your lab's storage account and files in your home directory's ~/data folder, not files directly inside of your home directory.

To connect to Luria using Visual Studio Code, follow the instructions at this website. When you reach the step "Connect to Host", use <your-kerberos-username>@luria.mit.edu as your host. Enter your Kerberos password when prompted.

You can then open any folder or workspace on Luria using by going to File > Open and edit it straight in Visual Studio Code.

After successfully connecting, go to the VS Code top bar, press Terminal, then New Terminal as shown here. This will open a remote shell on the Luria cluster to run commands on.

SecureCRT

SecureCRT is available to download from IS&T's website. To create a connection using SecureCRT, click Quick Connect, then fill in the following fields:

Enter your Kerberos password when prompted.

Windows / Linux/ MacOS Terminal

The Windows Terminal is available to download from the Microsoft app store. If you do not wish to download it, you can use the built-in Windows PowerShell to run the following commands. Linux and MacOS come with a terminal and SSH client built-in.

Run: ssh <your-kerberos-username>@luria.mit.edu
Enter your MIT Kerberos password when prompted.

Data Storage

Each user on Luria will have at least two storage areas: a home directory and a project directory. Both are backed up daily.

Home Directory

Each user has a home directory. The home directory is intended to save commonly used scripts, environment configuration files, and your own software. A standard user account has a storage quota of 10GB in home directory. Please do not store large data file under your home directory as it can fill up the space quickly.

Project Directory (~/data)

In your home directory, there is usually a symbolic link called data which points to your storage servers. A symbolic link (symlink) is a file that points to a different location. Symbolic links are useful for saving space, since instead of having a large directory stored into a location, such as your luria home, you can use a symbolic to point to that directory on a resource with more abundant storage.

Programs will automatically follow the symbolic link and behave as if it was actually the directory. Symlinks also serve as shortcuts to different locations in the filesystem. For example, the symbolic link called data in your homes point to your directory in your lab's storage share. This is probably one of the following:

/home/<username>/data -> /net/bmc-lab2/data/lab/<lab-name>/<username>
/home/<username>/data -> /net/bmc-pub14/data/<lab-name>/users/<username>

Please do not delete the data symbolic link. The project directory is intended to save all your data which resides on either bmc-pubX or bmc-labX storage servers.

Your home directory also has five hidden directories: .local, .R, .singularity, .conda, and .jupyter. These directories are specifically named and required by their respective programs to store files, configurations, images, and packages. Since these can grow quite large, these directories are linked to your storage server.

Starting in February 2024, these directories point to the following hidden directories:

/home/<username>/.singularity -> /home/<username>/data/.singularity
/home/<username>/.conda -> /home/<username>/data/.conda
/home/<username>/.jupyter -> /home/<username>/data/.jupyter
/home/<username>/.local -> /home/<username>/data/.local
/home/<username>/.R -> /home/<username>/data/.R

For older accounts, these directories may look somewhat different, for example:

/home/<username>/.singularity -> /home/<username>/data/singularity
/home/<username>/.conda -> /home/<username>/data/conda
/home/<username>/.jupyter -> /home/<username>/data/jupyter
/home/<username>/.local -> /home/<username>/data/local
/home/<username>/.R -> /home/<username>/data/R

Disk quota exceeded

If your home directory exceeds the 20G quota, you will receive the error "Disk quota exceeded" when writing to your home directory. If this happens, you should examine which directories/files take most of the space using the du command, and then move those large directories/files to your project directory ~/data located on the storage server.

$ du -shc $(ls -A)

Data Storage Cleanup and Compression

While computing on Luria is free, data storage is charged at $100/TB/year. To reduce the operational cost and better utilize the storage server, we recommend you check your data periodically - removing data that are no longer needed and compressing files that you will not need for a while. Here is an example command to compress all fastq files in the current directory and sub-directories that are older than 180 days. Please remember running it through either a batch script or interactive login to a compute node.

$ find . -name \*.fastq -mtime +180 | parallel -j16 gzip -v {}

Data Storage Archival

Each lab has a storage quota in the project directory. When the storage quota is exceeded for your lab, you will not be able to add more data to your storage server. As mentioned above, you will then need to do data storage cleanup and compression to free up spaces. In addition, you can send a request to [email protected] to increase your storage quota with a cost object provided and/or do TSM archival of your folder(s).

We don't offer tiered storage. We don't use the term cold storage as our all storage servers are active. If you have data that you don't need for a while, we can do TSM archive on MIT storage and then delete the data from our storage server. Archived data is stored in MIT TSM archive, and is no longer accessible from our cluster luria.mit.edu as well as our storage servers. If you need the data back available to luria and our storage servers, we can retrieve it from MIT TSM archive. The MIT archive is a free service, but will require two hours of our labor to retrieve it at a rate of $90/hr.

Transferring data to and from Luria

Using SCP

Secure Copy or SCP is a means of securely transferring computer files between a local and a remote host or between two remote hosts.
It is based on the Secure Shell (SSH) protocol.
The term SCP refers both to the protocol and the program.
The command line scp is one of the most common SCP program and implements the SCP protocol to transfer files.
The syntax of the scp command is the same as the syntax of the cp command:

Copying file(s) to luria

# copying file `myfile.txt` from your local machine
# to your home directory in luria
scp myfile.txt <user>@luria.mit.edu:~/myfile.txt

# copying folder `myfolder` from your local machine
# to your home directory in luria
scp -r myfolder <user>@luria.mit.edu:~/

Copying file(s) from luria

# copying file `myfile.txt` from luria to
# your local machine
scp [email protected]:~/myfile.txt myfile.txt

Transferring Files Using Jupyter Notebook

You can upload and download files from the Jupyter notebook interface via your browser. Please follow the instructions at this page.

Software Packages

module

Luria uses module to manage software and its versions, (see Managment of Software Packages with module). Please note that not all software/versions available on rous are installed on luria. To get a list of software packages and versions, run

module avail

While module helps you to locate software that has been loaded into LMOD, many packages installed on Luria are not configured in this way, or may be included with other packages. If a package that you are looking for doesn't show up using module , you can run the locate command to search the filesystem for packages by name.

For example, to find out if tabix is installed, run the command locate tabix. The result should show that the command tabix is available under samtools. You can then run module load samtools/1.3 to have tabix in your environment.

Software Installation

See Installation.

Running jobs

Slurm

Luria uses the Slurm scheduler in order to manage jobs. In order to submit a job to run on the cluster compute nodes, you will need to create a batch script for Slurm and then submit it to the scheduler. For more information, please see our separate article on Slurm.

Training Session

The Integrated Genomics and Bioinformatics core at MIT (IGB) is offering a hands-on introductory session covering Linux usage and cluster computing using KI/BioMicro computational resources. This training session is targeted to users who are new to Linux and/or cluster computing. It is currently offered every 6 weeks, and registration fee is managed using iLabs. If you are interested in the next training session or one-on-one training, please email [email protected] for details.

FAQs

Why can't my R package be installed?

Luria has multiple versions of R available through the environment modules system. Sometimes, R packages install normally. However, other times there may be incompatibilities between the versions of R installed on Luria and the R package you're trying to install or a dependency that an R package needs is not installed on Luria.

In cases like this, you'll likely to need to use R installed some other way. There are two ways: using a Singularity image with R built-in, or installing R in a Conda environment. The first way is preferable.

Using R from Singularity

The that come pre-installed with R. Each image is geared toward a specific use-case. The r-ver images can be used for when you only need the R command line. The rest of the images come pre-installed with RStudio. If you want to run RStudio, not just the R command line, refer to the following page.

To use one of these images, follow these instructions on Luria:

Using R in Conda

R is available as a package in the . You can create a Conda environment, then install R into this environment.

Other Resources

A brief summary of local and campus-wide computing resources.

Campus-wide Resources

The MIT Office of Research Computing and Data (ORCD) has a number of resources that are available to researchers campus wide, a list of which can be found here.
If you have storage or compute resources spread across multiple computing systems, please reach out to [email protected] in order to coordinate support.

Bioinformatics Topics

The purpose of this material is to help the KI-BCC members teach KI researchers the skills they need to effectively use bioinformatics in their projects. We intend the instruction to be practical and modular so that people can quickly learn easily applicable methods. We also hope that the teaching materials will serve as effective and readily accessible reference material.

It is not our intent to create a link-dump that regurgitates internet search results.

The overall organization is difficult to perfect. However, getting it right at the start is not important because the underlying information is small modules that can be organized and re-packaged in any way that works.

The material has been organized with general headers followed by more specific sections and subsections. The general headers are basically Theory, Tools, Tasks.

The sections under the headers should highlight general topics and the sub-sections beneath them are the actual modules of instruction. It is perfectly acceptable for the same module of instruction to be placed in different parts of the structure.

There is a lot of room for improvement with this organization and suggestions or comments are welcome (send them to [email protected]).

Tools - A Basic Bioinformatics Toolkit

Getting more out of Microsoft Excel

Bioinformatics Applications of Unix

Alignments and Mappers

The ‘read mapping’ problem is the alignment of relatively short reads to a reference genome.

Many short read mappers are currently available (Bowtie, BWA, SOAP, TopHat, etc...). They all try to overcome the following challenges presented by the mapping problem:

how do we align efficiently (in time and memory footprint) billions of short reads to a long reference genome?
how do we decide a specific placement when multiple, equally good alignment are available for the same read?

Some mappers are short-read aligners (they do not allow large gaps in their alignments), while others are spliced-read mappers (they allow large gaps corresponding to introns). The choice of the appropriate mapper depends on the particular dataset and application

Dataset

This tutorial uses a subset of the Illumina IDEA Challenge 2011 dataset (more information can be found ). You can copy the relevant fastq files by typing the following command on Luria:

The dataset consists of paired-end RNA-Seq data of breast carcinoma human cell line BT20. Because the reads are paired-end reads, for each DNA fragment we have sequence data from both ends and the sequence reads are stored in two separate files (one for the data from each end). Additional information about the formats used in this tutorial can be found below:

SAM/BAM format: and .
.

Bowtie

A complete guide on how to run bowtie can be found .

Bowtie can be run in 2 alignment modes (mutually exclusive):

n alignment mode: this is the default mode (specified with flag -n), and determines which alignments are valid according to the following policy:
- alignments may have no more than N mismatches in the first L bases of the reads ("seed"). N and L are integers specified through the flags -n and -l.
- the sum of the Phred quality values at all mismatched positions (not just in the seed) may not exceed E (set with -e).
v alignment mode:
- alignments may have no more than V mismatches (set using the -v option), and quality values are ignored.

Several reporting modes are available, specified through various combinations of the options, such as -k,-a, -m, etc...

Bowtie requires an index of the reference sequence against which the reads are mapped. Pre-built indexes for several organisms' assemblies are available on rous or on the Bowtie official (check under left pane section "Pre-built Indexes" ). If your reference sequence is not among the available pre-built indexes, you can run the following command and create one for it:

When the command has completed, the current directory will contain the index consisting of four new files (ref.1.ebwt, ref.2.ebwt, ref.rev.1.ebwt, and ref.rev.2.ebwt).

How to run Bowtie

If your dataset consists of single-end sequence reads, you can run Bowtie as follows:

If your dataset consists of paired-end sequence reads, you can run Bowtie as follows:

Some additional useful options are:

the flag -k specifies the maximum number of alignments to report if more than one valid alignment is found.
the flag -m suppresses all alignments for a particular read if more than m reportable alignments exist.
the flag -v specifies the maximum number of mismatches allowed in the entire length of the read.
the flag -n specifies the maximum number of mismatches in the high quality seed of the read.
the flag -l specifies the length of the seed to consider for each read.
the flag –-best guarantees that reported singleton alignments are the best in terms of stratum (i.e. number of mismatches or mismatches in the seed, depending on the alignment mode) and in terms of the quality values at the mismatched position(s).

Note that the option --best does not affect which alignments are considered "valid" by bowtie, only which valid alignments are reported. When --best is specified and multiple hits are allowed (via -k or -a), the alignments for a given read are guaranteed to appear in best-to-worst order in bowtie's output. bowtie is somewhat slower when --best is specified.

The last few lines written to the standard error file handle convey summary information about the run (see below). The other lines are written to the standard out file handle and show valid, reportable alignments found by Bowtie.

BWA

BWA is another fast aligner based on Burrows-Wheeler Transform (BWT). A complete guide on how to run BWA can be found .

BWA implements two different algorithms, both based on Burrows-Wheeler Transform (BWT).

ALN:
- designed for short reads (up to ~200bp) with low error rate ( less than 3%).
- performs gapped global alignment.
- supports paired-end reads.
BWA-SW:
- designed for longer reads with more errors.
- performs heuristic Smith-Waterman-like alignment to find high-scoring local hits.
- supports only single-end reads.

How to run BWA

The database file (in fasta format) must be first indexed with the index command, which typically takes a few hours. Pre-built indexes are available on rous for the most common model organisms.
If your reference sequence is not among the available pre-built indexes, you can run the following command and create one for it:

For short reads (up to ~200bp) with low error rate (<3%), the algorithm to use is invoked with the command aln, which finds the suffix array (SA) coordinates of good hits of each individual read; the commands samse (for single-end reads) or sampe (for paired-end reads) are then used to convert these SA coordinates into chromosomal coordinates. Note that multi-hits are randomly chosen.
If your dataset consists of single-end sequence reads, you can run BWA as follows:

For example, to run BWA on the query BT20_1.fastq against the reference genome hs18, you must type the following commands:

If your dataset consists of paired-end sequence reads, you must run BWA in two steps as follows:

For example, to run BWA on the query BT20_1.fastq and BT20_2.fastq against the reference genome hs18, you must type the following commands:

Note that by default BWA ALN allows up to K mismatches in the seed (the first L positions of each read) and up to N mismatches in the whole sequence reads. These parameters can be changed using the appropriate options as indicates below.
- -l specifies the length of the seed.
- -k specifies the maximum edit distance in the seed.
- -n specifies the maximum edit distance in the whole sequence read.
- -o specifies the maximum number of gaps to open.
- -e specifies the maximum number of gaps to extend.

The BWA output in SAM format has the following specific optional fields:
- NM Edit distance
- MD Mismatching positions/bases
- X0 Number of best hits
- X1 Number of suboptimal hits found by BWA
- XM Number of mismatches in the alignment
- XO Number of gap opens
- XG Number of gap extensions
- XT Type: Unique/Repeat/N/Mate-sw
- XA Alternative hits; format: (chr,pos,CIGAR,NM;)*

Relational databases

Running Joins on Galaxy

Use the Get Data --> Upload File tool to load the data located at the following paths:

Array Data contains a 100 lines of affymetrix microarray data. Gene Data is 70 lines of annotation information corresponding to a subset of the array data.
Examine the information in each file using the eye icons on the right side of the page. Note that the first column in both files is "ProbeID". That common column will be the basis for binding the 2 files together. In database terms, ProbeID is the "key" used in the join.

Examine the results. Column c1, (ProbeID) from the array data file was used to examine column c1 of the gene data file. Whenever a match was found, the infromation in the gene data file was appended to the array data file.

Spotfire

Tasks - Bioinformatics Methods

UCSC Genome Bioinformatics

Interacting with the UCSC Genome Browser

Open the page.
The horizontal blue menu provides access to the tools.
Particularly useful links are Genomes, Genome Browser, Tools and My Data.
Clicking on the Genomes link will lead to a page that allows selection of organism and genome assembly.

Position or search term queries - examples of how to query the database.
Assembly details - General assembly information
Summary Statistics - Detailed statistics of the active assembly

Proceed to the genome assembly by clicking "Go"
Search for the gene Protocadherin 10 by pasting that name into the "position or search term" box and clicking "submit".

The Genome Browser

The genome browser is the central visualization tool of the UCSC Genome Informatics Group.
Most everything is hyperlinked and adjustable

Restore the visualization to default by clicking the "default tracks" button under the image. Note that settings from previous sessions with the browser may be retained by your browser.

Positional and scale controls are located above and below the browser image (red boxes)

Zoom and movement buttons
Coordinate window
Coordinate bar in view - click re-centers and zooms in 3x
Dragging across the upper track is a zoom to selection function.
Move start and end functions
Clicking in window and dragging re-positions the view
Clicking the chromosome ideogram re-positions the window to selected part of the chromosome

Data organization and track controls (blue boxes)

Like the search results, the information displayed in the browser window is organized into groups and tracks of data. The kinds of tracks being displayed and the ways they are presented can be controlled.
NOTE: This human assembly is mature, heavily annotated and extensively analyzed. The genomes of other organisms with less mature assemblies have less associated information and therefore fewer tracks.
The buttons "default tracks" and "hide all" will do as their name suggests.
The "configure" button displays a list of the available data with display options.
The bottom of the page consists of a series of pull-down menus. Each is associated with a type of data and the options control display mode.
IMPORTANT: Useful details about each track can be obtained by clicking the hyperlinked track name.

Navigate to chr4:133,101,179-133,262,320 using the position window.
In the "Genes and Gene Predictions Tracks" section, use the pull-down menu to set the "GENCODEv24" track to full.
Click the "refresh" button. Toggle the GENCODEv24 track between dense and full using the pull-down menu and refresh button.
For reference, the effects of these different options are described

Note:After changes are made to the pull-down menus, refresh is required for them to take effect.

The visual cues that you see in the "Full" display of known genes are:
- Arrowheads indicate the strand of the gene.
- Thin lines are introns, medium lines are non-coding portions of the mRNA, thick lines are coding region.
Zoom in to display the region around the 5' end of the PCDH10 gene using coordinates chr4:133,148,483-133,162,403

Obtaining DNA sequence from the UCSC Database

Obtaining genomic data from the UCSC database.

Open the and navigate to the "Dec. 2013 (GRCh38/hg38)" human assembly.
Enter the position chr4:133,148,410-133,195,705 and click submit.
The appropriate view is .
Click the View-->DNA link to enter the "Get DNA in Window" page
Sequence range and formatting adjustments are available, accepting the defaults and clicking "Get DNA" will retrieve the genomic sequence that was present in the browser view.
Go back to the "Get DNA in Window" page
Click the “extended case/color options” button and configure the display so that known gene exons are in bold, red and upper case letters.
- Set default case set to "Lower" with the radio button.
- Check the boxes for "Toggle Case" and "Bold" and enter 255 in the "Red" box on the GENCODE v24 line.
- Click submit and view the newly formatted results.

Obtaining genomic data from the UCSC database using table browser queries

The UCSC table browser allows execution of complex database queries without knowledge of database query languages.
Open the UCSC browser and activate the "Dec. 2013 (GRCh38/hg38) human assembly.
Enter the position chr4:133149103-133195252 and click submit.
Click the "Tools-->Table Browser" link in the upper blue bar to open a Table Browser page.

The Table Browser

Note the organism and assembly controls (red box)
Note the data type controls (blue box) that are associated with different kinds of data (browser tracks).
Note the region selection option
Set group to "Genes and Gene Prediction Tracks", track to "GENCODEv24", and table to "knownGene".
Restrict region to chr4:133,149,846-133,195,995
locate the output format menu and select "BED - browser extensible data" and click the "summary/statistics" button at the bottom of the page.
- This provides details about your query results and can be very useful for data summaries and as a guide to ensure your query is behaving as desired.
The count of genes returned should be 3. Note that only 2 are visible in the browser. This is do to the default "Basic" display of the new known genes set ENCODEv24. For more information see here.
Return to the Table Browser and click the "get output" button to view the bed format version of the data.
Switch the output format to "GTF" and compare the results
Switch the output to "sequence" and select mRNA on the next page, save the resulting output file to your desktop as kg.fa, this file will be used to demonstrate BLAT searching. In case of problems, the file is available here
Note that all different kinds of data can be accessed and downloaded in ways similar to this although the available output formats may be different.
Zoom to the 5' end of PCDH10 (chr4:133,148,848-133,150,064) and experiment with Comparative Genomics --> Cons 20 Mammals track. Note the base level conservation scores from output format "data points".
Request output in "custom track" format and view the data with "Full" setting in the browser.
Note the high scoring region just upstream of the 5'end that also overlaps with a DNase I hypersensitivity peak and an area conserved in fish and lamprey.
Turn on the CpG islands track in the Regulation group and observe the overlap with conserves region and DHS. Open the Encode Regulation track and turn on all the other available tracks by setting them to "pack". In addition to those other features, there is also a valley in the histone marks. Together these data are consistent with this region having a role in the regulation of expression of PCDH10.

Filtering table browser queries

Open the
Click the "Genomes" link and select group "Mammal", genome "Human" and assembly "Dec. 2013 (GRCh38/hg38)
Enter the position chr4:133,143,371-133,155,540 and click submit.

The eye icon allows you to view the result, the pencil is for editing attributes and the X allows deletion.
Select the FASTA manipulation toolbox on the left. Then select the "Compute sequence length" function. When finished, view the results.

Performing a BLAT search

BLAT stands for "Blast Like Alignment Tool". It is very fast and is designed to find nearly identical matches.
A description of blat is located at the bottom of this PAGE.

The query is displayed with identities in blue and splice junctions in light blue.
The genomic sequence colored the same way.
The sequences in pairwise alignment.

Return to the BLAT results page and select the "browser" hyperlink. Note the new "Blat sequence" track being displayed in the "Mapping and Sequencing" group, this track has the same visualization controls as other similar tracks.
Click the "hide all" button, use the pull-down menus to set ENCODEv24 and Blat sequence to "full" and refresh the view. Zoom out so the entire footprint of the PCDH10 is visible. The coordinates are: chr4:133,147,918-133,213,548

Creating Custom Tracks

Data Formats

A wide range of data formats can be converted to custom tracks than can then be visualized and compared to other genomic data.
Custom tracks are managed through the "My Data --> Custom Tracks" window. Select this page.
Available formats for custom tracks are described here.

Creating a custom track

Paste the following block of text into the Paste URLs or Data window and click the Submit button. Then select the "view in Genome Browser" option on the next page.

track name="ClassExample" description="Edited" visibility=0 url= 
chr4	133149358	133149958	Example1	10	+
chr4	133149754	133149980	Example2	100	+
chr4	133150808	133152008	Example3	500	+
chr4	133152429	133152829	Example4	1000	+

Note the presence of the group "Custom Tracks" with the track "ClassExample". Toggle the display mode to full and click Refresh. Zoom out do that all 4 Examples are visible.
Select "Tools-->Table Browser". The region will usually correspond to the part of the genome visualized in the browser. Select the “summary/statistics” function to display some statistics about the information in the custom track.

UCSC Intersection Queries

Intersection queries are one of the most useful functions implemented by UCSC. They allow users to characterize the positional relationships of 2 different overlapping tracks.

The demonstration of this tool will build on the observations made during examination of the 5' end of the PCDH10 in . The region upstream of the 5' end of the gene overlapped with a CpG island. CpG islands correlate with promoters because they tend to be unmethylated. In non-promoter sequence, the C of a CG dinucleotide is usually methylated and thus subject to spontaneous deaminiation. See the details page of the track for more information. Note: all UCSC tracks are well documented and together contain extensive information about a wide range of genomics methods and data types.

summary/statistics and the browser can be used to confirm the custom track contains the desired data.
This processing creates one interval for each GENCODE transcript and many intervals will be similar because many start positions are close to one another. This redundancy can be collapsed by sending the results to galaxy and using the bedtools functions "Sort BED files" followed by "Merge BED files". There are fine controls on this processing and it will not be demonstrated here but the output is available .
Create a custom track using the URL to that merged data and count the results.
Compare the contents of the custom tracks in the browser. The region is chr2:109,000,000-119,000,000
Use an intersection query to identify regions in the merged upstream sequences that overlap with a CpG island.
- Open Tools --> Table Browser and set the "group" and "track" options so that they specify the merged bed custom track.
- Select the "create" button next to "intersection" button and select intersection with CpG islands according to the following image:

Note the ability to control the degree of intersection (some, none, partial).
After intersection is in place, use summary/statistics to count the results and view some in the genome browser.

Note about intersection direction

The results of the intersection query can be used in the same ways as any other kind of data.

Viewing cross-species alignments

Cross-species Alignments and Conservation

Query the March 2006 human genome browser gateway with the gene symbol FOXP2. View the gene on chromosome 7. Adjust the tracks displayed so that base position, chromosome band and refseq are set to full.
Switch the Zebrafish, Dog and Mouse Net Alignments to Full. What portions of the gene are most highly conserved across all of these species? Compare the dog and mouse alignments. Based on what you see which species is more closely related to human? Why? Which species diverged from human more recently?

chr7:113,838,514-113,842,471

Examine the pan-vertebrate conservation upstream from the transcription start site (TSS). How far from the TSS is the conserved block? Can you think of a reason why it is conserved? Can you find any tracks that could support this idea?
Examine the details of the 7X Reg Potential track. Reconfigure the view so that it includes an indicator line at 0.05.

Galaxy

The Penn State team has implemented a powerful set of genomics tools that interfaces nicely with the UCSC Genome informatics tools and also extends UCSC functionality.

The Galaxy team provides a wide range of excellent training viewable from there home page.

They also maintain a highly informative .

Intro to Galaxy

Background

Galaxy is a framework for integrating computational bioinformatics tools. It provides a common interface that wraps diverse tools, as well as an environment for interactive analysis that transparently tracks the details of analysis steps, a workflow system for convenient reuse, data management, sharing, publishing, and more.

The web-based Galaxy site is freely available to the public. User registration at the site is optional, but can be beneficial to increase its efficiency for the user. The registration is free and allows the user to create custom workflows and save analysis results.

Open this link in a new window or type http://main.g2.bx.psu.edu/ in your url bar. In the main page of Galaxy, you will find some introductory material, including useful screencasts that show how to use the various tools to perform many types of analyses, such as text manipulation, format conversion, NGS quality control, NGS Mapping, interval data manipulations, relational database operations, and many more.

General Information

There are three panes in the Galaxy desktop:
- Tool pane (left).
- Current Analysis pane (centre).
- Job/History pane (right).

To analyze the data in Galaxy, you perform an ordered sequence of tasks, each consisting in applying a specific tool (chosen from the left pane) to a specific set of files (chosen in the central pane or among the results from previous steps as reported in the History pane).
The History pane shows the sequence of steps/tasks you are running. In particular, the color of the frame associated with a particular job has the following meaning:
- gray : the job is in the queue.
- yellow: the job is running.
- green: the job is completed.
- red: the job produced an error.

Galaxy NGS Illumina QC

This tutorial shows how to perform basic QC on Illumina data, such as basic quality statistics, quality score boxplots, trimming and masking.

1. Load a fastq file and annotate the uploaded data

On the Tool Panel, click on Get Data → Upload File.
- Browse to your home folder and select the file Galaxy_GM12878.fastqillumina'.
- Set the format to "fastqillumina".
Click the "Execute" button.
Once the job is completed, click on the pencil icon to edit the attributes.
- Set the Database field to "Human Mar 2006 (NCBI36/hg18).
- Click the "Save" button.

2. Convert to Sanger FASTQ format

On the Tool Panel, click on NGS Toolbox Beta → NGS: QC and Manipulation → FASTQ Groomer.
This tool converts between various FASTQ quality formats.
- By default, the quality format output is Sanger FASTQ.
- Sanger FASTQ is the required format for downstream analyses in Galaxy.
- Set the input type to "Illumina 1.3+"
- Click the "Execute" button.
Once the job is completed, click on the pencil icon and edit the name of the job as "GM12878 fastqsanger".

3. Compute Quality Statistics

On the Tool Panel, click on NGS Toolbox Beta → Fastx-Toolkit → Compute Quality Statistics.
This tool compute quality statistics such as min, max, mean, median, Q1, Q3, IQR, etc. of quality scores.
- Select Data 2 as input library.
- Click the "Execute" button.

4. Draw Quality Score Boxplot

On the Tool Panel, click on NGS Toolbox Beta → Fastx-Toolkit → Draw Quality Score Boxplot.
This tool creates a box graph of the quality scores in the library.
- Select Data 3 as statistic report file.
- Click the "Execute" button.
Once the job is completed, click on the eye icon to see the boxplot figure. You can expand and collapse the figure by clicking on the arrows placed on the sides of the main panel.

5. Trim Sequence Reads to length of 60 bases

On the Tool Panel, click on NGS Toolbox Beta → FASTQ Trimmer (by column).
This tool trims the end of the reads.
- Select Data 2 as input FASTQ file.
- Set the offset from 5' end to 16.
- With these parameters, all reads are trimmed after the 60th base.
- Click the "Execute" button.
Once the job is completed, click on the eye icon to edit the attributes of the resulting data and change the name to "GM12878 Trimmed fastqsanger".

6. Apply Quality Masker to bases with quality lower than 20

On the Tool Panel, click on NGS Toolbox Beta → FASTQ Masker.
This tool allows masking base characters in FASTQ files according to quality score value and comparison method.
- Select Data 2 as input file to mask.
- Set the criterion as "less than" and the threshold to 20.
- Click the "Execute" button.
- With these parameters, any base with quality less than 20 will be masked with a symbol "N".

7. Apply FASTQ Quality Trimmer

On the Tool Panel, click on NGS Toolbox Beta → FASTQ Quality Trimmer (by sliding window).
This tool allows trimming the ends of reads based upon the aggregate value of quality scores found within a sliding window. Several criteria can be used to determine the aggregate value (min, max, sum, mean) within the sliding window.
- Select Data 2 as input file.
- Select "Trim 5' end" only from the scroll down menu.
- Set window size to 3.
- Select "max score" as aggregate action.
- Select ">= 2" as criterion for trimming
- Click the "Execute" button.

8. Create a Data Subset by selecting the first 2,500 sequence reads

On the Tool Panel, click on Text Manipulation → Select first lines.
This tool select the first N lines of the input dataset.
- Set to 10,000 the number of lines to select.
- Select data 5 as input.
- Click the "Execute" button.
- With these parameters, the first 10,000 lines of the input FASTQ file are selected, corresponding to the first 2,500 sequence reads.

Galaxy NGS Illumina SE Mapping

This tutorial shows how to perform basic QC on Illumina data, such as basic quality statistics, quality score boxplots, trimming and masking.

Single-End Mapping of Illumina Data

1. Load fastq file and annotate the uploaded data

On the Tool Panel, click on Get Data → Upload File.
This tools allows you to upload a data from a file, url or a textbox.
- Select Galaxy_GM12878_trimmed.fastq as input file.
- Select "fastqsanger" as file format.
- Select "hg18" as genome.
Click the "Execute" button.

2. Map to reference genome (hg18) using Bowtie

On the Tool Panel, click on NGS Toolbox Beta → NGS Mapping → Map with Bowtie for Illmina.
This tools allows you to run the aligner Bowtie. The output is a SAM files with all the read alignments.
- Select hg canonical as reference genome.
- Leave other settings as default.
Click the "Execute" button.

3. Filter SAM file on bitwise flag values

On the Tool Panel, click on NGS Toolbox Beta → NGS SAMtools → Filter SAM on bitwise flag values.
This tools
- Select data 2 as input dataset.
- Add new flag with type set to "the read is unmapped" and the value set to "No".
- Add new flag with type set to "read strand" and the value set to "Yes".
Click the "Execute" button.
With these parameters, the resulting output consists of those reads that are properly mapped and are on the reverse strand.

4. Find how many reads map to each chromosome

On the Tool Panel, click on Join, Subtract and Group → Group →
This tools
- Select data 2 (bowtie output) as input dataset.
- Group by column 3 (reference name, i.e. chromosome name)
- Add new operation: count on column 1 .
Click the "Execute" button.
With these parameters, the resulting output consists of those reads that are properly mapped and are on the reverse strand.
Edit the attributes of this module by clicking on the eye icon: rename as "read distribution by chromosome".

5. Find the most represented chromosome

On the Tool Panel, click on Filter and Sort → Sort data →
This tools
- Select the column representing the key to sort
- Select "Numerical sort" in "descending order" as options.
Click the "Execute" button.
With these parameters, the results show that chr19 is the most represented chromosome.

6. Convert SAM to BAM

On the Tool Panel, click on NGS Toolbox Beta→ NGS Samtools &rarr SAM to BAM converter.
This tools converts SAM-formatted files into BAM-formatted files.
- Select the SAM file to convert.
- Click the "Execute" button.

7. Compute general statistics via Flagstat operation

On the Tool Panel, click on NGS Toolbox Beta→ NGS Samtools &rarr flagstat.
This tools provides a simple summary based on BAM-format.
- Select data 6 (BAM file) as input.

Galaxy SNP Interval Data

his example is inspired by a screencast published on the Galaxy website. It consists in combining exon information and SNP information, both represented as interval data.

1. Load exon data from UCSC tables

On the Tool Panel, click on Get Data → UCSC Main Table Browser.
This tools allows you to upload data from the UCSC Tables.
- Use the following parameters:
  - Group: Variation and Repeats
  - Track: SNP(130)
  - Region: chr19:1-100,000
  - Output format: BED
  - Send output to Galaxy: checked
- Click "Get Output" button.
  - Select the radiobox so that one BED record is created for the whole gene.
  - Click the button "send query to Galaxy"
With these parameters, this tool creates a BED file containing all the SNPs for the first 1M bases of chromosome 19.
Once the job is completed, change the name of the dataset to "SNPs chr19".

2. Load SNP data from UCSC tables

On the Tool Panel, click on Get Data → UCSC Main Table Browser.
This tools allows you to upload data from the UCSC Tables.
- Use the following parameters:
  - Group: Genes and Gene Prediction
  - Track: UCSC Genes
  - Region: chr19:1-100,000
  - Output format: BED
  - Send output to Galaxy: checked
- Click "Get Output" button.
  - Select the radiobox so that one BED record is created per coding exon.
  - Click the button "send query to Galaxy"
With these parameters, this tool creates a BED file containing all the exon information for the first 1M bases of chromosome 19.
Once the job is completed, change the name of the dataset to "exons chr19".

3. Join exon and SNP information

On the Tool Panel, click on Operate on Genomic Intervals → Join the intervals.
This tools allows you to join the information from two interval files based on the coordinates of each feature.
- Select the SNP chr19 and the exons chr19 files as input.
- Click on the "Execute" button.
Because some exons might contain multiple SNPs, the resulting output might have size greater than the two input files.

4. Find the number of SNPs per exon

On the Tool Panel, click on Join, Subtract and Group → Group.
This tools groups the information based on a given column and performs the aggregation operations on the other columns.
- Select data 3 as input.
- Select column 4 (exon ID).
- Add operation to count c4.
- Click on the "Execute" button.

5. Find the exon with the most SNPs

On the Tool Panel, click on Filter and Sort → Sort.
This tools ...
- Select data 4 as input.
- Select column 2 as sorting key.
- Click on the "Execute" button.

6. Find how many chromosomes have a given number of exons

On the Tool Panel, click on Join, Subtract, Group → Group.
This tools ...
- Select data 5 as input (sorted).
- Select column 2 as sorting key.
- Set the operation to "count" on column 1.
- Click on the "Execute" button.

7. Filter exons with at least 10 SNPs

On the Tool Panel, click on Filter and Sort → Filter.
This tools ...
- Select data 5 as input (sorted).
- Set the condition to SNP count greater than 10 (i.e. c2 >= 10).
- Click on the "Execute" button.

8. Retrieve original information for exons

On the Tool Panel, click on Join, Subtract, Group → Join.
This tools ...
This is equivalent to a relational join (not an interval join).
- Select the exons with more than 10 SNPs as first input.
- Select the exon data for chr19:1-1,000,000 as second input.
- Select column 1 (exonID) for the first file.
- Select column 4 (exonID) for the second file.
- Click the "Execute" button.
Now repeat this step but invert the order of the file. Note that this time the output is a BED-formatted output, wherease before it was a tabular file.

9. Display using the UCSC Browser

On the Data Panel on the right-hand size, click on the last job → Display at UCSC.
- The User track show the exons that have more than 10 SNPs in the region of chr19 considered.

Editing and annotation gene structures with Argo

Annotating Genomic Sequence with Argo

Argo is a genome browser and annotation tool written by Reinhard Engles at the Broad Institute. It combines powerful data display functions with the ability to create and edit genomic features.

Start . You can use the web start or download the jar and double-click.
The data files your need for this exercise are located HERE. Save each file to your working directory.
Load Sequence into argo by selecting the "File-->Open Sequence File" menu item. The protocadherin 10 gene sequence is in the file "pcdh10.gene". In the sequence file format pull-down menu, select Fasta. In the sequence range window, accept the default of the entrire sequence range. In the Feature Map Track Table, click the "Load Tracks From File..." button. Select "known.gff". In the Track File Format window, accept the default of "gff".
A line will appear in the track table representing the gff file you just loaded. If you click on the colored square, you can change the color of the track, pick dark blue. The click OK.
The resulting feature map shows the 3 known gene isoforms for PCDH10. Select one of the isoforms.
This activates the feature inspector window in the lower left hand corner of argo. The properties tab describes the feature you have selected, the DNA tab contains the DNA sequence(exons in blue, introns in black), the mRNAis found inder the mRNA tab and the Protein translation of the mRNA is in the Protein tab. Note: Right-clicking (command click for macs) while hovering over the mRNA sequence will allow you to adjust the reading frame.
A right-click while hovering over the Feature map will open a function menu. Select the option "Track Table". This opens the same tool that appeared while we were loading the sequence. Click the "Load Tracks From File..." button.
Select the file "cad.blastx" and hit OK. This results in a Track File Format menu, select the Blast (standard) option. cad.blastx contains standard blast output (look at it here) in text format. The query sequence was the genomic DNA we are displaying, the database was a set of protein sequences related to the gene we are working on. Therefore, after hitting the OK button you will see a "Blast Feature Coordinates" dialog box. The answer to this question is no because of the subject sequences were the protein, not the DNA in argo. This will add the blast results to the Feature map. When you select a blast hit, information about that hit appears in the feature instpector.
Control the way the blast data is displayed by opening the track table and clicking on the area next to the blast track and under the style section. Change it to "Bar Graph".
Using similar methods, load the files cpg.gff and est.gff. All are in gff format.
The resulting display may place the known genes near the top of the page. This is not ideal, open the track table and change know genes to "Segregated".
In the UCSC browser, we identified the EST DA219615 as evidence for an uncharacterized PCDH10 splice variant. Search for this feature using the tool on the lower right hand side of argo. The search tool will result in selection of that feature.
You can inspect details about the feature by right clicking while hovering over the feature map. For example, select the option "Splice Site Profile". This results in window showing the splice junction for the selected feature.
Create a gene model based on this evidence using the "Edit --> Insert Compound Feature" menu item. This will open a window showing the coordinates of the feature. Give it the Gene name "PCDH10" and the transcript name "iso3". Next you will be asked for a file to save the data. Create a new gff3 file called iso3 and click OK.
The result will be a feature is inserted into the map that is identical to EST DA219615. It is located at the very top of the map. Move it's display position down using the track table. Segregated is a good choice.
The purple color indicates a pending insert. Save it using "Edit --> Save".
Examine the 3' End of the gene model. While hovering over the feature map, hold down the Shift Key. A magnifying glass will appear. Use the magnifying glass to draw and small box around the 3' exon of the inserted model.
Compare the exon to the est BG201062. The EST contains more sequence and is likely to represent the real 3' end of the gene. Select the BG201062 and drag it to the end of your inserted model. You will be given a transcript extension dialog box. Select Merge Right. Now, your edited feature should be highlighted in orange. This indicates a pending edit. Save the new model. You will get an error, but that is OK for now.
Zoom out to view the full gene by Shift-Right Clicking several times. Or use the Zoom to 100% function.
Next Add the 5' end of the gene. Select one of the longest known gene models and drag it to your edited feature. This time accept the merge left option. Save the model, you will now get 2 errors, accept them as well.
Repair the gene model using the Feature Inspector window on the lower left. Select your gene model, then click the DNA tab. The letters in DNA will turn red indicating that the feature has non-canonical (ie. non-GT-AG) splice junctions. These will be highlighted in red in the sequence pane. Scroll down until you find the first one. It is located at the 5' end of exon 3. Highlight a couple of lines before and after the red letters. Notice the yellow bar in the feature map. Zoom in on the troublesome exon. The exon is different from the know genes because it's 5'end hangs over a little. Select the known exon and drag it to the working model. select the replace option and save. This will repair the first problem.
Next, zoom in on the 3' exon of the working model. We have an exon that is a few bases off because of the EST-based extension we used to create the model. Go to the DNA tab of the Feature Inspector, scroll to the bottom. Note the non-canonical junction highlighted in red. Right next to the red highlight is the sequence AG that we need. Highlight the red letters and sequence up to and including the AG and do a right-click. One of the available options is "Make Highlighted Sequence Intronic". (WARNING: Mac users without a 2+-button mouse may have trouble with this operation - you can use the drag and drop features to get this exon right) Select this option then save the model. At this time, it should save without errors.
Go to the mRNA tab, right click and select the option "Select ORF > Select Longest Start to StopCodon ORF". Go to the Protein tab to see the protein sequence.
Note the green start and red stop codons indicated in the feature map. Also note the consensus poly-A signal analysis available from the right-click menu.
The entire feature map can be exported as an image for use in presentations using the "File --> Export Map as Image..." menu item.
Select the hand-edited isoform, return to the mRNA tab in the feature inspector, do select all and copy, then return to the section on the UCSC browser. We will start by pasting the mRNA sequence from this annotation effort into a Blat search window.

GeneGO MetaCore

GeneGo Introduction

Competitor of Ingenuity, offers manually curated pathways and networks, the tools to analyze those groups in the context of various experiments and powerful network building methods.
Owned by Thomson Reuters
Major Applications:

Access to manually curated databases
Evaluation of pathways, networks, annotations and associations as they relate gene lists. These gene lists could be deferentially expressed genes or any other list of genes.
Construction of networks based on gene lists using a database of interactions.
1. GeneGO ontologies
2. Pathway Maps
3. Process Networks
4. GO Processes
5. Disease (by Biomarkers)
6. Metabolic Networks

High quality documentation is available by following Help -> Help

Loading Data Into GeneGO

A link to example data is below. xls files are required, open both and examine the data.
Key structure is:

NOTE: Raw gene list of genes of interest works fine as well.
From the , there are 3 ways to access data loading tool:

File-->Upload Data...
Upload Data icon
Upload Page-->Upload Experiments with Gene or Protein IDs

NOTES: Tools emphasize the Activate/Deactivate/Reload requirements.
Choose the file and click Next
Adjust data types as needed. Rename columns and experiments as necessary. Make sure that fold changes and supporting p-value columns are adjacent to each other. Click Next
Specify the species and click Next.
The Background processing status page will indicate when upload is complete. Once processed, the line can be deleted from the table.
Return to the Background processing page from the start page by File-->Background processing item.

Data Management in GeneGO

The upper left panel of the Start Page hosts the data management functions.
Click the "Start Page" icon or your browsers "refresh" button to make sure that the displayed data is current.
Double-click the "Experiments" folder. You should see "ColonMet_Faked" inside the folder.
Right-Click on the My Data folder and select Create Folder... Create a folder called May1Training
Drag ColonMet_Faked into the new folder. Right-click on the original folder to delete. Note: folders can also be renamed.
Data can also be shared. Right-click on May1Training and select Share..., enter charliew and search. select username and click Add Selected to List. Note: Types of permissions can be controlled.
The lower left panel of the Start Page displays the active data. Active data is available for analysis and can be controlled by the Activate/Deactivate icon, right clicking or drag-and-drop.

Setting Thresholds and Background Sets

Select an experiment, right click and activate Properties...
- The page displayed indicates information about that experiment. There is a place to enter notes or descriptions, the number of IDs in play are also listed as is the Threshold and P-Threshold.
Threshold refers to the fold-change or log ration cutoff that is in play.
Select all three experiments in the ColonMet_Faked folder and activate Tools-->Set Threshold and Background List. Next to General, Enter 1 in the Threshold and 0.1 in the P Threshold. Leave background option at default and click OK.
- If an array is the source of the data, the background list can be set to the gene content of the array, otherwise the default behavior is to use everything.
Use the right-click, properties function to confirm that threshold settings have been set for each experiment.
Activate the ColonMet_Faked_FC experiment and refresh the page so that it is visible in the Active Data panel.
Select the experiment in the active data panel and click Tools-->Set Threshold and Background List. Click the statistics button. Counts of pass-threshold items in various categories are displayed. The signal distribution button will show the data for these faked values.

Search And Browse Content Tab

EZ Search

From the Start Page, select the Search and Browse Content link and pick the EZ search button
Click human FN1 from the results list
Results page is outlined according to category, available information includes pathway maps and interactions.
Under interactions page, compare the details for alpha2/beta1 binding and alpha5/beta1 binding.

Ontologies - Pathway Maps

From the Start Page, select the Search and Browse Content link and pick the Pathway Maps option.
This leads to a hierarchy of GeneGO pathway maps
Select Regulatory maps-->Regulation of cellular processes-->Cell adhesion-->Cell adhesion_Integrin inside-out signaling
NOTE: Data from activated experiments is displayed on the map.
The Overview tab describes details of the map and provides references.
The Processes tab lists significant GO processes in the map.

Ontologies - Process Networks

From the Start Page, select the Search and Browse Content link and pick the Process Networks option.
This leads to a hierarchy of GeneGO process networks
Select Process Networks-->Cell adhesion_ECM remodeling
NOTE: Data from activated experiments is displayed on the network.
The network is highly interactive.

Other - Protein Groups and Complexes

From the Start Page, select the Search and Browse Content link and pick the Protein Groups & Complexes tool.
The tool links to a page of table of results, select Integrin Receptor Family (Human)

Workflows and Reports Tab

Analyze Single Experiment Workflow

The analyze single experiment workflow allows one experiment to be tested. Activate the an experiment with desired thresholds applied and select Analyze Single Experiment from the Workflows & Reports list.
Confirm Network object count and threshold settings and click Apply.
Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
- Pathway maps are nice interactive maps
- Process networks and network statistics produce editable network maps.
- GO Processes produce lists that can be used to draw networks.
- Diseases link to disease pages.

Enrichment Analysis Workflow

This workflow requires at least 2 active experiments. Activate all 3 ColonMet experiments and refresh the browser page to confirm.
Select the Enrichment Analysis link to launch the workflow. Ensure the correct experiments are activated and click Next.
Fold change, p-value signal directionality can be adjusted on the next page. NOTE: the number of Network objects that meet threshold criteria are indicated. Clicking Apply launches the analysis.
Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
- Pathway maps are nice interactive maps
- Process networks produce editable network maps.
- GO Processes produce lists that can be used to draw networks.
- Diseases link to disease pages.

Compare Experiments Workflow

This workflow requires at least 2 active experiments. Activate all 3 ColonMet experiments and refresh the browser page to confirm.
Select the Compare Experiments link to launch the workflow. Ensure the correct experiments are activated and click Next.
Fold change, p-value signal directionality can be adjusted on the next page. NOTE: the number of Network objects that meet threshold criteria are indicated. Clicking Apply launches the analysis.
Genes passing thresholds in each experiment are compared and grouped in 3 categories:

Unique
Similar - present in more than one list
Common - present in all lists

Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
- Pathway maps are nice interactive maps
- Process networks and produces editable network maps.
- GO Processes produce lists that can be used to draw networks.
- Diseases link to disease pages.
Histograms can be sorted according to statistics

Statistically significant
Deferentially effected
Similarity by
Particular

Histogram are interactive, network statistics panel reports results.

One-click Analysis Tab

Pathway Maps

Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Pathway Maps tool
Analysis will consider only the Pathway maps
It can be run on multiple experiments as well.

Map Folders

Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Map Folders tool
Analysis will consider functionally related Pathway Map groups resulting in alteration of scoring and statistics.
It can be run on multiple experiments as well.
Clicking on a result folder leads to a results page displaying data for the composite Pathway Maps

Process Networks

Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Process Networks tool
Analysis will consider enrichment in various annotated process networks.
It can be run on multiple experiments as well.
Clicking on a result leads to a page displaying data for the indicated Process Network.

Building Network for Your Experimental Data

Run a Compare Experiments workflow for ColonMet_Faked_FC and ColonMet_Faked_FCrand1_edit.
Identify and select the common network objects in this comparison (10 Network objects) by clicked the blue/white histogram.
Click Analyze networks to produce a list of networks associated with the list.
Return to the Network options page and click Add network objects and add ITGA5.
Re-run Analyze Networks, using default settings. NOTE: Adding ITGA5 produces and additional network.
There 11 different network building algorithms available.
Clicking on network name opens the highly interactive network viewer. Explore the extensive control over the visualization.

Functional Annotation of Gene Lists

Multiple Sequence Alignment

Clustalw2

Multiple sequence alignment and phylogenetic analysis allow the identification of conserved positions in protein and nucleic acid sequences. This can lead to an appreciation of the evolutionary history of a group of sequences.

The clustal family of programs is commonly used to produce multiple sequence alignments. Other options are available as well:

There are 3 different ways to use the clustal programs.

Web-based clustalw can be used HERE.
ClustalX GUI (and also the command-line clustalw executable) is available for download HERE. exampledata
Command-line clustalw2 is installed on rous

The major difference between the 3 options is the interface and the calculation of bootstrap values for the tree, which is only available in the command-line and GUI versions. Other details of running the program are the same. For this lesson, we will use the web-based version to align these sequences.
Login to rous and copy the clustal training files to your home directory.

cp -r /net/n3/data/Teaching/IAP_2010_day3/clustal .

NOTE: -r means "recursive", copy the folder and everything inside it to the new location "."
The lack of a trailing "/" on the clustal path ensures that the folder is moved, not just it's contents.

enter the clustal directory:

cd clustal

view the raw sequence files:

cat *.pep

launch the clustalw application:

clustalw2

The following interactive menu appears, options 1 and 2 will be used in this demonstration. Additional information is available HERE

 **************************************************************
 ******** CLUSTAL 2.0.12 Multiple Sequence Alignments  ********
 **************************************************************


     1. Sequence Input From Disc
     2. Multiple Alignments
     3. Profile / Structure Alignments
     4. Phylogenetic trees

     S. Execute a system command
     H. HELP
     X. EXIT (leave program)


Your choice:

Select option 1 to load sequence and specify the file "tiny.pep". Before being returned to the original meny, you should see:

Sequences should all be in 1 file.

7 formats accepted: 
NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF,                  RSF.


Enter the name of the sequence file : tiny.pep
Sequence format is Pearson
Sequences assumed to be PROTEIN


Sequence 1: zf_12a1_a     28 aa
Sequence 2: zf_12a1_b     28 aa
Sequence 3: hs_a11        28 aa
Sequence 4: hs_22a1       28 aa
Sequence 5: zf_a11        28 aa

From the original menu, now select option 2. Multiple Alignments. This activates the Alignment menu:

****** MULTIPLE ALIGNMENT MENU ******


    1.  Do complete multiple alignment now Slow/Accurate
    2.  Produce guide tree file only
    3.  Do alignment using old guide tree file

    4.  Toggle Slow/Fast pairwise alignments = SLOW

    5.  Pairwise alignment parameters
    6.  Multiple alignment parameters

    7.  Reset gaps before alignment? = OFF
    8.  Toggle screen display          = ON
    9.  Output format options
    I. Iteration = NONE

    S.  Execute a system command
    H.  HELP
    or press [RETURN] to go back to main menu

All default options are correct except option 9,select this to see the output format menu:

 ********* Format of Alignment Output *********


     F. Toggle FASTA format output       =  OFF

     1. Toggle CLUSTAL format output     =  ON
     2. Toggle NBRF/PIR format output    =  OFF
     3. Toggle GCG/MSF format output     =  OFF
     4. Toggle PHYLIP format output      =  OFF
     5. Toggle NEXUS format output       =  OFF
     6. Toggle GDE format output         =  OFF

     7. Toggle GDE output case           =  LOWER
     8. Toggle CLUSTALW sequence numbers =  OFF
     9. Toggle output order              =  ALIGNED

     0. Create alignment output file(s) now?

     T. Toggle parameter output          = OFF
     R. Toggle sequence range numbers =  OFF

     H. HELP

User toggle #4 to activate PHYLIP output, hit return to revisit the alignment menu
Select option 1 to do the alignment. Accept the default conditions and hit return until you end up at the main menu. Exit the program with "X".

cat tiny.[ap][lh][ny]

NOTE: Square brackets are regular expression syntax. For example [ap] = either an "a" or a "p" at that position.

This is the clustal format output:

CLUSTAL 2.0.12 multiple sequence alignment


zf_12a1_a       VADLVFLVDGSWSVGRENFRFIRSFIGA--
zf_12a1_b       KADLVFLIDGSWSIGDDSFAKVRQFVFS--
hs_22a1         HYDLVFLLDTSSSVGKEDFEKVRQWVAN--
hs_a11          YMDIVIVLDGSNSIYP--WVEVQHFLINIL
zf_a11          YMDIVIVLDGSNSIYP--WNEVQDFLINIL
                  *:*:::* * *:    :  :: ::

This is the phylip format output:

     5     30
zf_12a1_a  VADLVFLVDG SWSVGRENFR FIRSFIGA-- 
zf_12a1_b  KADLVFLIDG SWSIGDDSFA KVRQFVFS-- 
hs_22a1    HYDLVFLLDT SSSVGKEDFE KVRQWVAN-- 
hs_a11     YMDIVIVLDG SNSIYP--WV EVQHFLINIL 
zf_a11     YMDIVIVLDG SNSIYP--WN EVQDFLINIL

Phylogenetic analysis

Neighbor Joining method in Phylip

To carry out phylogenetic analysis it is important to begin with an alignment in which all gap-containing positions have been removed.

This can be accomplished using the tool.
There is a bug in the process but the gap-free alignment can be loaded into clustalX, the the save sequences as "Phylip" format function can be used.

The correct alignment can be created with a text editor. Another version of it is in your clustal/phylip folder called tiny_nogap.phy
enter the phylip folder:

The first step to to create the bootstrap replicates. This is done with the phylip application seqboot.

when prompted, provide the name "tiny_nogap.phy".
At the interactive menu:
- change option r to 10 for 10 replicates. NOTE: normally the number used is 1000 for neighbor
- press Y to accept changes
- enter 111 as random number seed.

Move outfile to tiny_nogap.seqboot

The next step is to calculate the distances between the sequences in each replicate. This is done with protdist

when prompted, provide the name "tiny_nogap.seqboot".
At the interactive menu:
- change option m to d for datasets and 10 for number of replicates.
- press Y to accept changes

Rename the outfile

Build the neighbor-joining tree for each distance matrix wth neighbor:

when prompted, provide the name "tiny_nogap.protdist".
At the interactive menu:
- change option m to 10 for number of replicates.
- enter 111 for random number seed
- press Y to accept changes

Rename the output:

Create a consensus version of the outtree file with consense:

when prompted, provide the name "tiny_nogap.neightree".
At the interactive menu, default settings are ok so type "Y"

Rename the output:

Draw the tree with drawtree:

when prompted for a file enter tiny_nogap.neigh.constree
when prompted for a fontfile, enter:

At the interactive menu, select option V, then N to turn off preview. Then option Y
This results in a postscript plot of the tree. rename it to:

The contents of this file can be viewed with photo software like but it looks like this:

The bootstrap values can be viewed in the file tiny_nogap.neigh.constree. Or graphicall by opening this file with . With branch lengths selected, the view looks like this:

Microarray data processing with R/Bioconductor

Bioconductor is:

a large number of life sciences software implemented in R
an extensive collection of experimental and annotation data that relate to the analysis software.

Documentation

The packages are written, documented and supported according to consistent standards.
Each package has a website. For example see:

affy

affydata

both pdf documentation and example R scripts exits.

Mailing List

Instructions on joining the mailing list

IMPORTANT:Carefully read the posting guide and follow the instructions.

The list is high-traffic and widely read so be careful with postings.

UseExample

#Connect to rous.mit.edu and start R
R
#check available packages:
library()
#load affdata
library(affydata)
#Run some of the commands
data(Dilution)
ls()
Dilution
class(Dilution)
expressionData<-exprs(Dilution)
class(expressionData)
expressionData[1:3,]
log2(expressionData[1:3,])
round(log2(expressionData[1:3,]), digits=2)

Processing Microarray Data with r/bioconductor

The following series of commands can be used to process array data with gcrma and do differential expression testing with LPE:

library(affy)
library(gcrma)
library(LPE)
library(affyPLM)

#Set the working directory to the location of your CEL files.

setwd("/Path/TO/CEL_Files")

#Import Data
#Order of samples in the resulting matrix can be specified by the order in the list.

Name_Dat<-ReadAffy(
"Condition1a.CEL", "Condition1b.CEL", "Condition1c.CEL",
"Condition2a.CEL", "Condition2b.CEL", "Condition2c.CEL")

#RNA degradation Work
RNAdeg<-AffyRNAdeg(Name_Dat)
png(file="Name_rnaDeg.png", bg="white")
plotAffyRNAdeg(RNAdeg,cols=c(1:16))
dev.off()

#PLM work
pset1<-fitPLM(Name_Dat)

#RLE plot
png(file="rle.png", bg="white")
par(mar=c(3, 10, 3, 3))
RLE(pset1, main = "RLE for Name", horizontal=TRUE, las=2)
dev.off()

#NUSE plot
png(file="nuse.png", bg="white")
par(mar=c(3, 10, 3, 3))
NUSE(pset1, ylim= c(0.95,1.2), main = "NUSE for Name", horizontal=TRUE, las=2)
dev.off()

#Process the Data
Name_Exp<-gcrma(Name_Dat, fast=FALSE)

Name_Tab<-exprs(Name_Exp)
Name_Tab<-round(Name_Tab, digits=2)
write.table(data.frame(Name_Tab), sep="\t", quote=FALSE, file="Name.txt")

#Differential Expression Testing
#In editor, add "ProbeID" to top of first column and delete the affy control rows that start with AFFX

Name<-read.table("Name.txt", header=TRUE)
attach(Name)
names(Name)

set.seed(0)

#testing the columns by printing the first 3 rows for each condition
#var.Cond1
Name[1:3,c(2,3,4)]
#var.Cond2
Name[1:3,c(5,6,7)]

#LPE tests
var.Cond1<-baseOlig.error(Name[,c(2,3,4)],q=0.01)
var.Cond2<-baseOlig.error(Name[,c(5,6,7)],q=0.01)

lpeVal.Cond1.Cond2<-data.frame(lpe(Name[,c(5,6,7)], Name[,c(2,3,4)], var.Cond1, var.Cond2, probe.set.name = Name$ProbeID))

lpeVal.Cond1.Cond2<-round(lpeVal.Cond1.Cond2, digits=2)

fdrBH.Cond1.Cond2<-fdr.adjust(lpeVal.Cond1.Cond2, adjp="BH")

write.table(lpeVal.Cond1.Cond2, quote=FALSE, sep="\t", file="lpeVal.Cond1.Cond2.txt")
write.table(fdrBH.Cond1.Cond2, quote=FALSE, sep="\t", file="fdrBH.Cond1.Cond2.txt")

#clustering with pvalues

library(pvclust)

Name.pv<-pvclust(Name, method.hclust="ward",
method.dist="correlation", use.cor="pairwise.complete.obs",
nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE)

png(filename="Name_PV.png", bg="white",width=960, height=500)
plot(TAM.pv)
dev.off()

Running Jupyter notebooks on luria cluster nodes

Setup - Connect to your luria account and run the following commands to create a conda environment called jupyter, activate that environment and install the necessary software

module add miniconda3/v4

source /home/software/conda/miniconda3/bin/condainit

conda create --name jupyter

conda activate jupyter

conda install -c anaconda jupyter

Running jupyter notebooks -

On both mac and PC:

srun --pty -n 16 bash

to get all of the resources of a single cluster node in the normal queue. You need to note the cluster node being used. It will be something like: c##

then cd into your data directory: cd /net/host/data/share/users/kerberosID

NOTE: the path will vary by user and PI. The necessary information for host, share and kerberosID should be visible by running the following command from inside your luria home:

ls -lat | grep data

activate the jupyter conda environment with: module add miniconda3/v4

source /home/software/conda/miniconda3/bin/condainit

conda activate jupyter

From this cluster node with an active jupyter coda environment the execute the command: jupyter notebook --ip=0.0.0.0 --port=12123

That 12123 is a randomish 5 digit number, you could change to most anything between 10000 and 60000, you just need to keep track of what you use.

You should see a lot of output stream across the screen, leave that window open and running. There will be information in the output that you will need for the next step.

For Macs, you need to open up a new terminal window and run:

ssh -t [email protected] -L 12123:localhost:12123 ssh cXX -L 12123:localhost:12123

the cXX needs to be replaced with the node that you are running on

this will create a 2nd “tunneled” connection to the luria cluster node that is running your jupyter notebook server.

For PCs running secureCRT, you just need to modify the existing connection. This is done by selecting OptionsàSessionOptions and the Port Forwarding item

In there, you need to click Add and add a tunnel according to slide 9 of:

your port and your socket value will both be 5 digit number you select and you need to get the node correct.

Once those connections are set up, you can open a web browser on your mac and go to one of the URLs provided in the output of the notebook run.

There will be a couple of options for URLS, use the one that starts with something like:

...

If all is well, a jupyter notebook that is actually running on the cluster node will appear in your browser and the directories in your storage will be available.

In addition, I also want to point out another computational resource at MIT called engaging on demand:

This is a different computer that you have access to and it also offers jupyter notebooks with more substantial computational resources.

Data Management

Please visit the Data Management Analysis Core's documentation site for more information:

Globus

Globus is a robust data management platform designed to facilitate the secure and efficient transfer of large datasets across institutions and collaborators.

For a step-by step guide on how to login to Globus and transfer files, please see here.

Registering for a Globus Account

In order to make use of the Globus platform, you will need to create an account if you have not done so already. Most users will do so by using their institutional login (e.g., MIT Kerberos) by clicking the 'Log In' button in the top-right hand corner of the Globus website. After providing your consent, your new Globus account will be tied to your full name and email address as it appears at your institution.

If you are not affiliated with any of the organizations or research institutions listed, you can create your own unique login Globus login by filling out the Globus ID form. This login will take the form [email protected], and will have its own login credentials associated with it.

Accessing KI Core Collections

If you are a core staff member and are looking for instructions on how to share data with collaborators via Globus, please see STEP 6 in this step-by-step guide.

There are several public guest collections available in Globus, through which you will be able to access generated core data:

Once you have created a Globus account, you can reach out to the core responsible for your data to request access to one of these collections, and a link will be shared with you via email after approval.

Until you are provided access, you will not be able to browse the collections via the links above.

Transfer Requests

If you require assistance with moving data to or from an external collaborator and local KI storage, we can help to facilitate that. This is the information that we'll need from you:

Whether the transfer is incoming or outgoing.
The volume (size, number of files) of data being moved.
What is the requested delivery date? This should be at least two weeks in advance in order to accommodate for network and filesystem performance.

Incoming Transfers

Where does the incoming data need to go (e.g., /net/bmc-lab2/...)?
Who will be needing access to write to the specified path? This can be a member of your team or your collaborator's team. For information on logging in to the Globus platform, please see above.

Outgoing Transfers

Which storage host is the outgoing data located on, and what is the absolute path to it (e.g., /net/bmc-lab2/...)? This will depend on your lab association.
Does the destination institution use Globus? If so, has the destination Globus collection been created yet? If not, the data can be transferred to local storage using Globus Connect Personal, which will need to be configured prior to the transfer.
Who will be needing access to read from the specified path? This can be a member of your team or your collaborator's team. For information on logging in to the Globus platform, please see above.

For additional assistance or questions, please reach out to [email protected].

Resources

Mini Courses

Schedule

Summer 2024 Course Schedule

DATE

TIME

Instructor

Location

Room

Topic

Introduction to Unix and KI Computational Resources

THIS TRAINING MATERIAL IS DEPRECATED. PLEASE REFER TO THE FOLLOWING TWO COURSES:

KI Bioinformatics Support

Modes of Support
- Individual Training
- Project Work

$90 per hour
$1200/0.1FTE/mo

Requesting Support
- Send email to Charlie Whittaker or Stuart Levine to arrange for a free outreach consultation.
Project Tracking and Billing Handled with ilabs
- Core users must have ilabs account and valid cost-objects in place. The BCC/BMC core must provide a cost estimate and the project has to be approved by the faculty member before work begins.

Overview of the KI Computing Systems

Basic Unix

Why Unix?

UNIX is an operating system (suite of programs), originally developed in 1969 at Bell Labs and has been under development ever since. UNIX is a very popular operating system for many reasons, including:

multi user (multiple users can log in to a computer at the same time, and use concurrently the resources of that computer).
multi-tasking (each user can perform many tasks at the same time).
network-ready (built-in TCP /IP networking makes easy to communicate between computers).
very powerful programming environments (free of the many limits imposed by other operating systems).
robust and stable.
scalable, portable, flexible.
open source.

UNIX systems also have a graphical user interface (GUI) similar to Microsoft Windows which provides an easy to use environment. However, knowledge of UNIX is required for operations which aren't covered by a graphical program, or for when there is no windows interface available (for example, in a telnet session).

UNIX at a glance

UNIX consists of three main components: kernel, shell, and programs.

KERNEL: hub of the operating system.
SHELL: command line interpreter.
PROGRAMS: collections of operations.

Each component has specific roles:

KERNEL: it allocates time and memory to programs and handles the filestore and communications in response to system calls.
SHELL: it interprets the commands the user types in and arranges for them to be carried out.
PROGRAM: execute a set of predefined operations.

As an illustration of the way the shell and the kernel work together, suppose a user types the following command:

rm myfile

Internally,

The shell searches the PATH environment variable for the file containing the program rm.
The shell requests to the kernel, through system calls, to execute the program rm on myfile.
When the process has finished running, the shell returns the UNIX prompt to the user, indicating that it is waiting for further commands.

The Unix Tree

The UNIX file system consists of files and directories organized in a hierarchical structure.
Avoid File names and directory names with spaces and special characters
In the figure below, there are 5 subdirectories under the root, i.e. bin, tmp, home, use, src. The directory home has two subdirectories (i.e. iap and paolaf). Thus the parent directory of iap is home.

The system provides a few short synonyms relative to the working directory:

The root is denoted by a forward slash (/).
The home directory of any given user is specified by a tilde (~).
A single dot (".") indicates the current working directory.
A double dot ("..") indicates the directory above the current directory.

You can find out your current working directory (in other words, where you currently are in the Unix tree) by typing the command pwd (print working directory) at the prompt and then hitting return.

The Unix Terminal and Shell

Terminal

Common name for the interface that allows you to input commands and see the output from commands.

Command prompt

Sequence of one or more characters, such as $ or % in combination with other customizable elements, such as the username or the current working directory.
It indicates readiness to accept commands.

Shell

Command line interpreter.
Users operate the computer by entering command input as text for a shell to execute or by creating text scripts of one or more such commands.

Anatomy of a Unix Command

A basic UNIX command has 3 parts. These are:

command name
options (zero or more)
arguments (zero or more)

In general, UNIX commands are written as single strings followed by the return key. Note that UNIX commands are case-sensitive, i.e. it does matter whether you type a letter in uppercase or lowercase.

Example

wc –l myfile.txt

The example above has the following components:

the command name (wc)
one option (-l)
one argument (myfile.txt)

Options

Also called “switches” or “flags”.
Specify the way a given commands must work.
Preceded by one or two dashes (one dash if the switch is a single character, two dashes otherwise).

Tab completion

Tab completion is a useful features of UNIX where the shell automatically fills in partially typed commands when the tab key is used. This is advantageous in many ways:

Commands or filenames with long or difficult spelling require fewer keystroke to reach
In the case of multiple possible completions, the shell will list all filenames beginning with those few characters.

History

Another useful feature is the fact that the shell remembers each typed command, so you can easily recall and run previous commands.

By using the up and down arrows at the prompt, you can revisit the command history and recall previously typed commands.
Previous commands can be re-executed as they are or can be edited using the left/right arrow.

Use the history command to specify how many and which previous commands to display.

# display last typed commands
history

#display the last n typed commands
history n

#execute command n
!n

#execute last command in the list
!!

# recall and execute the n-th last command
!-n

Examples:

# recall the last 5 typed commands
history 5

# recall and execute the 3rd last command
!-3

# recall and execute command number 120
!120

Output Redirection and Piping

Output Redirection

Most processes initiated by UNIX commands take their input from the standard input (i.e. the keyboard) and write their output to the standard output (i.e. the terminal screen).
The "<" sign can be used to redirect the input, that is to specify that the input comes from something other than the keyboard.
The ">" sign can be used to redirect the output, that is to specify that the output goes to something other than the terminal screen.
The ">>" sign can be used to append the output to something other than the terminal screen.

# list the current files and redirect the output to a file named "mylist.txt"
ls > mylist.txt
# view content of mylist.txt
cat mylist.txt

# redirect the input to a command 
cat < mylist.txt

# redirect the output and append 
cat mylist.txt > list1.txt
cat mylist.txt >> list2.txt
# view content 
cat list2.txt

Piping

A pipe is denoted by "|".
Several pipes can be used in the same command line to create a "pipeline".
A pipe takes the output of a command and immediately sends it as input to another command.
A pipe is often used in conjunction with the command "less" to view the output within the pager.

#view users connected
who
#count the number of users connected
who | wc -l

#display the content of bin
ls -la /usr/local/bin
#display the content of bin within the pager provided by "less"
ls -la /usr/local/bin | less

Manual Pages

Unix comes with a preloaded documentation, also known as "man pages".

Anatomy of a man page

Each page is a self-contained document consisting of the following sections:

NAME: the name and a brief description of the command.
SYNOPSIS: how to use the command, including a listing of the various options and arguments you can use with the command. Square brackets ([ ]) are often used to indicate optional arguments. Any arguments or options that are not in square brackets are required.
DESCRIPTION: a more detailed description of the command including descriptions of each option.
SEE ALSO: References to other man pages that may be helpful in understanding how to use the command in question.

View man pages

To view a manual page, use the command man.

When viewing a man page, use the following keys to navigate the pages:

spacebar- view the next screen;
b (for "back") - view the previous screen;
arrow keys - navigate within the pages;
q (for "quit") - quit and return to the shell prompt.

For example, to view the man page associated with the command cp, type the following:

Search man pages

When you are not sure of the exact name of a command, you can use the apropos command to see all the commands with the given keyword on their man page.

For example, to see which commands are relevant to the task of copying, type the following:

Access Rights

Every user has a unique username and is member of 1+ groups.
Every file and directory has a owner, a group and a set of permission flags.
Flags specify permissions:
- read, write and execute(rwx)
- owner, group and others (ugo).
Programs need to have execute (x) permission to run in shell

ls -lt

Permissions are controlled with the commands chmod, chown, chgrp

Only owner can change permissions of a file

chmod

chmod has 2 different syntaxes

Syntax 1

assign (=), gives(+), or take away(-) permission
who corresponds to --> u (user), g (group), o (other)
permission corresponds to --> read (r), write (w), execute (x)
- chmod who=permission
- chmod who+permission
- chmod who-permission

 ls -l catfile.txt
 chmod g=rwx catfile.txt
 ls -l catfile.txt

 ls -l catfile.txt
 chmod g-w catfile.txt
 ls -l catfile.txt

 ls -l catfile.txt
 chmod o+x catfile.txt
 ls -l catfile.txt

chmod n

numeric shortcuts to change mode in a compact way.
format: chmod n filename
- n is a 3 digit number, values from 0 to 7
- each digit is associated to user, group, other

examples:

ls -l catfile.txt
chmod 644 catfile.txt
ls -l catfile.txt
chmod 700 catfile.txt
ls -l catfile.txt

Unix Text Editors

nano

Nano is freely available under the GPL for Unix and Unix-like systems.
It is a keyboard-oriented editor, controlled with key combinations, i.e. the "control" key (denoted by "^") PLUS certain letter keys.

Basic commands:

^O : saves the current file
^W : goes to the search menu
^G : gets the help screen
^X : exits

Basic Usage

Launch the editor by typing "nano" at the shell prompt. An overview of available commands is given at any time at the bottom of the editor screen.

nano

Open an existing file by typing "nano" followed by the file name.

nano file1.txt

If you modify the file, when you exit you will be asked whether to save the changes and the name of the file to save.

Advanced Usage

Find and replace in nano

While holding down Ctrl key, press \
Enter your search string at prompt and press return
Enter your replacement string at prompt and press return
Respond to "Replace this instance" option menu:
- Y to replace current instance
- N to skip current instance
- A to replace all instances
Note: The search string can also be a regular expression.

vi / vim

How it works

The Unix command vi starts up the visual editor.
Typing vi followed by a file name will automatically open the file.
After issuing the command, the appearance of your screen changes. Rather than seeing the shell prompt, the content of the file filename.txt appears on the screen. If filename.txt hadn't existed before you invoked the vi command, the screen will appear mostly blank.

Modes

One of the fundamental concept to keep in mind when using vi is that three modes exist: command, insert, and visual.
In command mode, everything you type on the keyboard gets interpreted as a command.
In insert mode, most everything you type on the keyboard gets interpreted as characters and will appear in your file--letters, numbers, punctuation, line returns, etc.
When you are in insert mode, you can switch to command mode by pressing the Esc key on your keyboard (this is often at the left and upper portion of your keyboard).
When you are in command mode, there are many keys you can use to get into insert mode, each one gives you a slightly different way of starting to type your text. For example, if you are in command mode, you can simply type i on your keyboard to enter insert mode (i stands for insert). Then, the characters you type go into the file.

Basic commands

h = move one character to the left
l = move one character to the right
k = move up one line
j = move down one line
[ctrl] b = move back one screen
[ctrl] f = move forward one screen
quit Vi without saving anything (you'll lose any changes you made when using this command) type: :q!
save/write the file you're working on without exiting type: :w followed by a filename
save/write your file and quit the vi editor in one step by typing: :wq

emacs

Prior to starting this demonstration, you should set up your unix directory according to the instructions in
is a powerful text editor with extensive functionality.

1) Control mode is activated by pressing and holding the <ctrl> key while pressing the second key. This process is written as:

C-key

This example shows the result after C-s (search):

2) Meta-mode is activated by pressing and releasing the <ESC> key followed by some other key that activates a sub-menu of commands. This process is written as:

ESC option

This example is what appears after entering ESC x (activate execute-extended-command menu)

check to see that you are in the IAP_2010 directory unix by entering the command:

the result should be (where USERNAME is your own unix username)

/home/USERNAME/IAP_2010/unix

If it is not, you can change to that directory by executing:

create a copy of the file replace.txt with:

Begin editing replace2.txt with:

Note, on new accounts a splash screen welcomes you to emacs use C-l to exit the screen and proceed to editing.
Insert text by regular typing.
Delete letters with C-d
Delete lines with C-k
Find and replace text with ESC x , then type replace-string on M-x line. enter search string, then return, then replacement text.
Move to the end of the document with ESC >
Move back to the start of the document with ESC <
Play tetris with ESC x, tetris
Save and exit with C-x followed by C-c then answer prompts if changes were made.
To turn off the splash welcome screen, edit a file called .emacs to have the contents:

Shell Scripts

Shell scripts are sequential collection of commands that generally represent a workflow.
Shell scripts can use other commands as building blocks.

Example

Suppose that occasionally you want the system to tell you some simple information, like your user id, the current directory and its content, and display the current date. To avoid typing this sequence of commands every time you want to know this information, you can compose a short script that execute the commands in a specific order.

Open a text editor (e.g. nano).
Type the sequence of commands to execute in the specific order they should be executed.

clear
echo Today date is:
date
echo
echo Your current directory is:
pwd
echo
echo The files in your current directory are:
ls -lt
echo
echo Have a nice day!

Save the file by giving to the script a meaningful name, for example info.sh.

Execute the command by invoking your shell (e.g. bash) followed by the script name. If you don't know what shell you are running, type "echo $SHELL" at the prompt.

bash info.sh

Software Installation

Module

Multiple versions of various software packages are available on luria.mit.edu Versions are managed using .

module commands

List available applications on the system:

Load an application so it can be used:

Specify version number of an application, e.g.

We recommend you always module add the version number so that you know which version of the software is used. Otherwise, a running application may generate different results or break when the software is upgraded in the future.

Note: some packages are available only after you load the python (for python packages) or jre (for java packages) modules. For example, compare the output of the following:

with

R and R packages - Various versions of R are installed on luria. Each one has a collection of associated packages. Execute the following commands taking note of the comment lines:

Once inside R, check out the available R packages

NOTE: Those package version numbers may not be consistent from month to month and can change without warning.
When working with R, capture your session info with the command below so you can reproduce the environment if needed:

quit R

Once back in the shell, view the R data that remains

List loaded applications:

Unload an application:

Print module help:

Conda Environment

About Conda

1. Conda is an open-source package management system and environment management system.

2. Package, dependency and environment management for any language---Python, R, Ruby, Lua, Scala, Java, JavaScript, C/C++, FORTRAN

3. Comparison of pip and conda

Pip installs Python software packaged as wheels or source distributions, which may require compilers and libraries. Conda packages are binaries. No need for compilers.
Conda packages are not limited to Python package.
Conda can create isolated environments. Pip has no built in support for environments but rather depends on other tools like virtualenv or venv to create isolated environments.
Pip installation does not check dependencies of all packages. Conda installation will verify that all requirements of all packages installed in an environment are met. This check can be slow.
A package may not be available as a conda package but is available on PyPI and can be installed with pip

4. Miniconda is a free and mini version of Anaconda.

Use Conda

1. Load miniconda3 module

Some online conda doc suggests running conda init to initialize conda environment. Please do not run this as it will add conda settings in your .bashrc file which may pollute your environment. You only want to call the conda environment when needed, not by default.

2. Managing environments

Conda allows you to create separate environments containing files, packages, and their dependencies that will not interact with other environments. When you begin using conda, you already have a default environment named base. To see available packages in the base environment

You can run the command "python -V" to see what is the python version in the base environment. You can't add your own programs into the base environment because you will not have permissions. Create your own separate environments to keep your programs isolated from each other. To see available conda environments

Create a new environment and install a package in it. We will name the environment snowflakes and install the package BioPython. Type the following:

Run python and then import Bio to confirm biopython can be imported. Then run conda deactivate to leave the snowflakes environment.

To delete an entire environment.

3. Show conda environment info

4. Exercise 1: install samtools

If you run "module avail samtools", you will find the latest module available is 1.10. Now you are going to install a newer version of samtools using conda. First create an environment named samtools

Visit and search for "samtools" which will lead you to page about installation instructions. Follow the instruction there. Run the samtools command to confirm it worked.

What if you want to install a different version of samtools? You can change the syntax to conda install -c bioconda samtools=1.1. This will downgrade your samtools from 1.12 to 1.11.

What if you want to keep both versions? HINT: create a separate conda environment named samtools-1.1.

5. Exercise 2: install umi_tools

Next you want to install a python package named umi_tools. Search for umi_tools in to get instructions.

You can download conda cheat sheet from or google conda cheat sheet.

Introduction to Unix

Why Unix?

Unix is an operating system (suite of programs), originally developed in 1969 at the Computing Science Research Group at Bell Labs. In 1991, Linus Torvalds released a Unix-like kernel called the Linux kernel that he subsequently open-sourced. Linux-based operating system are the most widely used Unix-like operating system, and what we use in our local high performance computing cluster.

From here on out, any mention of UNIX should be regarded as meaning both UNIX and UNIX-like operating systems, primary Linux-based operating systems.

PC vs. Shared System

There is a distinction between the computing paradigm popular at the time of UNIX's creation and the one present within UNIX. Personal computing (PC) vs. a shared system.

Whereas previous systems were meant to be used by one person at a time, UNIX has multi-user and multi-tasking support.

Aside from that feature, UNIX also had these features that made it popular:

network-ready (built-in TCP /IP networking makes easy to communicate between computers).
very powerful programming environments (free of the many limits imposed by other operating systems).
robust and stable.
scalable, portable, flexible.
open source.

The Unix Filesystem

The Unix Tree

The UNIX file system consists of files and directories organized in a hierarchical structure. When visualized, this hierarchical structure looks like a tree, with roots and many branches.

tree -L 1 / command, showing the Unix filesystem tree:

/
├── bin
├── boot
├── dev
├── etc
├── home
├── lib
├── lib64
├── mnt
├── nfs
├── opt
├── proc
├── root
├── run
├── srv
├── sys
├── tmp
├── usr
└── var

19 directories, 0 files

Every file or directory in a UNIX operating system is somewhere on this "tree." / is referred to as "root," because it's the root of the tree which every other file or directory is inside.

For example, every UNIX user's home directory is in home, which is in /. In other words, every user's home directory is in /home.

Unix has some shortcuts for referring to directories.

. stands for "my current directory."
.. stands for "my parent directory," a.k.a. the directory one branch higher in the tree
~ stands for "my home directory."

Example of Unix directory shortcuts:

/
├── home
│   ├── asoberan ( ~ )
│   │   └──Documents ( .. )
│   │      └──unix_class (I am here) ( . )
│   ├── duan
│   └── yourusername
├── etc
├── bin
├── tmp
...

The organization of the filesystem is not set in stone. However, there is a standard that many UNIX operating systems follow called the Filesystem Hierarchy Standard (FHS). The FHS defines a standard filesystem layout for greater uniformity and easier documentation across UNIX-like operating systems.

Some of what the FHS dictates includes:

/ must have everything to boot, restore, recover, or repair a system.
/etc holds configuration files for the system and programs present on the system. For example, if I install the SSH server, I can reasonably expect configuration files for it to be found at /etc/ssh/
/home are user's home directories
/tmp holds temporary files that can be deleted on reboot
/bin holds essential programs that are needed for system recovery
/usr/bin holds non-essential programs

Thanks to the FHS, you can expect most UNIX-like operating systems to look like this.

Network Filesystems

NFS

The UNIX filesystem does not have to be present on the local hard drive. UNIX supports filesystems that are present over the network, called Network Filesystems (NFS), so that you can mount a directory from another computer and traverse it just as if it were a normal directory on your local computer.

Our Luria cluster uses this technology to mount our many storage servers, which run NFS servers. Every storage server is mounted at /net.

tree -L 1 / command, showing the different storage servers mounted at /net:

From our perspective, these look like any other directory on Luria, but they're present on completely different computers.

SMB

Another protocol for sharing filesystems over the network is SMB, which is supported on UNIX systems through Samba. In addition to NFS, our storage servers run Samba servers, which allow you to mount them on your local laptop or PC. For instructions on doing so, refer to the page linked below:

The Unix Shell

Unix Shell Manual Pages

Unix comes with a preloaded documentation, also known as "man pages".

Anatomy of a man page

Each page is a self-contained document consisting of the following sections:

NAME: the name and a brief description of the command.
SYNOPSIS: how to use the command, including a listing of the various options and arguments you can use with the command. Square brackets ([ ]) are often used to indicate optional arguments. Any arguments or options that are not in square brackets are required.
DESCRIPTION: a more detailed description of the command including descriptions of each option.
SEE ALSO: References to other man pages that may be helpful in understanding how to use the command in question.

Search for man pages

When you are not sure of the exact name of a command, you can use the apropos command to see all the commands with the given keyword on their man page.

apropos keyword

For example, to see which commands are relevant to the task of copying, type the following:

apropos copy

View man pages

To view a manual page, use the command man.

man command_name

When viewing a man page, use the following keys to navigate the pages:

spacebar- view the next screen;
b (for "back") - view the previous screen;
arrow keys - navigate within the pages;
q (for "quit") - quit and return to the shell prompt.

For example, to view the man page associated with the command cp, type the following:

man cp

Using the Unix Shell

Viewing the Unix Tree

pwd
- This command name stands for "print working directory".
- It displays on the terminal the absolute path name of the current (working) directory.
- It is useful to locate where you are currently in the Unix tree.

# print working directory
pwd

ls
- This command stands for "list".
- It displays the content of a directory.
- By default, the content is listed in lexicographic order.

# list the content of the current directory
ls
ls .

# list content of the parent directory
ls ..

# list the contens of your home directory from anywhere
ls ~

Traversing the Unix Tree

cd
- This command name stands for "change directory".
- It changes your current working directory to the specified location.
- The last visited directory is referred to with a hyphen ("-").

# go to root directory of the system and print the working directory
cd /
pwd

# go to the home directory and print the working directory
cd ~
pwd

# change directory using the absolute path and print the working directory
cd /net/bmc-pub14/data/
pwd

Editing the Unix Tree

Make sure to run all of the copy commands below, as we'll be using files from the Ostrom server later in the course.

mkdir
- This command name stands for "make a directory".
- It creates a new folder (or directory). If no path is specified, the new directory is created in the current directory.

# start in your home directory
cd ~

# create a directory named "unixclass"
# with a subdirectory named "testdir"
mkdir unixclass
mkdir unixclass/testdir

# change current directory directly to "testdir"
cd unixclass/testdir 

# go to the parent directory (i.e. unixclass)
# and print the working directory
cd ..
pwd

touch
- This command creates an empty file with the given name.

# Go to /home/<your username>/unixclass
cd ~/unixclass

# Create an empty file named "hello.txt"
touch hello.txt

# List the files in the directory to verify that worked
ls

cp and mv
- These commands stand for "copy" and "move," respectively.
- They copy / move files and directories to the specified location.
- Wildcards symbols such as "*" or "?" are commonly used to copy multiple files with a single command.
  - The symbol "*" stands for any number of alphanumeric characters.
  - The symbol "?" stands for a single alphanumeric character.

# Start in ~/unixclass
cd ~/unixclass

# copy the file named arrayDat.txt into your unix_class directory
cp /net/ostrom/data/dropbox/arrayDat.txt .
ls

# copy all the files with suffix "array”
# into the current directory 
cp /net/ostrom/data/dropbox/array* .
ls

# copy any file whose extension is "txt" 
cp /net/ostrom/data/dropbox/*.txt .
ls

# copy all files
cp /net/ostrom/data/dropbox/* .
ls

rmdir and rm
- rmdir only removes empty directories, rm removes both directories and files.
- rm needs -r flag to remove directories.

# Start in ~/unixclass
cd ~/unixclass

# Create a temporary directory
mkdir trash

# create copies of arrayDat.txt in the temporary directory
cp arrayDat.txt trash/arrayDat1.txt
cp arrayDat.txt trash/arrayDat2.txt
cp arrayDat.txt trash/arrayDat3.txt
cp arrayDat.txt trash/arrayDat4.txt

ls trash

# Try to delete the directory with `rmdir`
rmdir trash

# Try to delete the directory with `rm -r`
rm -r trash

Searching the Unix Tree

find
- Searches directories to find a directory or file.

grep
- Searches a file for a specific pattern.
- Can be used similarly to find, but it also searches file contents.

Files

The following sections explain how to view, create, and edit files all through the Unix shell. While it is possible to do these things straight from the shell, if you are using Visual Studio Code set up as described on the page linked below, you can view, create, and edit files through it instead of the shell.

Viewing File Contents

cat
- Concatenates two files together, then prints content to STDOUT.
- Colloquially, it's used to print out the contents of a single file.

head
- By default, prints the first 10 lines of a file to STDOUT.
- You can pass the -n flag to specify the number of lines to print.

tail
- By default, prints the last 10 lines of a file to STDOUT.
- You can pass the -n flag to specify the number of lines to print.

less
- Lets you page through a long file or stream of text a.k.a. you can scroll.
- Exit by pressing q

Creating and Editing Files

nano
- A basic text editor found on most UNIX operating systems.
- Gives instructions on how to use it at the bottom of the screen.
  - ^ stands for Ctrl

Example output from running nano arrayDat.txt. Save any changes by exiting with ^X:

If you recall, you can mount the storage servers to your local machine using SMB. Since mounting the server makes it look like a local directory on your computer, you can use whatever text editing software that you're comfortable with to edit text files on the computing cluster.

Manipulating Files

Until now, all the examples of using a shell that we've seen have used our keyboard input as STDIN and the terminal screen as STDOUT. However, the shell gives us the ability to change STDIN and STDOUT by using two mechanisms: pipes and redirectors.

Redirectors

A redirector is denoted by a >. A redirector redirects the output of a command into a file, essentially changing the STDOUT to a file rather than the terminal screen.

For example, echo 'Hello, World!' usually outputs the message "Hello, World!" onto our terminal screen. However, if we run

echo 'Hello, World!' > hello.txt

nothing will be printed to the screen. Instead, STDOUT is changed to the file hello.txt, so the output of echo 'Hello, World!' will be pasted into the file hello.txt.

We can verify that by running cat hello.txt, which should print out the contents of hello.txt to our terminal screen.

A redirector will always replace the contents of a file. If, however, you wish to append content to a file, you can use >>.

Using >> to append contents to a file:

echo 'Hello, again!' >> hello.txt

Pipes

A pipe is denoted by a |. Pipes allow us to use the output of a command as the input of another command.

For example, let's say you wanted to search through the contents of cities.csv to search for every line that includes "TX". You can do so by printing the contents of the file using cat, then piping the output into grep "TX".

cat cities.csv | grep "TX"

This will print every line that includes "TX".

Piping is a very powerful feature of the shell, because it lets us compose many small operations into a very complex set of operations. For example:

# List the files in the current directory
# and print extended information
ls -l

# Format the output of ls -l by removing 
# cases of multiple spaces and replace
# them with a single space for easier
# parsing
ls -l | tr -s " "

# Use the cut program to separate each line
# of output into colums, delimited by a space
# then print columns 3-4. In the case of this
# output, we're printing the users and groups
# which own the files in the directory
ls -l | tr -s " " | cut -d" " -f3-4

# Alphabetically sort the users and groups which
# own the files in the directory
ls -l | tr -s " " | cut -d" " -f3-4 | sort

# Count the instances of users and groups so we can
# know who owns files in the current directory and
# how many files they own
ls -l | tr -s " " | cut -d" " -f3-4 | sort | uniq -c

Symbolic Links

UNIX filesystems have a feature called symbolic links, which are files that point to another file, called the target. They're similar to shortcuts in Windows.

There are two kinds of symbolic links: hard links and soft links.

Hard Links

Hard links create a file which points to the same space on the hard drive as the file which is being linked to, rather than to that file itself.

/home/me/file      /home/you/file
    |                    |
 hard link           hard link
    |                   ,'
    V               _.-`
HARD DRIVE <....--``

A file won't be deleted until every hard link to it is deleted.

To create a hard link, you use the ln command with the source file, and the target hard linked file:

ln arrayDat.txt arrayHard.txt

ls

Notice that any changes you make to arrayDat.txt will be reflected in arrayHard.txt.

Soft Links

Soft links create a file which points to the original file or directory.

/home/me/file <---Soft link--- /home/you/file
     |
     |
     V
HARD DRIVE

Soft links break if the original file is deleted.

To create a soft link, you use the ln command with the -s flag.

ln -s arrayDat.txt arraySoft.txt

ls -l

Notice that any changes you make to arrayDat.txt will be reflected in arraySoft.txt. The ls command will show that arraySoft.txt points to the file arrayDat.txt.

Your Luria home folder has a couple of soft links automatically set up pointing to storage servers so that common programs don't take up too much space on the head node.

File Ownership

How Unix File Ownership Works

UNIX is a multi-user environment, how does it maintain security inside of itself?

Every file has an owner and permissions.

There are three levels of ownership:

User
Group
Other

Three levels of permissions:

Read
Write
Execute

How is this useful? Well imagine a lab! There are files that an entire lab should have access to. So put all users in a lab into a lab group, then sharing a file between a lab just means making the lab group the owner of a file. This is already what we do on Luria!

Read File Ownership and Permissions

You can view the ownership and permissions of a file by running ls -l. Here's an example of the output of ls -l:

[asoberan@luria unixclass]$ ls -l
total 40
-rwxr-xr-x 1 asoberan ki-bcc 3845 Apr 28 21:48 arrayAnnot.txt
-rwxr-xr-x 2 asoberan ki-bcc 3134 Apr 28 22:11 arrayDat.txt
-rwxr-xr-x 2 asoberan ki-bcc 3134 Apr 28 22:11 arrayHard.txt
-rwxr-xr-x 1 asoberan ki-bcc 1634 Apr 28 21:48 arraylen.txt
lrwxrwxrwx 1 asoberan ki-bcc   12 Apr 28 22:13 arraySoft.txt -> arrayDat.txt
-rwxr-xr-x 1 asoberan ki-bcc 3128 Apr 28 21:48 beep.txt
-rw-r--r-- 1 asoberan ki-bcc  528 Apr 28 21:48 ex1.sh
-rw-r--r-- 1 asoberan ki-bcc  479 Apr 28 21:48 ex2.sh
-rw-r--r-- 1 asoberan ki-bcc  368 Apr 28 21:48 ex3.sh
-rwxr-xr-- 1 asoberan ki-bcc  340 Apr 28 21:48 test_1.fastq
-rwxr-xr-- 1 asoberan ki-bcc  340 Apr 28 21:48 test_2.fastq

Let's focus on the arrayDat.txt file.

-rwxr-xr-x 2 asoberan ki-bcc 3134 Apr 28 22:11 arrayDat.txt

asoberan ki-bcc describes the ownership of a file. In this case, the user asoberan and the group ki-bcc own the file.

-rwxr-xr-x describes the permissions that the owners of the file have.

The permissions can be broken down into three parts:

The user's permissions
- -rwx
- The user asoberan has read (r), write (w), and execute (x) permissions for this file.
The group's permissions
- r-x
- The group ki-bcc has read (r) and execute (x) permissions for this file.
Everyone's else's permissions
- r-x
- Anyone who isn't asoberan or in the group ki-bcc has read (r) and execute (x) permissions for this file.

To check what group you are in, you can use the groups command:

[asoberan@luria unixclass]$ groups
ki-bcc

Change File Ownership and Permissions

To change the owners of a file, you can use the following commands:

chown
- This changes the user who owns a particular file or directory.
chgrp
- This changes the group who owns a particular file or directory.

To change the permissions that the owners of a file have, you use the chmod command.

chmod takes two arguments: the permissions to give a file, and the file to change the permissions of. The permissions are represented as a 3-digit number, where each digit represents the permissions to give the user, group, or others, respectively.

Read, write, and execute permissions are represented by the following numbers:

r - 4
w - 2
x - 1

If you want to give someone multiple permissions, you add the numeric representations of those permissions together. For example:

Read, write, execute (rwx) permissions = (4 + 2 + 1) = 7
Write, execute (_wx) permissions = (2 + 1) = 3

So let's say you want to give a file the following permissions:

The user that owns the file should be able to read, write, and execute the file. rwx = (4 + 2 + 1) = 7
The group that owns the file should be able to read and execute the file. r_x = (4 + 1) = 5
Anyone else should have no permissions for the file. ___ = 0

The you'd run the following command:

chmod 750 arrayDat.txt

Remembering the syntax for this command can be quite cumbersome, so I recommend using a third-party website such as https://quickref.me/chmod.

File Transfer (in-progress)

scp
- Allows you to transfer files between a local and remote host. It does this by leveraging ssh.
- Shares the same syntax as the normal cp command.

File Storage and Compression

quota
- Luria has a disk usage quota on the head node. You can use the quota command to check what the size of the quota is and how much space you're currently using.
- The -s flag will display the quota in a human-readable format. (e.g. 18K, 14M, 65G).

quota -s

Disk quotas for user asoberan (uid 247789): 
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
/dev/mapper/cl-home
                  4391M  18432M    100G           24925       0       0        

# Space column shows how much space you're currently using on the head node
# Quota column is total amount of space allotted to you

du
- Stands for "disk usage". Reports how much disk space a directory is taken up.
- Defaults to displaying the disk usage in kilobytes, but passing it the -h flag will return disk usage in a human-readable format. (e.g. 18K, 14M, 65G).
- Will search the entire depth of the Unix tree starting from the directory you give it. You can control the depth which it searches by passing the -d <num> flag.

# Displays the disk usage of every file under your home directory
# The last entry will be the disk usage of your entire home directory
du -h ~

# Displays the disk usage of every file and directory immediately
# under your home directory
du -h -d 1 ~

# Displays the disk usage of one file, file.txt
du -h file.txt

tar
- Combines multiple files or directories into one archive for easy sharing. Similar to "zipping" files, however tar does not compress by default.
- Create an archive by passing the -cf flags
- Can compress multiple files/folders using gzip by passing the -z flag when creating a new archive.
- Useful when you have files that take up a lot of space and you want to save space.
- Extract an archive by passing the -xf flags. Un-compress an archive by passing the -z flag to those two flags.

# Create an archive of a directory and name it my_directory.tar
tar -cf my_directory.tar <directory>

# Create a compressed archive of a directory and
# name it my_compressed_directory.tar.gz
tar -czf my_compressed_directory.tar.gz <directory>

# Extract archive my_directory.tar
tar -xf my_directory.tar

# Uncompress the archive my_compressed_directory.tar.gz
tar -xzf my_compressed_directory.tar.gz

zip
- Zip multiple files and directories into one file. Zipping is similar to archiving with tar, but zipped files are easier to deal with on Windows machines.

# Zip a directory to the file my_directory.zip
zip my_directory.zip <directory>

# Unzip my_directory.zip
unzip my_directory.zip

Getting System Information

uname
- uname prints various different pieces of system information. Possible pieces of information are shown in the uname manual pages.
- The -a flag will print all the information uname can print.

file
- This will print out the file type of the given file.

ps
- Shows the running processes on the system. Pass it the argument aux to get extensive information about every running process on the system

htop
- An interactive utility to see every running process on the system, as well as extensive information about the CPU, RAM, etc. usage on the system.
- Useful to check this utility if you're worried about your own CPU and RAM usage on the head node.

Writing Scripts

Instead of running a single command at a time, you can combine multiple commands into a file to create a script. Scripts can simplify multi-step processes into a single invocation, saving you time in the long-run.

Writing a script is as simple as writing a file. Usually, we make the first line of the script #!/bin/bash to tell the shell to use bash when running the script.

You can create variables in a bash script using the = operator. So to make a variable named myname, you'd write myname="Allen". To use the variable later in the script, you'd prefix it with a $. For example, $myname.

You can also ask for a user's input and store that into a variable by using the read command. Preface it with echo "question?" to give context for what the user is inputting. For example:

You can run shell commands inside of a script and store their results in a variable. To do so, you wrap the command with $(). For example, to get the size of a file and store it in a variable, you'd do file_size = $(du file).

Knowing this, let's create a script that looks through the files in a directory, and moves any files above a certain size to a new folder. Name it sizewatcher.sh

Once you've created and saved this file, make sure to modify its permissions to let it be executed. An easy way of doing this is running the following command:

Then run the script:

Let's create another script that asks a user for a directory, then sorts the files in that directory into folders that correspond to the year and month that the file was created. Name it sorter.sh:

Make it executable and then run it.

Previous Teaching

DATE

TIME

Instructor

Location

Room

Topic

Mon May 11, 2015

2pm-3pm

Duanduan Ma

Building 76

76-156a&b

Introduction to UNIX and the KI/BioMicro computational resources

Mon Jun 15, 2015

1:30pm-3pm

Duanduan Ma

Building 68

68-374

Follow Up Course of Introduction to UNIX and the KI/BioMicro computational resources

Mon Jul 13, 2015

10am-11am

Charlie Whittaker

Building 76

76-156a&b

Introduction to Spotfire

Mon Aug 10, 2015

1:30pm-3pm

Duanduan Ma

Building 56

56-602

Introduction to UNIX and the KI/BioMicro computational resources

Mon Sep 14, 2015

1pm-4pm

Duanduan Ma

Building 68

68-374

Introduction to Python

Mon Oct 19, 2015

1pm-4pm

Duanduan Ma

Building 68

68-674

Introduction to Python

Mon Nov 9, 2015

1:30pm-3:30pm

Jie Wu and Vincent Butty

Building 68

68-181

A Traveller's companion to RNA-Seq interpretation

Mon Dec 14, 2015

2pm-4pm

Duanduan Ma

Building 68

68-474

Introduction to UNIX and the KI/BioMicro computational resources

Thur Jan 7, 2016

1:30pm-4:30pm

Duanduan Ma

Building 32

32-124

Quantitative Biology Workshop Outreach-Introduction to Python

Fri Jan 29, 2016

9:00am-11:00am

Charlie Whittaker and Duanduan Ma

Building 14N (DIRC)

14N-132

Visualizing and Accessing Genomic Data Using Publically Available Genome Browsers and Databases

Tue Feb 16, 2016

2:00am-3:00am

Sara Strandberg-Qlucore

Building 76

76-156B

Introduction to Qlucore software

Mon Mar 7, 2016

2pm-4pm

Duanduan Ma

Building 56

56-602

Introduction to UNIX and the KI/BioMicro computational resources

Mon April 11, 2016

1pm-4pm

Duanduan Ma

Building 68

68-121

Introduction to R

Mon May 2, 2016

1pm-4pm

Duanduan Ma

Building 68

68-374

Introduction to R

Mon May 9, 2016

1pm-3pm

Qlucore

Building 14N(DIRC)

14N-132

Hands on training of Qlucore software

Mon June 6, 2016

2pm-4pm

Duanduan Ma

Building 68

68-374

Introduction to UNIX and the KI/BioMicro computational resources

Mon Jun 13, 2016

1pm-4pm

Duanduan Ma

Building 76

76-156A/B

Introduction to R

Mon July 11, 2016

1pm-4pm

Duanduan Ma

Building 56

56-602

Introduction to R

Mon Aug 8, 2016

1pm-4pm

Duanduan Ma

Building 68

68-374

Introduction to R

Mon Sep 12, 2016

2pm-4pm

Duanduan Ma

Building 76

76-459

Introduction to UNIX and the KI/BioMicro computational resources

Tue Oct 18, 2016

1pm-3pm

Duanduan Ma

Building 68

68-274

Introduction to UNIX and the KI/BioMicro computational resources

Mon Nov 7, 2016

1:00pm-2:00pm

Jie Wu & Vincent Butty

Building 68

68-181

A Traveler's companion to RNA-Seq interpretation session 1

Mon Nov 14, 2016

1:00pm-2:00pm

Jie Wu & Vincent Butty

Building 68

68-181

A Traveler's companion to RNA-Seq interpretation session 2

Mon Dec 12, 2016

2pm-4pm

Duanduan Ma

Building 76

76-259

Introduction to UNIX and the KI/BioMicro computational resources

Thurs Jan 19, 2017

1pm-3pm

Duanduan Ma

Building 68

68-274

Introduction to UNIX and the KI/BioMicro computational resources

Fri Feb 3, 2017

12:00pm-2:00pm

Charlie Whittaker and Duanduan Ma

Building 14N (DIRC)

14N-132

Visualizing and Accessing Genomic Data Using Publically Available Genome Browsers and Databases

Mon Mar 13, 2017

2pm-4pm

Duanduan Ma

Building 68

68-374

Introduction to UNIX and the KI/BioMicro computational resources

Mon April 24, 2017

1pm-4pm

Duanduan Ma

Building 68

68-574

Introduction to Python

Mon May 1, 2017

1pm-4pm

Duanduan Ma

Building 68

68-374

Introduction to Python

Mon June 5, 2017

1pm-4pm

Duanduan Ma

Building 68

68-121

Introduction to Python

Mon July 10, 2017

1pm-4pm

Duanduan Ma

Building 76

76-559

Introduction to Python

Mon Aug 28, 2017

2pm-4pm

Duanduan Ma

Building 68

68-374

Introduction to UNIX and the KI/BioMicro computational resources

Mon Sep 25, 2017

2:00pm-3:00pm

Vincent Butty

Building 68

68-181

RNA-Seq Experimental design + QC ; Slides

Mon Oct 2, 2017

1:00pm-2:00pm

Vincent Butty

Building 68

68-181

RNA-Seq analysis ; Slides

Mon Nov 13, 2017

2pm-4pm

Duanduan Ma

Building 76

76-259

Introduction to UNIX and the KI/BioMicro computational resources

Mon Dec 11, 2017

1pm-4pm

Duanduan Ma

Building 56

56-602

Introduction to R

Wed Jan 31, 2018

10am-12pm

Charlie Whittaker

Building 14N (DIRC)

14N-132

Gene Set Enrichment Analysis

Mon Feb 12, 2018

1pm-4pm

Duanduan Ma

Building 76

76-459

Introduction to R

Mon Mar 12, 2018

2pm-4pm

Duanduan Ma

Building 56

56-602

Introduction to UNIX and the KI/BioMicro computational resources

Mon Apr 23, 2018

1:30pm-2:30pm

Duanduan Ma

Building 68

68-181

Introduction to Biopython

Mon May 14, 2018

2:30pm-3:30pm

Duanduan Ma

Building 76

76-259

Introduction to Biopython

Mon June 11, 2018

2:00pm-4:00pm

Duanduan Ma

Building 56

56-602

Introduction to UNIX and the KI/BioMicro computational resourceso Biopython

Mon July 16, 2018

2:30pm-4:00pm

Duanduan Ma

Building 68

68-121

Introduction of Python to biologists session 1

Mon August 13, 2018

2:30pm-4:00pm

Duanduan Ma

Building 76

76-559

Introduction of Python to biologists session 2

Mon September 10, 2018

2:00pm-4:00pm

Duanduan Ma

Building 56

56-602

Introduction to UNIX and the KI/BioMicro computational resources

Tue October 23, 2018

1:30pm-2:30pm

Vincent Butty

Building 68

68-181

A Traveler's companion to RNA-Seq interpretation session 1 - Slides

Mon November 5, 2018

1pm-2pm

Vincent Butty

Building 68

68-181

A Traveler's companion to RNA-Seq interpretation session 2 - Slides

Mon December 10, 2018

2:00pm-4:00pm

Duanduan Ma

Building 76

76-259

Introduction to UNIX and the KI/BioMicro computational resources

Thursday Jan 31, 2019

10am-12pm

Charlie Whittaker

Building 14N (DIRC)

14N-132

Gene Set Enrichment Analysis

Mon February 11, 2019

2:30pm-4:00pm

Duanduan Ma

Building 68

68-674

Introduction of Python to biologists session 1

Mon March 11, 2019

2:30pm-4:30pm

Duanduan Ma

Building 76

76-259

Introduction of Python to biologists session 2

Mon April 8, 2019

2:00pm-4:00pm

Duanduan Ma

Building 68

68-180

Introduction to UNIX and the KI/BioMicro computational resources

Mon May 13, 2019

2pm-4pm

Duanduan Ma

Building 68

68-121

Introduction to R session 1

Mon June 10, 2019

2pm-4pm

Duanduan Ma

Building 76

76-659

Introduction to R session 2

Mon July 8, 2019

2:00pm-4:00pm

Duanduan Ma

Building 68

68-574

Introduction to UNIX and the KI/BioMicro computational resources

Tue August 27, 2019

2:30pm-3:30pm

Jingzhi Zhu & Duanduan Ma

Building 76

76-659

How to install software packages yourself on computing clusters

Wed September 25, 2019

1:00pm-2:00pm

Vincent Butty

Building 68

68-181

A Traveler's companion to RNA-Seq interpretation session 1 - Slides

Wed October 2, 2019

1:00pm-2:00pm

Vincent Butty

Building 68

68-181

A Traveler's companion to RNA-Seq interpretation session 2 - Slides

Mon November 18, 2019

2:00pm-4:00pm

Duanduan Ma

Building 76

76-659

Introduction to UNIX and the KI/BioMicro computational resources

Tues December 10, 2019

1:00pm-2:00pm

BMC/IGB core (Vincent Butty)

Building 68

68-181

Single Cell RNA-Seq Slides:

Mon January 31, 2020

2:00pm-3:30pm

Duanduan Ma

Building 76

76-659

Introduction of machine learning and its applications in biology

Mon February 10, 2020

2:00pm-4:30pm

Duanduan Ma

Building 76

76-659

Introduction to UNIX and the KI/BioMicro computational resources

Mon March 9, 2020

2:30pm-4:00pm

Duanduan Ma

Building 76

76-659

Introduction of Python to biologists session 1

Mon April 17, 2020

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to UNIX and the KI/BioMicro computational resources

Fri May 8, 2020

1:00pm-3:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 1

Wed May 13, 2020

1:00pm-3:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 2

Fri May 27, 2020

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction of Python to biologists session 2

Fri June 5, 2020

2:00pm-4:00pm

Duanduan Ma

Zoom

Zoom link

Introduction of Python to biologists session 1

Mon June 8, 2020

2:00pm-4:00pm

Duanduan Ma

Zoom

Zoom link

Introduction of Python to biologists session 2

Mon June 15, 2020

2:00pm-4:00pm

Duanduan Ma

Zoom

Zoom link

Introduction to Biopython

Mon July 20, 2020

2:00pm-4:00pm

Duanduan Ma

Zoom

Zoom link

Introduction to UNIX and the KI/BioMicro computational resources

Mon September 14, 2020

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction of machine learning and its applications in biology session I

Mon September 21, 2020

2:00pm-4:30pm

Duanduan Ma and Huiming Ding

Zoom

Zoom link

Introduction of machine learning and its applications in biology session II

Mon October 19, 2020

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to UNIX and the KI/BioMicro computational resources

Mon October 26, 2020

2:00pm-4:30pm

Jingzhi Zhu

Zoom

Zoom link

How to install software packages yourself on computing clusters

Mon November 9, 2020

2:00pm-4:30pm

Dikshant Pradhan

Zoom

Zoom link

Public Data Repositories

Wed December 9, 2020

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 1

Wed December 16, 2020

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 2

Tuesday February 16, 2021

2pm-4:30pm

Duanduan Ma

zoom

zoom link

Introduction to UNIX and the KI/BioMicro computational resources session I

Monday February 22, 2021

2pm-4:30pm

Jingzhi Zhu, Duanduan Ma

zoom

zoom link

Introduction to UNIX and the KI/BioMicro computational resources session II

Monday March 15, 2021

2pm-4:30pm

Duanduan Ma

zoom

zoom link

Introduction of Python to biologists session I

Monday March 22, 2021

2pm-4:30pm

Duanduan Ma

zoom

zoom link

Introduction of Python to biologists session II

Monday March 29, 2021

2pm-4:30pm

Duanduan Ma

zoom

zoom link

Introduction to Biopython

Monday April 12, 2021

2pm-3pm

Vincent Butty

zoom

zoom link

A Traveler's companion to RNA-Seq interpretation session I

Monday April 26, 2021

2pm-3pm

Vincent Butty

zoom

zoom link

A Traveler's companion to RNA-Seq interpretation session II

Monday May 3, 2021

2pm-3pm

Vincent Butty

zoom

zoom link

Single Cell RNA-seq

Monday May 10, 2021

2pm-4:30pm

Duanduan Ma

zoom

zoom link

Introduction to UNIX and the KI/BioMicro computational resources session I

Monday May 17, 2021

2pm-4:30pm

Jingzhi Zhu, Duanduan Ma

zoom

zoom link

Introduction to UNIX and the KI/BioMicro computational resources session II

Monday June 7, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 1

Monday June 14, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 2

Monday September 13, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to UNIX and the KI/BioMicro computational resources

Monday September 20, 2021

2:00pm-4:30pm

Jingzhi Zhu

Zoom

Zoom link

Effective utilization of IGB computational resources - SLURM & CONDA

Monday October 11, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to Python session I

Monday October 18, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to Python session II

Monday October 25, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to Biopython

Monday November 1, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Data processing and visualization using Python: Introduction to Pandas

Monday November 8, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Data processing and visualization using Python: Introduction to Matplotlib

Monday November 15, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Data processing and visualization using Python: Introduction to Seaborn

Monday December 13, 2021

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to Linux and the IGB computational resources

Tuesday February 8, 2022

2:00pm-4:30pm

Duanduan Ma and Jingzhi Zhu

Zoom

Zoom link

Introduction to Linux and the KI/BioMicro computational resources

Thursday March 3rd, 2022

2:00pm-3:00pm

Dikshant Pradhan

Zoom

Zoom link

Public Data Repositories

Monday March 14th, 2022

1:00pm-3:30pm

Vincent Butty

Zoom

Zoom link

RNA-Seq Design and Analysis Session I

Tuesday March 15th, 2022

1:00pm-3:30pm

Vincent Butty

Zoom

Zoom link

RNA-Seq Design and Analysis Session II

Thursday March 17th, 2022

1:00pm-3:30pm

Vincent Butty

Zoom

Zoom link

RNA-Seq Design and Analysis Session III

Monday March 28, 2022

1:00pm-3:30pm

Duanduan Ma and Jingzhi Zhu

Zoom

Zoom link

Introduction to Linux and the KI/BioMicro computational resources

Monday April 11, 2022

1:00pm-3:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 1

Wednesday April 13, 2022

1:00pm-3:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 2

Monday Apil 25th, 2022

1:00pm-3:30pm

Charlie Whittaker Dikshant Pradhan

Zoom

Zoom link

RNAseq HandsOn session 1

Wednesday Apil 27th, 2022

1:00pm-3:30pm

Charlie Whittaker Dikshant Pradhan

Zoom

Zoom link

RNAseq HandsOn session 2

Friday Apil 27th, 2022

1:00pm-3:30pm

Charlie Whittaker Dikshant Pradhan

Zoom

Zoom link

RNAseq HandsOn session 3

Monday September 19th, 2022

2:00pm-4:30pm

Duanduan Ma Jingzhi Zhu

Zoom

Zoom link

Introduction to Linux and Cluster Computing

Monday October 3rd, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction of Python to biologists session I

Tuesday October 4th, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction of Python to biologists session II

Friday October 7th, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to Biopython

Monday October 31st, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Data processing and visualization using Python: Introduction to Matplotlib

Wednesday November 2nd, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Data processing and visualization using Python: Introduction to Pandas

Friday November 4th, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Data processing and visualization using Python: Introduction to Seaborn

Monday November 14th, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction of machine learning and its applications in biology. session I

Wednesday November 16th, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction of machine learning and its applications in biology. session II

Monday December 5th, 2022

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to Linux and Cluster Computing

Monday February 6th, 2023

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to Linux and Cluster Computing

Tuesday February 21, 2023

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 1

Wednesday February 22, 2023

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to R session 2

Mon March 13, 2023

2:00pm-3:30pm

Stuart Levine

Zoom

Zoom link

Advanced Excel/Google Sheets for Biologists & Administrators

Monday April 3rd, 2023

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction of Python to biologists session I

Wednesday April 5th, 2023

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction of Python to biologists session II

Monday April 10th, 2023

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Version control with Git session I

Wednesday April 12th, 2023

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Version control with Git session II

Monday April 24th, 2023

1:00pm-3:30pm

Vincent Butty

Zoom

Zoom link

Bulk RNA-Seq Design and Analysis Session I

Wednesday April 26th, 2023

1:00pm-3:30pm

Vincent Butty

Zoom

Zoom link

Bulk RNA-Seq Design and Analysis Session II

Thursday April 27th, 2023

1:00pm-3:30pm

Vincent Butty

Zoom

Zoom link

Single cell RNA-Seq Design and Analysis Session I

Friday April 28th, 2023

1:00pm-3:30pm

Vincent Butty

Zoom

Zoom link

Single cell RNA-Seq Design and Analysis Session II

Monday May 1st, 2023

2:00pm-4:30pm

Duanduan Ma

Zoom

Zoom link

Introduction to Linux and Cluster Computing

Monday May 15th, 2023

1:00pm-3:00pm

Charlie Whittaker

Zoom

Zoom link

RNAseq HandsOn session 1

Wednesday May 17th, 2023

1:00pm-3:00pm

Charlie Whittaker

Zoom

Zoom link

RNAseq HandsOn session 2

Thursday May 18th, 2023

1:00pm-3:00pm

Charlie Whittaker

Zoom

Zoom link

RNAseq HandsOn session 3

nday September 18th, 2023

2:00pm-4:30pm

Duanduan Ma

Building 56

56-602

Introduction to Linux and Cluster Computing

Tuesday September 26, 2023

2:00pm-4:30pm

Duanduan Ma

Building 56

56-602

Introduction of Python to biologists session I

Wednesday September 27, 2023

2:00pm-4:30pm

Duanduan Ma

Building 76

76-659

Introduction of Python to biologists session II

Thursday September 28, 2023

2:00pm-4:30pm

Duanduan Ma

Building 76

76-559

Introduction to Biopython

Monday October 16, 2023

2:00pm-4:30pm

Allen Soberanes & Duanduan Ma

Building 76

76-259

Introduction to Linux and Cluster Computing

Monday October 23th, 2023

2:00pm-4:30pm

Duanduan Ma

Building 76

76-259

Data processing and visualization using Python: Introduction to Pandas

Tuesday October 24th, 2023

2:00pm-4:30pm

Duanduan Ma

Building 76

76-659

Data processing and visualization using Python: Introduction to Matplotlib

Wednesday October 25th, 2023

2:00pm-4:30pm

Duanduan Ma

Building 76

76-359

Data processing and visualization using Python: Introduction to Seaborn

Monday November 13th, 2023

2:00pm-4:30pm

Duanduan Ma

Building 76

76-259

Introduction of machine learning and its applications in biology. session I

Tuesday November 14th, 2023

2:00pm-4:30pm

Duanduan Ma

Building 76

76-659

Introduction of machine learning and its applications in biology. session II

Monday November 27th, 2023

2:00pm-4:30pm

Allen Soberanes & Duanduan Ma

Building 76

76-259

Introduction to Unix and cluster computing

Monday December 12th, 2023

2:00pm-4:30pm

Charlie Demurjian

TBD

Data Deposition in Public Repositorie

Monday February 5, 2024

2:00pm-4:30pm

Allen Soberanes

Building 68

68-352

Introduction to Linux and Cluster Computing

Tuesday February 20, 2024

2:00pm-4:30pm

Duanduan Ma

Building 68

68-156

Introduction to R

Monday March 4, 2024

1:00pm-3:00pm

Vincent Butty

Building 68

68-181

Single-Cell RNA-Seq Session 1

Tuesday March 5, 2024

1:00pm-3:00pm

Vincent Butty

Building 68

68-180

Single-Cell RNA-Seq Session 2

Wednesday March 6, 2024

1:00pm-3:00pm

Vincent Butty

Building 68

68-181

Bulk RNA-Seq Session 1

Thursday March 7, 2024

1:00pm-3:00pm

Vincent Butty

Building 68

68-180

Bulk RNA-Seq Session 2

Monday March 18, 2024

2:00pm-4:30pm

Allen Soberanes

Building 68

68-474

Introduction to Linux and Cluster Computing

Monday April 1, 2024

2:00pm-4:30pm

Duanduan Ma

Building 68

68-156

Introduction to Python

Monday April 8, 2024

2:00pm-4:30pm

Duanduan Ma

Building 68

68-374

Biopython

Monday April 15, 2024

2:00pm-4:30pm

Duanduan Ma

Building 68

68-156

Machine Learning with Python

Monday April 22, 2024

2:00pm-4:30pm

Duanduan Ma

TBD

Data Processing with Python

Monday April 29, 2024

2:00pm-4:30pm

Allen Soberanes

Building 68

68-156

Introduction to Linux and Cluster Computing

Tuesday May 21, 2024

TBD

Charlie Whitaker

TBD

Shared workflows and containerized computing for Bulk RNA-Seq Analysis Session 1

Wednesday May 22, 2024

TBD

Charlie Whitaker

TBD

Shared workflows and containerized computing for Bulk RNA-Seq Analysis Session 2

Thursday May 23, 2024

TBD

Charlie Whitaker

TBD

Shared workflows and containerized computing for Bulk RNA-Seq Analysis Session 3