Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Integrated Genomics and Bioinformatics Core is comprised of the Genomics Facility and the Barbara K. Ostrom Bioinformatics Facility. Together, these shared institutional resources provide Koch Institute investigators with support throughout all phases of modern genomics experimentation.
The Barbara K. Ostrom (1978) Bioinformatics Facility is a shared institutional resource providing KI investigators with support, assistance and training in a wide range of bioinformatics topics. The core also maintains a high-performance computing resource offering many bioinformatics applications, substantial processing power and secure storage. The team also provides support for desktop computing and poster printing.
Access is available to all members of the MIT community, to the extent permitted by available capacity. In recognition of funding support, priority access is given to members of the Core’s sponsoring entities (Koch Institute, MIT Biology, MIT Biological Engineering and MIT Center for Environmental Health Science) and NCI-funded research projects. In special circumstances, access may be available to non-MIT users (details available on request from the Facility Scientific Director, Charlie Whittaker)
The Bioinformatics & Computing Core is supported in part by funding provided to the Koch Institute from a National Cancer Institute Cancer Center Support Grant.
Vincent Butty - Bioinformatics Specialist
Huiming Ding - Bioinformatics Specialist
Duanduan Ma - Bioinformatics Specialist
Yann Vanrobaeys - Bioinformatics Specialist
Stuart Levine - BMC Director and Bioinformatics Specialist
Charlie Whittaker - BCC Leader and Bioinformatics Specialist
Charlie Demurjian - Senior Bioinformatics Data Specialist
Taïsha Joseph - Bioinformatics Data Specialist
Allen Soberanes - IT & Systems Administrator
AJ Bhutkar - Sr Bioinformatics Specialist
Stephen Goldman - Computing Support Specialist
Dikshant Pradhan - Research Specialist
Jie Wu - Bioinformatics Specialist
Sebastian Hoersch - Bioinformatics Specialist
Paola Favaretto - Bioinformatics Specialist
Michael Moran - Computing Support Specialist
Jingzhi Zhu - Research Computing Specialist
https://ki-sbc.mit.edu/bioinformatics/contact
TSM archive
Free to archive
Folders will be archived to TSM and then deleted from bmc-lab.
Access to archived data and restores from archive will require assistance from systems administrators and this help will be billed at $90 per hour.
We can also help to transfer the data from bmc-lab to your own hardware.
The purpose of this material is to help the KI-BCC members teach KI researchers the skills they need to effectively use bioinformatics in their projects. We intend the instruction to be practical and modular so that people can quickly learn easily applicable methods. We also hope that the teaching materials will serve as effective and readily accessible reference material.
It is not our intent to create a link-dump that regurgitates internet search results.
The overall organization is difficult to perfect. However, getting it right at the start is not important because the underlying information is small modules that can be organized and re-packaged in any way that works.
The material has been organized with general headers followed by more specific sections and subsections. The general headers are basically Theory, Tools, Tasks.
The sections under the headers should highlight general topics and the sub-sections beneath them are the actual modules of instruction. It is perfectly acceptable for the same module of instruction to be placed in different parts of the structure.
There is a lot of room for improvement with this organization and suggestions or comments are welcome (send them to [email protected]).
Open the
Click the "Genomes" link and select group "Mammal", genome "Human" and assembly "Dec. 2013 (GRCh38/hg38)
Enter the position chr4:133,143,371-133,155,540 and click submit.
The eye icon allows you to view the result, the pencil is for editing attributes and the X allows deletion.
Select the FASTA manipulation toolbox on the left. Then select the "Compute sequence length" function. When finished, view the results.
The Penn State team has implemented a powerful set of genomics tools that interfaces nicely with the UCSC Genome informatics tools and also extends UCSC functionality.
The Galaxy team provides a wide range of excellent training viewable from there home page.
They also maintain a highly informative .
A brief summary of local and campus-wide computing resources.
The MIT Office of Research Computing and Data (ORCD) has a number of resources that are available to researchers campus wide, a list of which can be found here.
If you have storage or compute resources spread across multiple computing systems, please reach out to [email protected] in order to coordinate support.
BLAT stands for "Blast Like Alignment Tool". It is very fast and is designed to find nearly identical matches.
A description of blat is located at the bottom of this PAGE.
The query is displayed with identities in blue and splice junctions in light blue.
The genomic sequence colored the same way.
The sequences in pairwise alignment.
Return to the BLAT results page and select the "browser" hyperlink. Note the new "Blat sequence" track being displayed in the "Mapping and Sequencing" group, this track has the same visualization controls as other similar tracks.
Click the "hide all" button, use the pull-down menus to set ENCODEv24 and Blat sequence to "full" and refresh the view. Zoom out so the entire footprint of the PCDH10 is visible. The coordinates are: chr4:133,147,918-133,213,548
A wide range of data formats can be converted to custom tracks than can then be visualized and compared to other genomic data.
Custom tracks are managed through the "My Data --> Custom Tracks" window. Select this page.
Available formats for custom tracks are described here.
Paste the following block of text into the Paste URLs or Data window and click the Submit button. Then select the "view in Genome Browser" option on the next page.
track name="ClassExample" description="Edited" visibility=0 url=
chr4 133149358 133149958 Example1 10 +
chr4 133149754 133149980 Example2 100 +
chr4 133150808 133152008 Example3 500 +
chr4 133152429 133152829 Example4 1000 +
Note the presence of the group "Custom Tracks" with the track "ClassExample". Toggle the display mode to full and click Refresh. Zoom out do that all 4 Examples are visible.
Select "Tools-->Table Browser". The region will usually correspond to the part of the genome visualized in the browser. Select the “summary/statistics” function to display some statistics about the information in the custom track.
Cross-species Alignments and Conservation
Query the March 2006 human genome browser gateway with the gene symbol FOXP2. View the gene on chromosome 7. Adjust the tracks displayed so that base position, chromosome band and refseq are set to full.
Switch the Zebrafish, Dog and Mouse Net Alignments to Full. What portions of the gene are most highly conserved across all of these species? Compare the dog and mouse alignments. Based on what you see which species is more closely related to human? Why? Which species diverged from human more recently?
chr7:113,838,514-113,842,471
Examine the pan-vertebrate conservation upstream from the transcription start site (TSS). How far from the TSS is the conserved block? Can you think of a reason why it is conserved? Can you find any tracks that could support this idea?
Examine the details of the 7X Reg Potential track. Reconfigure the view so that it includes an indicator line at 0.05.
Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Pathway Maps tool
Analysis will consider only the Pathway maps
It can be run on multiple experiments as well.
Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Map Folders tool
Analysis will consider functionally related Pathway Map groups resulting in alteration of scoring and statistics.
It can be run on multiple experiments as well.
Clicking on a result folder leads to a results page displaying data for the composite Pathway Maps
Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Process Networks tool
Analysis will consider enrichment in various annotated process networks.
It can be run on multiple experiments as well.
Clicking on a result leads to a page displaying data for the indicated Process Network.
Run a Compare Experiments workflow for ColonMet_Faked_FC and ColonMet_Faked_FCrand1_edit.
Identify and select the common network objects in this comparison (10 Network objects) by clicked the blue/white histogram.
Click Analyze networks to produce a list of networks associated with the list.
Return to the Network options page and click Add network objects and add ITGA5.
Re-run Analyze Networks, using default settings. NOTE: Adding ITGA5 produces and additional network.
There 11 different network building algorithms available.
Clicking on network name opens the highly interactive network viewer. Explore the extensive control over the visualization.
UNIX is an operating system (suite of programs), originally developed in 1969 at Bell Labs and has been under development ever since. UNIX is a very popular operating system for many reasons, including:
multi user (multiple users can log in to a computer at the same time, and use concurrently the resources of that computer).
multi-tasking (each user can perform many tasks at the same time).
network-ready (built-in TCP /IP networking makes easy to communicate between computers).
very powerful programming environments (free of the many limits imposed by other operating systems).
robust and stable.
scalable, portable, flexible.
open source.
UNIX systems also have a graphical user interface (GUI) similar to Microsoft Windows which provides an easy to use environment. However, knowledge of UNIX is required for operations which aren't covered by a graphical program, or for when there is no windows interface available (for example, in a telnet session).
UNIX consists of three main components: kernel, shell, and programs.
KERNEL: hub of the operating system.
SHELL: command line interpreter.
PROGRAMS: collections of operations.
Each component has specific roles:
KERNEL: it allocates time and memory to programs and handles the filestore and communications in response to system calls.
SHELL: it interprets the commands the user types in and arranges for them to be carried out.
PROGRAM: execute a set of predefined operations.
As an illustration of the way the shell and the kernel work together, suppose a user types the following command:
rm myfile
Internally,
The shell searches the PATH environment variable for the file containing the program rm.
The shell requests to the kernel, through system calls, to execute the program rm on myfile.
When the process has finished running, the shell returns the UNIX prompt to the user, indicating that it is waiting for further commands.
cd
This command name stands for "change directory".
It changes your current working directory to the specified location.
The last visited directory is referred to with a hyphen ("-").
# go to root directory of the system and print the working directory
cd /
pwd
# go to the home directory and print the working directory
cd ~
pwd
# change directory using the absolute path and print the working directory
cd /net/bmc-pub14/data/
pwd
Competitor of Ingenuity, offers manually curated pathways and networks, the tools to analyze those groups in the context of various experiments and powerful network building methods.
Owned by Thomson Reuters
Major Applications:
Access to manually curated databases
Evaluation of pathways, networks, annotations and associations as they relate gene lists. These gene lists could be deferentially expressed genes or any other list of genes.
Construction of networks based on gene lists using a database of interactions.
GeneGO ontologies
Pathway Maps
Process Networks
GO Processes
Disease (by Biomarkers)
Metabolic Networks
High quality documentation is available by following Help -> Help
Please visit the Data Management Analysis Core's documentation site for more information:
Use the Get Data --> Upload File tool to load the data located at the following paths:
Array Data contains a 100 lines of affymetrix microarray data. Gene Data is 70 lines of annotation information corresponding to a subset of the array data.
Examine the information in each file using the eye icons on the right side of the page. Note that the first column in both files is "ProbeID". That common column will be the basis for binding the 2 files together. In database terms, ProbeID is the "key" used in the join.
Examine the results. Column c1, (ProbeID) from the array data file was used to examine column c1 of the gene data file. Whenever a match was found, the infromation in the gene data file was appended to the array data file.
Open the and navigate to the "Dec. 2013 (GRCh38/hg38)" human assembly.
Enter the position chr4:133,148,410-133,195,705 and click submit.
The appropriate view is .
Click the View-->DNA link to enter the "Get DNA in Window" page
Sequence range and formatting adjustments are available, accepting the defaults and clicking "Get DNA" will retrieve the genomic sequence that was present in the browser view.
Go back to the "Get DNA in Window" page
Click the “extended case/color options” button and configure the display so that known gene exons are in bold, red and upper case letters.
Set default case set to "Lower" with the radio button.
Check the boxes for "Toggle Case" and "Bold" and enter 255 in the "Red" box on the GENCODE v24 line.
Click submit and view the newly formatted results.
Intersection queries are one of the most useful functions implemented by UCSC. They allow users to characterize the positional relationships of 2 different overlapping tracks.
The demonstration of this tool will build on the observations made during examination of the 5' end of the PCDH10 in . The region upstream of the 5' end of the gene overlapped with a CpG island. CpG islands correlate with promoters because they tend to be unmethylated. In non-promoter sequence, the C of a CG dinucleotide is usually methylated and thus subject to spontaneous deaminiation. See the details page of the track for more information. Note: all UCSC tracks are well documented and together contain extensive information about a wide range of genomics methods and data types.
summary/statistics and the browser can be used to confirm the custom track contains the desired data.
This processing creates one interval for each GENCODE transcript and many intervals will be similar because many start positions are close to one another. This redundancy can be collapsed by sending the results to galaxy and using the bedtools functions "Sort BED files" followed by "Merge BED files". There are fine controls on this processing and it will not be demonstrated here but the output is available .
Create a custom track using the URL to that merged data and count the results.
Compare the contents of the custom tracks in the browser. The region is chr2:109,000,000-119,000,000
Use an intersection query to identify regions in the merged upstream sequences that overlap with a CpG island.
Open Tools --> Table Browser and set the "group" and "track" options so that they specify the merged bed custom track.
Select the "create" button next to "intersection" button and select intersection with CpG islands according to the following image:
Note the ability to control the degree of intersection (some, none, partial).
After intersection is in place, use summary/statistics to count the results and view some in the genome browser.
The results of the intersection query can be used in the same ways as any other kind of data.
A link to example data is below. xls files are required, open both and examine the data.
Key structure is:
NOTE: Raw gene list of genes of interest works fine as well.
From the , there are 3 ways to access data loading tool:
File-->Upload Data...
Upload Data icon
Upload Page-->Upload Experiments with Gene or Protein IDs
NOTES: Tools emphasize the Activate/Deactivate/Reload requirements.
Choose the file and click Next
Adjust data types as needed. Rename columns and experiments as necessary. Make sure that fold changes and supporting p-value columns are adjacent to each other. Click Next
Specify the species and click Next.
The Background processing status page will indicate when upload is complete. Once processed, the line can be deleted from the table.
Return to the Background processing page from the start page by File-->Background processing item.
How it works
The Unix command vi starts up the visual editor.
Typing vi followed by a file name will automatically open the file.
After issuing the command, the appearance of your screen changes. Rather than seeing the shell prompt, the content of the file filename.txt appears on the screen. If filename.txt hadn't existed before you invoked the vi command, the screen will appear mostly blank.
Modes
One of the fundamental concept to keep in mind when using vi is that three modes exist: command, insert, and visual.
In command mode, everything you type on the keyboard gets interpreted as a command.
In insert mode, most everything you type on the keyboard gets interpreted as characters and will appear in your file--letters, numbers, punctuation, line returns, etc.
When you are in insert mode, you can switch to command mode by pressing the Esc key on your keyboard (this is often at the left and upper portion of your keyboard).
When you are in command mode, there are many keys you can use to get into insert mode, each one gives you a slightly different way of starting to type your text. For example, if you are in command mode, you can simply type i on your keyboard to enter insert mode (i stands for insert). Then, the characters you type go into the file.
Basic commands
h = move one character to the left
l = move one character to the right
k = move up one line
j = move down one line
[ctrl] b = move back one screen
[ctrl] f = move forward one screen
quit Vi without saving anything (you'll lose any changes you made when using this command) type: :q!
save/write the file you're working on without exiting type: :w followed by a filename
save/write your file and quit the vi editor in one step by typing: :wq
vi filename.txt
Identifier FoldChange pvalue
/net/ostrom/data/dropbox/arrayDat.txt
/net/ostrom/data/dropbox/arrayAnnot.txt
The intersection query reports the results in the first set that
have the intersection you specified with the second set you selected.
To view the converse, select known genes first, then edit the intersection
to seek any overlap with custom track.
How many results are produced when the tracks are inverted?
Most processes initiated by UNIX commands take their input from the standard input (i.e. the keyboard) and write their output to the standard output (i.e. the terminal screen).
The "<" sign can be used to redirect the input, that is to specify that the input comes from something other than the keyboard.
The ">" sign can be used to redirect the output, that is to specify that the output goes to something other than the terminal screen.
The ">>" sign can be used to append the output to something other than the terminal screen.
# list the current files and redirect the output to a file named "mylist.txt"
ls > mylist.txt
# view content of mylist.txt
cat mylist.txt
# redirect the input to a command
cat < mylist.txt
# redirect the output and append
cat mylist.txt > list1.txt
cat mylist.txt >> list2.txt
# view content
cat list2.txt
A pipe is denoted by "|".
Several pipes can be used in the same command line to create a "pipeline".
A pipe takes the output of a command and immediately sends it as input to another command.
A pipe is often used in conjunction with the command "less" to view the output within the pager.
#view users connected
who
#count the number of users connected
who | wc -l
#display the content of bin
ls -la /usr/local/bin
#display the content of bin within the pager provided by "less"
ls -la /usr/local/bin | less
Unix comes with a preloaded documentation, also known as "man pages".
Each page is a self-contained document consisting of the following sections:
NAME: the name and a brief description of the command.
SYNOPSIS: how to use the command, including a listing of the various options and arguments you can use with the command. Square brackets ([ ]) are often used to indicate optional arguments. Any arguments or options that are not in square brackets are required.
DESCRIPTION: a more detailed description of the command including descriptions of each option.
SEE ALSO: References to other man pages that may be helpful in understanding how to use the command in question.
When you are not sure of the exact name of a command, you can use the apropos command to see all the commands with the given keyword on their man page.
apropos keyword
For example, to see which commands are relevant to the task of copying, type the following:
apropos copy
To view a manual page, use the command man.
man command_name
When viewing a man page, use the following keys to navigate the pages:
spacebar- view the next screen;
b (for "back") - view the previous screen;
arrow keys - navigate within the pages;
q (for "quit") - quit and return to the shell prompt.
For example, to view the man page associated with the command cp
, type the following:
man cp
The UNIX file system consists of files and directories organized in a hierarchical structure. When visualized, this hierarchical structure looks like a tree, with roots and many branches.
tree -L 1 /
command, showing the Unix filesystem tree:
/
├── bin
├── boot
├── dev
├── etc
├── home
├── lib
├── lib64
├── mnt
├── nfs
├── opt
├── proc
├── root
├── run
├── srv
├── sys
├── tmp
├── usr
└── var
19 directories, 0 files
Every file or directory in a UNIX operating system is somewhere on this "tree." /
is referred to as "root," because it's the root of the tree which every other file or directory is inside.
For example, every UNIX user's home directory is in home
, which is in /
. In other words, every user's home directory is in /home
.
Unix has some shortcuts for referring to directories.
.
stands for "my current directory."
..
stands for "my parent directory," a.k.a. the directory one branch higher in the tree
~
stands for "my home directory."
Example of Unix directory shortcuts:
/
├── home
│ ├── asoberan ( ~ )
│ │ └──Documents ( .. )
│ │ └──unix_class (I am here) ( . )
│ ├── duan
│ └── yourusername
├── etc
├── bin
├── tmp
...
The organization of the filesystem is not set in stone. However, there is a standard that many UNIX operating systems follow called the Filesystem Hierarchy Standard (FHS). The FHS defines a standard filesystem layout for greater uniformity and easier documentation across UNIX-like operating systems.
Some of what the FHS dictates includes:
/
must have everything to boot, restore, recover, or repair a system.
/etc
holds configuration files for the system and programs present on the system. For example, if I install the SSH server, I can reasonably expect configuration files for it to be found at /etc/ssh/
/home
are user's home directories
/tmp
holds temporary files that can be deleted on reboot
/bin
holds essential programs that are needed for system recovery
/usr/bin
holds non-essential programs
Thanks to the FHS, you can expect most UNIX-like operating systems to look like this.
A basic UNIX command has 3 parts. These are:
command name
options (zero or more)
arguments (zero or more)
In general, UNIX commands are written as single strings followed by the return key. Note that UNIX commands are case-sensitive, i.e. it does matter whether you type a letter in uppercase or lowercase.
wc –l myfile.txt
The example above has the following components:
the command name (wc)
one option (-l)
one argument (myfile.txt)
Also called “switches” or “flags”.
Specify the way a given commands must work.
Preceded by one or two dashes (one dash if the switch is a single character, two dashes otherwise).
Tab completion is a useful features of UNIX where the shell automatically fills in partially typed commands when the tab key is used. This is advantageous in many ways:
Commands or filenames with long or difficult spelling require fewer keystroke to reach
In the case of multiple possible completions, the shell will list all filenames beginning with those few characters.
Another useful feature is the fact that the shell remembers each typed command, so you can easily recall and run previous commands.
By using the up and down arrows at the prompt, you can revisit the command history and recall previously typed commands.
Previous commands can be re-executed as they are or can be edited using the left/right arrow.
Use the history command to specify how many and which previous commands to display.
# display last typed commands
history
#display the last n typed commands
history n
#execute command n
!n
#execute last command in the list
!!
# recall and execute the n-th last command
!-n
Examples:
# recall the last 5 typed commands
history 5
# recall and execute the 3rd last command
!-3
# recall and execute command number 120
!120
Until now, all the examples of using a shell that we've seen have used our keyboard input as STDIN and the terminal screen as STDOUT. However, the shell gives us the ability to change STDIN and STDOUT by using two mechanisms: pipes and redirectors.
A redirector is denoted by a >
. A redirector redirects the output of a command into a file, essentially changing the STDOUT to a file rather than the terminal screen.
For example, echo 'Hello, World!'
usually outputs the message "Hello, World!"
onto our terminal screen. However, if we run
echo 'Hello, World!' > hello.txt
nothing will be printed to the screen. Instead, STDOUT is changed to the file hello.txt
, so the output of echo 'Hello, World!'
will be pasted into the file hello.txt
.
We can verify that by running cat hello.txt
, which should print out the contents of hello.txt
to our terminal screen.
A redirector will always replace the contents of a file. If, however, you wish to append content to a file, you can use >>
.
Using >>
to append contents to a file:
echo 'Hello, again!' >> hello.txt
A pipe is denoted by a |
. Pipes allow us to use the output of a command as the input of another command.
For example, let's say you wanted to search through the contents of cities.csv
to search for every line that includes "TX". You can do so by printing the contents of the file using cat
, then piping the output into grep "TX"
.
cat cities.csv | grep "TX"
This will print every line that includes "TX".
Piping is a very powerful feature of the shell, because it lets us compose many small operations into a very complex set of operations. For example:
# List the files in the current directory
# and print extended information
ls -l
# Format the output of ls -l by removing
# cases of multiple spaces and replace
# them with a single space for easier
# parsing
ls -l | tr -s " "
# Use the cut program to separate each line
# of output into colums, delimited by a space
# then print columns 3-4. In the case of this
# output, we're printing the users and groups
# which own the files in the directory
ls -l | tr -s " " | cut -d" " -f3-4
# Alphabetically sort the users and groups which
# own the files in the directory
ls -l | tr -s " " | cut -d" " -f3-4 | sort
# Count the instances of users and groups so we can
# know who owns files in the current directory and
# how many files they own
ls -l | tr -s " " | cut -d" " -f3-4 | sort | uniq -c
The UCSC table browser allows execution of complex database queries without knowledge of database query languages.
Open the UCSC browser and activate the "Dec. 2013 (GRCh38/hg38) human assembly.
Enter the position chr4:133149103-133195252 and click submit.
Click the "Tools-->Table Browser" link in the upper blue bar to open a Table Browser page.
Note the organism and assembly controls (red box)
Note the data type controls (blue box) that are associated with different kinds of data (browser tracks).
Note the region selection option
Set group to "Genes and Gene Prediction Tracks", track to "GENCODEv24", and table to "knownGene".
Restrict region to chr4:133,149,846-133,195,995
locate the output format menu and select "BED - browser extensible data" and click the "summary/statistics" button at the bottom of the page.
This provides details about your query results and can be very useful for data summaries and as a guide to ensure your query is behaving as desired.
The count of genes returned should be 3. Note that only 2 are visible in the browser. This is do to the default "Basic" display of the new known genes set ENCODEv24. For more information see here.
Return to the Table Browser and click the "get output" button to view the bed format version of the data.
Switch the output format to "GTF" and compare the results
Note that all different kinds of data can be accessed and downloaded in ways similar to this although the available output formats may be different.
Zoom to the 5' end of PCDH10 (chr4:133,148,848-133,150,064) and experiment with Comparative Genomics --> Cons 20 Mammals track. Note the base level conservation scores from output format "data points".
Request output in "custom track" format and view the data with "Full" setting in the browser.
Note the high scoring region just upstream of the 5'end that also overlaps with a DNase I hypersensitivity peak and an area conserved in fish and lamprey.
Turn on the CpG islands track in the Regulation group and observe the overlap with conserves region and DHS. Open the Encode Regulation track and turn on all the other available tracks by setting them to "pack". In addition to those other features, there is also a valley in the histone marks. Together these data are consistent with this region having a role in the regulation of expression of PCDH10.
Modes of Support
Individual Training
Project Work
$90 per hour
$1200/0.1FTE/mo
Requesting Support
Send email to Charlie Whittaker or Stuart Levine to arrange for a free outreach consultation.
Project Tracking and Billing Handled with ilabs
Core users must have ilabs account and valid cost-objects in place. The BCC/BMC core must provide a cost estimate and the project has to be approved by the faculty member before work begins.
UNIX filesystems have a feature called symbolic links, which are files that point to another file, called the target. They're similar to shortcuts in Windows.
There are two kinds of symbolic links: hard links and soft links.
Hard links create a file which points to the same space on the hard drive as the file which is being linked to, rather than to that file itself.
/home/me/file /home/you/file
| |
hard link hard link
| ,'
V _.-`
HARD DRIVE <....--``
A file won't be deleted until every hard link to it is deleted.
To create a hard link, you use the ln
command with the source file, and the target hard linked file:
ln arrayDat.txt arrayHard.txt
ls
Notice that any changes you make to arrayDat.txt
will be reflected in arrayHard.txt
.
Soft links create a file which points to the original file or directory.
/home/me/file <---Soft link--- /home/you/file
|
|
V
HARD DRIVE
Soft links break if the original file is deleted.
To create a soft link, you use the ln
command with the -s
flag.
ln -s arrayDat.txt arraySoft.txt
ls -l
Notice that any changes you make to arrayDat.txt
will be reflected in arraySoft.txt
. The ls
command will show that arraySoft.txt
points to the file arrayDat.txt
.
Your Luria home folder has a couple of soft links automatically set up pointing to storage servers so that common programs don't take up too much space on the head node.
Luria has multiple versions of R available through the environment modules system. Sometimes, R packages install normally. However, other times there may be incompatibilities between the versions of R installed on Luria and the R package you're trying to install or a dependency that an R package needs is not installed on Luria.
In cases like this, you'll likely to need to use R installed some other way. There are two ways: using a Singularity image with R built-in, or installing R in a Conda environment. The first way is preferable.
The that come pre-installed with R. Each image is geared toward a specific use-case. The r-ver
images can be used for when you only need the R command line. The rest of the images come pre-installed with RStudio. If you want to run RStudio, not just the R command line, refer to the following page.
To use one of these images, follow these instructions on Luria:
R is available as a package in the . You can create a Conda environment, then install R into this environment.
Open the page.
The horizontal blue menu provides access to the tools.
Particularly useful links are Genomes, Genome Browser, Tools and My Data.
Clicking on the Genomes link will lead to a page that allows selection of organism and genome assembly.
Position or search term queries - examples of how to query the database.
Assembly details - General assembly information
Summary Statistics - Detailed statistics of the active assembly
Proceed to the genome assembly by clicking "Go"
Search for the gene Protocadherin 10 by pasting that name into the "position or search term" box and clicking "submit".
The genome browser is the central visualization tool of the UCSC Genome Informatics Group.
Most everything is hyperlinked and adjustable
Restore the visualization to default by clicking the "default tracks" button under the image. Note that settings from previous sessions with the browser may be retained by your browser.
Positional and scale controls are located above and below the browser image (red boxes)
Zoom and movement buttons
Coordinate window
Coordinate bar in view - click re-centers and zooms in 3x
Dragging across the upper track is a zoom to selection function.
Move start and end functions
Clicking in window and dragging re-positions the view
Clicking the chromosome ideogram re-positions the window to selected part of the chromosome
Data organization and track controls (blue boxes)
Like the search results, the information displayed in the browser window is organized into groups and tracks of data. The kinds of tracks being displayed and the ways they are presented can be controlled.
NOTE: This human assembly is mature, heavily annotated and extensively analyzed. The genomes of other organisms with less mature assemblies have less associated information and therefore fewer tracks.
The buttons "default tracks" and "hide all" will do as their name suggests.
The "configure" button displays a list of the available data with display options.
The bottom of the page consists of a series of pull-down menus. Each is associated with a type of data and the options control display mode.
IMPORTANT: Useful details about each track can be obtained by clicking the hyperlinked track name.
Navigate to chr4:133,101,179-133,262,320 using the position window.
In the "Genes and Gene Predictions Tracks" section, use the pull-down menu to set the "GENCODEv24" track to full.
Click the "refresh" button. Toggle the GENCODEv24 track between dense and full using the pull-down menu and refresh button.
For reference, the effects of these different options are described
Note:After changes are made to the pull-down menus, refresh is required for them to take effect.
The visual cues that you see in the "Full" display of known genes are:
Arrowheads indicate the strand of the gene.
Thin lines are introns, medium lines are non-coding portions of the mRNA, thick lines are coding region.
Zoom in to display the region around the 5' end of the PCDH10 gene using coordinates chr4:133,148,483-133,162,403
Setup - Connect to your luria account and run the following commands to create a conda environment called jupyter, activate that environment and install the necessary software
module add miniconda3/v4
source /home/software/conda/miniconda3/bin/condainit
conda create --name jupyter
conda activate jupyter
conda install -c anaconda jupyter
Running jupyter notebooks -
On both mac and PC:
login to luria and run:
srun --pty -n 16 bash
to get all of the resources of a single cluster node in the normal queue. You need to note the cluster node being used. It will be something like: c##
then cd into your data directory:
cd /net/host/data/share/users/kerberosID
NOTE: the path will vary by user and PI. The necessary information for host, share and kerberosID should be visible by running the following command from inside your luria home:
ls -lat | grep data
activate the jupyter conda environment with:
module add miniconda3/v4
source /home/software/conda/miniconda3/bin/condainit
conda activate jupyter
From this cluster node with an active jupyter coda environment the execute the command:
jupyter notebook --ip=0.0.0.0 --port=12123
That 12123 is a randomish 5 digit number, you could change to most anything between 10000 and 60000, you just need to keep track of what you use.
You should see a lot of output stream across the screen, leave that window open and running. There will be information in the output that you will need for the next step.
For Macs, you need to open up a new terminal window and run:
ssh -t [email protected] -L 12123:localhost:12123 ssh cXX -L 12123:localhost:12123
the cXX needs to be replaced with the node that you are running on
this will create a 2nd “tunneled” connection to the luria cluster node that is running your jupyter notebook server.
For PCs running secureCRT, you just need to modify the existing connection. This is done by selecting OptionsàSessionOptions and the Port Forwarding item
In there, you need to click Add and add a tunnel according to slide 9 of:
your port and your socket value will both be 5 digit number you select and you need to get the node correct.
Once those connections are set up, you can open a web browser on your mac and go to one of the URLs provided in the output of the notebook run.
There will be a couple of options for URLS, use the one that starts with something like:
...
If all is well, a jupyter notebook that is actually running on the cluster node will appear in your browser and the directories in your storage will be available.
In addition, I also want to point out another computational resource at MIT called engaging on demand:
This is a different computer that you have access to and it also offers jupyter notebooks with more substantial computational resources.
1. Conda is an open-source package management system and environment management system.
2. Package, dependency and environment management for any language---Python, R, Ruby, Lua, Scala, Java, JavaScript, C/C++, FORTRAN
3. Comparison of pip and conda
Pip installs Python software packaged as wheels or source distributions, which may require compilers and libraries. Conda packages are binaries. No need for compilers.
Conda packages are not limited to Python package.
Conda can create isolated environments. Pip has no built in support for environments but rather depends on other tools like virtualenv or venv to create isolated environments.
Pip installation does not check dependencies of all packages. Conda installation will verify that all requirements of all packages installed in an environment are met. This check can be slow.
A package may not be available as a conda package but is available on PyPI and can be installed with pip
4. Miniconda is a free and mini version of Anaconda.
1. Load miniconda3 module
Some online conda doc suggests running conda init to initialize conda environment. Please do not run this as it will add conda settings in your .bashrc file which may pollute your environment. You only want to call the conda environment when needed, not by default.
2. Managing environments
Conda allows you to create separate environments containing files, packages, and their dependencies that will not interact with other environments. When you begin using conda, you already have a default environment named base. To see available packages in the base environment
You can run the command "python -V" to see what is the python version in the base environment. You can't add your own programs into the base environment because you will not have permissions. Create your own separate environments to keep your programs isolated from each other. To see available conda environments
Create a new environment and install a package in it. We will name the environment snowflakes and install the package BioPython. Type the following:
Run python and then import Bio to confirm biopython can be imported. Then run conda deactivate to leave the snowflakes environment.
To delete an entire environment.
3. Show conda environment info
4. Exercise 1: install samtools
If you run "module avail samtools", you will find the latest module available is 1.10. Now you are going to install a newer version of samtools using conda. First create an environment named samtools
Visit and search for "samtools" which will lead you to page about installation instructions. Follow the instruction there. Run the samtools command to confirm it worked.
What if you want to install a different version of samtools? You can change the syntax to conda install -c bioconda samtools=1.1. This will downgrade your samtools from 1.12 to 1.11.
What if you want to keep both versions? HINT: create a separate conda environment named samtools-1.1.
5. Exercise 2: install umi_tools
Next you want to install a python package named umi_tools. Search for umi_tools in to get instructions.
You can download conda cheat sheet from or google conda cheat sheet.
Multiple versions of various software packages are available on luria.mit.edu Versions are managed using .
module commands
List available applications on the system:
Load an application so it can be used:
Specify version number of an application, e.g.
We recommend you always module add the version number so that you know which version of the software is used. Otherwise, a running application may generate different results or break when the software is upgraded in the future.
Note: some packages are available only after you load the python (for python packages) or jre (for java packages) modules. For example, compare the output of the following:
with
R and R packages - Various versions of R are installed on luria. Each one has a collection of associated packages. Execute the following commands taking note of the comment lines:
Once inside R, check out the available R packages
NOTE: Those package version numbers may not be consistent from month to month and can change without warning.
When working with R, capture your session info with the command below so you can reproduce the environment if needed:
quit R
Once back in the shell, view the R data that remains
List loaded applications:
Unload an application:
Print module help:
Instead of running a single command at a time, you can combine multiple commands into a file to create a script. Scripts can simplify multi-step processes into a single invocation, saving you time in the long-run.
Writing a script is as simple as writing a file. Usually, we make the first line of the script #!/bin/bash
to tell the shell to use bash
when running the script.
You can create variables in a bash script using the =
operator. So to make a variable named myname
, you'd write myname="Allen"
. To use the variable later in the script, you'd prefix it with a $
. For example, $myname
.
You can also ask for a user's input and store that into a variable by using the read
command. Preface it with echo "question?"
to give context for what the user is inputting. For example:
You can run shell commands inside of a script and store their results in a variable. To do so, you wrap the command with $()
. For example, to get the size of a file and store it in a variable, you'd do file_size = $(du file)
.
Knowing this, let's create a script that looks through the files in a directory, and moves any files above a certain size to a new folder. Name it sizewatcher.sh
Once you've created and saved this file, make sure to modify its permissions to let it be executed. An easy way of doing this is running the following command:
Then run the script:
Let's create another script that asks a user for a directory, then sorts the files in that directory into folders that correspond to the year and month that the file was created. Name it sorter.sh
:
Make it executable and then run it.
echo "What is your name?"
read name
echo "Hello, $name"
#!/bin/bash
# Make variables for directory to loop through
# and directory where files should be moved.
directory="/home/asoberan/unixclass"
new_directory="/home/asoberan/bigfiles"
# If the new directory doesn't exist, then
# create the directory
if [ ! -d $new_directory]; then
mkdir $new_directory;
fi;
# Loop through the files in the directory.
# Store the size of the file (in KB), in
# file_size. If the file size is greater
# than 400000 KB, move the file to the
# new directory
for file in $(ls $directory);
do
file_size=$(du $file | cut -f1);
if [ $file_size -gt "400000" ]; then
mv $file $new_directory;
fi;
done;
chmod +x sizewatcher.sh
./sizewatcher.sh
#!/bin/bash
# Ask for directory and store it in a variable
# called "directory"
echo "What directory to check?"
read directory
# Check if the provided argument is a directory
if [ ! -d "$directory" ]; then
echo "Error: $directory is not a directory."
exit 1
fi
# Iterate over files in the directory
for file in "$directory"/*; do
# Check to see if the file is really file and
# not a directory, etc.
if [ -f "$file" ]; then
# Get the year and month of creation for the file
year=$(date -r "$file" +%Y)
month=$(date -r "$file" +%m)
# Create directory for the year and month if it doesn't exist
mkdir -p "$directory/$year-$month"
# Move the file to the corresponding directory
mv "$file" "$directory/$year-$month"
echo "Moved $file to $directory/$year-$month"
fi
done
echo "File sorting complete."
module load miniconda3/v4
source /home/software/conda/miniconda3/bin/condainit
conda list
conda env list
conda create --name snowflakes
conda env list
conda activate snowflakes
conda install biopython
conda list
conda deactivate
conda remove --name snowflakes --all
conda info
conda create --name samtools
conda activate samtools
module avail
module add
module add tophat/2.0.12
module avail
module add jre/1.6.0-29
module add python/2.7.2
module avail
# Load an R version
module add r/2.15.3
# Start R
R
#List R packages and direct the list into an object called "a"
a<-installed.packages()
#Display the dimensions of the object "a"
dim(a)
#List the parts of object "a" that have package name and package version
a[,c(1,3)]
sessionInfo()
q(save="yes")
#enter y at prompt
ls -lat | head
cat .Rhistory
module list
module del
module help
Globus is a robust data management platform designed to facilitate the secure and efficient transfer of large datasets across institutions and collaborators.
In order to make use of the Globus platform, you will need to create an account if you have not done so already. Most users will do so by using their institutional login (e.g., MIT Kerberos) by clicking the 'Log In' button in the top-right hand corner of the Globus website. After providing your consent, your new Globus account will be tied to your full name and email address as it appears at your institution.
If you are not affiliated with any of the organizations or research institutions listed, you can create your own unique login Globus login by filling out the Globus ID form. This login will take the form [email protected]
, and will have its own login credentials associated with it.
There are several public guest collections available in Globus, through which you will be able to access generated core data:
Once you have created a Globus account, you can reach out to the core responsible for your data to request access to one of these collections, and a link will be shared with you via email after approval.
Until you are provided access, you will not be able to browse the collections via the links above.
If you require assistance with moving data to or from an external collaborator and local KI storage, we can help to facilitate that. This is the information that we'll need from you:
The volume (size, number of files) of data being moved.
What is the requested delivery date? This should be at least two weeks in advance in order to accommodate for network and filesystem performance.
Where does the incoming data need to go (e.g., /net/bmc-lab2/...
)?
Who will be needing access to write to the specified path? This can be a member of your team or your collaborator's team. For information on logging in to the Globus platform, please see above.
Which storage host is the outgoing data located on, and what is the absolute path to it (e.g., /net/bmc-lab2/...
)? This will depend on your lab association.
Does the destination institution use Globus? If so, has the destination Globus collection been created yet? If not, the data can be transferred to local storage using Globus Connect Personal, which will need to be configured prior to the transfer.
Who will be needing access to read from the specified path? This can be a member of your team or your collaborator's team. For information on logging in to the Globus platform, please see above.
For additional assistance or questions, please reach out to [email protected]
.
quota
Luria has a disk usage quota on the head node. You can use the quota
command to check what the size of the quota is and how much space you're currently using.
The -s
flag will display the quota in a human-readable format. (e.g. 18K, 14M, 65G).
quota -s
Disk quotas for user asoberan (uid 247789):
Filesystem space quota limit grace files quota limit grace
/dev/mapper/cl-home
4391M 18432M 100G 24925 0 0
# Space column shows how much space you're currently using on the head node
# Quota column is total amount of space allotted to you
du
Stands for "disk usage". Reports how much disk space a directory is taken up.
Defaults to displaying the disk usage in kilobytes, but passing it the -h
flag will return disk usage in a human-readable format. (e.g. 18K, 14M, 65G).
Will search the entire depth of the Unix tree starting from the directory you give it. You can control the depth which it searches by passing the -d <num>
flag.
# Displays the disk usage of every file under your home directory
# The last entry will be the disk usage of your entire home directory
du -h ~
# Displays the disk usage of every file and directory immediately
# under your home directory
du -h -d 1 ~
# Displays the disk usage of one file, file.txt
du -h file.txt
tar
Combines multiple files or directories into one archive for easy sharing. Similar to "zipping" files, however tar does not compress by default.
Create an archive by passing the -cf
flags
Can compress multiple files/folders using gzip
by passing the -z
flag when creating a new archive.
Useful when you have files that take up a lot of space and you want to save space.
Extract an archive by passing the -xf
flags. Un-compress an archive by passing the -z
flag to those two flags.
# Create an archive of a directory and name it my_directory.tar
tar -cf my_directory.tar <directory>
# Create a compressed archive of a directory and
# name it my_compressed_directory.tar.gz
tar -czf my_compressed_directory.tar.gz <directory>
# Extract archive my_directory.tar
tar -xf my_directory.tar
# Uncompress the archive my_compressed_directory.tar.gz
tar -xzf my_compressed_directory.tar.gz
zip
Zip multiple files and directories into one file. Zipping is similar to archiving with tar
, but zipped files are easier to deal with on Windows machines.
# Zip a directory to the file my_directory.zip
zip my_directory.zip <directory>
# Unzip my_directory.zip
unzip my_directory.zip
module load singularity/3.10.4
# Replace 4.2.0 with the version of R you need.
# The first time this runs, Singularity will pull
# the image, which may take around 5 - 10 minutes.
# Subsequent times will be faster.
singularity exec rocker/r-ver:4.2.0 R
# You will now have an R 4.2.0 session running. Run
# install.packages("YOUR PACKAGE") and type "y" if
# R prompts you to create a user library
R>
module load miniconda3/v4
source /home/software/conda/miniconda3/bin/condainit
# Name the environment as you please
conda create --name r_environment
conda activate r_environment
# Replace 4.4.1 with the version of R you
# need. Make sure to check if conda-forge
# has the version packaged.
conda install -n conda-forge r-base::4.4.1
R
# You will now have an R 4.4.1 session running. Run
# install.packages("YOUR PACKAGE") and type "y" if
# R prompts you to create a user library
R>
This tutorial shows how to perform basic QC on Illumina data, such as basic quality statistics, quality score boxplots, trimming and masking.
1. Load fastq file and annotate the uploaded data
On the Tool Panel, click on Get Data → Upload File.
This tools allows you to upload a data from a file, url or a textbox.
Select Galaxy_GM12878_trimmed.fastq as input file.
Select "fastqsanger" as file format.
Select "hg18" as genome.
Click the "Execute" button.
2. Map to reference genome (hg18) using Bowtie
On the Tool Panel, click on NGS Toolbox Beta → NGS Mapping → Map with Bowtie for Illmina.
This tools allows you to run the aligner Bowtie. The output is a SAM files with all the read alignments.
Select hg canonical as reference genome.
Leave other settings as default.
Click the "Execute" button.
3. Filter SAM file on bitwise flag values
On the Tool Panel, click on NGS Toolbox Beta → NGS SAMtools → Filter SAM on bitwise flag values.
This tools
Select data 2 as input dataset.
Add new flag with type set to "the read is unmapped" and the value set to "No".
Add new flag with type set to "read strand" and the value set to "Yes".
Click the "Execute" button.
With these parameters, the resulting output consists of those reads that are properly mapped and are on the reverse strand.
4. Find how many reads map to each chromosome
On the Tool Panel, click on Join, Subtract and Group → Group →
This tools
Select data 2 (bowtie output) as input dataset.
Group by column 3 (reference name, i.e. chromosome name)
Add new operation: count on column 1 .
Click the "Execute" button.
With these parameters, the resulting output consists of those reads that are properly mapped and are on the reverse strand.
Edit the attributes of this module by clicking on the eye icon: rename as "read distribution by chromosome".
5. Find the most represented chromosome
On the Tool Panel, click on Filter and Sort → Sort data →
This tools
Select the column representing the key to sort
Select "Numerical sort" in "descending order" as options.
Click the "Execute" button.
With these parameters, the results show that chr19 is the most represented chromosome.
6. Convert SAM to BAM
On the Tool Panel, click on NGS Toolbox Beta→ NGS Samtools &rarr SAM to BAM converter.
This tools converts SAM-formatted files into BAM-formatted files.
Select the SAM file to convert.
Click the "Execute" button.
7. Compute general statistics via Flagstat operation
On the Tool Panel, click on NGS Toolbox Beta→ NGS Samtools &rarr flagstat.
This tools provides a simple summary based on BAM-format.
Select data 6 (BAM file) as input.
This tutorial shows how to perform basic QC on Illumina data, such as basic quality statistics, quality score boxplots, trimming and masking.
1. Load a fastq file and annotate the uploaded data
On the Tool Panel, click on Get Data → Upload File.
Browse to your home folder and select the file Galaxy_GM12878.fastqillumina'.
Set the format to "fastqillumina".
Click the "Execute" button.
Once the job is completed, click on the pencil icon to edit the attributes.
Set the Database field to "Human Mar 2006 (NCBI36/hg18).
Click the "Save" button.
2. Convert to Sanger FASTQ format
On the Tool Panel, click on NGS Toolbox Beta → NGS: QC and Manipulation → FASTQ Groomer.
This tool converts between various FASTQ quality formats.
By default, the quality format output is Sanger FASTQ.
Sanger FASTQ is the required format for downstream analyses in Galaxy.
Set the input type to "Illumina 1.3+"
Click the "Execute" button.
Once the job is completed, click on the pencil icon and edit the name of the job as "GM12878 fastqsanger".
3. Compute Quality Statistics
On the Tool Panel, click on NGS Toolbox Beta → Fastx-Toolkit → Compute Quality Statistics.
This tool compute quality statistics such as min, max, mean, median, Q1, Q3, IQR, etc. of quality scores.
Select Data 2 as input library.
Click the "Execute" button.
4. Draw Quality Score Boxplot
On the Tool Panel, click on NGS Toolbox Beta → Fastx-Toolkit → Draw Quality Score Boxplot.
This tool creates a box graph of the quality scores in the library.
Select Data 3 as statistic report file.
Click the "Execute" button.
Once the job is completed, click on the eye icon to see the boxplot figure. You can expand and collapse the figure by clicking on the arrows placed on the sides of the main panel.
5. Trim Sequence Reads to length of 60 bases
On the Tool Panel, click on NGS Toolbox Beta → FASTQ Trimmer (by column).
This tool trims the end of the reads.
Select Data 2 as input FASTQ file.
Set the offset from 5' end to 16.
With these parameters, all reads are trimmed after the 60th base.
Click the "Execute" button.
Once the job is completed, click on the eye icon to edit the attributes of the resulting data and change the name to "GM12878 Trimmed fastqsanger".
6. Apply Quality Masker to bases with quality lower than 20
On the Tool Panel, click on NGS Toolbox Beta → FASTQ Masker.
This tool allows masking base characters in FASTQ files according to quality score value and comparison method.
Select Data 2 as input file to mask.
Set the criterion as "less than" and the threshold to 20.
Click the "Execute" button.
With these parameters, any base with quality less than 20 will be masked with a symbol "N".
7. Apply FASTQ Quality Trimmer
On the Tool Panel, click on NGS Toolbox Beta → FASTQ Quality Trimmer (by sliding window).
This tool allows trimming the ends of reads based upon the aggregate value of quality scores found within a sliding window. Several criteria can be used to determine the aggregate value (min, max, sum, mean) within the sliding window.
Select Data 2 as input file.
Select "Trim 5' end" only from the scroll down menu.
Set window size to 3.
Select "max score" as aggregate action.
Select ">= 2" as criterion for trimming
Click the "Execute" button.
8. Create a Data Subset by selecting the first 2,500 sequence reads
On the Tool Panel, click on Text Manipulation → Select first lines.
This tool select the first N lines of the input dataset.
Set to 10,000 the number of lines to select.
Select data 5 as input.
Click the "Execute" button.
With these parameters, the first 10,000 lines of the input FASTQ file are selected, corresponding to the first 2,500 sequence reads.
Annotating Genomic Sequence with Argo
Argo is a genome browser and annotation tool written by Reinhard Engles at the Broad Institute. It combines powerful data display functions with the ability to create and edit genomic features.
Start . You can use the web start or download the jar and double-click.
The data files your need for this exercise are located HERE. Save each file to your working directory.
Load Sequence into argo by selecting the "File-->Open Sequence File" menu item. The protocadherin 10 gene sequence is in the file "pcdh10.gene". In the sequence file format pull-down menu, select Fasta. In the sequence range window, accept the default of the entrire sequence range. In the Feature Map Track Table, click the "Load Tracks From File..." button. Select "known.gff". In the Track File Format window, accept the default of "gff".
A line will appear in the track table representing the gff file you just loaded. If you click on the colored square, you can change the color of the track, pick dark blue. The click OK.
The resulting feature map shows the 3 known gene isoforms for PCDH10. Select one of the isoforms.
This activates the feature inspector window in the lower left hand corner of argo. The properties tab describes the feature you have selected, the DNA tab contains the DNA sequence(exons in blue, introns in black), the mRNAis found inder the mRNA tab and the Protein translation of the mRNA is in the Protein tab. Note: Right-clicking (command click for macs) while hovering over the mRNA sequence will allow you to adjust the reading frame.
A right-click while hovering over the Feature map will open a function menu. Select the option "Track Table". This opens the same tool that appeared while we were loading the sequence. Click the "Load Tracks From File..." button.
Select the file "cad.blastx" and hit OK. This results in a Track File Format menu, select the Blast (standard) option. cad.blastx contains standard blast output (look at it here) in text format. The query sequence was the genomic DNA we are displaying, the database was a set of protein sequences related to the gene we are working on. Therefore, after hitting the OK button you will see a "Blast Feature Coordinates" dialog box. The answer to this question is no because of the subject sequences were the protein, not the DNA in argo. This will add the blast results to the Feature map. When you select a blast hit, information about that hit appears in the feature instpector.
Control the way the blast data is displayed by opening the track table and clicking on the area next to the blast track and under the style section. Change it to "Bar Graph".
Using similar methods, load the files cpg.gff and est.gff. All are in gff format.
The resulting display may place the known genes near the top of the page. This is not ideal, open the track table and change know genes to "Segregated".
In the UCSC browser, we identified the EST DA219615 as evidence for an uncharacterized PCDH10 splice variant. Search for this feature using the tool on the lower right hand side of argo. The search tool will result in selection of that feature.
You can inspect details about the feature by right clicking while hovering over the feature map. For example, select the option "Splice Site Profile". This results in window showing the splice junction for the selected feature.
Create a gene model based on this evidence using the "Edit --> Insert Compound Feature" menu item. This will open a window showing the coordinates of the feature. Give it the Gene name "PCDH10" and the transcript name "iso3". Next you will be asked for a file to save the data. Create a new gff3 file called iso3 and click OK.
The result will be a feature is inserted into the map that is identical to EST DA219615. It is located at the very top of the map. Move it's display position down using the track table. Segregated is a good choice.
The purple color indicates a pending insert. Save it using "Edit --> Save".
Examine the 3' End of the gene model. While hovering over the feature map, hold down the Shift Key. A magnifying glass will appear. Use the magnifying glass to draw and small box around the 3' exon of the inserted model.
Compare the exon to the est BG201062. The EST contains more sequence and is likely to represent the real 3' end of the gene. Select the BG201062 and drag it to the end of your inserted model. You will be given a transcript extension dialog box. Select Merge Right. Now, your edited feature should be highlighted in orange. This indicates a pending edit. Save the new model. You will get an error, but that is OK for now.
Zoom out to view the full gene by Shift-Right Clicking several times. Or use the Zoom to 100% function.
Next Add the 5' end of the gene. Select one of the longest known gene models and drag it to your edited feature. This time accept the merge left option. Save the model, you will now get 2 errors, accept them as well.
Repair the gene model using the Feature Inspector window on the lower left. Select your gene model, then click the DNA tab. The letters in DNA will turn red indicating that the feature has non-canonical (ie. non-GT-AG) splice junctions. These will be highlighted in red in the sequence pane. Scroll down until you find the first one. It is located at the 5' end of exon 3. Highlight a couple of lines before and after the red letters. Notice the yellow bar in the feature map. Zoom in on the troublesome exon. The exon is different from the know genes because it's 5'end hangs over a little. Select the known exon and drag it to the working model. select the replace option and save. This will repair the first problem.
Next, zoom in on the 3' exon of the working model. We have an exon that is a few bases off because of the EST-based extension we used to create the model. Go to the DNA tab of the Feature Inspector, scroll to the bottom. Note the non-canonical junction highlighted in red. Right next to the red highlight is the sequence AG that we need. Highlight the red letters and sequence up to and including the AG and do a right-click. One of the available options is "Make Highlighted Sequence Intronic". (WARNING: Mac users without a 2+-button mouse may have trouble with this operation - you can use the drag and drop features to get this exon right) Select this option then save the model. At this time, it should save without errors.
Go to the mRNA tab, right click and select the option "Select ORF > Select Longest Start to StopCodon ORF". Go to the Protein tab to see the protein sequence.
Note the green start and red stop codons indicated in the feature map. Also note the consensus poly-A signal analysis available from the right-click menu.
The entire feature map can be exported as an image for use in presentations using the "File --> Export Map as Image..." menu item.
Select the hand-edited isoform, return to the mRNA tab in the feature inspector, do select all and copy, then return to the section on the UCSC browser. We will start by pasting the mRNA sequence from this annotation effort into a Blat search window.
Monday July 8, 2024
2:00pm-4:30pm
Allen Soberanes
Building 76
TBD
Introduction to Unix
Tuesday July 9, 2024
2:00pm-4:30pm
Allen Soberanes
Building 76
TBD
Advanced Utilization of IGB Computational Resources
a large number of life sciences software implemented in R
an extensive collection of experimental and annotation data that relate to the analysis software.
The packages are written, documented and supported according to consistent standards.
Each package has a website. For example see:
both pdf documentation and example R scripts exits.
Instructions on joining the mailing list
IMPORTANT:Carefully read the posting guide and follow the instructions.
The list is high-traffic and widely read so be careful with postings.
#Connect to rous.mit.edu and start R
R
#check available packages:
library()
#load affdata
library(affydata)
#Run some of the commands
data(Dilution)
ls()
Dilution
class(Dilution)
expressionData<-exprs(Dilution)
class(expressionData)
expressionData[1:3,]
log2(expressionData[1:3,])
round(log2(expressionData[1:3,]), digits=2)
The following series of commands can be used to process array data with gcrma and do differential expression testing with LPE:
library(affy)
library(gcrma)
library(LPE)
library(affyPLM)
#Set the working directory to the location of your CEL files.
setwd("/Path/TO/CEL_Files")
#Import Data
#Order of samples in the resulting matrix can be specified by the order in the list.
Name_Dat<-ReadAffy(
"Condition1a.CEL", "Condition1b.CEL", "Condition1c.CEL",
"Condition2a.CEL", "Condition2b.CEL", "Condition2c.CEL")
#RNA degradation Work
RNAdeg<-AffyRNAdeg(Name_Dat)
png(file="Name_rnaDeg.png", bg="white")
plotAffyRNAdeg(RNAdeg,cols=c(1:16))
dev.off()
#PLM work
pset1<-fitPLM(Name_Dat)
#RLE plot
png(file="rle.png", bg="white")
par(mar=c(3, 10, 3, 3))
RLE(pset1, main = "RLE for Name", horizontal=TRUE, las=2)
dev.off()
#NUSE plot
png(file="nuse.png", bg="white")
par(mar=c(3, 10, 3, 3))
NUSE(pset1, ylim= c(0.95,1.2), main = "NUSE for Name", horizontal=TRUE, las=2)
dev.off()
#Process the Data
Name_Exp<-gcrma(Name_Dat, fast=FALSE)
Name_Tab<-exprs(Name_Exp)
Name_Tab<-round(Name_Tab, digits=2)
write.table(data.frame(Name_Tab), sep="\t", quote=FALSE, file="Name.txt")
#Differential Expression Testing
#In editor, add "ProbeID" to top of first column and delete the affy control rows that start with AFFX
Name<-read.table("Name.txt", header=TRUE)
attach(Name)
names(Name)
set.seed(0)
#testing the columns by printing the first 3 rows for each condition
#var.Cond1
Name[1:3,c(2,3,4)]
#var.Cond2
Name[1:3,c(5,6,7)]
#LPE tests
var.Cond1<-baseOlig.error(Name[,c(2,3,4)],q=0.01)
var.Cond2<-baseOlig.error(Name[,c(5,6,7)],q=0.01)
lpeVal.Cond1.Cond2<-data.frame(lpe(Name[,c(5,6,7)], Name[,c(2,3,4)], var.Cond1, var.Cond2, probe.set.name = Name$ProbeID))
lpeVal.Cond1.Cond2<-round(lpeVal.Cond1.Cond2, digits=2)
fdrBH.Cond1.Cond2<-fdr.adjust(lpeVal.Cond1.Cond2, adjp="BH")
write.table(lpeVal.Cond1.Cond2, quote=FALSE, sep="\t", file="lpeVal.Cond1.Cond2.txt")
write.table(fdrBH.Cond1.Cond2, quote=FALSE, sep="\t", file="fdrBH.Cond1.Cond2.txt")
#clustering with pvalues
library(pvclust)
Name.pv<-pvclust(Name, method.hclust="ward",
method.dist="correlation", use.cor="pairwise.complete.obs",
nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE)
png(filename="Name_PV.png", bg="white",width=960, height=500)
plot(TAM.pv)
dev.off()
The analyze single experiment workflow allows one experiment to be tested. Activate the an experiment with desired thresholds applied and select Analyze Single Experiment from the Workflows & Reports list.
Confirm Network object count and threshold settings and click Apply.
Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
Pathway maps are nice interactive maps
Process networks and network statistics produce editable network maps.
GO Processes produce lists that can be used to draw networks.
Diseases link to disease pages.
This workflow requires at least 2 active experiments. Activate all 3 ColonMet experiments and refresh the browser page to confirm.
Select the Enrichment Analysis link to launch the workflow. Ensure the correct experiments are activated and click Next.
Fold change, p-value signal directionality can be adjusted on the next page. NOTE: the number of Network objects that meet threshold criteria are indicated. Clicking Apply launches the analysis.
Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
Pathway maps are nice interactive maps
Process networks produce editable network maps.
GO Processes produce lists that can be used to draw networks.
Diseases link to disease pages.
This workflow requires at least 2 active experiments. Activate all 3 ColonMet experiments and refresh the browser page to confirm.
Select the Compare Experiments link to launch the workflow. Ensure the correct experiments are activated and click Next.
Fold change, p-value signal directionality can be adjusted on the next page. NOTE: the number of Network objects that meet threshold criteria are indicated. Clicking Apply launches the analysis.
Genes passing thresholds in each experiment are compared and grouped in 3 categories:
Unique
Similar - present in more than one list
Common - present in all lists
Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
Pathway maps are nice interactive maps
Process networks and produces editable network maps.
GO Processes produce lists that can be used to draw networks.
Diseases link to disease pages.
Histograms can be sorted according to statistics
Statistically significant
Deferentially effected
Similarity by
Particular
Histogram are interactive, network statistics panel reports results.
From the Start Page, select the Search and Browse Content link and pick the EZ search button
Click human FN1 from the results list
Results page is outlined according to category, available information includes pathway maps and interactions.
Under interactions page, compare the details for alpha2/beta1 binding and alpha5/beta1 binding.
From the Start Page, select the Search and Browse Content link and pick the Pathway Maps option.
This leads to a hierarchy of GeneGO pathway maps
Select Regulatory maps-->Regulation of cellular processes-->Cell adhesion-->Cell adhesion_Integrin inside-out signaling
NOTE: Data from activated experiments is displayed on the map.
The Overview tab describes details of the map and provides references.
The Processes tab lists significant GO processes in the map.
From the Start Page, select the Search and Browse Content link and pick the Process Networks option.
This leads to a hierarchy of GeneGO process networks
Select Process Networks-->Cell adhesion_ECM remodeling
NOTE: Data from activated experiments is displayed on the network.
The network is highly interactive.
From the Start Page, select the Search and Browse Content link and pick the Protein Groups & Complexes tool.
The tool links to a page of table of results, select Integrin Receptor Family (Human)
The upper left panel of the Start Page hosts the data management functions.
Click the "Start Page" icon or your browsers "refresh" button to make sure that the displayed data is current.
Double-click the "Experiments" folder. You should see "ColonMet_Faked" inside the folder.
Right-Click on the My Data folder and select Create Folder... Create a folder called May1Training
Drag ColonMet_Faked into the new folder. Right-click on the original folder to delete. Note: folders can also be renamed.
Data can also be shared. Right-click on May1Training and select Share..., enter charliew and search. select username and click Add Selected to List. Note: Types of permissions can be controlled.
The lower left panel of the Start Page displays the active data. Active data is available for analysis and can be controlled by the Activate/Deactivate icon, right clicking or drag-and-drop.
cat
Concatenates two files together, then prints content to STDOUT.
Colloquially, it's used to print out the contents of a single file.
head
By default, prints the first 10 lines of a file to STDOUT.
You can pass the -n
flag to specify the number of lines to print.
tail
By default, prints the last 10 lines of a file to STDOUT.
You can pass the -n
flag to specify the number of lines to print.
less
Lets you page through a long file or stream of text a.k.a. you can scroll.
Exit by pressing q
scp
Allows you to transfer files between a local and remote host. It does this by leveraging ssh
.
Shares the same syntax as the normal cp
command.
The UNIX file system consists of files and directories organized in a hierarchical structure.
Avoid File names and directory names with spaces and special characters
In the figure below, there are 5 subdirectories under the root, i.e. bin, tmp, home, use, src. The directory home has two subdirectories (i.e. iap and paolaf). Thus the parent directory of iap is home.
The system provides a few short synonyms relative to the working directory:
The root is denoted by a forward slash (/).
The home directory of any given user is specified by a tilde (~).
A single dot (".") indicates the current working directory.
A double dot ("..") indicates the directory above the current directory.
You can find out your current working directory (in other words, where you currently are in the Unix tree) by typing the command pwd (print working directory) at the prompt and then hitting return.
# Start in ~/unixclass
cd ~/unixclass
cat arrayDat.txt
# Start in ~/unixclass
cd ~/unixclass
# Print the first 10 lines of arrayDat.txt
head arrayDat.txt
# Print the first 5 lines of arrayDat.txt
head -n 5 arrayDat.txt
# Start in ~/unixclass
cd ~/unixclass
# Print the last 10 lines of arrayDat.txt
tail arrayDat.txt
# Print the last 5 lines of arrayDat.txt
tail -n 5 arrayDat.txt
# Start in ~/unixclass
cd ~/unixclass
# Scroll through the contents of arrayDat.txt
less arrayDat.txt
# Transfer from local computer to remote computer
scp <source file> <username>@<remote host>:<destination>
scp agenda.txt [email protected]:/home/asoberan/unxiclass
# Transfer from remote computer to local computer
scp <username>@<remote host>:<source file> <destination>
scp [email protected]:/home/asoberan/arrayDat.txt .
Multiple sequence alignment and phylogenetic analysis allow the identification of conserved positions in protein and nucleic acid sequences. This can lead to an appreciation of the evolutionary history of a group of sequences.
The clustal family of programs is commonly used to produce multiple sequence alignments. Other options are available as well:
There are 3 different ways to use the clustal programs.
Web-based clustalw can be used HERE.
ClustalX GUI (and also the command-line clustalw executable) is available for download HERE. exampledata
Command-line clustalw2 is installed on rous
The major difference between the 3 options is the interface and the calculation of bootstrap values for the tree, which is only available in the command-line and GUI versions. Other details of running the program are the same. For this lesson, we will use the web-based version to align these sequences.
Login to rous and copy the clustal training files to your home directory.
cp -r /net/n3/data/Teaching/IAP_2010_day3/clustal .
NOTE: -r means "recursive", copy the folder and everything inside it to the new location "."
The lack of a trailing "/" on the clustal path ensures that the folder is moved, not just it's contents.
enter the clustal directory:
cd clustal
view the raw sequence files:
cat *.pep
launch the clustalw application:
clustalw2
The following interactive menu appears, options 1 and 2 will be used in this demonstration. Additional information is available HERE
**************************************************************
******** CLUSTAL 2.0.12 Multiple Sequence Alignments ********
**************************************************************
1. Sequence Input From Disc
2. Multiple Alignments
3. Profile / Structure Alignments
4. Phylogenetic trees
S. Execute a system command
H. HELP
X. EXIT (leave program)
Your choice:
Select option 1 to load sequence and specify the file "tiny.pep". Before being returned to the original meny, you should see:
Sequences should all be in 1 file.
7 formats accepted:
NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF, RSF.
Enter the name of the sequence file : tiny.pep
Sequence format is Pearson
Sequences assumed to be PROTEIN
Sequence 1: zf_12a1_a 28 aa
Sequence 2: zf_12a1_b 28 aa
Sequence 3: hs_a11 28 aa
Sequence 4: hs_22a1 28 aa
Sequence 5: zf_a11 28 aa
From the original menu, now select option 2. Multiple Alignments. This activates the Alignment menu:
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now Slow/Accurate
2. Produce guide tree file only
3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters
6. Multiple alignment parameters
7. Reset gaps before alignment? = OFF
8. Toggle screen display = ON
9. Output format options
I. Iteration = NONE
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
All default options are correct except option 9,select this to see the output format menu:
********* Format of Alignment Output *********
F. Toggle FASTA format output = OFF
1. Toggle CLUSTAL format output = ON
2. Toggle NBRF/PIR format output = OFF
3. Toggle GCG/MSF format output = OFF
4. Toggle PHYLIP format output = OFF
5. Toggle NEXUS format output = OFF
6. Toggle GDE format output = OFF
7. Toggle GDE output case = LOWER
8. Toggle CLUSTALW sequence numbers = OFF
9. Toggle output order = ALIGNED
0. Create alignment output file(s) now?
T. Toggle parameter output = OFF
R. Toggle sequence range numbers = OFF
H. HELP
User toggle #4 to activate PHYLIP output, hit return to revisit the alignment menu
Select option 1 to do the alignment. Accept the default conditions and hit return until you end up at the main menu. Exit the program with "X".
cat tiny.[ap][lh][ny]
NOTE: Square brackets are regular expression syntax. For example [ap] = either an "a" or a "p" at that position.
This is the clustal format output:
CLUSTAL 2.0.12 multiple sequence alignment
zf_12a1_a VADLVFLVDGSWSVGRENFRFIRSFIGA--
zf_12a1_b KADLVFLIDGSWSIGDDSFAKVRQFVFS--
hs_22a1 HYDLVFLLDTSSSVGKEDFEKVRQWVAN--
hs_a11 YMDIVIVLDGSNSIYP--WVEVQHFLINIL
zf_a11 YMDIVIVLDGSNSIYP--WNEVQDFLINIL
*:*:::* * *: : :: ::
This is the phylip format output:
5 30
zf_12a1_a VADLVFLVDG SWSVGRENFR FIRSFIGA--
zf_12a1_b KADLVFLIDG SWSIGDDSFA KVRQFVFS--
hs_22a1 HYDLVFLLDT SSSVGKEDFE KVRQWVAN--
hs_a11 YMDIVIVLDG SNSIYP--WV EVQHFLINIL
zf_a11 YMDIVIVLDG SNSIYP--WN EVQDFLINIL
UNIX is a multi-user environment, how does it maintain security inside of itself?
Every file has an owner and permissions.
There are three levels of ownership:
User
Group
Other
Three levels of permissions:
Read
Write
Execute
How is this useful? Well imagine a lab! There are files that an entire lab should have access to. So put all users in a lab into a lab group, then sharing a file between a lab just means making the lab group the owner of a file. This is already what we do on Luria!
You can view the ownership and permissions of a file by running ls -l
. Here's an example of the output of ls -l
:
[asoberan@luria unixclass]$ ls -l
total 40
-rwxr-xr-x 1 asoberan ki-bcc 3845 Apr 28 21:48 arrayAnnot.txt
-rwxr-xr-x 2 asoberan ki-bcc 3134 Apr 28 22:11 arrayDat.txt
-rwxr-xr-x 2 asoberan ki-bcc 3134 Apr 28 22:11 arrayHard.txt
-rwxr-xr-x 1 asoberan ki-bcc 1634 Apr 28 21:48 arraylen.txt
lrwxrwxrwx 1 asoberan ki-bcc 12 Apr 28 22:13 arraySoft.txt -> arrayDat.txt
-rwxr-xr-x 1 asoberan ki-bcc 3128 Apr 28 21:48 beep.txt
-rw-r--r-- 1 asoberan ki-bcc 528 Apr 28 21:48 ex1.sh
-rw-r--r-- 1 asoberan ki-bcc 479 Apr 28 21:48 ex2.sh
-rw-r--r-- 1 asoberan ki-bcc 368 Apr 28 21:48 ex3.sh
-rwxr-xr-- 1 asoberan ki-bcc 340 Apr 28 21:48 test_1.fastq
-rwxr-xr-- 1 asoberan ki-bcc 340 Apr 28 21:48 test_2.fastq
Let's focus on the arrayDat.txt
file.
-rwxr-xr-x 2 asoberan ki-bcc 3134 Apr 28 22:11 arrayDat.txt
asoberan ki-bcc
describes the ownership of a file. In this case, the user asoberan
and the group ki-bcc
own the file.
-rwxr-xr-x
describes the permissions that the owners of the file have.
The permissions can be broken down into three parts:
The user's permissions
-rwx
The user asoberan
has read (r
), write (w
), and execute (x
) permissions for this file.
The group's permissions
r-x
The group ki-bcc
has read (r
) and execute (x
) permissions for this file.
Everyone's else's permissions
r-x
Anyone who isn't asoberan
or in the group ki-bcc
has read (r
) and execute (x
) permissions for this file.
To check what group you are in, you can use the groups
command:
[asoberan@luria unixclass]$ groups
ki-bcc
# print working directory
pwd
# change directory to root directory
cd /
pwd
# change directory to home directory
cd ~
pwd
# change directory to root directory
cd /
pwd
The ‘read mapping’ problem is the alignment of relatively short reads to a reference genome.
Many short read mappers are currently available (Bowtie, BWA, SOAP, TopHat, etc...). They all try to overcome the following challenges presented by the mapping problem:
how do we align efficiently (in time and memory footprint) billions of short reads to a long reference genome?
how do we decide a specific placement when multiple, equally good alignment are available for the same read?
Some mappers are short-read aligners (they do not allow large gaps in their alignments), while others are spliced-read mappers (they allow large gaps corresponding to introns). The choice of the appropriate mapper depends on the particular dataset and application
This tutorial uses a subset of the Illumina IDEA Challenge 2011 dataset (more information can be found ). You can copy the relevant fastq files by typing the following command on Luria:
The dataset consists of paired-end RNA-Seq data of breast carcinoma human cell line BT20. Because the reads are paired-end reads, for each DNA fragment we have sequence data from both ends and the sequence reads are stored in two separate files (one for the data from each end). Additional information about the formats used in this tutorial can be found below:
SAM/BAM format: and .
.
A complete guide on how to run bowtie can be found .
Bowtie can be run in 2 alignment modes (mutually exclusive):
n alignment mode: this is the default mode (specified with flag -n), and determines which alignments are valid according to the following policy:
alignments may have no more than N mismatches in the first L bases of the reads ("seed"). N and L are integers specified through the flags -n and -l.
the sum of the Phred quality values at all mismatched positions (not just in the seed) may not exceed E (set with -e).
v alignment mode:
alignments may have no more than V mismatches (set using the -v option), and quality values are ignored.
Several reporting modes are available, specified through various combinations of the options, such as -k,-a, -m, etc...
Bowtie requires an index of the reference sequence against which the reads are mapped. Pre-built indexes for several organisms' assemblies are available on rous or on the Bowtie official (check under left pane section "Pre-built Indexes" ). If your reference sequence is not among the available pre-built indexes, you can run the following command and create one for it:
When the command has completed, the current directory will contain the index consisting of four new files (ref.1.ebwt, ref.2.ebwt, ref.rev.1.ebwt, and ref.rev.2.ebwt).
How to run Bowtie
If your dataset consists of single-end sequence reads, you can run Bowtie as follows:
If your dataset consists of paired-end sequence reads, you can run Bowtie as follows:
Some additional useful options are:
the flag -k specifies the maximum number of alignments to report if more than one valid alignment is found.
the flag -m suppresses all alignments for a particular read if more than m reportable alignments exist.
the flag -v specifies the maximum number of mismatches allowed in the entire length of the read.
the flag -n specifies the maximum number of mismatches in the high quality seed of the read.
the flag -l specifies the length of the seed to consider for each read.
the flag –-best guarantees that reported singleton alignments are the best in terms of stratum (i.e. number of mismatches or mismatches in the seed, depending on the alignment mode) and in terms of the quality values at the mismatched position(s).
Note that the option --best does not affect which alignments are considered "valid" by bowtie, only which valid alignments are reported. When --best is specified and multiple hits are allowed (via -k or -a), the alignments for a given read are guaranteed to appear in best-to-worst order in bowtie's output. bowtie is somewhat slower when --best is specified.
The last few lines written to the standard error file handle convey summary information about the run (see below). The other lines are written to the standard out file handle and show valid, reportable alignments found by Bowtie.
BWA is another fast aligner based on Burrows-Wheeler Transform (BWT). A complete guide on how to run BWA can be found .
BWA implements two different algorithms, both based on Burrows-Wheeler Transform (BWT).
ALN:
designed for short reads (up to ~200bp) with low error rate ( less than 3%).
performs gapped global alignment.
supports paired-end reads.
BWA-SW:
designed for longer reads with more errors.
performs heuristic Smith-Waterman-like alignment to find high-scoring local hits.
supports only single-end reads.
The database file (in fasta format) must be first indexed with the index command, which typically takes a few hours. Pre-built indexes are available on rous for the most common model organisms.
If your reference sequence is not among the available pre-built indexes, you can run the following command and create one for it:
For short reads (up to ~200bp) with low error rate (<3%), the algorithm to use is invoked with the command aln, which finds the suffix array (SA) coordinates of good hits of each individual read; the commands samse (for single-end reads) or sampe (for paired-end reads) are then used to convert these SA coordinates into chromosomal coordinates. Note that multi-hits are randomly chosen.
If your dataset consists of single-end sequence reads, you can run BWA as follows:
For example, to run BWA on the query BT20_1.fastq against the reference genome hs18, you must type the following commands:
If your dataset consists of paired-end sequence reads, you must run BWA in two steps as follows:
For example, to run BWA on the query BT20_1.fastq and BT20_2.fastq against the reference genome hs18, you must type the following commands:
Note that by default BWA ALN allows up to K mismatches in the seed (the first L positions of each read) and up to N mismatches in the whole sequence reads. These parameters can be changed using the appropriate options as indicates below.
-l specifies the length of the seed.
-k specifies the maximum edit distance in the seed.
-n specifies the maximum edit distance in the whole sequence read.
-o specifies the maximum number of gaps to open.
-e specifies the maximum number of gaps to extend.
The BWA output in SAM format has the following specific optional fields:
NM Edit distance
MD Mismatching positions/bases
X0 Number of best hits
X1 Number of suboptimal hits found by BWA
XM Number of mismatches in the alignment
XO Number of gap opens
XG Number of gap extensions
XT Type: Unique/Repeat/N/Mate-sw
XA Alternative hits; format: (chr,pos,CIGAR,NM;)*
Prior to starting this demonstration, you should set up your unix directory according to the instructions in
is a powerful text editor with extensive functionality.
1) Control mode is activated by pressing and holding the <ctrl> key while pressing the second key. This process is written as:
C-key
This example shows the result after C-s (search):
2) Meta-mode is activated by pressing and releasing the <ESC> key followed by some other key that activates a sub-menu of commands. This process is written as:
ESC option
This example is what appears after entering ESC x (activate execute-extended-command menu)
check to see that you are in the IAP_2010 directory unix by entering the command:
the result should be (where USERNAME is your own unix username)
/home/USERNAME/IAP_2010/unix
If it is not, you can change to that directory by executing:
create a copy of the file replace.txt with:
Begin editing replace2.txt with:
Note, on new accounts a splash screen welcomes you to emacs use C-l to exit the screen and proceed to editing.
Insert text by regular typing.
Delete letters with C-d
Delete lines with C-k
Find and replace text with ESC x , then type replace-string on M-x line. enter search string, then return, then replacement text.
Move to the end of the document with ESC >
Move back to the start of the document with ESC <
Play tetris with ESC x, tetris
Save and exit with C-x followed by C-c then answer prompts if changes were made.
To turn off the splash welcome screen, edit a file called .emacs to have the contents:
The UNIX filesystem does not have to be present on the local hard drive. UNIX supports filesystems that are present over the network, called Network Filesystems (NFS), so that you can mount a directory from another computer and traverse it just as if it were a normal directory on your local computer.
Our Luria cluster uses this technology to mount our many storage servers, which run NFS servers. Every storage server is mounted at /net
.
tree -L 1 /
command, showing the different storage servers mounted at /net
:
From our perspective, these look like any other directory on Luria, but they're present on completely different computers.
Another protocol for sharing filesystems over the network is SMB, which is supported on UNIX systems through Samba. In addition to NFS, our storage servers run Samba servers, which allow you to mount them on your local laptop or PC. For instructions on doing so, refer to the page linked below:
# copy from stock directory to current directory
cp /net/ostrom/data/rowleybcc/charliew/old_teaching/Biol2Bioinfo/SAM/BT20_?.fastq .
# build bowtie index for reference ref.fa
bowtie-build ref.fa ref
# run bowtie single-end with default parameters
bowtie /home/Genomes/bowtie_indexes/hg18 BT20_1.fastq BT20_SE_bwt.sam
# run bowtie paired-end with default parameters
bowtie /home/Genomes/bowtie_indexes/hg18 -1 BT20_1.fastq -2 BT20_2.fastq BT20_PE_bwt.sam
# build index for reference sequence ref.fa
bwa index ref.fa
# run BWA single-end with default parameters
bwa aln ref.fa query.fastq > bwa.sai
bwa samse ref.fa bwa.sai query.fastq > bwa.sam
bwa aln /home/Genomes/hs18.fa BT20_1.fastq > BT20_SE_bwa.sai
bwa samse /home/Genomes/hs18.fa BT20_SE_bwa.sai BT20_1.fastq > BT20_SE_bwa.sam
# run BWA paired-end with default parameters
bwa aln ref.fa query_1.fastq > query_1.sai
bwa aln ref.fa query_2.fastq > query_2.sai
bwa sampe ref.fa bwa_1.sai bwa_2.sai query_1.fastq query_2.fastq > bwa.sam
bwa aln /home/Genomes/hs18.fa BT20_1.fastq > BT20_1.sai
bwa aln /home/Genomes/hs18.fa BT20_2.fastq > BT20_2.sai
bwa sampe /home/Genomes/hs18.fa BT20_1.sai BT20_2.sai BT20_1.fastq BT20_2.fastq > BT20_PE_bwa.sam
To change the owners of a file, you can use the following commands:
chown
This changes the user who owns a particular file or directory.
chgrp
This changes the group who owns a particular file or directory.
To change the permissions that the owners of a file have, you use the chmod
command.
chmod
takes two arguments: the permissions to give a file, and the file to change the permissions of. The permissions are represented as a 3-digit number, where each digit represents the permissions to give the user, group, or others, respectively.
Read, write, and execute permissions are represented by the following numbers:
r - 4
w - 2
x - 1
If you want to give someone multiple permissions, you add the numeric representations of those permissions together. For example:
Read, write, execute (rwx) permissions = (4 + 2 + 1) = 7
Write, execute (_wx) permissions = (2 + 1) = 3
So let's say you want to give a file the following permissions:
The user that owns the file should be able to read, write, and execute the file. rwx = (4 + 2 + 1) = 7
The group that owns the file should be able to read and execute the file. r_x = (4 + 1) = 5
Anyone else should have no permissions for the file. ___ = 0
The you'd run the following command:
chmod 750 arrayDat.txt
Remembering the syntax for this command can be quite cumbersome, so I recommend using a third-party website such as https://quickref.me/chmod.
[asoberan@luria]$ tree -L 1 /net
/net
├── bmc-lab1
├── bmc-lab2
├── bmc-lab2.mit.edu
├── bmc-lab3
├── bmc-lab4
├── BMC-LAB4
├── bmc-lab5
├── bmc-lab6
├── BMC-LAB6
├── bmc-lab7
├── bmc-lab8
├── bmc-pub10
├── bmc-pub14
├── bmc-pub15
├── bmc-pub16
├── bmc-pub17
├── bmc-pub9
├── gelbart
├── ostrom
└── rowley
21 directories, 0 files
Select an experiment, right click and activate Properties...
The page displayed indicates information about that experiment. There is a place to enter notes or descriptions, the number of IDs in play are also listed as is the Threshold and P-Threshold.
Threshold refers to the fold-change or log ration cutoff that is in play.
Select all three experiments in the ColonMet_Faked folder and activate Tools-->Set Threshold and Background List. Next to General, Enter 1 in the Threshold and 0.1 in the P Threshold. Leave background option at default and click OK.
If an array is the source of the data, the background list can be set to the gene content of the array, otherwise the default behavior is to use everything.
Use the right-click, properties function to confirm that threshold settings have been set for each experiment.
Activate the ColonMet_Faked_FC experiment and refresh the page so that it is visible in the Active Data panel.
Select the experiment in the active data panel and click Tools-->Set Threshold and Background List. Click the statistics button. Counts of pass-threshold items in various categories are displayed. The signal distribution button will show the data for these faked values.
pwd
cd /home/USERNAME/IAP_2010/unix
cp replace.txt replace2.txt
emacs replace2.txt
(setq inhibit-splash-screen t)
Luria.mit.edu is a Linux cluster built in 2017, with the intent of providing computing resources for life sciences workloads for the Koch Cancer Research Institute and affiliated labs and DLCs at MIT.
It has 2272 cores, across 57 nodes:
c1-9
128GB
c10
192GB
c11-40
96GB
b1-16
768GB or 384GB
Head node
64GB
c1-4
32 cores
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
c5-40
16 cores
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
b1-16
96 cores
Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz or 5220R CPU @ 2.20GHz
Head node
32 cores
Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
To request a Luria account, please email to [email protected] from your MIT email account with the following information:
Your lab affiliation.
What kind of work you are looking to accomplish on the Luria cluster.
Whether you have previous experiences in Linux and cluster computing.
You must be connected to the MIT VPN if you are not connected to MITNet or the MIT Secure Wi-Fi.
Unfortunately, the most recent versions of Visual Studio Code's Remote-SSH plugin no longer support operating systems with older versions of glibc
, which includes Luria. There's no easy way around this besides downgrading VSCode to a version from before March 2025, or by changing your workflow.
If you still want to use VSCode but do not wish to downgrade it, you can mount your lab's storage over SMB, access your storage files in VSCode through your mounted folder, and manually SSH into Luria instead of using the Remote-SSH plugin by typing ssh <your username>@luria.mit.edu
into the VSCode Terminal.
Keep in mind that mounting your storage server over SMB only gives you access to the files in your lab's storage account and files in your home directory's ~/data
folder, not files directly inside of your home directory.
To connect to Luria using Visual Studio Code, follow the instructions at this website. When you reach the step "Connect to Host", use <your-kerberos-username>@luria.mit.edu
as your host. Enter your Kerberos password when prompted.
You can then open any folder or workspace on Luria using by going to File > Open and edit it straight in Visual Studio Code.
After successfully connecting, go to the VS Code top bar, press Terminal, then New Terminal as shown here. This will open a remote shell on the Luria cluster to run commands on.
SecureCRT is available to download from IS&T's website. To create a connection using SecureCRT, click Quick Connect, then fill in the following fields:
Enter your Kerberos password when prompted.
The Windows Terminal is available to download from the Microsoft app store. If you do not wish to download it, you can use the built-in Windows PowerShell to run the following commands. Linux and MacOS come with a terminal and SSH client built-in.
Run: ssh <your-kerberos-username>@luria.mit.edu
Enter your MIT Kerberos password when prompted.
Each user on Luria will have at least two storage areas: a home directory and a project directory. Both are backed up daily.
Each user has a home directory. The home directory is intended to save commonly used scripts, environment configuration files, and your own software. A standard user account has a storage quota of 10GB in home directory. Please do not store large data file under your home directory as it can fill up the space quickly.
In your home directory, there is usually a symbolic link called data
which points to your storage servers. A symbolic link (symlink) is a file that points to a different location. Symbolic links are useful for saving space, since instead of having a large directory stored into a location, such as your luria home, you can use a symbolic to point to that directory on a resource with more abundant storage.
Programs will automatically follow the symbolic link and behave as if it was actually the directory. Symlinks also serve as shortcuts to different locations in the filesystem. For example, the symbolic link called data
in your homes point to your directory in your lab's storage share. This is probably one of the following:
/home/<username>/data
-> /net/bmc-lab2/data/lab/<lab-name>/<username>
/home/<username>/data
-> /net/bmc-pub14/data/<lab-name>/users/<username>
Please do not delete the data
symbolic link. The project directory is intended to save all your data which resides on either bmc-pubX
or bmc-labX
storage servers.
Your home directory also has five hidden directories: .local
, .R
, .singularity
, .conda
, and .jupyter
. These directories are specifically named and required by their respective programs to store files, configurations, images, and packages. Since these can grow quite large, these directories are linked to your storage server.
Starting in February 2024, these directories point to the following hidden directories:
/home/<username>/.singularity
-> /home/<username>/data/.singularity
/home/<username>/.conda
-> /home/<username>/data/.conda
/home/<username>/.jupyter
-> /home/<username>/data/.jupyter
/home/<username>/.local
-> /home/<username>/data/.local
/home/<username>/.R
-> /home/<username>/data/.R
For older accounts, these directories may look somewhat different, for example:
/home/<username>/.singularity
-> /home/<username>/data/singularity
/home/<username>/.conda
-> /home/<username>/data/conda
/home/<username>/.jupyter
-> /home/<username>/data/jupyter
/home/<username>/.local
-> /home/<username>/data/local
/home/<username>/.R
-> /home/<username>/data/R
If your home directory exceeds the 20G quota, you will receive the error "Disk quota exceeded" when writing to your home directory. If this happens, you should examine which directories/files take most of the space using the du command, and then move those large directories/files to your project directory ~/data located on the storage server.
$ du -shc $(ls -A)
While computing on Luria is free, data storage is charged at $100/TB/year. To reduce the operational cost and better utilize the storage server, we recommend you check your data periodically - removing data that are no longer needed and compressing files that you will not need for a while. Here is an example command to compress all fastq files in the current directory and sub-directories that are older than 180 days. Please remember running it through either a batch script or interactive login to a compute node.
$ find . -name \*.fastq -mtime +180 | parallel -j16 gzip -v {}
Each lab has a storage quota in the project directory. When the storage quota is exceeded for your lab, you will not be able to add more data to your storage server. As mentioned above, you will then need to do data storage cleanup and compression to free up spaces. In addition, you can send a request to [email protected] to increase your storage quota with a cost object provided and/or do TSM archival of your folder(s).
We don't offer tiered storage. We don't use the term cold storage as our all storage servers are active. If you have data that you don't need for a while, we can do TSM archive on MIT storage and then delete the data from our storage server. Archived data is stored in MIT TSM archive, and is no longer accessible from our cluster luria.mit.edu as well as our storage servers. If you need the data back available to luria and our storage servers, we can retrieve it from MIT TSM archive. The MIT archive is a free service, but will require two hours of our labor to retrieve it at a rate of $90/hr.
Secure Copy or SCP is a means of securely transferring computer files between a local and a remote host or between two remote hosts.
It is based on the Secure Shell (SSH) protocol.
The term SCP refers both to the protocol and the program.
The command line scp is one of the most common SCP program and implements the SCP protocol to transfer files.
The syntax of the scp command is the same as the syntax of the cp command:
Copying file(s) to luria
# copying file `myfile.txt` from your local machine
# to your home directory in luria
scp myfile.txt <user>@luria.mit.edu:~/myfile.txt
# copying folder `myfolder` from your local machine
# to your home directory in luria
scp -r myfolder <user>@luria.mit.edu:~/
Copying file(s) from luria
# copying file `myfile.txt` from luria to
# your local machine
scp [email protected]:~/myfile.txt myfile.txt
Transferring Files Using Jupyter Notebook
You can upload and download files from the Jupyter notebook interface via your browser. Please follow the instructions at this page.
Luria uses module to manage software and its versions, (see Managment of Software Packages with module). Please note that not all software/versions available on rous are installed on luria. To get a list of software packages and versions, run
module avail
While module
helps you to locate software that has been loaded into LMOD, many packages installed on Luria are not configured in this way, or may be included with other packages. If a package that you are looking for doesn't show up using module
, you can run the locate
command to search the filesystem for packages by name.
For example, to find out if tabix is installed, run the command locate tabix
. The result should show that the command tabix is available under samtools. You can then run module load samtools/1.3
to have tabix in your environment.
See Installation.
Luria uses the Slurm scheduler in order to manage jobs. In order to submit a job to run on the cluster compute nodes, you will need to create a batch script for Slurm and then submit it to the scheduler. For more information, please see our separate article on Slurm.
The Integrated Genomics and Bioinformatics core at MIT (IGB) is offering a hands-on introductory session covering Linux usage and cluster computing using KI/BioMicro computational resources. This training session is targeted to users who are new to Linux and/or cluster computing. It is currently offered every 6 weeks, and registration fee is managed using iLabs. If you are interested in the next training session or one-on-one training, please email [email protected] for details.
Every user has a unique username and is member of 1+ groups.
Every file and directory has a owner, a group and a set of permission flags.
Flags specify permissions:
read, write and execute(rwx)
owner, group and others (ugo).
Programs need to have execute (x) permission to run in shell
ls -lt
Only owner can change permissions of a file
chmod has 2 different syntaxes
Syntax 1
assign (=), gives(+), or take away(-) permission
who corresponds to --> u (user), g (group), o (other)
permission corresponds to --> read (r), write (w), execute (x)
chmod who=permission
chmod who+permission
chmod who-permission
ls -l catfile.txt
chmod g=rwx catfile.txt
ls -l catfile.txt
ls -l catfile.txt
chmod g-w catfile.txt
ls -l catfile.txt
ls -l catfile.txt
chmod o+x catfile.txt
ls -l catfile.txt
chmod n
numeric shortcuts to change mode in a compact way.
format: chmod n filename
n is a 3 digit number, values from 0 to 7
each digit is associated to user, group, other
examples:
ls -l catfile.txt
chmod 644 catfile.txt
ls -l catfile.txt
chmod 700 catfile.txt
ls -l catfile.txt
Galaxy is a framework for integrating computational bioinformatics tools. It provides a common interface that wraps diverse tools, as well as an environment for interactive analysis that transparently tracks the details of analysis steps, a workflow system for convenient reuse, data management, sharing, publishing, and more.
The web-based Galaxy site is freely available to the public. User registration at the site is optional, but can be beneficial to increase its efficiency for the user. The registration is free and allows the user to create custom workflows and save analysis results.
Open this link in a new window or type http://main.g2.bx.psu.edu/ in your url bar. In the main page of Galaxy, you will find some introductory material, including useful screencasts that show how to use the various tools to perform many types of analyses, such as text manipulation, format conversion, NGS quality control, NGS Mapping, interval data manipulations, relational database operations, and many more.
There are three panes in the Galaxy desktop:
Tool pane (left).
Current Analysis pane (centre).
Job/History pane (right).
To analyze the data in Galaxy, you perform an ordered sequence of tasks, each consisting in applying a specific tool (chosen from the left pane) to a specific set of files (chosen in the central pane or among the results from previous steps as reported in the History pane).
The History pane shows the sequence of steps/tasks you are running. In particular, the color of the frame associated with a particular job has the following meaning:
gray : the job is in the queue.
yellow: the job is running.
green: the job is completed.
red: the job produced an error.
Common name for the interface that allows you to input commands and see the output from commands.
Sequence of one or more characters, such as $ or % in combination with other customizable elements, such as the username or the current working directory.
It indicates readiness to accept commands.
Command line interpreter.
Users operate the computer by entering command input as text for a shell to execute or by creating text scripts of one or more such commands.
Unix comes with a preloaded documentation, also known as "man pages".
Each page is a self-contained document consisting of the following sections:
NAME: the name and a brief description of the command.
SYNOPSIS: how to use the command, including a listing of the various options and arguments you can use with the command. Square brackets ([ ]) are often used to indicate optional arguments. Any arguments or options that are not in square brackets are required.
DESCRIPTION: a more detailed description of the command including descriptions of each option.
SEE ALSO: References to other man pages that may be helpful in understanding how to use the command in question.
To view a manual page, use the command man.
When viewing a man page, use the following keys to navigate the pages:
spacebar- view the next screen;
b (for "back") - view the previous screen;
arrow keys - navigate within the pages;
q (for "quit") - quit and return to the shell prompt.
For example, to view the man page associated with the command cp, type the following:
When you are not sure of the exact name of a command, you can use the apropos command to see all the commands with the given keyword on their man page.
For example, to see which commands are relevant to the task of copying, type the following:
find
Searches directories to find a directory or file.
grep
Searches a file for a specific pattern.
Can be used similarly to find, but it also searches file contents.
man command_name
man cp
apropos keyword
apropos copy
find ~ -name arrayDat*
grep -r "arrayDat*" .
pwd
This command name stands for "print working directory".
It displays on the terminal the absolute path name of the current (working) directory.
It is useful to locate where you are currently in the Unix tree.
# print working directory
pwd
ls
This command stands for "list".
It displays the content of a directory.
By default, the content is listed in lexicographic order.
# list the content of the current directory
ls
ls .
# list content of the parent directory
ls ..
# list the contens of your home directory from anywhere
ls ~
Shell scripts are sequential collection of commands that generally represent a workflow.
Shell scripts can use other commands as building blocks.
Suppose that occasionally you want the system to tell you some simple information, like your user id, the current directory and its content, and display the current date. To avoid typing this sequence of commands every time you want to know this information, you can compose a short script that execute the commands in a specific order.
Open a text editor (e.g. nano).
Type the sequence of commands to execute in the specific order they should be executed.
clear
echo Today date is:
date
echo
echo Your current directory is:
pwd
echo
echo The files in your current directory are:
ls -lt
echo
echo Have a nice day!
Save the file by giving to the script a meaningful name, for example info.sh.
Execute the command by invoking your shell (e.g. bash) followed by the script name. If you don't know what shell you are running, type "echo $SHELL" at the prompt.
bash info.sh
uname
uname
prints various different pieces of system information. Possible pieces of information are shown in the uname
manual pages.
The -a
flag will print all the information uname
can print.
file
This will print out the file type of the given file.
ps
Shows the running processes on the system. Pass it the argument aux
to get extensive information about every running process on the system
htop
An interactive utility to see every running process on the system, as well as extensive information about the CPU, RAM, etc. usage on the system.
Useful to check this utility if you're worried about your own CPU and RAM usage on the head node.
The following sections explain how to view, create, and edit files all through the Unix shell. While it is possible to do these things straight from the shell, if you are using Visual Studio Code set up as described on the page linked below, you can view, create, and edit files through it instead of the shell.
uname -a
Linux luria 3.10.0-514.2.2.el7.x86_64 \#1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
file hello.txt
hello.txt: ASCII text
file my_directory.zip
my_directory.zip: Zip archive data, at least v2.0 to extract
# Prints information on every running process on the system
ps aux
# Pipe the output into grep and search for every process being run by a specific user
ps aux | grep "asoberan"
asoberan 6342 0.0 0.0 115716 2372 pts/124 Ss May02 0:00 /bin/bash
asoberan 7855 0.0 0.0 302216 4668 pts/124 Sl+ 15:25 0:00 srun --pty bash
asoberan 7873 0.0 0.0 96680 768 pts/124 S+ 15:25 0:00 srun --pty bash
Make sure to run all of the copy commands below, as we'll be using files from the Ostrom server later in the course.
mkdir
This command name stands for "make a directory".
It creates a new folder (or directory). If no path is specified, the new directory is created in the current directory.
# start in your home directory
cd ~
# create a directory named "unixclass"
# with a subdirectory named "testdir"
mkdir unixclass
mkdir unixclass/testdir
# change current directory directly to "testdir"
cd unixclass/testdir
# go to the parent directory (i.e. unixclass)
# and print the working directory
cd ..
pwd
touch
This command creates an empty file with the given name.
# Go to /home/<your username>/unixclass
cd ~/unixclass
# Create an empty file named "hello.txt"
touch hello.txt
# List the files in the directory to verify that worked
ls
cp
and mv
These commands stand for "copy" and "move," respectively.
They copy / move files and directories to the specified location.
Wildcards symbols such as "*" or "?" are commonly used to copy multiple files with a single command.
The symbol "*" stands for any number of alphanumeric characters.
The symbol "?" stands for a single alphanumeric character.
# Start in ~/unixclass
cd ~/unixclass
# copy the file named arrayDat.txt into your unix_class directory
cp /net/ostrom/data/dropbox/arrayDat.txt .
ls
# copy all the files with suffix "array”
# into the current directory
cp /net/ostrom/data/dropbox/array* .
ls
# copy any file whose extension is "txt"
cp /net/ostrom/data/dropbox/*.txt .
ls
# copy all files
cp /net/ostrom/data/dropbox/* .
ls
rmdir
and rm
rmdir
only removes empty directories, rm
removes both directories and files.
rm
needs -r
flag to remove directories.
# Start in ~/unixclass
cd ~/unixclass
# Create a temporary directory
mkdir trash
# create copies of arrayDat.txt in the temporary directory
cp arrayDat.txt trash/arrayDat1.txt
cp arrayDat.txt trash/arrayDat2.txt
cp arrayDat.txt trash/arrayDat3.txt
cp arrayDat.txt trash/arrayDat4.txt
ls trash
# Try to delete the directory with `rmdir`
rmdir trash
# Try to delete the directory with `rm -r`
rm -r trash
Unix is an operating system (suite of programs), originally developed in 1969 at the Computing Science Research Group at Bell Labs. In 1991, Linus Torvalds released a Unix-like kernel called the Linux kernel that he subsequently open-sourced. Linux-based operating system are the most widely used Unix-like operating system, and what we use in our local high performance computing cluster.
From here on out, any mention of UNIX should be regarded as meaning both UNIX and UNIX-like operating systems, primary Linux-based operating systems.
There is a distinction between the computing paradigm popular at the time of UNIX's creation and the one present within UNIX. Personal computing (PC) vs. a shared system.
Whereas previous systems were meant to be used by one person at a time, UNIX has multi-user and multi-tasking support.
Aside from that feature, UNIX also had these features that made it popular:
network-ready (built-in TCP /IP networking makes easy to communicate between computers).
very powerful programming environments (free of the many limits imposed by other operating systems).
robust and stable.
scalable, portable, flexible.
open source.
his example is inspired by a screencast published on the Galaxy website. It consists in combining exon information and SNP information, both represented as interval data.
1. Load exon data from UCSC tables
On the Tool Panel, click on Get Data → UCSC Main Table Browser.
This tools allows you to upload data from the UCSC Tables.
Use the following parameters:
Group: Variation and Repeats
Track: SNP(130)
Region: chr19:1-100,000
Output format: BED
Send output to Galaxy: checked
Click "Get Output" button.
Select the radiobox so that one BED record is created for the whole gene.
Click the button "send query to Galaxy"
With these parameters, this tool creates a BED file containing all the SNPs for the first 1M bases of chromosome 19.
Once the job is completed, change the name of the dataset to "SNPs chr19".
2. Load SNP data from UCSC tables
On the Tool Panel, click on Get Data → UCSC Main Table Browser.
This tools allows you to upload data from the UCSC Tables.
Use the following parameters:
Group: Genes and Gene Prediction
Track: UCSC Genes
Region: chr19:1-100,000
Output format: BED
Send output to Galaxy: checked
Click "Get Output" button.
Select the radiobox so that one BED record is created per coding exon.
Click the button "send query to Galaxy"
With these parameters, this tool creates a BED file containing all the exon information for the first 1M bases of chromosome 19.
Once the job is completed, change the name of the dataset to "exons chr19".
3. Join exon and SNP information
On the Tool Panel, click on Operate on Genomic Intervals → Join the intervals.
This tools allows you to join the information from two interval files based on the coordinates of each feature.
Select the SNP chr19 and the exons chr19 files as input.
Click on the "Execute" button.
Because some exons might contain multiple SNPs, the resulting output might have size greater than the two input files.
4. Find the number of SNPs per exon
On the Tool Panel, click on Join, Subtract and Group → Group.
This tools groups the information based on a given column and performs the aggregation operations on the other columns.
Select data 3 as input.
Select column 4 (exon ID).
Add operation to count c4.
Click on the "Execute" button.
5. Find the exon with the most SNPs
On the Tool Panel, click on Filter and Sort → Sort.
This tools ...
Select data 4 as input.
Select column 2 as sorting key.
Click on the "Execute" button.
6. Find how many chromosomes have a given number of exons
On the Tool Panel, click on Join, Subtract, Group → Group.
This tools ...
Select data 5 as input (sorted).
Select column 2 as sorting key.
Set the operation to "count" on column 1.
Click on the "Execute" button.
7. Filter exons with at least 10 SNPs
On the Tool Panel, click on Filter and Sort → Filter.
This tools ...
Select data 5 as input (sorted).
Set the condition to SNP count greater than 10 (i.e. c2 >= 10).
Click on the "Execute" button.
8. Retrieve original information for exons
On the Tool Panel, click on Join, Subtract, Group → Join.
This tools ...
This is equivalent to a relational join (not an interval join).
Select the exons with more than 10 SNPs as first input.
Select the exon data for chr19:1-1,000,000 as second input.
Select column 1 (exonID) for the first file.
Select column 4 (exonID) for the second file.
Click the "Execute" button.
Now repeat this step but invert the order of the file. Note that this time the output is a BED-formatted output, wherease before it was a tabular file.
9. Display using the UCSC Browser
On the Data Panel on the right-hand size, click on the last job → Display at UCSC.
The User track show the exons that have more than 10 SNPs in the region of chr19 considered.
Nano is freely available under the GPL for Unix and Unix-like systems.
It is a keyboard-oriented editor, controlled with key combinations, i.e. the "control" key (denoted by "^") PLUS certain letter keys.
^O : saves the current file
^W : goes to the search menu
^G : gets the help screen
^X : exits
Launch the editor by typing "nano" at the shell prompt. An overview of available commands is given at any time at the bottom of the editor screen.
nano
Open an existing file by typing "nano" followed by the file name.
nano file1.txt
If you modify the file, when you exit you will be asked whether to save the changes and the name of the file to save.
While holding down Ctrl key, press \
Enter your search string at prompt and press return
Enter your replacement string at prompt and press return
Respond to "Replace this instance" option menu:
Y to replace current instance
N to skip current instance
A to replace all instances
Note: The search string can also be a regular expression.
To carry out phylogenetic analysis it is important to begin with an alignment in which all gap-containing positions have been removed.
This can be accomplished using the tool.
There is a bug in the process but the gap-free alignment can be loaded into clustalX, the the save sequences as "Phylip" format function can be used.
The correct alignment can be created with a text editor. Another version of it is in your clustal/phylip folder called tiny_nogap.phy
enter the phylip folder:
The first step to to create the bootstrap replicates. This is done with the phylip application seqboot.
when prompted, provide the name "tiny_nogap.phy".
At the interactive menu:
change option r to 10 for 10 replicates. NOTE: normally the number used is 1000 for neighbor
press Y to accept changes
enter 111 as random number seed.
Move outfile to tiny_nogap.seqboot
The next step is to calculate the distances between the sequences in each replicate. This is done with protdist
when prompted, provide the name "tiny_nogap.seqboot".
At the interactive menu:
change option m to d for datasets and 10 for number of replicates.
press Y to accept changes
Rename the outfile
Build the neighbor-joining tree for each distance matrix wth neighbor:
when prompted, provide the name "tiny_nogap.protdist".
At the interactive menu:
change option m to 10 for number of replicates.
enter 111 for random number seed
press Y to accept changes
Rename the output:
Create a consensus version of the outtree file with consense:
when prompted, provide the name "tiny_nogap.neightree".
At the interactive menu, default settings are ok so type "Y"
Rename the output:
Draw the tree with drawtree:
when prompted for a file enter tiny_nogap.neigh.constree
when prompted for a fontfile, enter:
At the interactive menu, select option V, then N to turn off preview. Then option Y
This results in a postscript plot of the tree. rename it to:
The contents of this file can be viewed with photo software like but it looks like this:
The bootstrap values can be viewed in the file tiny_nogap.neigh.constree. Or graphicall by opening this file with . With branch lengths selected, the view looks like this:
nano
A basic text editor found on most UNIX operating systems.
Gives instructions on how to use it at the bottom of the screen.
^
stands for Ctrl
Example output from running nano arrayDat.txt
. Save any changes by exiting with ^X
:
If you recall, you can mount the storage servers to your local machine using SMB. Since mounting the server makes it look like a local directory on your computer, you can use whatever text editing software that you're comfortable with to edit text files on the computing cluster.
GNU nano 2.3.1 File: arrayDat.txt
ProbeID Sample1 Sample2 Sample3 Sample4
1007_s_at 10.93 11.44 11.19 11.64
1053_at 8.28 7.54 8.06 7.32
117_at 3.31 3.41 3.13 3.13
121_at 4.42 4.32 4.46 4.63
1255_g_at 1.8 1.7 1.75 1.81
^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos
^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell
CLUSTAL 2.0.12 multiple sequence alignment
zf_12a1_a VADLVFLVDGSWSVGRENFRFIRSFIGA--
zf_12a1_b KADLVFLIDGSWSIGDDSFAKVRQFVFS--
hs_22a1 HYDLVFLLDTSSSVGKEDFEKVRQWVAN--
hs_a11 YMDIVIVLDGSNSIYP--WVEVQHFLINIL
zf_a11 YMDIVIVLDGSNSIYP--WNEVQDFLINIL
*:*:::* * *: : :: ::
5 26
zf_12a1_a VADLVFLVDGSWSVGRFRFIRSFIGA
zf_12a1_b KADLVFLIDGSWSIGDFAKVRQFVFS
hs_22a1 HYDLVFLLDTSSSVGKFEKVRQWVAN
hs_a11 YMDIVIVLDGSNSIYPWVEVQHFLIN
zf_a11 YMDIVIVLDGSNSIYPWNEVQDFLIN
cd /home/USERNAME/clustal/phylip/
seqboot
cat outfile
mv outfile tiny_nogap.seqboot
protdist
cat outfile
mv outfile tiny_nogap.protdist
neighbor
cat outfile
cat outtree
mv outtree tiny_nogap.neightree
mv outfile tiny_nogap.neighout
consense
cat outfile
cat outtree
mv outtree tiny_nogap.neigh.constree
mv outfile tiny_nogap.neigh.consout
drawtree
/net/n3/data/Teaching/IAP_2010_day3/clustal/font1
NOTE: a series of different fonts are available within the phylip installation directory
mv plotfile tiny_nogap.tree.ps
This demonstration shows how to perform some basic bioinformatics tasks using simple UNIX commands.
The following examples use primarily two files: arrayDat.txt
and arrayAnnot.txt
. You can copy them using the following command on Luria:
# copy from stock directory to current directory
cp /net/ostrom/data/dropbox/arrayDat.txt .
cp /net/ostrom/data/dropbox/arrayAnnot.txt .
arrayDat.txt contains some microarray data, with rows corresponding to probe IDs and columns corresponding to four samples.
arrayAnnot.txt contains the annotation for the probe IDs, specifically the gene description (i.e. geneTitle) and the gene symbol (i.e. geneSymbol).
Note that in general there is a "many-to-one" correspondence between probe IDs and gene symbols, that is more than one probe ID could be associated with the same gene symbol.
You can use this command to view the content of the top part of a file.
The switch "-n" allows you to specify how many lines to display, starting from the first. If you specify a negative number N, then all the lines except the bottom N will be displayed.
# print first 5 lines
head -n 5 arrayDat.txt
head -n 5 arrayAnnot.txt
# print everything except the last line
head -n -1 arrayDat.txt
head -n -1 arrayAnnot.txt
This command extracts portions of lines of a file.
The switch "-f" allows you to specify which field (by default column) to extract.
Multiple fields can be selected either by specifying a range (e.g. 1-3) or by listing the fields (e.g. 1,2,3).
# get 1st column (i.e. probeID)
cut -f 1 arrayDat.txt
# get 3rd column (i.e. geneSymbol)
cut -f 3 arrayAnnot.txt
# get the first three columns
cut -f 1-3 arrayDat.txt
# get the 1st and 3rd columns and redirect the output into a file (mapping probeID to geneSymbol)
cut -f 1,3 arrayAnnot.txt > probe2gene.txt
This command merges the lines of files and puts them side by side.
# put each column in a separate tmp files
cut -f 1 arrayAnnot.txt > tmp1
cut -f 2 arrayAnnot.txt > tmp2
cut -f 3 arrayAnnot.txt > tmp3
# merge lines in a new column order
paste tmp3 tmp1 tmp2 > arrayAnnotOrdered.txt
This command sorts the lines of a text file.
By default the sort is in lexicographic order according to the first field.
In a lexicographic order, letters follow number: 0-9 then aA-zZ.
Note that numbers are to be treated as strings, so 10 comes before 2 because there is no positional weighting (the symbol 1 comes before 2).
Other sorting criteria are available:
the switch -k lets you specify which field to use as key for sorting.
the switch -n specifies a numeric sort.
the switch -r specifies a sort in reverse order (either lexicographic or numeric).
the switch -R specifies a random sort.
# sort by 1st field (probeID)
sort probe2gene.txt
# sort by the 2nd field (geneSymbol)
sort -k 2 probe2gene.txt
# sort by geneSymbol in reverse order
sort -r -k 2 probe2gene.txt
# sort by 2nd field (default lexicographic sort)
sort -k 2 arrayDat.txt
# sort by 2nd field (numeric sort)
sort -n -k 2 arrayDat.txt
This command counts the lines, words and bytes in a text file.
It is often used in conjunction to other filtering operations to count the number of items that pass the filter.
# print number of lines, words and bytes
wc arrayDat.txt
# print number of lines only
wc -l arrayDat.txt
# print the number of fastq files in the current directory
ls *.fastq | wc -l
This command reports or filters repeated lines in a file.
It compares adjacent lines and reports those lines that are unique. Because repeated lines in the input will not be detected if they are not adjacent, it might be necessary to sort the input file before invoking this command.
The switch "-f" specifies how many fields (starting from the first one), to skip when performing the comparisons.
The switch "-c" specifies to return a count of how many occurrences for each distinct line.
The switch "-d" specifies to print only duplicates lines.
The switch "-u" specifies to print only unique lines (i.e. occurring only once).
The switch "-i" specifies a case insensitive comparison.
# print all distinct rows then count
uniq arrayAnnot.txt | wc -l
# count distinct rows without considering the 1st field (probeID)
uniq -f 1 arrayAnnot.txt |wc -l
# report distinct gene symbols and count number of occurrences
uniq -f 1 -c arrayAnnot.txt
# report repeated gene symbols
uniq -f 1 -d arrayAnnot.txt
# report list of unique gene symbols
uniq -f 1 -u arrayAnnot.txt
Acronym for "general regular expression print".
This command prints the lines of the input file that match the given pattern(s).
# print lines that match “chr” (lines with the words "chromosome" and/or "cytochrome" are matched).
grep chr arrayAnnot.txt
The switch "-w" specifies that the match has to occur with a whole word, not just part of a word. Thus, in the example input file, no line matches the pattern "chr" as whole word.
# print lines that match “chr” as a whole word
grep -w chr arrayAnnot.txt
The switch "-i" can be used to specify a case-insensitive match (by default this command is case sensitive).
In the example below, when the switch "-i" is used for the pattern "SS", lines containing the words "class", "repressor", "associated", "expressed" as well as the gene symbol "PRSS33" are matched.
# print lines that match “SS” (case sensitive)
grep SS arrayAnnot.txt
# print lines that match “SS” (case insensitive, that is "SS" or “ss” or “Ss” or “sS”)
grep -i SS arrayAnnot.txt
The switch "-c" returns the number of lines where the specified pattern is found.
# print how many lines match the pattern "orf"
grep -c orf arrayAnnot.txt
The switch "-v" performs a "reverse-matching" and prints only the lines that do not match the pattern.
# print lines that do NOT match the pattern "orf"
grep -v orf arrayAnnot.txt
The switch "-n" prints out the matching line with its number.
# preceed the matching line with the line number
grep -n orf arrayAnnot.txt
When multiple patterns must be matched, you can use the switch "-f" and pass as argument a file containing the patterns to match (one per line).
# make a list with 5 gene symbols
cut -f 3 arrayAnnot.txt|head -n 5 > tmp
# use list "tmp" to match lines in arrayAnnot.txt
grep -f tmp arrayAnnot.txt
This command provides relational database functionality.
It merges the lines of two files based on the common value of a given field, also known as "key".
By default the first field of each file is considered as key.
To perform a relational joint, where you merge all the lines that have identical values in the specified field of two files, the files themselves must be sorted. Unless otherwise specified in the command (with option "--nocheck-order", a warning is given when the files are not sorted on the key field).
# join two files (use default field as key)
join arrayDat.txt arrayAnnot.txt
join arrayAnnot.txt arrayDat.txt
The switch "-t" lets you specify what character to use as field separator in the input and output files. To specify a tab character, use the string '$\t'.
Note that by default the separator is a blank space. It is a generally a good idea to specify a tab as output separator so that if a field consists of multiple words separated by blank spaces, these words remain bundled in the same field and you can use the cut command to access specific fields.
In the examples below, a joint on default settings, yields an output where the fields are separated by blanks. If we try to extract the description (6th field), even if we specify that the delimiter is a blank space rather than a tab, we cannot extract the entire description.
# join two files (default separator is blank space)
join arrayDat.txt arrayAnnot.txt
# note how joint fields are separated by blank spaces and default setting of cut do not yield the desired output
join arrayDat.txt arrayAnnot.txt | cut -f 6
# note how specifying a blank space as delimiter for cut does not yield the desired output because the description field consists of mulitple words
join arrayDat.txt arrayAnnot.txt | cut -d ' ' -f 6
# join two files (separator is a tab)
join -t $'\t' arrayDat.txt arrayAnnot.txt
join -t $'\t' arrayDat.txt arrayAnnot.txt | cut -f 6
The key field of each file can be specified using the switch "-1" and "-2" for the first and second file respectively. Fields are numbered starting from 1.
In the example below, we join the files probe2gene.txt and arrayDat.txt based on the gene symbol (2nd and 3rd field respectively). Before performing the joint, the files must be sorted according to their keys.
# sort by key, then use the probeID as key (probeID is the 2nd and 3rd field in the files)
sort -k 2 probe2gene.txt > tmp1
sort -k 3 -t $'\t' arrayAnnot.txt > tmp2
join -1 2 -2 3 -t $'\t' tmp1 tmp2
The switch "-o" lets you specify which fields in each file should be output.
Note that in the example below the output has repeated rows because the joint was performed on non-unique keys (the gene Symbol).
# specify which field to output for each file
join -1 2 -o '1.1 2.2 2.3 2.4 2.5' arrayAnnotOrdered.txt arrayDat.txt
join -1 2 -2 3 -t $'\t' -o '1.1 1.2 2.2' tmp1 tmp2
This command splits a file into a series of smaller files.
The content of the input file is split into lexically ordered files named with the prefix "x", unless another prefix is provided as argument to the command.
The switch "-l" lets you specify how many lines to include in each file.
# split the content into separate files of 50 lines each
split -l 50 arrayDat.txt
This command invoke a stream editor that modifies the input as specified by a list of commands.
The pattern replacement syntax is: s/pattern/replacement/
# substitute the word “chromosome” with the string “chr”
sed s/chromosome/chr/g arrayAnnot.txt
awk is more than just a command, it is a text processing language.
The input file is treated as sequence of records. By default, each line is a record and is broken into fields.
An awk command is a sequence of condition-action statements, where:
condition is typically an expression
action is a series of commands
awk 'condition {action}' inputfile
Typically, awk reads the input one line at a time; when a line matches the provided condition, the associated action is executed.
There are several built-in variables:
field variables: $1 refers to the first field, $2 refers to the second field, etc.
record variable: $0 refers to the entire record (by default the entire line).
NR represents the current count of records (by default the line number).
NF represents the total number of fields in a record (by default the last column).
FILENAME represents the name of the current input file.
FS represents the field separator character used to parse fields of the input record. By default, FS is the white space (both space and tab). FS can be reassigned to another character to change the field separator.
# print every line of the input file (missing condition)
awk '{print $0}' arrayDat.txt
# print every line of the input file (default argument for print is $0)
awk '{print}' arrayDat.txt
# print 1st field only
awk '{print $1}' arrayDat.txt
# rearrange the fields, separated by a blank space
awk '{print $1, $3, $2}' arrayDat.txt
# rearrange the fields, separated by a tab
awk '{print $1, "\t", $3, "\t", $2}' arrayDat.txt
# print the number of fields for each record
awk '{print NF}' arrayDat.txt
# print the last field of each record
awk '{print $NF}' arrayDat.txt
# print a string
awk 'BEGIN {print "Hello world!"}'
#print the result of an arithmetic operation
awk 'BEGIN {print 2+3}'
awk 'BEGIN {print "2+3=" 2+3}'
# print first 5 lines (default action statement is to print)
awk 'NR < 6' arrayDat.txt
# print first line (i.e. column headers)
awk 'NR == 1' arrayDat.txt
# print first 5 records, excluding headers (NR ==1)
awk 'NR > 1 && NR < 7' arrayDat.txt
# print the total number of records
awk 'END {print NR}' arrayDat.txt
By using loops and conditional statements, awk allows to perform quite sophisticated actions. For example, we can compute the sum or the mean of the intensity values for each probe across all samples:
Compute the sum of intensity values for each gene: for all records except the column headers, we initialize the variable "s" to zero, then we loop through the all the values in a given records and sum cumulatively. When all the values have been considered, we print the sum.
Compute the mean of intensity values for each gene: we compute the sum of the intensity values as before, but we divide the sum by the number of values considered before printing the result. Note that the number of values considered is the number of fields in each record minus one (the probe ID).
# print sum of values for each gene
awk 'NR > 1 {s=0; for (i=2; i<=NF; i++) s=s+$i; print s}' arrayDat.txt
# print mean of values for each gene
awk 'NR > 1 {s=0; n=NF-1; for (i=2; i<=NF; i ++) s=s+$i; s=s/n; print s}' arrayDat.txt
Because the condition can be a pattern matching expression, awk can easily emulate grep.
# print the lines that match the string "orf"
awk '/orf/' arrayAnnot.txt
# print the probe IDs whose annotation contain the string "orf"
awk '/orf/ {print $1}' arrayAnnot.txt
# print number of lines matching “orf” (emulates grep -c)
awk '/orf/ {n++}; END {print n+0}' arrayAnnot.txt
Storage is charged at $100 per TB, per year, billed through iLab.
Each lab gets its own directory, accessible via the Luria cluster head node and through instruments and personal endpoints over SMB.
Storage is generally not accessible via the public internet, so use of the MIT VPN is required for off-campus connections.
To request access, please send an email to [email protected] and indicate your lab affiliation.
On Luria, you can access your lab data at /net/bmc-labX/data/lab/<share>
or /net/bmc-pub14/data <share>
, where <share>
is your lab name (your PI's last name in lower case, e.g. sharp or jacks). For example:
cd /net/bmc-lab2/data/lab/sharp
or
cd /net/bmc-lab5/data/kellis
Note: the above location is only an example. You should change the PATH to your own lab share.
Click on Finder to get a new Finder window.
Click on the Go menu on the top of the screen.
Choose Connect to Server under the Go menu.
Enter the the address of the share, for example: smb://bmc-lab6.mit.edu/share
You should replace "bmc-lab6" with the actual storage server name (e.g. bmc-pub14 or bmc-lab5), and replace <share>
with the lab or core name that you are looking to access (e.g., jacks, histology)
Click Connect
Enter your MIT Kerberos username (not email) and Kerberos password. These fields are case sensitive.
If you put in your MIT Kerberos username and password in correctly but are still not able to log in, try appending win\
to your username and trying again. For example, change mykerberos
to win\mykerberos
.
Now your network drive should be connected. Browse the drive to access your files and folders.
Open the Windows File Explorer.
Click "Map network drive" on the toolbar to open a dialog box.
Choose a drive letter not being used for the network folder in the Drive drop-down list.
In the Folder box, enter the network share path name, e.g. \\bmc-lab3.mit.edu\<share>
.
Make sure you use backwards slashes like in the above example, not forward slashes.
You should replace "bmc-lab3" with the actual storage server name (e.g., bmc-pub17), and replace <share>
with your lab name (your PI's last name in lower case, e.g. "sharp" or "jacks").
Check the "Connect using different credentials" box. Then, click Finish.
In the Enter Network Password dialog, enter WIN\
followed by your MIT Kerberos username (not email) and Kerberos password.
Click OK to connect.
Default permission of new folders and files: owner read/write, lab members read only and no access for others.
Luria users: use the command chmod
to change permissions. For example, use chmod -R g+w
to add group write access for a folder
chmod -R g+w /net/bmc-lab2/data/lab/labX/userY/projectZ
If you are collaborating with your lab members on a folder that needs to be group writable, it is recommended that you set umask 002
in your ~/.bashrc
file so that new files/directories will be created with owner and group writing privileges, but read only access for others.
Windows users: Right click your folder or file and then select properties. Go to the security tab to change permissions.
Mac users: To change permissions, access the data via the cifs protocol instead of smb, i.e., cifs://bmc-lab2.mit.edu/<share>
instead of smb://bmc-lab2.mit.edu/<share>
. Right click your folder or file and then select Get Info. Go to the Sharing & Permissions section to change custom access for group members.
Double-check your MIT Kerberos username and password. Remember, it's your username, not your email address.
Confirm that 1-4 hours have passed after you've been added to the appropriate Moira list that controls access to the storage target.
Confirm whether adding the domain prefix (WIN) makes a difference:
For Windows: Use WIN\your_kerberos_username
For MacOS: Use just your Kerberos username
Verify that you're connected to the MIT VPN if you're off-campus.
If you've recently changed your password, make sure you're using the new one.
If you're getting "path not found" or similar errors:
Verify the exact server name and share path. Common mistakes include:
Using forward slashes (/
) instead of backslashes (\
) on Windows
Mistyping the server name (e.g., "bmc-lab3" instead of "bmc-lab2")
Using the wrong share name (remember, it's usually your PI's last name in lowercase)
Check that you're using the correct protocol:
Windows: \\server_name\share_name
MacOS: smb://server_name/share_name
Ensure you have the necessary permissions to access the path. You may need to contact your lab administrator or [email protected] for access.
Check your network connection. Run a speed test to ensure you have adequate bandwidth.
If you're off-campus, verify that your connection and VPN connection are stable.
Try connecting at a different time. High network traffic during peak hours (10AM to 4PM) can affect performance.
After checking the above, try completely disconnecting your active connections. This means fully disconnecting any mounted network locations, and closing any files or terminal sessions that are accessing said storage target.
If you are still experiencing issues, we recommend trying the open-source SMB client browser Cyberduck, with these instructions. If using the Cyberduck client resolves your issue(s), then the issue lies with your client sessions, not the server.
If you continue to experience problems after trying these solutions, please contact [email protected] for further assistance. Be sure to include:
The exact error message you're seeing.
The steps you've already taken to troubleshoot.
Your operating system and version.
The server and share name you're trying to access.
If you are aware of other users in your lab or group that are experiencing the same issue, and if so, for how long.
pwd
This command name stands for "print working directory".
It displays on the terminal the absolute path name of the current (working) directory.
It is useful to locate where you are currently in the Unix tree.
# print working directory
pwd
cd
This command name stands for "change directory".
It changes your current working directory to the specified location.
Your home directory is referred to as "~" (tilde).
The current directory is referred to with a single dot ( ".").
The directory located one level up the current directory (also known as "parent directory") is referred to with two dots ("..").
The last visited directory is referred to with an hyphen ("-").
# go to root directory of the system and print the working directory
cd /
pwd
# go to the home directory and print the working directory
cd ~
pwd
# change directory using the absolute path and print the working directory
cd /net/bmc-pub14/data/
pwd
ls
This command stands for "list".
It displays the content of a directory.
By default, the content is listed in lexicographic order.
# list the content of the current directory
ls
# list content of the parent directory
ls ../
Command switches allows to list the directory's content with more or less detail and sorted according to different criteria.
The switch "a" lists all files and directories, even those starting with "." that are normally hidden.
The switch "l" lists the content in "long" format, including the total size (in blocks), mode (permission), the number of links, the owner, the size, the date of last modification, and the file or subdirectory name. Note that the first hyphen is replaced with a “d” if the item is a directory.
The switch "t" sorts the content of the directory by time modified (most recently modified first).
The switch "S" sorts the content of the directory by size (smaller first).
The switch "r" reverses the order of the sort.
The switch "R" recursively lists subdirectories encountered.
The switch "u" specifies the use of time of last access of the file, instead of last modification, for sorting the content.
The switch "U" specifies the use of time of file creation, instead of last modification, for sorting the content.
The switch "F" displays a slash ("/") immediately after a directory, an asterisk ("*") after an executable, an at sign ("@") after a symbolic link.
# list all files
ls -a
# list in long format
ls -l
# list in long format, sorting by date
ls -lt
# list in reverse lexicographic order
ls -r
# list by size
ls -S
# list in long format according to last access
ls -lu
# list in long format according to the time of creation
ls -lU
# display “/” after a directory, “*” after an executable, “@” after a symbolic link
ls -F
To view the commands available on luria, execute the following (note the space-separated list of directories to list):
ls /bin/ /usr/bin/ /usr/local/bin/
mkdir
This command name stands for "make a directory".
It creates a new folder (or directory). If no path is specified, the new directory is created in the current directory.
The switch "-v" specifies a verbose mode: a message with the folder(s) created is printed on the screen.
# create a directory named "testdir1" with a subdirectory named "testdir2"
mkdir testdir1
mkdir testdir1/testdir2
# change current directory directly to "testdir2"
cd testdir1/testdir2
# go to the parent directory (i.e. testdir1) and print the working directory
cd ..
pwd
# create a new directory named "testdir3" with the verbose mode on
mkdir -v testdir3
cp
This command name stands for "copy".
It makes copies of files and directories to the specified location.
The switch "v" enables the verbose mode: messages describing the files copied are displayed on the screen.
The switch "i" enables the interactive mode: confirmation is requested before overwriting files.
Wildcards symbols such as "*" or "?" are commonly used to copy multiple files with a single command.
The symbol "*" stands for any number of alphanumeric characters.
The symbol "?" stands for a single alphanumeric character.
The following examples assume you have a directory named "unix_class" in your home directory:
# create a directory "unix_class" in your home directory and access it
mkdir ~/unix_class
cd ~/unix_class
# copy the file named arrayDat.txt into your unix_class directory
cp /net/ostrom/data/dropbox/arrayDat.txt ~/unix_class/
ls
# the above command is equivalent to the following:
cp /net/ostrom/data/dropbox/arrayDat.txt ./
ls
# use the interactive mode and type "y" to confirm the choice
cp -i /net/ostrom/data/dropbox/arrayDat.txt ~/unix_class/
ls
# use the verbose mode to see the file being copied
cp -v /net/ostrom/data/dropbox/arrayDat.txt ~/unix_class/
# make a local copy of the file call it "arrayDat1.txt"
cp arrayDat.txt arrayDat1.txt
ls
# copy all the files with suffix "array” into the current directory
cp /net/ostrom/data/dropbox/UNIX/array* ./
ls
# copy any file whose extension is "txt"
cp -i /net/ostrom/data/dropbox/UNIX/*.txt ./
ls
mv
This command name stands for "move".
It moves (renames) files and directories.
Several switches are available to specify the behavior of this command, including "-i" (interactive mode) and "-v" (verbose).
# cp arrayDat.txt into arrayDat1.txt, then rename arrayDat1.txt as arrayDat2.txt
cp arrayDat.txt arrayDat1.txt
mv arrayDat1.txt arrayDat2.txt
ls
# rename in interactive mode, i.e. ask for confirmation before overwriting existing file
cp arrayDat2.txt arrayDat3.txt
ls
mv -i arrayDat2.txt arrayDat3.txt
ls
# rename in verbose mode, i.e. print information on the screen
mv -v arrayDat3.txt arrayDat4.txt
ls
rmdir
This command stands for "remove directory".
It deletes the specified directory.
The switch "-v" specifies a verbose mode: a message with the folder(s) deleted is printed on the screen.
Note that in general you cannot delete a directory that is not empty. The content of such directory has to be deleted before the directory itself can be deleted.
# remove the folder "testdir2" in the location specified by the path
rmdir ~/testdir1/testdir2
# remove the folder "newdir1"
rmdir testdir1
# remove the folder "newdir3" using the verbose mode
rmdir -v testdir3
rm
This command name stands for "remove".
It deletes files and directories.
Several switches are available to specify the behavior of this command, including "-i" (interactive mode) and "-v" (verbose).
# create copies of arrayDat.txt
cp arrayDat.txt arrayDat1.txt
cp arrayDat.txt arrayDat2.txt
cp arrayDat.txt arrayDat3.txt
cp arrayDat.txt arrayDat4.txt
ls
# delete file
rm arrayDat2.txt
# delete file in interactive mode, i.e. ask for confirmation before removing
rm -i arrayDat3.txt
# delete file in verbose mode, i.e. prints the name of file deleted
rm -v arrayDat4.txt
The "rm" command can be used in conjunction with wildcard symbols to delete multiple files at once. Extreme caution should be used, as action are not often reveresible. It is good practice either to use the interactive/verbose mode or use the command "ls" with the intended pattern, before invoking the command "rm".
# create a copy of arrayDat1.txt, then delete files with suffix “arrayDat”,
followed by a digit and extension “txt”
cp arrayDat1.txt arrayDat2.txt
rm -i arrayDat?.txt
The switch "-r" deletes the content of a directory recursively,
i.e. deletes the directory's files and subdirectories, including their content.
#create a subdirectory and copy *.txt files
mkdir testdir
cp *.txt testdir/
ls
ls testdir/*
# remove directory and its content
rmdir testdir
rm -r testdir
ls
cat
This command name stands for "concatenate".
It concatenates the content of file(s) and prints it out.
The switch "-n" specifies to number each line, starting from 1.
The "cat" command is often used in conjuction with the pipe ("|") to view the content of long files.
The "cat" command can be used to copy and append the content of files using output redirection (">").
# view the content of myfile.txt
cat arrayDat.txt
# numbers each line of myfile.txt
cat -n arrayDat.txt
# copy the content of myfile.txt using output redirection
cat arrayDat.txt > file1.txt
cat arrayDat.txt > file2.txt
ls
# concatenate the content of the two files
cat file1.txt file2.txt
# append the content of file2.txt to file1.txt
cat file2.txt >> file1.txt
cat file1.txt
When used without argument, the "cat" command takes STDIN as default argument. You can use this behavior to type in brief information, rather than invoking a standard text editor.
# type the following command
cat > testfile.txt
# type any text, then hit Return, then press Ctrl-C
#view content of newly created file
cat testfile.txt
less
This command is a "pager" that allows the user to view (but not modify) the contents of a text file one screen at a time.
The space-bar is used to advance to the next page.
The key "q" is used to "quit" the pager.
The switch "-n" allows to view the content of a file starting at the specified line.
# view the content of arrayDat.txt, hit the space bar to advance, type "q" to quit
less arrayDat.txt
# view content of file starting at line 5
less +5 arrayDat.txt
head
This command displays the top part of a file.
By default, the first 10 lines of a file are displayed.
The switch "-n" specifies the number of lines to display.
The switch "-b" specifies the number of bytes to display.
# display the first lines of a file (10 lines by default)
head arrayDat.txt
#display the first 2 lines of a file
head -n 2 arrayDat.txt
tail
This command displays the bottom part of a file.
By default, the last 10 lines of a file are displayed.
The switch "-n" specifies the number of lines to display.
The switch "-b" specifies the number of bytes to display.
# display the last lines of a file (10 lines by default)
tail arrayDat.txt
#display the last 2 lines of a file
tail-n 2 arrayDat.txt
top
This command displays Linux tasks
The top program provides a dynamic real-time view of a running system
du
This command stands for Disk Usage
It estimates file space usage
# print sizes in human readable format
du -h
df
This command stands for Disk Free
It reports file system disk space usage
# print sizes in human readable format
df -h
One very important utility is the shell.
Shell - user interface to the system, lets you communicate with a UNIX computer via the keyboard.
The shell interfaces you with the computer through three channels: standard input (STDIN), standard output (STDOUT), and standard error (STDERR). STDERR is the channel used for communicating program errors. We'll focus on STDIN and STDOUT.
STDIN is the channel, or stream, from which a program reads your input. When you type in your terminal, you're typing into the STDIN.
STDOUT is the channel, or stream, from which a program outputs data. When you run echo 'Hello, World!'
, the shell is printing the output into STDOUT. In this case, STDOUT is the terminal screen.
[asoberan@luria net]$ echo 'Hello, World!'
Hello, World!
STDIN and STDOUT are the ways that you and the shell communicate. But what is the shell doing when you tell it to run a command?
Let's say I want to run the echo
command, as we did above. We type echo
in the terminal, then press Enter
. What the shell does is read the PATH
environment variable, which is a system variable that lists the different paths that programs are stored in.
Using printenv
to print the PATH
environment variable. Notice how many paths in this environment variable are ones dictated by the FHS:
[asoberan@luria net]$ printenv PATH
/home/asoberan/.conda/envs/class/bin:/home/software/conda/miniconda3/condabin:/home/software/conda/miniconda3/bin:/home/asoberan/.local/bin:/opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin:/opt/ohpc/pub/compiler/gcc/5.4.0/bin:/opt/ohpc/pub/prun/1.1:/opt/ohpc/pub/autotools/bin:/opt/ohpc/pub/bin:/home/asoberan/.local/bin:/home/software/google-cloud-sdk/google-cloud-sdk-193.0.0/google-cloud-sdk/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
Then, it will go through these paths and look for the program you are trying to run. For example, the echo
command is stored in /usr/bin
, a directory we can see in the PATH
environment variable:
Using which
to find the location of a program:
[asoberan@luria net]$ which echo
/usr/bin/echo
Once it finds it, it will run the program in a new process, wait for the program to give an exit status, then print any output of the program into STDOUT, in this case our terminal screen. Now, the shell is ready to run another command.
The shell is used to interface with a UNIX computer, but we're not going to go to the server room, plug in a keyboard, and start typing commands for our shell. We're going to use the Secure Shell (SSH) to remotely connect to Luria.
SSH is a program that lets you open a remote shell to your UNIX system. It's secure because your connection to the system is kept encrypted. For instructions on using SSH to connect to Luria, refer to the page below:
The following examples use the file isoforms_fpkm_track.txt
that you can copy by typing the following command on Luria:
This is incorrect! Will update this later.
# copy the file from the stock directory to the current directory
cp /net/ostrom/data/rowleybcc/charliew/old_teaching/Biol2Bioinfo/UNIX/isoforms_fpkm_track.txt .
The given file is tab delimited and represents one of the outputs of the transcriptome assembly tool Cufflinks.
A description of the columns is reported here for convenience, but more information can be found in the cufflinks manual.
tracking_id: A unique identifier describing the object (gene, transcript, CDS, primary transcript)
class_code: The class_code attribute for the object, or "-"
nearest_ref_id: The reference transcript to which the class code refers, if any
gene_id: The gene ID(s) associated with the object
gene_short_name: The gene short name(s) associated with the object
tss_id: The tss ID associated with the object, or "-"
locus: Genomic coordinates of the object
length: The number of base pairs in the transcript, or '-' if not a transcript/primary transcript
coverage: Estimate for the absolute depth of read coverage across the object
FPKM: FPKM of the object
FPKM_lo: The lower bound of the 95% confidence interval on the FPKM of the object
FPKM_hi: The upper bound of the 95% confidence interval on the FPKM of the object
status: Quantification status for the object. Can be one of: OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL (when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution).
In this tutorial we will answer the following questions about the given file:
By counting how many distinct entries there are in the 1st and 4th column, we can determine the number of transcripts and genes respectively.
# count the number of isoforms
cut -f 1 isoforms_fpkm_track.txt | sort | uniq | wc -l
# count the number of genes
cut -f 4 isoforms_fpkm_track.txt | sort | uniq | wc -l
The FPKM values are reported in the 10th field. We can ignore the transcripts with zero FPKM value by using simple awk statement.
# print isoforms with nonzero FPKM value
awk '$10 !=0 {print $0}' isoforms_fpkm_track.txt
Transcripts with status other than "OK" are generally not suitable for downstream analysis. We can filter these out by considering only those records whose last field matches the string "OK".
# print isoforms with nonzero FPKM value
awk '$13 == "OK" {print $0}' isoforms_fpkm_track.txt
Note that using a naive grep statement might not always work because a naive grep match is performed on the entire line, rather than on a specific field.
Sort by increasing FPKM values
Because FPKM values are often reported in a scientific format called E notation, a simple numeric sort (specified with the switch -n) does not work properly because the character "e" in the notation is not interpreted as it should (e.g. 2.5e+8 is equivalent to 2.5 x 10 ^ (+8)). The switch "-g" takes care of this by performing a true numeric sort.
# sort numerically by increasing FPKM value (E notation IS NOT handled correctly!)
sort -n -k 10 isoforms_fpkm_track.txt
# check the highest FPKM values
sort -n -k 10 isoforms_fpkm_track.txt | cut -f 10 | tail
# sort numerically by increasing FPKM value (E notation IS handled correctly!)
sort -g -k 10 isoforms_fpkm_track.txt
# check the highest FPKM values
sort -g -k 10 isoforms_fpkm_track.txt | cut -f 10 | tail
Imposing that the length must be greater than 125 and smaller than 10000 in a simple awk statement, yields to records with transcript length meeting the specified criteria.
# print only transcripts with length between 125 and 10000
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt
# verify the length of the shortest transcript
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt | cut -f 8 | sort -n | head
# verify the length of the longest transcript
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt | cut -f 8 | sort -n | tail
Suppose we want to split the locus annotation into chromosome, start position and end position. We can use an awk statement to split the content of the 7th field using the separators ":" and "-".
# split locus into chromosome, start and end
awk 'split($7, a, ":") split(a[2],b,"-") {print $1,a[1],b[1],b[2]}' isoforms_fpkm_track.txt
We can extract the records of those isoforms that are mapped within a specific genomic interval. For simplicity's sake, first we split the locus into chromosome, start and end, then we impose limits on the coordinates using awk.
# split locus into chromosome, start and end
awk 'split($7, a, ":") split(a[2],b,"-") {print $1,a[1],b[1], b[2]}' isoforms_fpkm_track.txt > tmp
# impose condition on start and end position (10,000,000 and 12,000,000)
awk '$3 > 10000000 && $4 < 12000000 {print}' tmp
Jobs are managed on luria.mit.edu using Slurm.
Slurm is an advanced job scheduler for a cluster environment.
The main purpose of a job scheduler is to utilize system resources in the most efficient way possible.
The number of tasks (slots) required for each job should be specified with the "-n" flag
Each node provides 16,32, or 96 slots.
The process of submitting jobs to Slurm is done using a script.
The process of submitting jobs to Slurm is done generally using a script. The job script allows all options and the programs/commands to be placed in a single file.
It is possible to specify options via command line, but it becomes cumbersome when the number of options is significant.
An example of a script that can be used to submit a job to the cluster is reported below. Start by opening a file and copy and paste the following commands, then save the file as myjob.sh or any other meaningful name. Note: Job names can not start with a number.
The first 5 lines specify important information about the job submitted, the rest of the file contain some simple UNIX commands (date, sleep) and comments (lines starting with #####).
The "#SBATCH" is used in the script to indicate an slurm option.
#SBATCH -N 1: You should always include this line exactly. The number followed by -N must always be 1 unless you run MPI applications which is rare for typical bioinformatics software.
#SBATCH -n: This is the number of tasks requested. The recommended maximum number allowed is 16 in the normal partition, i.e., don't ask for more than 16 tasks in your script unless you receive special instructions from the system administrator. It is important to request resources as accurate as you can. If possible, please do not request more than what you need, and do not request less than what you need. The best way to find out how much you need is through testing. While your job is running, you can ssh to the node and use the top command to see if it is using the requested resources properly. Note that what you requested from the slurm scheduler by -n is not the actual CPUs allocated by the OS.
#SBATCH --mail-user=[]: You must replace [] with your email address
#SBATCH -t [min] OR -t [days-hh:mm:ss]: Specifies the wall clock limit. It has a time limit of 14 days at maximum, i.e. a job can not run for more than 14 days on luria.
Submit your job by executing the command:
where myjob.sh is the name of the submit script. After submission, you should see the message: Submitted batch job XXX where XXX is an auto-incremented job number assigned by the scheduler. For example: Submitted batch job 3200863
To submit your job to a specific node, use the following command:
where X is a number specifying the node you intend to use. For example, the following command will submit myjob.sh to node c5 : sbatch -w c5 myjob.sh
To submit your job while excluding nodes (for example exclude c[5-22]), use the following command:
You should not run interactive jobs on the head node of Luria. The head node is shared by all users. An interactive job may negatively affect how other users interact with the head node or even make the head node inaccessible to all users. Thus, instead of running myjob.sh on the head node, you should run "sbatch myjob.sh". However, you can run an interactive job on a compute node. This can be done using command "srun --pty bash", which will open up a remote shell on a random compute node.
Then you can run program interactively. This is often useful when you are compiling, debugging, or testing a program, and the program does not take long to finish.
Sometimes your program (such as matlab or R) may need the X11 window for graphical user interface, and then you can use the command srun --pty bash. You will also need to install an X11 client such as Xming or XQuartz on your machine to display X window and enable X11 forwarding on your ssh client.
Remember to exit cleanly from interactive sessions when done; otherwise it will be killed without your notice.
A user can submit up to 1000 jobs at a time. Jobs are typically scheduled on on a first-come, first-served basis. If you submit a lot of jobs at a time and take a lot of resources, others will have to wait until your jobs complete which are not optimal for cluster usage. If you do need to submit a lot of jobs, please add these options in your job scripts.
The nice option is to lower the job priority and the exclude option is to exclude half of the nodes for your jobs. This will allow others’ jobs getting a chance to run while allowing you to run some of your jobs.
To monitor the progress of your job use the command:
To display information relative only to the jobs you submitted, use the following (where username is your username):
A useful tip on customizing the output of squeue
Get more information on a job
Any job run on the cluster is output to a slurm output file slurm-XXX.out where XXX is the job ID number (for example: slurm-3200707.out).
After submitting myjob.sh, any output that would normally be printed out to the screen is now redirected to:slurm-XXX.out
You can also redirect output within the submission script.
To stop and delete a job, use the following command:
where XXX is the job number assigned by slurm when you submit the job using sbatch. You can only delete your jobs.
To check the status of host and its nodes, you can use the following command:
states of nodes:
Customizing output
Slurm job arrays provide an easy way to submit a large number of independent processing jobs. For example, job arrays can be used to process the same workflow with different datasets. When a job array script is submitted, a specified number of array tasks are created based on the master sbatch script.
SLURM provides a variable named $SLURM_ARRAY_TASK_ID to each task. It can be used inside the job script to handle input/output for that task. For example, create a file named ex3.sh with the following lines.
Create a slurm job script named ex1.sh that processes a fastq file. Include the following lines in your job script. Determine what should be the appropriate -n value. Use the top command to watch the CPU and memory while the job is running on a compute node.
Now you want to process multiple fastq files and run the script in parallel. One way to do is to make the fastq filename as an argument and sbatch the script with the filename in the argument. Create a new script named ex2.sh and include the following lines
You can submit the script twice with arguments
Get an interactive terminal
You will get a node allocated by the slurm scheduler. For example, c2.
Start notebook on the allocated node (e.g. c2).
Open ssh conncetion to Luria.mit.edu and the node (e.g. c2) from your local machine, i.e. your local desktop or laptop SSH client from either the Terminal (Mac) or the PowerShell (Windows). Replace username with your own username, and c2 with the actual compute node.
The above commands are actually running
It is likely that some other user has taken port 8888 ($HEAD_NODE_PORT) on the head node. In that case, you will get an error "bind: Address already in use". You should then change $HEAD_NODE_PORT from 8888 to a different port such as 8887 or 8886.
You can also change $MY_MACHINE_PORT and $COMPUTE_NODE_PORT, but that is only needed if you have another process that has taken 8888 on your local machine, or another user happens to take 8888 on the same compute node.
Tunnel from your local machine (either Windows or Mac) to Jupyter notebook running on $COMPUTE_NODE_NAME
Direct your browser on your local machine to
Close connection when finished
There is no Rstudio server installed on Luria. You can run Rstudio server using a singularity image with a wrapper script. Please see an . The steps of getting an interactive terminal, and opening ssh tunnel from your local machine to the allocated computing node are similar to the Running Jupyter notebook steps above, but with a different module (module load singularity) and a different port by default (for example, in your local machine). Here local machine refers to your Mac or Windows PC. Run the ssh command from the Terminal (Mac), or the Windows PowerShell (Windows). For example
Tip 1: To allow the Rstudio server singularity container to access your data located in your storage server (e.g. rowley or bmc-lab2), you need to edit the rstudio.sh file to bind your data path. At the end of the rstudio.sh script, you will see a singularity exec command with many --bind arguments. In the script, you should add additional --bind arguments, for example, --bind /net/rowley/ifs/data// or --bind /net/bmc-lab2/data/lab//, where you need to replace labname and username with actual values.
Tip 2: Ignore the open an SSH tunnel command printed in the rstudio.sh standard output, use the ssh command in the above example instead on your local machine.
Tip 3: If someone else has taken the port 8787 on the head node of Luria, you will get an error like "bind: Address already in use" when you run the ssh command from your laptop. In that case, choose a different port number, e.g. 8777 (please refer to the previous section on Jupyter notebook for an explanation of port numbers). For example
Alternatively, you can choose a different port number before starting the rstudio.sh script on the compute node, for example
Tip 4: The script uses a custom image with Seurat dependencies pre-installed. You can select your own R version or image based on the documentation of the
Tip 5: If you are using Windows Secure CRT to connect to Luria, you can set up Port Forwarding (Tunneling). Select Options -> Session Options. Click on the "Port Forwarding" and then Click on the "Add" button to add a forwarding setup. You will get the Local Port Forwarding Properties dialog. Choose a Name, and then set the port for both Local and Remote. For example, 8777. On the head node of Luria, you will also need to run
An example MATLAB script: matlabcode.m
Note: fname need to be changed correspondingly
A shell job submission script submitting matlabcode.m to a compute nodes
To automate Matlab script, interactive session can be turned off.
On a UNIX system, MATLAB uses the X-Windows Interface to interact with the user. In a batch execution, this communication is not necessary.
We don't need the initial "splash" which displays the MATLAB logo. We can turn this off with the -nosplash option.
We also don't want the interactive menu window to be set up, which can be suppressed with the -nodisplay option
Two other options are available which may be useful in suppressing visual output and logs, the -nojvm option ("no Java Virtual Machine") and -nodesktop.
Many Unix systems come preinstalled with crontab
, a tool that generates a file where each row is a cron job, or a program/script set to run at a specific time or interval.
Creating cron jobs is useful for periodically running programs. For example, you could create a script that searches your directory for files older than 1 year, and then create a cron job that runs this script once a month.
To specify the schedule for which a cron job should run, you need to use cron schedule expression. Cron schedule expressions have the following format:
The first place represents the minute that the cron job should run at. The second place the hour, the third place the day of the month, the fourth the month, the fifth the day of the week.
A *
indicates to run at "any" time. So a *
over the month position would mean run on any month.
Prepending a number with */
means "every" . So a */5
over the minute position would mean run every 5th minute, A.K.A. every 5 minutes.
The following expression would dictate that the cron job run at 6:00 AM everyday, A.K.A. the 0th minute of the 6th hour on any day of the month, any month, and any day of the week.
The following expression would dictate that the cron job run every 30 minutes on Saturday, A.K.A. every 30th minute of any hour on any day of the month, any month, and on the 6th day of the week.
It's easy to get confused making cron schedule expressions, so I recommend using a site such as that makes it easy to construct the expressions and describe them to you.
Once you have a cron schedule expression, you can create a cron job in your crontab using the command crontab -e
, which will open an editor for you the add your cron job. Put your cron job on an empty line with the following format:
#!/bin/bash
#SBATCH -N 1 # Number of nodes. You must always set -N 1 unless you receive special instruction from the system admin
#SBATCH -n 1 # Number of tasks. Don't specify more than 16 unless approved by the system admin
#SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL. Equivalent to the -m option in SGE
#SBATCH --mail-user=[] # Email to which notifications will be sent. Equivalent to the -M option in SGE. You must replace [] with your email address.
#############################################
echo print all system information
uname -a
echo print effective userid
whoami
echo Today date is:
date
echo Your current directory is:
pwd
echo The files in your current directory are:
ls -lt
echo Have a nice day!
sleep 20
sbatch myjob.sh
sbatch -p [normal] -w [cX] [script_file]
sbatch --exclude c[5-22] myjob.sh
srun --pty bash
#SBATCH --nice=100000
#SBATCH --exclude=c[5-22]
squeue
squeue -u username
alias qf='squeue --format="%.18i %.16j %.8u %.8T %.10M %.4C %.20V %.8R"'
qf
scontrol show job <jobid>
scancel XXX
sinfo -N
mix : consumable resources partially allocated
idle : available to requests consumable resources
drain : unavailable for use per system administrator request
drng : currently executing a job, but will not be allocated to additional jobs.
alloc : consumable resources fully allocated
down : unavailable for use. Slurm can automatically place nodes in this state if some failure occurs.
alias qn='sinfo -N --format="%.8N %.9P %.13C %.8O %.8e %.8m %.8T"'
qn
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 4
#SBATCH --array=1-2
module load fastqc/0.11.5
module load bwa/0.7.17
FASTQDIR=/net/rowley/ifs/data/dropbox/UNIX
WORKDIR=~/data/class
mkdir -p $WORKDIR
cd $WORKDIR
FILE=$(ls $FASTQDIR/*.fastq | sed -n ${SLURM_ARRAY_TASK_ID}p)
fastqc -o $WORKDIR $FILE
bwa mem -t4 -f $(basename $FILE).sam /home/Genomes/bwa_indexes/mm10.fa $FILE
module load fastqc/0.11.5
module load bwa/0.7.17
mkdir -p ~/data/class
cd ~/data/class
fastqc -o ~/data/class /net/rowley/ifs/data/dropbox/UNIX/test_1.fastq
bwa mem -t16 -f ex1.sam /home/Genomes/bwa_indexes/mm10.fa /net/rowley/ifs/data/dropbox/UNIX/test_1.fastq
module load fastqc/0.11.5
module load bwa/0.7.17
FILE=$1
WORKDIR=~/data/class
mkdir -p $WORKDIR
cd $WORKDIR
fastqc -o $WORKDIR $FILE
bwa mem -t16 -f $(basename $FILE).sam /home/Genomes/bwa_indexes/mm10.fa $FILE
sbatch ex2.sh /net/rowley/ifs/data/dropbox/UNIX/test_1.fastq
sbatch ex2.sh /net/rowley/ifs/data/dropbox/UNIX/test_2.fastq
$ srun --pty bash
$ module load python3/3.6.4 # skip this if you have your own jupyter notebook installed through conda, pip or any other approach
$ export XDG_RUNTIME_DIR="" # skip this if you have your own jupyter notebook installed through conda, pip or any other approach
$ jupyter notebook --no-browser --port=8888 # 8888 is the port running on the compute node $COMPUTE_NODE_PORT
$ MY_MACHINE_PORT=8888
$ HEAD_NODE_PORT=8888
$ COMPUTE_NODE_PORT=8888
$ COMPUTE_NODE_NAME=c2
$ ssh -t [email protected] -L ${MY_MACHINE_PORT}:localhost:${HEAD_NODE_PORT} ssh ${COMPUTE_NODE_NAME} -L ${HEAD_NODE_PORT}:localhost:${COMPUTE_NODE_PORT}
$ ssh -t [email protected] -L 8888:localhost:8888 ssh c2 -L 8888:localhost:8888
ctrl-c
Run on head node: $ srun --pty bash # Get interactive terminal
Run on compute node (e.g. c7): $ module load singularity/3.5.0 # load singularity module on computing node
Run on compute node (e.g. c7): $ IMAGE=pansapiens/rocker-seurat:4.1.2-4.1.0 ./rstudio.sh # Launch wrapper script on computing node (which downloads the Singularity image only once if run for the first time)
Run on your local machine: % ssh -t [email protected] -L 8787:localhost:8787 ssh c7 -L 8787:localhost:8787 # replace username and c7 with actual values
Run on your local machine: open a browser and point URL to http://localhost:8787
Run on your local machine: % ssh -t [email protected] -L 8787:localhost:8777 ssh c7 -L 8777:localhost:8787 # replace username and c7 with actual values
Run on your local machine: open a browser and point URL to http://localhost:8787
Run on compute node (e.g. c7): $ export RSTUDIO_PORT=8777
Run on compute node (e.g. c7): $ IMAGE=pansapiens/rocker-seurat:4.1.2-4.1.0 ./rstudio.sh
Run on your local machine: % ssh -t [email protected] -L 8777:localhost:8777 ssh c7 -L 8777:localhost:8777 # replace username and c7 with actual values
Run on your local machine: open a browser and point URL to http://localhost:8777
Run on head node: $ ssh c7 -L 8777:localhost:8787 # replace c7 with actual node where the Rstudio server is running
Run on your local machine: open a browser and point URL to http://localhost:8777
fprintf('create a 3 by 3 random matrix X\n')
X=rand(3)
fprintf('create a 3 by 4 random matrix Y\n')
Y=rand(3,4)
fprintf('calculate the product of matrices X and Y\n')
Z=X*Y
fprintf('calculate the inverse of matrix X\n')
A=inv(X)
fprintf('transpose matrix Z\n')
B=Z'
fprintf('find out the toolboxes installed on rous\n')
ver
fprintf('find out the location of matlab toolboxes on rous\n')
matlabroot
fprintf('find out how to use Matlab Bioinformatics Toolbox\n')
help bioinfo
fprintf('A brief Example of using Matlab Bioinformatics Toolbox\n')
fprintf('Load data into the Matlab enviroment\n')
load yeastdata.mat
fprintf('Get the size of the data\n')
numel(genes)
fprintf('Display the 15th gene\n')
genes{15}
fprintf('Display the expression of the 15th gene\n')
yeastvalues(15,:)
fprintf('Display the time series in hours\n')
times
fprintf('Plot expression profile for gene15\n')
figure1=figure
plot(times, yeastvalues(15,:))
xlabel('Time Hours')
ylabel('Log2 Relative Expression Level')
fname='/home/duan/course/2015/June'
saveas(figure1,fullfile(fname,'YAL054C_expression_profile'),'jpg')
#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -V
#$ -m e
#$ -M [email protected]
#$ -pe whole_nodes 1
#############################################
module load matlab/2011b
matlab -nosplash -nodisplay -nojvm -nodesktop <matlabcode.m >output
* * * * *
(min) (hour) (day of the month) (month) (day of the week)
0 6 * * *
*/30 * * * 6
0 6 * * * /home/asoberan/your/script.sh
The following examples use the file myfile.sam
that you can copy by typing the following command on Luria:
# copy the file from the stock directory to the current directory
cp /net/ostrom/data/rowleybcc/charliew/old_teaching/Biol2Bioinfo/SAM/myfile.sam .
As revealed by the extension, the file is in SAM format.
The SAM format stores sequence data in a series of tab delimited columns.
BAM, the sister version of a SAM file, contains the same information but in a compressed, indexed, binary form that is not human readable.
Basic information on SAM files can be found here.
The official specifications of SAM and BAM format are available here.
In this tutorial we will answer the following questions about the given SAM file:
You can look at the content of the file by using the command more, head, cat, or by opening the file itself with a text editor such as nano.
# display the content of the file on the screen, one page at a time
more myfile.sam
# display the content of the whole file on the screen
cat myfile.sam
# display the first 10 lines
head myfile.sam
# open the file with the text editor nano
nano myfile.sam
Because in a SAM formatted file each line corresponds to one reported alignment, we just need to count the number of lines in the file to get the number of alignments.
# how many alignments?
wc -l myfile.sam
Given that in a generic SAM flle any sequence read can have more than one reported alignment, the number of alignments is not necessarily the number of reads in the file. However, the sequence reads are associated with a unique header, so we can count the number of distinct headers to determine how many reads are represented in the file.
# how many reads?
cut -f 1 myfile.sam | sort | uniq | wc -l
Because two distinct reads could have identical sequence (by chance due to their short length, or by being PCR duplicates), the number of reads is not necessarily the same as the number of distinct sequences in the file, so we need to count the number of distinct occurrences in the 10th field.
# how many sequences?
cut -f 10 myfile.sam | sort | uniq | wc -l
Uniquely mapped reads are generally intended as reads that have only one reported alignment. To determine the number of reads with only one reported alignment, we can count how many headers occur only once in the file. Note that without the switch "-u", repeated headers will also be reported once, yielding to incorrect output.
# how many reads are uniquely mapped?
cut -f 1 myfile.sam | sort | uniq -u | wc -l
Multi-hits are those reads that have more than one reported alignment. By counting how many distinct headers are repeated in the file, we can determine the number of multi-hits.
# how many reads are multi-hits?
cut -f 1 myfile.sam | sort | uniq -d | wc -l
To count the number of alignments each read has reported in the file, we count how many times each header appears in the file.
# how many alignments are there for each read?
cut -f 1 myfile.sam | sort | uniq -c
To get the top 10 multi-hit reads, we count the number of alignments for each header, then we perform a numerical, reverse sort on the number of occurrences.
# top 10 reads that have the most reported alignments
cut -f 1 myfile.sam | sort | uniq -c | sort -nr | head
Another way to get the top 10 multi-hit reads is to consider the alignment tag of the form "NH:i:" and use the command grep with a regular expression. Note that the sorting occurs on the tag itself because of the "-o" switch in the grep command, and precisely on the number of hits reported in the tag because of the "-k" switch in the sort command (i.e. we ignore the first 5 positions of the tag, which are common to all the tags, and we sort on the number of multi-hits itself). Another caveat is using "-n" switch with the sort command (if omitted, the sort occurs in a lexicographic order and the top multi-hit reads reported would be those with 9 multi-hits).
# top 10 reads that have the most reported alignments
grep -o "NH:i:[[:digit:]]*" myfile.sam | sort -n -k 1.6 | tail
To determine what chromosomes are mapped in the file, we extract the chromosome label (corresponding to the 3rd field), then apply the sort and uniq commands. To print the chromosomes in numerical order, we use the switch "-n" on the labels, skipping the first three characters ("chr").
# what chromosome are there?
cut -f 3 myfile.sam | sort | uniq
# what chromosome are there (in numerical chromosome order)?
cut -f 3 myfile.sam | sort | uniq | sort -k 1.4 -n
To list the chromosome from the most represented to the least represented, we use the switch "-n" on the number of occurrences for each chromosome. Note that without the "-n" switch, the sorting of the chromosome frequency would be lexicographic rather than numeric, so chrX would appear before chr19 because the string "99" occurs before "87937" in a lexicographic reverse order.
# count the occurrences by chromosome
cut -f 3 myfile.sam | sort | uniq -c
# count the occurrences by chromosome, then sort from most to least abundant
cut -f 3 myfile.sam | sort | uniq -c | sort -nr
Using the sorted list obtained in the step above, we consider the first reported chromosome as the most abundant and the last reported chromosome as the least abundant. Note that we can either choose to sort the list in reverse order or use the commands head/tail appropriately.
# count the occurrences by chromosome
cut -f 3 myfile.sam | sort | uniq -c
# count the occurrences by chromosome, then sort from most to least abundant
cut -f 3 myfile.sam | sort | uniq -c | sort -nr
# which chromosome is the most represented?
cut -f 3 myfile.sam | sort | uniq -c | sort -nr | head -n 1
# which chromosome is the least represented?
cut -f 3 myfile.sam | sort | uniq -c | sort -nr | tail -n 1
# which chromosome is the least represented?
cut -f 3 myfile.sam | sort | uniq -c | sort -n | head -n 1
The 2nd field in a SAM formatted file contains the mapping flag, which provides information on the mapping features of the given alignment. For example, a flag "4" indicates that the read is unmapped, whereas a flag "16" indicates the read is mapped on reverse strand. To find what flags are associated with the alignments in the file, we simply report the distinct flags present in the file. The switch "-c" can be use to report the counts.
# list the mapping flags
cut -f 2 myfile.sam | sort | uniq
# what is the distribution of mapping flags?
cut -f 2 myfile.sam | sort | uniq -c
To grab only the reads that are mapped on the reverse strand, we use a simple awk statement that compares the mapping flag with the value 16 in the condition. Note that the number of alignments mapped on the reverse strand is not necessarily the same as the number of reads mapping to the reverse strand, because of the presence of multi-hits in the file.
# what reads map on the reverse strand?
awk '$2 == 16' myfile.sam | cut -f 1 | uniq
# how many reads map on the reverse strand?
awk '$2 == 16' myfile.sam | cut -f 1 | uniq | wc -l
# which alignments are mapped on the reverse strand?
awk '$2 == 16' myfile.sam
# how many alignments are mapped on the reverse strand?
awk '$2 == 16' myfile.sam | wc -l
# extract alignments mapped to chr1
grep -w chr1 myfile.sam > myfile_chr1.sam
# extract alignments mapped to chr3
grep -w chr3 myfile.sam > myfile_chr3.sam
Note that the "-w" option is necessary in the first case, otherwise other chromosomes with digit 1, such as 10,11,etc... will be considered. Even though these commands yield to the correct output in the considered example, in general is recommended that you explicitly compare the field of interest with the given pattern (i.e. chr1), rather than using the command grep on the entire record, as by chance another field could contain the pattern.
Are there fixed-length reads? What is their length?
To verify that all reads have same length, we compute the length of each sequence read and then consider the number of distinct lengths obtained: if only one distinct length is present, then all reads have same length.
# what is the length of the reads?
awk '{print length($10)}' myfile.sam | sort | uniq
The mapping quality is reported by the aligner on the 5th field of the SAM file. The valid range for mapping qualities in a SAM file is 0-255. Note that even though the SAM format specifications state that a value 255 is to be interpreted as unavailable mapping quality, many aligners assign the value 255 as the maximum possible alignment quality, indicating a high confidence on the given read placement. Naturally, as the number of multiple hits reported for a given read increases, the mapping quality for each corresponding alignment decreases, as you can see from the example below.
# distribution of mapping qualities
cut -f 5 myfile.sam | sort | uniq -c
# multi-hits associated with mapping quality 0
awk '$5 == 0' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq
# multi-hits associated with mapping quality 1
awk '$5 == 1' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq
# multi-hits associated with mapping quality between 2 and 254
awk '$5 > 2 && $5 < 254' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq
# multi-hits associated with mapping quality equal to 255
awk '$5 == 255' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq
Gapless alignments have a CIGAR string consisting of a number (i.e. the length of the read) followed by the letter M. Recall that the reads have a fixed length of 60 bases in the given file.
# count the alignments that have only matches/mismatches
cut -f 6 myfile.sam | grep -c 60M
# this is equivalent to the command above
awk '$6 == "60M"' myfile.sam | wc -l
How many reported alignments are perfect matches?
The SAM tag "NM:i:" reports the edit distance from the reference sequence. An edit distance equal to 0 represents a perfect match.
# number of alignments that are perfect matches (edit distance = 0)
awk '$6 == "60M" ' myfile.sam | grep -o "NM:i:0" | wc -l
Mon May 11, 2015
2pm-3pm
Duanduan Ma
Building 76
76-156a&b
Introduction to UNIX and the KI/BioMicro computational resources
Mon Jun 15, 2015
1:30pm-3pm
Duanduan Ma
Building 68
68-374
Follow Up Course of Introduction to UNIX and the KI/BioMicro computational resources
Mon Jul 13, 2015
10am-11am
Charlie Whittaker
Building 76
76-156a&b
Introduction to Spotfire
Mon Aug 10, 2015
1:30pm-3pm
Duanduan Ma
Building 56
56-602
Introduction to UNIX and the KI/BioMicro computational resources
Mon Sep 14, 2015
1pm-4pm
Duanduan Ma
Building 68
68-374
Introduction to Python
Mon Oct 19, 2015
1pm-4pm
Duanduan Ma
Building 68
68-674
Introduction to Python
Mon Nov 9, 2015
1:30pm-3:30pm
Jie Wu and Vincent Butty
Building 68
68-181
A Traveller's companion to RNA-Seq interpretation
Mon Dec 14, 2015
2pm-4pm
Duanduan Ma
Building 68
68-474
Introduction to UNIX and the KI/BioMicro computational resources
Thur Jan 7, 2016
1:30pm-4:30pm
Duanduan Ma
Building 32
32-124
Quantitative Biology Workshop Outreach-Introduction to Python
Fri Jan 29, 2016
9:00am-11:00am
Charlie Whittaker and Duanduan Ma
Building 14N (DIRC)
14N-132
Visualizing and Accessing Genomic Data Using Publically Available Genome Browsers and Databases
Tue Feb 16, 2016
2:00am-3:00am
Sara Strandberg-Qlucore
Building 76
76-156B
Introduction to Qlucore software
Mon Mar 7, 2016
2pm-4pm
Duanduan Ma
Building 56
56-602
Introduction to UNIX and the KI/BioMicro computational resources
Mon April 11, 2016
1pm-4pm
Duanduan Ma
Building 68
68-121
Introduction to R
Mon May 2, 2016
1pm-4pm
Duanduan Ma
Building 68
68-374
Introduction to R
Mon May 9, 2016
1pm-3pm
Qlucore
Building 14N(DIRC)
14N-132
Hands on training of Qlucore software
Mon June 6, 2016
2pm-4pm
Duanduan Ma
Building 68
68-374
Introduction to UNIX and the KI/BioMicro computational resources
Mon Jun 13, 2016
1pm-4pm
Duanduan Ma
Building 76
76-156A/B
Introduction to R
Mon July 11, 2016
1pm-4pm
Duanduan Ma
Building 56
56-602
Introduction to R
Mon Aug 8, 2016
1pm-4pm
Duanduan Ma
Building 68
68-374
Introduction to R
Mon Sep 12, 2016
2pm-4pm
Duanduan Ma
Building 76
76-459
Introduction to UNIX and the KI/BioMicro computational resources
Tue Oct 18, 2016
1pm-3pm
Duanduan Ma
Building 68
68-274
Introduction to UNIX and the KI/BioMicro computational resources
Mon Nov 7, 2016
1:00pm-2:00pm
Jie Wu & Vincent Butty
Building 68
68-181
A Traveler's companion to RNA-Seq interpretation session 1
Mon Nov 14, 2016
1:00pm-2:00pm
Jie Wu & Vincent Butty
Building 68
68-181
A Traveler's companion to RNA-Seq interpretation session 2
Mon Dec 12, 2016
2pm-4pm
Duanduan Ma
Building 76
76-259
Introduction to UNIX and the KI/BioMicro computational resources
Thurs Jan 19, 2017
1pm-3pm
Duanduan Ma
Building 68
68-274
Introduction to UNIX and the KI/BioMicro computational resources
Fri Feb 3, 2017
12:00pm-2:00pm
Charlie Whittaker and Duanduan Ma
Building 14N (DIRC)
14N-132
Visualizing and Accessing Genomic Data Using Publically Available Genome Browsers and Databases
Mon Mar 13, 2017
2pm-4pm
Duanduan Ma
Building 68
68-374
Introduction to UNIX and the KI/BioMicro computational resources
Mon April 24, 2017
1pm-4pm
Duanduan Ma
Building 68
68-574
Introduction to Python
Mon May 1, 2017
1pm-4pm
Duanduan Ma
Building 68
68-374
Introduction to Python
Mon June 5, 2017
1pm-4pm
Duanduan Ma
Building 68
68-121
Introduction to Python
Mon July 10, 2017
1pm-4pm
Duanduan Ma
Building 76
76-559
Introduction to Python
Mon Aug 28, 2017
2pm-4pm
Duanduan Ma
Building 68
68-374
Introduction to UNIX and the KI/BioMicro computational resources
Mon Sep 25, 2017
2:00pm-3:00pm
Vincent Butty
Building 68
68-181
RNA-Seq Experimental design + QC ; Slides
Mon Oct 2, 2017
1:00pm-2:00pm
Vincent Butty
Building 68
68-181
RNA-Seq analysis ; Slides
Mon Nov 13, 2017
2pm-4pm
Duanduan Ma
Building 76
76-259
Introduction to UNIX and the KI/BioMicro computational resources
Mon Dec 11, 2017
1pm-4pm
Duanduan Ma
Building 56
56-602
Introduction to R
Wed Jan 31, 2018
10am-12pm
Charlie Whittaker
Building 14N (DIRC)
14N-132
Gene Set Enrichment Analysis
Mon Feb 12, 2018
1pm-4pm
Duanduan Ma
Building 76
76-459
Introduction to R
Mon Mar 12, 2018
2pm-4pm
Duanduan Ma
Building 56
56-602
Introduction to UNIX and the KI/BioMicro computational resources
Mon Apr 23, 2018
1:30pm-2:30pm
Duanduan Ma
Building 68
68-181
Introduction to Biopython
Mon May 14, 2018
2:30pm-3:30pm
Duanduan Ma
Building 76
76-259
Introduction to Biopython
Mon June 11, 2018
2:00pm-4:00pm
Duanduan Ma
Building 56
56-602
Introduction to UNIX and the KI/BioMicro computational resourceso Biopython
Mon July 16, 2018
2:30pm-4:00pm
Duanduan Ma
Building 68
68-121
Introduction of Python to biologists session 1
Mon August 13, 2018
2:30pm-4:00pm
Duanduan Ma
Building 76
76-559
Introduction of Python to biologists session 2
Mon September 10, 2018
2:00pm-4:00pm
Duanduan Ma
Building 56
56-602
Introduction to UNIX and the KI/BioMicro computational resources
Tue October 23, 2018
1:30pm-2:30pm
Vincent Butty
Building 68
68-181
A Traveler's companion to RNA-Seq interpretation session 1 - Slides
Mon November 5, 2018
1pm-2pm
Vincent Butty
Building 68
68-181
A Traveler's companion to RNA-Seq interpretation session 2 - Slides
Mon December 10, 2018
2:00pm-4:00pm
Duanduan Ma
Building 76
76-259
Introduction to UNIX and the KI/BioMicro computational resources
Thursday Jan 31, 2019
10am-12pm
Charlie Whittaker
Building 14N (DIRC)
14N-132
Gene Set Enrichment Analysis
Mon February 11, 2019
2:30pm-4:00pm
Duanduan Ma
Building 68
68-674
Introduction of Python to biologists session 1
Mon March 11, 2019
2:30pm-4:30pm
Duanduan Ma
Building 76
76-259
Introduction of Python to biologists session 2
Mon April 8, 2019
2:00pm-4:00pm
Duanduan Ma
Building 68
68-180
Introduction to UNIX and the KI/BioMicro computational resources
Mon May 13, 2019
2pm-4pm
Duanduan Ma
Building 68
68-121
Introduction to R session 1
Mon June 10, 2019
2pm-4pm
Duanduan Ma
Building 76
76-659
Introduction to R session 2
Mon July 8, 2019
2:00pm-4:00pm
Duanduan Ma
Building 68
68-574
Introduction to UNIX and the KI/BioMicro computational resources
Tue August 27, 2019
2:30pm-3:30pm
Jingzhi Zhu & Duanduan Ma
Building 76
76-659
How to install software packages yourself on computing clusters
Wed September 25, 2019
1:00pm-2:00pm
Vincent Butty
Building 68
68-181
A Traveler's companion to RNA-Seq interpretation session 1 - Slides
Wed October 2, 2019
1:00pm-2:00pm
Vincent Butty
Building 68
68-181
A Traveler's companion to RNA-Seq interpretation session 2 - Slides
Mon November 18, 2019
2:00pm-4:00pm
Duanduan Ma
Building 76
76-659
Introduction to UNIX and the KI/BioMicro computational resources
Tues December 10, 2019
1:00pm-2:00pm
BMC/IGB core (Vincent Butty)
Building 68
68-181
Single Cell RNA-Seq Slides:
Mon January 31, 2020
2:00pm-3:30pm
Duanduan Ma
Building 76
76-659
Introduction of machine learning and its applications in biology
Mon February 10, 2020
2:00pm-4:30pm
Duanduan Ma
Building 76
76-659
Introduction to UNIX and the KI/BioMicro computational resources
Mon March 9, 2020
2:30pm-4:00pm
Duanduan Ma
Building 76
76-659
Introduction of Python to biologists session 1
Mon April 17, 2020
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to UNIX and the KI/BioMicro computational resources
Fri May 8, 2020
1:00pm-3:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 1
Wed May 13, 2020
1:00pm-3:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 2
Fri May 27, 2020
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction of Python to biologists session 2
Fri June 5, 2020
2:00pm-4:00pm
Duanduan Ma
Zoom
Zoom link
Introduction of Python to biologists session 1
Mon June 8, 2020
2:00pm-4:00pm
Duanduan Ma
Zoom
Zoom link
Introduction of Python to biologists session 2
Mon June 15, 2020
2:00pm-4:00pm
Duanduan Ma
Zoom
Zoom link
Introduction to Biopython
Mon July 20, 2020
2:00pm-4:00pm
Duanduan Ma
Zoom
Zoom link
Introduction to UNIX and the KI/BioMicro computational resources
Mon September 14, 2020
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction of machine learning and its applications in biology session I
Mon September 21, 2020
2:00pm-4:30pm
Duanduan Ma and Huiming Ding
Zoom
Zoom link
Introduction of machine learning and its applications in biology session II
Mon October 19, 2020
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to UNIX and the KI/BioMicro computational resources
Mon October 26, 2020
2:00pm-4:30pm
Jingzhi Zhu
Zoom
Zoom link
How to install software packages yourself on computing clusters
Mon November 9, 2020
2:00pm-4:30pm
Dikshant Pradhan
Zoom
Zoom link
Public Data Repositories
Wed December 9, 2020
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 1
Wed December 16, 2020
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 2
Tuesday February 16, 2021
2pm-4:30pm
Duanduan Ma
zoom
zoom link
Introduction to UNIX and the KI/BioMicro computational resources session I
Monday February 22, 2021
2pm-4:30pm
Jingzhi Zhu, Duanduan Ma
zoom
zoom link
Introduction to UNIX and the KI/BioMicro computational resources session II
Monday March 15, 2021
2pm-4:30pm
Duanduan Ma
zoom
zoom link
Introduction of Python to biologists session I
Monday March 22, 2021
2pm-4:30pm
Duanduan Ma
zoom
zoom link
Introduction of Python to biologists session II
Monday March 29, 2021
2pm-4:30pm
Duanduan Ma
zoom
zoom link
Introduction to Biopython
Monday April 12, 2021
2pm-3pm
Vincent Butty
zoom
zoom link
A Traveler's companion to RNA-Seq interpretation session I
Monday April 26, 2021
2pm-3pm
Vincent Butty
zoom
zoom link
A Traveler's companion to RNA-Seq interpretation session II
Monday May 3, 2021
2pm-3pm
Vincent Butty
zoom
zoom link
Single Cell RNA-seq
Monday May 10, 2021
2pm-4:30pm
Duanduan Ma
zoom
zoom link
Introduction to UNIX and the KI/BioMicro computational resources session I
Monday May 17, 2021
2pm-4:30pm
Jingzhi Zhu, Duanduan Ma
zoom
zoom link
Introduction to UNIX and the KI/BioMicro computational resources session II
Monday June 7, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 1
Monday June 14, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 2
Monday September 13, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to UNIX and the KI/BioMicro computational resources
Monday September 20, 2021
2:00pm-4:30pm
Jingzhi Zhu
Zoom
Zoom link
Effective utilization of IGB computational resources - SLURM & CONDA
Monday October 11, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to Python session I
Monday October 18, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to Python session II
Monday October 25, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to Biopython
Monday November 1, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Data processing and visualization using Python: Introduction to Pandas
Monday November 8, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Data processing and visualization using Python: Introduction to Matplotlib
Monday November 15, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Data processing and visualization using Python: Introduction to Seaborn
Monday December 13, 2021
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to Linux and the IGB computational resources
Tuesday February 8, 2022
2:00pm-4:30pm
Duanduan Ma and Jingzhi Zhu
Zoom
Zoom link
Introduction to Linux and the KI/BioMicro computational resources
Thursday March 3rd, 2022
2:00pm-3:00pm
Dikshant Pradhan
Zoom
Zoom link
Public Data Repositories
Monday March 14th, 2022
1:00pm-3:30pm
Vincent Butty
Zoom
Zoom link
RNA-Seq Design and Analysis Session I
Tuesday March 15th, 2022
1:00pm-3:30pm
Vincent Butty
Zoom
Zoom link
RNA-Seq Design and Analysis Session II
Thursday March 17th, 2022
1:00pm-3:30pm
Vincent Butty
Zoom
Zoom link
RNA-Seq Design and Analysis Session III
Monday March 28, 2022
1:00pm-3:30pm
Duanduan Ma and Jingzhi Zhu
Zoom
Zoom link
Introduction to Linux and the KI/BioMicro computational resources
Monday April 11, 2022
1:00pm-3:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 1
Wednesday April 13, 2022
1:00pm-3:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 2
Monday Apil 25th, 2022
1:00pm-3:30pm
Charlie Whittaker Dikshant Pradhan
Zoom
Zoom link
RNAseq HandsOn session 1
Wednesday Apil 27th, 2022
1:00pm-3:30pm
Charlie Whittaker Dikshant Pradhan
Zoom
Zoom link
RNAseq HandsOn session 2
Friday Apil 27th, 2022
1:00pm-3:30pm
Charlie Whittaker Dikshant Pradhan
Zoom
Zoom link
RNAseq HandsOn session 3
Monday September 19th, 2022
2:00pm-4:30pm
Duanduan Ma Jingzhi Zhu
Zoom
Zoom link
Introduction to Linux and Cluster Computing
Monday October 3rd, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction of Python to biologists session I
Tuesday October 4th, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction of Python to biologists session II
Friday October 7th, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to Biopython
Monday October 31st, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Data processing and visualization using Python: Introduction to Matplotlib
Wednesday November 2nd, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Data processing and visualization using Python: Introduction to Pandas
Friday November 4th, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Data processing and visualization using Python: Introduction to Seaborn
Monday November 14th, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction of machine learning and its applications in biology. session I
Wednesday November 16th, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction of machine learning and its applications in biology. session II
Monday December 5th, 2022
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to Linux and Cluster Computing
Monday February 6th, 2023
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to Linux and Cluster Computing
Tuesday February 21, 2023
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 1
Wednesday February 22, 2023
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to R session 2
Mon March 13, 2023
2:00pm-3:30pm
Stuart Levine
Zoom
Zoom link
Advanced Excel/Google Sheets for Biologists & Administrators
Monday April 3rd, 2023
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction of Python to biologists session I
Wednesday April 5th, 2023
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction of Python to biologists session II
Monday April 10th, 2023
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Version control with Git session I
Wednesday April 12th, 2023
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Version control with Git session II
Monday April 24th, 2023
1:00pm-3:30pm
Vincent Butty
Zoom
Zoom link
Bulk RNA-Seq Design and Analysis Session I
Wednesday April 26th, 2023
1:00pm-3:30pm
Vincent Butty
Zoom
Zoom link
Bulk RNA-Seq Design and Analysis Session II
Thursday April 27th, 2023
1:00pm-3:30pm
Vincent Butty
Zoom
Zoom link
Single cell RNA-Seq Design and Analysis Session I
Friday April 28th, 2023
1:00pm-3:30pm
Vincent Butty
Zoom
Zoom link
Single cell RNA-Seq Design and Analysis Session II
Monday May 1st, 2023
2:00pm-4:30pm
Duanduan Ma
Zoom
Zoom link
Introduction to Linux and Cluster Computing
Monday May 15th, 2023
1:00pm-3:00pm
Charlie Whittaker
Zoom
Zoom link
RNAseq HandsOn session 1
Wednesday May 17th, 2023
1:00pm-3:00pm
Charlie Whittaker
Zoom
Zoom link
RNAseq HandsOn session 2
Thursday May 18th, 2023
1:00pm-3:00pm
Charlie Whittaker
Zoom
Zoom link
RNAseq HandsOn session 3
nday September 18th, 2023
2:00pm-4:30pm
Duanduan Ma
Building 56
56-602
Introduction to Linux and Cluster Computing
Tuesday September 26, 2023
2:00pm-4:30pm
Duanduan Ma
Building 56
56-602
Introduction of Python to biologists session I
Wednesday September 27, 2023
2:00pm-4:30pm
Duanduan Ma
Building 76
76-659
Introduction of Python to biologists session II
Thursday September 28, 2023
2:00pm-4:30pm
Duanduan Ma
Building 76
76-559
Introduction to Biopython
Monday October 16, 2023
2:00pm-4:30pm
Allen Soberanes & Duanduan Ma
Building 76
76-259
Introduction to Linux and Cluster Computing
Monday October 23th, 2023
2:00pm-4:30pm
Duanduan Ma
Building 76
76-259
Data processing and visualization using Python: Introduction to Pandas
Tuesday October 24th, 2023
2:00pm-4:30pm
Duanduan Ma
Building 76
76-659
Data processing and visualization using Python: Introduction to Matplotlib
Wednesday October 25th, 2023
2:00pm-4:30pm
Duanduan Ma
Building 76
76-359
Data processing and visualization using Python: Introduction to Seaborn
Monday November 13th, 2023
2:00pm-4:30pm
Duanduan Ma
Building 76
76-259
Introduction of machine learning and its applications in biology. session I
Tuesday November 14th, 2023
2:00pm-4:30pm
Duanduan Ma
Building 76
76-659
Introduction of machine learning and its applications in biology. session II
Monday November 27th, 2023
2:00pm-4:30pm
Allen Soberanes & Duanduan Ma
Building 76
76-259
Introduction to Unix and cluster computing
Monday December 12th, 2023
2:00pm-4:30pm
Charlie Demurjian
TBD
TBD
Data Deposition in Public Repositorie
Monday February 5, 2024
2:00pm-4:30pm
Allen Soberanes
Building 68
68-352
Introduction to Linux and Cluster Computing
Tuesday February 20, 2024
2:00pm-4:30pm
Duanduan Ma
Building 68
68-156
Introduction to R
Monday March 4, 2024
1:00pm-3:00pm
Vincent Butty
Building 68
68-181
Single-Cell RNA-Seq Session 1
Tuesday March 5, 2024
1:00pm-3:00pm
Vincent Butty
Building 68
68-180
Single-Cell RNA-Seq Session 2
Wednesday March 6, 2024
1:00pm-3:00pm
Vincent Butty
Building 68
68-181
Bulk RNA-Seq Session 1
Thursday March 7, 2024
1:00pm-3:00pm
Vincent Butty
Building 68
68-180
Bulk RNA-Seq Session 2
Monday March 18, 2024
2:00pm-4:30pm
Allen Soberanes
Building 68
68-474
Introduction to Linux and Cluster Computing
Monday April 1, 2024
2:00pm-4:30pm
Duanduan Ma
Building 68
68-156
Introduction to Python
Monday April 8, 2024
2:00pm-4:30pm
Duanduan Ma
Building 68
68-374
Biopython
Monday April 15, 2024
2:00pm-4:30pm
Duanduan Ma
Building 68
68-156
Machine Learning with Python
Monday April 22, 2024
2:00pm-4:30pm
Duanduan Ma
TBD
TBD
Data Processing with Python
Monday April 29, 2024
2:00pm-4:30pm
Allen Soberanes
Building 68
68-156
Introduction to Linux and Cluster Computing
Tuesday May 21, 2024
TBD
Charlie Whitaker
TBD
TBD
Shared workflows and containerized computing for Bulk RNA-Seq Analysis Session 1
Wednesday May 22, 2024
TBD
Charlie Whitaker
TBD
TBD
Shared workflows and containerized computing for Bulk RNA-Seq Analysis Session 2
Thursday May 23, 2024
TBD
Charlie Whitaker
TBD
TBD
Shared workflows and containerized computing for Bulk RNA-Seq Analysis Session 3