LogoLogo
LogoLogo
  • The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility
  • Computing Resources
    • Active Data Storage
    • Archive Data Storage
    • Luria Cluster
      • FAQs
    • Other Resources
  • Bioinformatics Topics
    • Tools - A Basic Bioinformatics Toolkit
      • Getting more out of Microsoft Excel
      • Bioinformatics Applications of Unix
        • Unix commands applied to bioinformatics
        • Manipulate NGS files using UNIX commands
        • Manipulate alignment files using UNIX commands
      • Alignments and Mappers
      • Relational databases
        • Running Joins on Galaxy
      • Spotfire
    • Tasks - Bioinformatics Methods
      • UCSC Genome Bioinformatics
        • Interacting with the UCSC Genome Browser
        • Obtaining DNA sequence from the UCSC Database
        • Obtaining genomic data from the UCSC database using table browser queries
        • Filtering table browser queries
        • Performing a BLAT search
        • Creating Custom Tracks
        • UCSC Intersection Queries
        • Viewing cross-species alignments
        • Galaxy
          • Intro to Galaxy
          • Galaxy NGS Illumina QC
          • Galaxy NGS Illumina SE Mapping
          • Galaxy SNP Interval Data
        • Editing and annotation gene structures with Argo
      • GeneGO MetaCore
        • GeneGo Introduction
        • Loading Data Into GeneGO
        • Data Management in GeneGO
        • Setting Thresholds and Background Sets
        • Search And Browse Content Tab
        • Workflows and Reports Tab
        • One-click Analysis Tab
        • Building Network for Your Experimental Data
      • Functional Annotation of Gene Lists
      • Multiple Sequence Alignment
        • Clustalw2
      • Phylogenetic analysis
        • Neighbor Joining method in Phylip
      • Microarray data processing with R/Bioconductor
    • Running Jupyter notebooks on luria cluster nodes
  • Data Management
    • Globus
  • Mini Courses
    • Schedule
      • Previous Teaching
    • Introduction to Unix and KI Computational Resources
      • Basic Unix
        • Why Unix?
        • The Unix Tree
        • The Unix Terminal and Shell
        • Anatomy of a Unix Command
        • Basic Unix Commands
        • Output Redirection and Piping
        • Manual Pages
        • Access Rights
        • Unix Text Editors
          • nano
          • vi / vim
          • emacs
        • Shell Scripts
      • Software Installation
        • Module
        • Conda Environment
      • Slurm
    • Introduction to Unix
      • Why Unix?
      • The Unix Filesystem
        • The Unix Tree
        • Network Filesystems
      • The Unix Shell
        • About the Unix Shell
        • Unix Shell Manual Pages
        • Using the Unix Shell
          • Viewing the Unix Tree
          • Traversing the Unix Tree
          • Editing the Unix Tree
          • Searching the Unix Tree
      • Files
        • Viewing File Contents
        • Creating and Editing Files
        • Manipulating Files
        • Symbolic Links
        • File Ownership
          • How Unix File Ownership Works
          • Change File Ownership and Permissions
        • File Transfer (in-progress)
        • File Storage and Compression
      • Getting System Information
      • Writing Scripts
      • Schedule Scripts Using Crontab
    • Advanced Utilization of IGB Computational Resources
      • High Performance Computing Clusters
      • Slurm
        • Checking the Status of Computing Nodes
        • Submitting Jobs / Slurm Scripts
        • Interactive Sessions
      • Package Management
        • The System Package Manager
        • Environment Modules
        • Conda Environments
      • SSH Port Forwarding
        • SSH Port Forwarding Jupyter Notebooks
      • Containerization
        • Docker
          • Docker Installation
          • Running Docker Images
          • Building Docker Images
        • Singularity
          • Differences from Docker
          • Running Images in Singularity
      • Running Nextflow / nf-core Pipelines
    • Python
      • Introduction to Python for Biologists
        • Interactive Python
        • Types
          • Strings
          • Lists
          • Tuples
          • Dictionaries
        • Control Flow
        • Loops
          • For Loops
          • While Loops
        • Control Flows and Loops
        • Storing Programs for Re-use
        • Reading and Writing Files
        • Functions
      • Biopython
        • About Biopython
        • Quick Start
          • Basic Sequence Analyses
          • SeqRecord
          • Sequence IO
          • Exploration of Entrez Databases
        • Example Projects
          • Coronavirus Exploration
          • Translating a eukaryotic FASTA file of CDS entries
        • Further Resources
      • Machine Learning with Python
        • About Machine Learning
        • Hands-On
          • Project Introduction
          • Supervised Approaches
            • The Logistic Regression Model
            • K-Nearest Neighbors
          • Unsupervised Approaches
            • K-Means Clustering
          • Further Resources
      • Data Processing with Python
        • Pandas
          • About Pandas
          • Making DataFrames
          • Inspecting DataFrames
          • Slicing DataFrames
          • Selecting from DataFrames
          • Editing DataFrames
        • Matplotlib
          • About Matplotlib
          • Basic Plotting
          • Advanced Plotting
        • Seaborn
          • About Seaborn
          • Basic Plotting
          • Visualizing Statistics
          • Visualizing Proteomics Data
          • Visualizing RNAseq Data
    • R
      • Intro to R
        • Before We Start
        • Getting to Know R
        • Variables in R
        • Functions in R
        • Data Manipulation
        • Simple Statistics in R
        • Basic Plotting in R
        • Advanced Plotting in R
        • Writing Figures to a File
        • Further Resources
    • Version Control with Git
      • About Version Control
      • Setting up Git
      • Creating a Repository
      • Tracking Changes
        • Exercises
      • Exploring History
        • Exercises
      • Ignoring Things
      • Remotes in Github
      • Collaborating
      • Conflicts
      • Open Science
      • Licensing
      • Citation
      • Hosting
      • Supplemental
Powered by GitBook

MIT Resources

  • https://accessibility.mit.edu

Massachusetts Institute of Technology

On this page
  • Dataset
  • Bowtie
  • BWA
  • How to run BWA

Was this helpful?

Export as PDF
  1. Bioinformatics Topics
  2. Tools - A Basic Bioinformatics Toolkit

Alignments and Mappers

PreviousManipulate alignment files using UNIX commandsNextRelational databases

Last updated 1 year ago

Was this helpful?

The ‘read mapping’ problem is the alignment of relatively short reads to a reference genome.

Many short read mappers are currently available (Bowtie, BWA, SOAP, TopHat, etc...). They all try to overcome the following challenges presented by the mapping problem:

  • how do we align efficiently (in time and memory footprint) billions of short reads to a long reference genome?

  • how do we decide a specific placement when multiple, equally good alignment are available for the same read?

Some mappers are short-read aligners (they do not allow large gaps in their alignments), while others are spliced-read mappers (they allow large gaps corresponding to introns). The choice of the appropriate mapper depends on the particular dataset and application

Dataset

This tutorial uses a subset of the Illumina IDEA Challenge 2011 dataset (more information can be found ). You can copy the relevant fastq files by typing the following command on Luria:

# copy from stock directory to current directory  
cp /net/ostrom/data/rowleybcc/charliew/old_teaching/Biol2Bioinfo/SAM/BT20_?.fastq .

The dataset consists of paired-end RNA-Seq data of breast carcinoma human cell line BT20. Because the reads are paired-end reads, for each DNA fragment we have sequence data from both ends and the sequence reads are stored in two separate files (one for the data from each end). Additional information about the formats used in this tutorial can be found below:

  • SAM/BAM format: and .

  • .

Bowtie

A complete guide on how to run bowtie can be found .

Bowtie can be run in 2 alignment modes (mutually exclusive):

  • n alignment mode: this is the default mode (specified with flag -n), and determines which alignments are valid according to the following policy:

    • alignments may have no more than N mismatches in the first L bases of the reads ("seed"). N and L are integers specified through the flags -n and -l.

    • the sum of the Phred quality values at all mismatched positions (not just in the seed) may not exceed E (set with -e).

  • v alignment mode:

    • alignments may have no more than V mismatches (set using the -v option), and quality values are ignored.

Several reporting modes are available, specified through various combinations of the options, such as -k,-a, -m, etc...

# build bowtie index for reference ref.fa 
bowtie-build ref.fa ref

When the command has completed, the current directory will contain the index consisting of four new files (ref.1.ebwt, ref.2.ebwt, ref.rev.1.ebwt, and ref.rev.2.ebwt).

How to run Bowtie

If your dataset consists of single-end sequence reads, you can run Bowtie as follows:

# run bowtie single-end with default parameters
bowtie /home/Genomes/bowtie_indexes/hg18 BT20_1.fastq BT20_SE_bwt.sam

If your dataset consists of paired-end sequence reads, you can run Bowtie as follows:

# run bowtie paired-end with default parameters
bowtie /home/Genomes/bowtie_indexes/hg18 -1 BT20_1.fastq -2 BT20_2.fastq  BT20_PE_bwt.sam

Some additional useful options are:

  • the flag -k specifies the maximum number of alignments to report if more than one valid alignment is found.

  • the flag -m suppresses all alignments for a particular read if more than m reportable alignments exist.

  • the flag -v specifies the maximum number of mismatches allowed in the entire length of the read.

  • the flag -n specifies the maximum number of mismatches in the high quality seed of the read.

  • the flag -l specifies the length of the seed to consider for each read.

  • the flag –-best guarantees that reported singleton alignments are the best in terms of stratum (i.e. number of mismatches or mismatches in the seed, depending on the alignment mode) and in terms of the quality values at the mismatched position(s).

Note that the option --best does not affect which alignments are considered "valid" by bowtie, only which valid alignments are reported. When --best is specified and multiple hits are allowed (via -k or -a), the alignments for a given read are guaranteed to appear in best-to-worst order in bowtie's output. bowtie is somewhat slower when --best is specified.

The last few lines written to the standard error file handle convey summary information about the run (see below). The other lines are written to the standard out file handle and show valid, reportable alignments found by Bowtie.

BWA

BWA implements two different algorithms, both based on Burrows-Wheeler Transform (BWT).

  • ALN:

    • designed for short reads (up to ~200bp) with low error rate ( less than 3%).

    • performs gapped global alignment.

    • supports paired-end reads.

  • BWA-SW:

    • designed for longer reads with more errors.

    • performs heuristic Smith-Waterman-like alignment to find high-scoring local hits.

    • supports only single-end reads.

How to run BWA

  • The database file (in fasta format) must be first indexed with the index command, which typically takes a few hours. Pre-built indexes are available on rous for the most common model organisms.

  • If your reference sequence is not among the available pre-built indexes, you can run the following command and create one for it:

# build index for reference sequence ref.fa
bwa index ref.fa
  • For short reads (up to ~200bp) with low error rate (<3%), the algorithm to use is invoked with the command aln, which finds the suffix array (SA) coordinates of good hits of each individual read; the commands samse (for single-end reads) or sampe (for paired-end reads) are then used to convert these SA coordinates into chromosomal coordinates. Note that multi-hits are randomly chosen.

  • If your dataset consists of single-end sequence reads, you can run BWA as follows:

# run BWA single-end with default parameters
bwa aln ref.fa query.fastq > bwa.sai
bwa samse ref.fa bwa.sai query.fastq > bwa.sam

For example, to run BWA on the query BT20_1.fastq against the reference genome hs18, you must type the following commands:

bwa aln /home/Genomes/hs18.fa BT20_1.fastq > BT20_SE_bwa.sai
bwa samse /home/Genomes/hs18.fa BT20_SE_bwa.sai BT20_1.fastq > BT20_SE_bwa.sam
  • If your dataset consists of paired-end sequence reads, you must run BWA in two steps as follows:

# run BWA paired-end with default parameters
bwa aln ref.fa query_1.fastq > query_1.sai
bwa aln ref.fa query_2.fastq > query_2.sai
bwa sampe ref.fa bwa_1.sai bwa_2.sai query_1.fastq query_2.fastq > bwa.sam

For example, to run BWA on the query BT20_1.fastq and BT20_2.fastq against the reference genome hs18, you must type the following commands:

bwa aln /home/Genomes/hs18.fa BT20_1.fastq > BT20_1.sai
bwa aln /home/Genomes/hs18.fa BT20_2.fastq > BT20_2.sai
bwa sampe /home/Genomes/hs18.fa BT20_1.sai BT20_2.sai BT20_1.fastq BT20_2.fastq > BT20_PE_bwa.sam
  • Note that by default BWA ALN allows up to K mismatches in the seed (the first L positions of each read) and up to N mismatches in the whole sequence reads. These parameters can be changed using the appropriate options as indicates below.

    • -l specifies the length of the seed.

    • -k specifies the maximum edit distance in the seed.

    • -n specifies the maximum edit distance in the whole sequence read.

    • -o specifies the maximum number of gaps to open.

    • -e specifies the maximum number of gaps to extend.

  • The BWA output in SAM format has the following specific optional fields:

    • NM Edit distance

    • MD Mismatching positions/bases

    • X0 Number of best hits

    • X1 Number of suboptimal hits found by BWA

    • XM Number of mismatches in the alignment

    • XO Number of gap opens

    • XG Number of gap extensions

    • XT Type: Unique/Repeat/N/Mate-sw

    • XA Alternative hits; format: (chr,pos,CIGAR,NM;)*

Bowtie requires an index of the reference sequence against which the reads are mapped. Pre-built indexes for several organisms' assemblies are available on rous or on the Bowtie official (check under left pane section "Pre-built Indexes" ). If your reference sequence is not among the available pre-built indexes, you can run the following command and create one for it:

BWA is another fast aligner based on Burrows-Wheeler Transform (BWT). A complete guide on how to run BWA can be found .

here
FASTQ format
basic info
official specs
Manipulation of SAM format using UNIX commands
here
page
here