LogoLogo
LogoLogo
  • The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility
  • Computing Resources
    • Active Data Storage
    • Archive Data Storage
    • Luria Cluster
      • FAQs
    • Other Resources
  • Bioinformatics Topics
    • Tools - A Basic Bioinformatics Toolkit
      • Getting more out of Microsoft Excel
      • Bioinformatics Applications of Unix
        • Unix commands applied to bioinformatics
        • Manipulate NGS files using UNIX commands
        • Manipulate alignment files using UNIX commands
      • Alignments and Mappers
      • Relational databases
        • Running Joins on Galaxy
      • Spotfire
    • Tasks - Bioinformatics Methods
      • UCSC Genome Bioinformatics
        • Interacting with the UCSC Genome Browser
        • Obtaining DNA sequence from the UCSC Database
        • Obtaining genomic data from the UCSC database using table browser queries
        • Filtering table browser queries
        • Performing a BLAT search
        • Creating Custom Tracks
        • UCSC Intersection Queries
        • Viewing cross-species alignments
        • Galaxy
          • Intro to Galaxy
          • Galaxy NGS Illumina QC
          • Galaxy NGS Illumina SE Mapping
          • Galaxy SNP Interval Data
        • Editing and annotation gene structures with Argo
      • GeneGO MetaCore
        • GeneGo Introduction
        • Loading Data Into GeneGO
        • Data Management in GeneGO
        • Setting Thresholds and Background Sets
        • Search And Browse Content Tab
        • Workflows and Reports Tab
        • One-click Analysis Tab
        • Building Network for Your Experimental Data
      • Functional Annotation of Gene Lists
      • Multiple Sequence Alignment
        • Clustalw2
      • Phylogenetic analysis
        • Neighbor Joining method in Phylip
      • Microarray data processing with R/Bioconductor
    • Running Jupyter notebooks on luria cluster nodes
  • Data Management
    • Globus
  • Mini Courses
    • Schedule
      • Previous Teaching
    • Introduction to Unix and KI Computational Resources
      • Basic Unix
        • Why Unix?
        • The Unix Tree
        • The Unix Terminal and Shell
        • Anatomy of a Unix Command
        • Basic Unix Commands
        • Output Redirection and Piping
        • Manual Pages
        • Access Rights
        • Unix Text Editors
          • nano
          • vi / vim
          • emacs
        • Shell Scripts
      • Software Installation
        • Module
        • Conda Environment
      • Slurm
    • Introduction to Unix
      • Why Unix?
      • The Unix Filesystem
        • The Unix Tree
        • Network Filesystems
      • The Unix Shell
        • About the Unix Shell
        • Unix Shell Manual Pages
        • Using the Unix Shell
          • Viewing the Unix Tree
          • Traversing the Unix Tree
          • Editing the Unix Tree
          • Searching the Unix Tree
      • Files
        • Viewing File Contents
        • Creating and Editing Files
        • Manipulating Files
        • Symbolic Links
        • File Ownership
          • How Unix File Ownership Works
          • Change File Ownership and Permissions
        • File Transfer (in-progress)
        • File Storage and Compression
      • Getting System Information
      • Writing Scripts
      • Schedule Scripts Using Crontab
    • Advanced Utilization of IGB Computational Resources
      • High Performance Computing Clusters
      • Slurm
        • Checking the Status of Computing Nodes
        • Submitting Jobs / Slurm Scripts
        • Interactive Sessions
      • Package Management
        • The System Package Manager
        • Environment Modules
        • Conda Environments
      • SSH Port Forwarding
        • SSH Port Forwarding Jupyter Notebooks
      • Containerization
        • Docker
          • Docker Installation
          • Running Docker Images
          • Building Docker Images
        • Singularity
          • Differences from Docker
          • Running Images in Singularity
      • Running Nextflow / nf-core Pipelines
    • Python
      • Introduction to Python for Biologists
        • Interactive Python
        • Types
          • Strings
          • Lists
          • Tuples
          • Dictionaries
        • Control Flow
        • Loops
          • For Loops
          • While Loops
        • Control Flows and Loops
        • Storing Programs for Re-use
        • Reading and Writing Files
        • Functions
      • Biopython
        • About Biopython
        • Quick Start
          • Basic Sequence Analyses
          • SeqRecord
          • Sequence IO
          • Exploration of Entrez Databases
        • Example Projects
          • Coronavirus Exploration
          • Translating a eukaryotic FASTA file of CDS entries
        • Further Resources
      • Machine Learning with Python
        • About Machine Learning
        • Hands-On
          • Project Introduction
          • Supervised Approaches
            • The Logistic Regression Model
            • K-Nearest Neighbors
          • Unsupervised Approaches
            • K-Means Clustering
          • Further Resources
      • Data Processing with Python
        • Pandas
          • About Pandas
          • Making DataFrames
          • Inspecting DataFrames
          • Slicing DataFrames
          • Selecting from DataFrames
          • Editing DataFrames
        • Matplotlib
          • About Matplotlib
          • Basic Plotting
          • Advanced Plotting
        • Seaborn
          • About Seaborn
          • Basic Plotting
          • Visualizing Statistics
          • Visualizing Proteomics Data
          • Visualizing RNAseq Data
    • R
      • Intro to R
        • Before We Start
        • Getting to Know R
        • Variables in R
        • Functions in R
        • Data Manipulation
        • Simple Statistics in R
        • Basic Plotting in R
        • Advanced Plotting in R
        • Writing Figures to a File
        • Further Resources
    • Version Control with Git
      • About Version Control
      • Setting up Git
      • Creating a Repository
      • Tracking Changes
        • Exercises
      • Exploring History
        • Exercises
      • Ignoring Things
      • Remotes in Github
      • Collaborating
      • Conflicts
      • Open Science
      • Licensing
      • Citation
      • Hosting
      • Supplemental
Powered by GitBook

MIT Resources

  • https://accessibility.mit.edu

Massachusetts Institute of Technology

On this page
  • What is in the file?
  • How many alignments are in the file?
  • How many reads are in the file?
  • How many distinct sequences are in the file?
  • How many reads are uniquely mapped?
  • How many reads are multi-hits?
  • How many alignments are reported for each read?
  • What are the top 10 multi-hit reads?
  • What chromosomes are represented in the file?
  • List the chromosomes according to the read distribution
  • Which are the most and least represented chromosomes?
  • What mapping flags are in the file?
  • What reads and alignments are mapped on the reverse strand?
  • Split the alignments to chromosome 1 and chromosome 3 into separate files
  • What is the relationship between mapping qualities and number of multiple hits in the file?
  • How many reported alignments are gapless?

Was this helpful?

Export as PDF
  1. Bioinformatics Topics
  2. Tools - A Basic Bioinformatics Toolkit
  3. Bioinformatics Applications of Unix

Manipulate alignment files using UNIX commands

PreviousManipulate NGS files using UNIX commandsNextAlignments and Mappers

Last updated 1 year ago

Was this helpful?

  • The following examples use the file myfile.sam that you can copy by typing the following command on Luria:

# copy the file from the stock directory to the current directory 
cp /net/ostrom/data/rowleybcc/charliew/old_teaching/Biol2Bioinfo/SAM/myfile.sam .
  • As revealed by the extension, the file is in SAM format.

    • The SAM format stores sequence data in a series of tab delimited columns.

    • BAM, the sister version of a SAM file, contains the same information but in a compressed, indexed, binary form that is not human readable.

    • Basic information on SAM files can be found .

    • The official specifications of SAM and BAM format are available .

In this tutorial we will answer the following questions about the given SAM file:

What is in the file?

You can look at the content of the file by using the command , , , or by opening the file itself with a text editor such as .

# display the content of the file on the screen, one page at a time 
more myfile.sam
# display the content of the whole file on the screen 
cat myfile.sam
# display the first 10 lines 
head myfile.sam
# open the file with the text editor nano 
nano myfile.sam

How many alignments are in the file?

Because in a SAM formatted file each line corresponds to one reported alignment, we just need to count the number of lines in the file to get the number of alignments.

# how many alignments? 
wc -l myfile.sam

How many reads are in the file?

Given that in a generic SAM flle any sequence read can have more than one reported alignment, the number of alignments is not necessarily the number of reads in the file. However, the sequence reads are associated with a unique header, so we can count the number of distinct headers to determine how many reads are represented in the file.

# how many reads? 
cut -f 1 myfile.sam | sort | uniq | wc -l

How many distinct sequences are in the file?

Because two distinct reads could have identical sequence (by chance due to their short length, or by being PCR duplicates), the number of reads is not necessarily the same as the number of distinct sequences in the file, so we need to count the number of distinct occurrences in the 10th field.

# how many sequences? 
cut -f 10 myfile.sam | sort | uniq | wc -l

How many reads are uniquely mapped?

Uniquely mapped reads are generally intended as reads that have only one reported alignment. To determine the number of reads with only one reported alignment, we can count how many headers occur only once in the file. Note that without the switch "-u", repeated headers will also be reported once, yielding to incorrect output.

# how many reads are uniquely mapped? 
cut -f 1 myfile.sam | sort | uniq -u | wc -l

How many reads are multi-hits?

Multi-hits are those reads that have more than one reported alignment. By counting how many distinct headers are repeated in the file, we can determine the number of multi-hits.

# how many reads are multi-hits? 
cut -f 1 myfile.sam | sort | uniq -d | wc -l

How many alignments are reported for each read?

To count the number of alignments each read has reported in the file, we count how many times each header appears in the file.

# how many alignments are there for each read? 
cut -f 1 myfile.sam | sort | uniq -c

What are the top 10 multi-hit reads?

To get the top 10 multi-hit reads, we count the number of alignments for each header, then we perform a numerical, reverse sort on the number of occurrences.

# top 10 reads that have the most reported alignments 
cut -f 1 myfile.sam | sort | uniq -c | sort -nr | head

Another way to get the top 10 multi-hit reads is to consider the alignment tag of the form "NH:i:" and use the command grep with a regular expression. Note that the sorting occurs on the tag itself because of the "-o" switch in the grep command, and precisely on the number of hits reported in the tag because of the "-k" switch in the sort command (i.e. we ignore the first 5 positions of the tag, which are common to all the tags, and we sort on the number of multi-hits itself). Another caveat is using "-n" switch with the sort command (if omitted, the sort occurs in a lexicographic order and the top multi-hit reads reported would be those with 9 multi-hits).

# top 10 reads that have the most reported alignments 
grep -o "NH:i:[[:digit:]]*" myfile.sam | sort -n -k 1.6 | tail

What chromosomes are represented in the file?

To determine what chromosomes are mapped in the file, we extract the chromosome label (corresponding to the 3rd field), then apply the sort and uniq commands. To print the chromosomes in numerical order, we use the switch "-n" on the labels, skipping the first three characters ("chr").

# what chromosome are there? 
cut -f 3 myfile.sam | sort | uniq
# what chromosome are there (in numerical chromosome order)?
cut -f 3 myfile.sam | sort | uniq | sort -k 1.4 -n

List the chromosomes according to the read distribution

To list the chromosome from the most represented to the least represented, we use the switch "-n" on the number of occurrences for each chromosome. Note that without the "-n" switch, the sorting of the chromosome frequency would be lexicographic rather than numeric, so chrX would appear before chr19 because the string "99" occurs before "87937" in a lexicographic reverse order.

# count the occurrences by chromosome 
cut -f 3 myfile.sam | sort | uniq -c  
# count the occurrences by chromosome, then sort from most to least abundant 
cut -f 3 myfile.sam | sort | uniq -c | sort -nr

Which are the most and least represented chromosomes?

Using the sorted list obtained in the step above, we consider the first reported chromosome as the most abundant and the last reported chromosome as the least abundant. Note that we can either choose to sort the list in reverse order or use the commands head/tail appropriately.

# count the occurrences by chromosome 
cut -f 3 myfile.sam | sort | uniq -c 
# count the occurrences by chromosome, then sort from most to least abundant 
cut -f 3 myfile.sam | sort | uniq -c | sort -nr
# which chromosome is the most represented? 
cut -f 3 myfile.sam | sort | uniq -c | sort -nr | head -n 1  
# which chromosome is the least represented? 
cut -f 3 myfile.sam | sort | uniq -c | sort -nr | tail -n 1
# which chromosome is the least represented? 
cut -f 3 myfile.sam | sort | uniq -c | sort -n | head -n 1

What mapping flags are in the file?

The 2nd field in a SAM formatted file contains the mapping flag, which provides information on the mapping features of the given alignment. For example, a flag "4" indicates that the read is unmapped, whereas a flag "16" indicates the read is mapped on reverse strand. To find what flags are associated with the alignments in the file, we simply report the distinct flags present in the file. The switch "-c" can be use to report the counts.

# list the mapping flags
cut -f 2 myfile.sam | sort | uniq
 # what is the distribution of mapping flags? 
cut -f 2 myfile.sam | sort | uniq -c

What reads and alignments are mapped on the reverse strand?

To grab only the reads that are mapped on the reverse strand, we use a simple awk statement that compares the mapping flag with the value 16 in the condition. Note that the number of alignments mapped on the reverse strand is not necessarily the same as the number of reads mapping to the reverse strand, because of the presence of multi-hits in the file.

# what reads map on the reverse strand? 
awk '$2 == 16' myfile.sam | cut -f 1 | uniq
# how many reads map on the reverse strand? 
awk '$2 == 16' myfile.sam | cut -f 1 | uniq | wc -l
# which alignments are mapped on the reverse strand? 
awk '$2 == 16' myfile.sam
# how many alignments are mapped on the reverse strand? 
awk '$2 == 16' myfile.sam | wc -l

Split the alignments to chromosome 1 and chromosome 3 into separate files

# extract alignments mapped to chr1 
grep -w chr1 myfile.sam > myfile_chr1.sam
# extract alignments mapped to chr3
grep -w chr3 myfile.sam > myfile_chr3.sam

Note that the "-w" option is necessary in the first case, otherwise other chromosomes with digit 1, such as 10,11,etc... will be considered. Even though these commands yield to the correct output in the considered example, in general is recommended that you explicitly compare the field of interest with the given pattern (i.e. chr1), rather than using the command grep on the entire record, as by chance another field could contain the pattern.

Are there fixed-length reads? What is their length?

To verify that all reads have same length, we compute the length of each sequence read and then consider the number of distinct lengths obtained: if only one distinct length is present, then all reads have same length.

# what is the length of the reads?
awk '{print length($10)}' myfile.sam | sort | uniq

What is the relationship between mapping qualities and number of multiple hits in the file?

The mapping quality is reported by the aligner on the 5th field of the SAM file. The valid range for mapping qualities in a SAM file is 0-255. Note that even though the SAM format specifications state that a value 255 is to be interpreted as unavailable mapping quality, many aligners assign the value 255 as the maximum possible alignment quality, indicating a high confidence on the given read placement. Naturally, as the number of multiple hits reported for a given read increases, the mapping quality for each corresponding alignment decreases, as you can see from the example below.

# distribution of mapping qualities 
cut -f 5 myfile.sam | sort | uniq -c
# multi-hits associated with mapping quality 0 
awk '$5 == 0' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq

# multi-hits associated with mapping quality 1 
awk '$5 == 1' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq
# multi-hits associated with mapping quality between 2 and 254 
awk '$5 > 2 && $5 < 254' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq
# multi-hits associated with mapping quality equal to 255 
awk '$5 == 255' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq

How many reported alignments are gapless?

Gapless alignments have a CIGAR string consisting of a number (i.e. the length of the read) followed by the letter M. Recall that the reads have a fixed length of 60 bases in the given file.

# count the alignments that have only matches/mismatches 
 cut -f 6 myfile.sam  | grep -c 60M
# this is equivalent to the command above 
 awk '$6 == "60M"' myfile.sam | wc -l

How many reported alignments are perfect matches?

The SAM tag "NM:i:" reports the edit distance from the reference sequence. An edit distance equal to 0 represents a perfect match.

# number of alignments that are perfect matches (edit distance = 0) 
awk '$6 == "60M" ' myfile.sam | grep -o "NM:i:0" | wc -l
here
here
more
head
cat
nano