Manipulate alignment files using UNIX commands

  • The following examples use the file myfile.sam that you can copy by typing the following command on Luria:

# copy the file from the stock directory to the current directory 
cp /net/ostrom/data/rowleybcc/charliew/old_teaching/Biol2Bioinfo/SAM/myfile.sam .
  • As revealed by the extension, the file is in SAM format.

    • The SAM format stores sequence data in a series of tab delimited columns.

    • BAM, the sister version of a SAM file, contains the same information but in a compressed, indexed, binary form that is not human readable.

    • Basic information on SAM files can be found here.

    • The official specifications of SAM and BAM format are available here.

In this tutorial we will answer the following questions about the given SAM file:

What is in the file?

You can look at the content of the file by using the command more, head, cat, or by opening the file itself with a text editor such as nano.

# display the content of the file on the screen, one page at a time 
more myfile.sam
# display the content of the whole file on the screen 
cat myfile.sam
# display the first 10 lines 
head myfile.sam
# open the file with the text editor nano 
nano myfile.sam

How many alignments are in the file?

Because in a SAM formatted file each line corresponds to one reported alignment, we just need to count the number of lines in the file to get the number of alignments.

# how many alignments? 
wc -l myfile.sam

How many reads are in the file?

Given that in a generic SAM flle any sequence read can have more than one reported alignment, the number of alignments is not necessarily the number of reads in the file. However, the sequence reads are associated with a unique header, so we can count the number of distinct headers to determine how many reads are represented in the file.

# how many reads? 
cut -f 1 myfile.sam | sort | uniq | wc -l

How many distinct sequences are in the file?

Because two distinct reads could have identical sequence (by chance due to their short length, or by being PCR duplicates), the number of reads is not necessarily the same as the number of distinct sequences in the file, so we need to count the number of distinct occurrences in the 10th field.

# how many sequences? 
cut -f 10 myfile.sam | sort | uniq | wc -l

How many reads are uniquely mapped?

Uniquely mapped reads are generally intended as reads that have only one reported alignment. To determine the number of reads with only one reported alignment, we can count how many headers occur only once in the file. Note that without the switch "-u", repeated headers will also be reported once, yielding to incorrect output.

# how many reads are uniquely mapped? 
cut -f 1 myfile.sam | sort | uniq -u | wc -l

How many reads are multi-hits?

Multi-hits are those reads that have more than one reported alignment. By counting how many distinct headers are repeated in the file, we can determine the number of multi-hits.

# how many reads are multi-hits? 
cut -f 1 myfile.sam | sort | uniq -d | wc -l

How many alignments are reported for each read?

To count the number of alignments each read has reported in the file, we count how many times each header appears in the file.

# how many alignments are there for each read? 
cut -f 1 myfile.sam | sort | uniq -c

What are the top 10 multi-hit reads?

To get the top 10 multi-hit reads, we count the number of alignments for each header, then we perform a numerical, reverse sort on the number of occurrences.

# top 10 reads that have the most reported alignments 
cut -f 1 myfile.sam | sort | uniq -c | sort -nr | head

Another way to get the top 10 multi-hit reads is to consider the alignment tag of the form "NH:i:" and use the command grep with a regular expression. Note that the sorting occurs on the tag itself because of the "-o" switch in the grep command, and precisely on the number of hits reported in the tag because of the "-k" switch in the sort command (i.e. we ignore the first 5 positions of the tag, which are common to all the tags, and we sort on the number of multi-hits itself). Another caveat is using "-n" switch with the sort command (if omitted, the sort occurs in a lexicographic order and the top multi-hit reads reported would be those with 9 multi-hits).

# top 10 reads that have the most reported alignments 
grep -o "NH:i:[[:digit:]]*" myfile.sam | sort -n -k 1.6 | tail

What chromosomes are represented in the file?

To determine what chromosomes are mapped in the file, we extract the chromosome label (corresponding to the 3rd field), then apply the sort and uniq commands. To print the chromosomes in numerical order, we use the switch "-n" on the labels, skipping the first three characters ("chr").

# what chromosome are there? 
cut -f 3 myfile.sam | sort | uniq
# what chromosome are there (in numerical chromosome order)?
cut -f 3 myfile.sam | sort | uniq | sort -k 1.4 -n

List the chromosomes according to the read distribution

To list the chromosome from the most represented to the least represented, we use the switch "-n" on the number of occurrences for each chromosome. Note that without the "-n" switch, the sorting of the chromosome frequency would be lexicographic rather than numeric, so chrX would appear before chr19 because the string "99" occurs before "87937" in a lexicographic reverse order.

# count the occurrences by chromosome 
cut -f 3 myfile.sam | sort | uniq -c  
# count the occurrences by chromosome, then sort from most to least abundant 
cut -f 3 myfile.sam | sort | uniq -c | sort -nr

Which are the most and least represented chromosomes?

Using the sorted list obtained in the step above, we consider the first reported chromosome as the most abundant and the last reported chromosome as the least abundant. Note that we can either choose to sort the list in reverse order or use the commands head/tail appropriately.

# count the occurrences by chromosome 
cut -f 3 myfile.sam | sort | uniq -c 
# count the occurrences by chromosome, then sort from most to least abundant 
cut -f 3 myfile.sam | sort | uniq -c | sort -nr
# which chromosome is the most represented? 
cut -f 3 myfile.sam | sort | uniq -c | sort -nr | head -n 1  
# which chromosome is the least represented? 
cut -f 3 myfile.sam | sort | uniq -c | sort -nr | tail -n 1
# which chromosome is the least represented? 
cut -f 3 myfile.sam | sort | uniq -c | sort -n | head -n 1

What mapping flags are in the file?

The 2nd field in a SAM formatted file contains the mapping flag, which provides information on the mapping features of the given alignment. For example, a flag "4" indicates that the read is unmapped, whereas a flag "16" indicates the read is mapped on reverse strand. To find what flags are associated with the alignments in the file, we simply report the distinct flags present in the file. The switch "-c" can be use to report the counts.

# list the mapping flags
cut -f 2 myfile.sam | sort | uniq
 # what is the distribution of mapping flags? 
cut -f 2 myfile.sam | sort | uniq -c

What reads and alignments are mapped on the reverse strand?

To grab only the reads that are mapped on the reverse strand, we use a simple awk statement that compares the mapping flag with the value 16 in the condition. Note that the number of alignments mapped on the reverse strand is not necessarily the same as the number of reads mapping to the reverse strand, because of the presence of multi-hits in the file.

# what reads map on the reverse strand? 
awk '$2 == 16' myfile.sam | cut -f 1 | uniq
# how many reads map on the reverse strand? 
awk '$2 == 16' myfile.sam | cut -f 1 | uniq | wc -l
# which alignments are mapped on the reverse strand? 
awk '$2 == 16' myfile.sam
# how many alignments are mapped on the reverse strand? 
awk '$2 == 16' myfile.sam | wc -l

Split the alignments to chromosome 1 and chromosome 3 into separate files

# extract alignments mapped to chr1 
grep -w chr1 myfile.sam > myfile_chr1.sam
# extract alignments mapped to chr3
grep -w chr3 myfile.sam > myfile_chr3.sam

Note that the "-w" option is necessary in the first case, otherwise other chromosomes with digit 1, such as 10,11,etc... will be considered. Even though these commands yield to the correct output in the considered example, in general is recommended that you explicitly compare the field of interest with the given pattern (i.e. chr1), rather than using the command grep on the entire record, as by chance another field could contain the pattern.

Are there fixed-length reads? What is their length?

To verify that all reads have same length, we compute the length of each sequence read and then consider the number of distinct lengths obtained: if only one distinct length is present, then all reads have same length.

# what is the length of the reads?
awk '{print length($10)}' myfile.sam | sort | uniq

What is the relationship between mapping qualities and number of multiple hits in the file?

The mapping quality is reported by the aligner on the 5th field of the SAM file. The valid range for mapping qualities in a SAM file is 0-255. Note that even though the SAM format specifications state that a value 255 is to be interpreted as unavailable mapping quality, many aligners assign the value 255 as the maximum possible alignment quality, indicating a high confidence on the given read placement. Naturally, as the number of multiple hits reported for a given read increases, the mapping quality for each corresponding alignment decreases, as you can see from the example below.

# distribution of mapping qualities 
cut -f 5 myfile.sam | sort | uniq -c
# multi-hits associated with mapping quality 0 
awk '$5 == 0' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq

# multi-hits associated with mapping quality 1 
awk '$5 == 1' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq
# multi-hits associated with mapping quality between 2 and 254 
awk '$5 > 2 && $5 < 254' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq
# multi-hits associated with mapping quality equal to 255 
awk '$5 == 255' myfile.sam | grep -o "NH:i:[[:digit:]]*" | sort -n -k 1.6 | uniq

How many reported alignments are gapless?

Gapless alignments have a CIGAR string consisting of a number (i.e. the length of the read) followed by the letter M. Recall that the reads have a fixed length of 60 bases in the given file.

# count the alignments that have only matches/mismatches 
 cut -f 6 myfile.sam  | grep -c 60M
# this is equivalent to the command above 
 awk '$6 == "60M"' myfile.sam | wc -l

How many reported alignments are perfect matches?

The SAM tag "NM:i:" reports the edit distance from the reference sequence. An edit distance equal to 0 represents a perfect match.

# number of alignments that are perfect matches (edit distance = 0) 
awk '$6 == "60M" ' myfile.sam | grep -o "NM:i:0" | wc -l

Last updated

Massachusetts Institute of Technology