Manipulate alignment files using UNIX commands
The following examples use the file
myfile.sam
that you can copy by typing the following command on Luria:
As revealed by the extension, the file is in SAM format.
The SAM format stores sequence data in a series of tab delimited columns.
BAM, the sister version of a SAM file, contains the same information but in a compressed, indexed, binary form that is not human readable.
Basic information on SAM files can be found here.
The official specifications of SAM and BAM format are available here.
In this tutorial we will answer the following questions about the given SAM file:
What is in the file?
You can look at the content of the file by using the command more, head, cat, or by opening the file itself with a text editor such as nano.
How many alignments are in the file?
Because in a SAM formatted file each line corresponds to one reported alignment, we just need to count the number of lines in the file to get the number of alignments.
How many reads are in the file?
Given that in a generic SAM flle any sequence read can have more than one reported alignment, the number of alignments is not necessarily the number of reads in the file. However, the sequence reads are associated with a unique header, so we can count the number of distinct headers to determine how many reads are represented in the file.
How many distinct sequences are in the file?
Because two distinct reads could have identical sequence (by chance due to their short length, or by being PCR duplicates), the number of reads is not necessarily the same as the number of distinct sequences in the file, so we need to count the number of distinct occurrences in the 10th field.
How many reads are uniquely mapped?
Uniquely mapped reads are generally intended as reads that have only one reported alignment. To determine the number of reads with only one reported alignment, we can count how many headers occur only once in the file. Note that without the switch "-u", repeated headers will also be reported once, yielding to incorrect output.
How many reads are multi-hits?
Multi-hits are those reads that have more than one reported alignment. By counting how many distinct headers are repeated in the file, we can determine the number of multi-hits.
How many alignments are reported for each read?
To count the number of alignments each read has reported in the file, we count how many times each header appears in the file.
What are the top 10 multi-hit reads?
To get the top 10 multi-hit reads, we count the number of alignments for each header, then we perform a numerical, reverse sort on the number of occurrences.
Another way to get the top 10 multi-hit reads is to consider the alignment tag of the form "NH:i:" and use the command grep with a regular expression. Note that the sorting occurs on the tag itself because of the "-o" switch in the grep command, and precisely on the number of hits reported in the tag because of the "-k" switch in the sort command (i.e. we ignore the first 5 positions of the tag, which are common to all the tags, and we sort on the number of multi-hits itself). Another caveat is using "-n" switch with the sort command (if omitted, the sort occurs in a lexicographic order and the top multi-hit reads reported would be those with 9 multi-hits).
What chromosomes are represented in the file?
To determine what chromosomes are mapped in the file, we extract the chromosome label (corresponding to the 3rd field), then apply the sort and uniq commands. To print the chromosomes in numerical order, we use the switch "-n" on the labels, skipping the first three characters ("chr").
List the chromosomes according to the read distribution
To list the chromosome from the most represented to the least represented, we use the switch "-n" on the number of occurrences for each chromosome. Note that without the "-n" switch, the sorting of the chromosome frequency would be lexicographic rather than numeric, so chrX would appear before chr19 because the string "99" occurs before "87937" in a lexicographic reverse order.
Which are the most and least represented chromosomes?
Using the sorted list obtained in the step above, we consider the first reported chromosome as the most abundant and the last reported chromosome as the least abundant. Note that we can either choose to sort the list in reverse order or use the commands head/tail appropriately.
What mapping flags are in the file?
The 2nd field in a SAM formatted file contains the mapping flag, which provides information on the mapping features of the given alignment. For example, a flag "4" indicates that the read is unmapped, whereas a flag "16" indicates the read is mapped on reverse strand. To find what flags are associated with the alignments in the file, we simply report the distinct flags present in the file. The switch "-c" can be use to report the counts.
What reads and alignments are mapped on the reverse strand?
To grab only the reads that are mapped on the reverse strand, we use a simple awk statement that compares the mapping flag with the value 16 in the condition. Note that the number of alignments mapped on the reverse strand is not necessarily the same as the number of reads mapping to the reverse strand, because of the presence of multi-hits in the file.
Split the alignments to chromosome 1 and chromosome 3 into separate files
Note that the "-w" option is necessary in the first case, otherwise other chromosomes with digit 1, such as 10,11,etc... will be considered. Even though these commands yield to the correct output in the considered example, in general is recommended that you explicitly compare the field of interest with the given pattern (i.e. chr1), rather than using the command grep on the entire record, as by chance another field could contain the pattern.
Are there fixed-length reads? What is their length?
To verify that all reads have same length, we compute the length of each sequence read and then consider the number of distinct lengths obtained: if only one distinct length is present, then all reads have same length.
What is the relationship between mapping qualities and number of multiple hits in the file?
The mapping quality is reported by the aligner on the 5th field of the SAM file. The valid range for mapping qualities in a SAM file is 0-255. Note that even though the SAM format specifications state that a value 255 is to be interpreted as unavailable mapping quality, many aligners assign the value 255 as the maximum possible alignment quality, indicating a high confidence on the given read placement. Naturally, as the number of multiple hits reported for a given read increases, the mapping quality for each corresponding alignment decreases, as you can see from the example below.
How many reported alignments are gapless?
Gapless alignments have a CIGAR string consisting of a number (i.e. the length of the read) followed by the letter M. Recall that the reads have a fixed length of 60 bases in the given file.
How many reported alignments are perfect matches?
The SAM tag "NM:i:" reports the edit distance from the reference sequence. An edit distance equal to 0 represents a perfect match.
Last updated