Manipulate NGS files using UNIX commands

The following examples use the file isoforms_fpkm_track.txt that you can copy by typing the following command on Luria:

This is incorrect! Will update this later.

# copy the file from the stock directory to the current directory 
cp /net/ostrom/data/rowleybcc/charliew/old_teaching/Biol2Bioinfo/UNIX/isoforms_fpkm_track.txt .

The given file is tab delimited and represents one of the outputs of the transcriptome assembly tool Cufflinks.

A description of the columns is reported here for convenience, but more information can be found in the cufflinks manual.

  • tracking_id: A unique identifier describing the object (gene, transcript, CDS, primary transcript)

  • class_code: The class_code attribute for the object, or "-"

  • nearest_ref_id: The reference transcript to which the class code refers, if any

  • gene_id: The gene ID(s) associated with the object

  • gene_short_name: The gene short name(s) associated with the object

  • tss_id: The tss ID associated with the object, or "-"

  • locus: Genomic coordinates of the object

  • length: The number of base pairs in the transcript, or '-' if not a transcript/primary transcript

  • coverage: Estimate for the absolute depth of read coverage across the object

  • FPKM: FPKM of the object

  • FPKM_lo: The lower bound of the 95% confidence interval on the FPKM of the object

  • FPKM_hi: The upper bound of the 95% confidence interval on the FPKM of the object

  • status: Quantification status for the object. Can be one of: OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL (when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution).

In this tutorial we will answer the following questions about the given file:

How many genes and transcripts are in the file?

By counting how many distinct entries there are in the 1st and 4th column, we can determine the number of transcripts and genes respectively.

# count the number of isoforms 
cut -f 1 isoforms_fpkm_track.txt | sort | uniq | wc -l
# count the number of genes 
cut -f 4 isoforms_fpkm_track.txt | sort | uniq | wc -l

Filter out the transcripts with FPKM value equal to 0

The FPKM values are reported in the 10th field. We can ignore the transcripts with zero FPKM value by using simple awk statement.

# print isoforms with nonzero FPKM value 
awk '$10 !=0  {print $0}' isoforms_fpkm_track.txt

Filter out the transcripts with status other than "OK"

Transcripts with status other than "OK" are generally not suitable for downstream analysis. We can filter these out by considering only those records whose last field matches the string "OK".

# print isoforms with nonzero FPKM value 
awk '$13 == "OK" {print $0}' isoforms_fpkm_track.txt 

Note that using a naive grep statement might not always work because a naive grep match is performed on the entire line, rather than on a specific field.

Sort by increasing FPKM values

Because FPKM values are often reported in a scientific format called E notation, a simple numeric sort (specified with the switch -n) does not work properly because the character "e" in the notation is not interpreted as it should (e.g. 2.5e+8 is equivalent to 2.5 x 10 ^ (+8)). The switch "-g" takes care of this by performing a true numeric sort.

# sort numerically by increasing FPKM value (E notation IS NOT handled correctly!) 
sort -n -k 10 isoforms_fpkm_track.txt
# check the highest FPKM values
sort -n -k 10 isoforms_fpkm_track.txt | cut -f 10 | tail
# sort numerically by increasing FPKM value (E notation IS handled correctly!) 
sort -g -k 10 isoforms_fpkm_track.txt 
# check the highest FPKM values
sort -g -k 10 isoforms_fpkm_track.txt | cut -f 10 | tail

Filter out the transcripts with length < 125 and length > 10000

Imposing that the length must be greater than 125 and smaller than 10000 in a simple awk statement, yields to records with transcript length meeting the specified criteria.

# print only transcripts with length between 125 and 10000 
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt
# verify the length of the shortest transcript
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt | cut -f 8 | sort -n | head 
# verify the length of the longest transcript
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt | cut -f 8 | sort -n | tail

Separate the locus annotation in 3 columns: chromosome, start, end

Suppose we want to split the locus annotation into chromosome, start position and end position. We can use an awk statement to split the content of the 7th field using the separators ":" and "-".

# split locus into chromosome, start and end 
awk 'split($7, a, ":") split(a[2],b,"-") {print $1,a[1],b[1],b[2]}' isoforms_fpkm_track.txt

Consider only the transcripts in a specific genomic interval

We can extract the records of those isoforms that are mapped within a specific genomic interval. For simplicity's sake, first we split the locus into chromosome, start and end, then we impose limits on the coordinates using awk.

# split locus into chromosome, start and end 
awk 'split($7, a, ":") split(a[2],b,"-") {print $1,a[1],b[1], b[2]}' isoforms_fpkm_track.txt > tmp
# impose condition on start and end position (10,000,000 and 12,000,000)
awk '$3 > 10000000 && $4 < 12000000 {print}' tmp

Last updated

Was this helpful?