Manipulate NGS files using UNIX commands
Last updated
Last updated
MIT Resources
https://accessibility.mit.eduMassachusetts Institute of Technology
The following examples use the file isoforms_fpkm_track.txt
that you can copy by typing the following command on Luria:
This is incorrect! Will update this later.
The given file is tab delimited and represents one of the outputs of the transcriptome assembly tool Cufflinks.
A description of the columns is reported here for convenience, but more information can be found in the cufflinks manual.
tracking_id: A unique identifier describing the object (gene, transcript, CDS, primary transcript)
class_code: The class_code attribute for the object, or "-"
nearest_ref_id: The reference transcript to which the class code refers, if any
gene_id: The gene ID(s) associated with the object
gene_short_name: The gene short name(s) associated with the object
tss_id: The tss ID associated with the object, or "-"
locus: Genomic coordinates of the object
length: The number of base pairs in the transcript, or '-' if not a transcript/primary transcript
coverage: Estimate for the absolute depth of read coverage across the object
FPKM: FPKM of the object
FPKM_lo: The lower bound of the 95% confidence interval on the FPKM of the object
FPKM_hi: The upper bound of the 95% confidence interval on the FPKM of the object
status: Quantification status for the object. Can be one of: OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL (when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution).
In this tutorial we will answer the following questions about the given file:
By counting how many distinct entries there are in the 1st and 4th column, we can determine the number of transcripts and genes respectively.
The FPKM values are reported in the 10th field. We can ignore the transcripts with zero FPKM value by using simple awk statement.
Transcripts with status other than "OK" are generally not suitable for downstream analysis. We can filter these out by considering only those records whose last field matches the string "OK".
Note that using a naive grep statement might not always work because a naive grep match is performed on the entire line, rather than on a specific field.
Sort by increasing FPKM values
Because FPKM values are often reported in a scientific format called E notation, a simple numeric sort (specified with the switch -n) does not work properly because the character "e" in the notation is not interpreted as it should (e.g. 2.5e+8 is equivalent to 2.5 x 10 ^ (+8)). The switch "-g" takes care of this by performing a true numeric sort.
Imposing that the length must be greater than 125 and smaller than 10000 in a simple awk statement, yields to records with transcript length meeting the specified criteria.
Suppose we want to split the locus annotation into chromosome, start position and end position. We can use an awk statement to split the content of the 7th field using the separators ":" and "-".
We can extract the records of those isoforms that are mapped within a specific genomic interval. For simplicity's sake, first we split the locus into chromosome, start and end, then we impose limits on the coordinates using awk.