LogoLogo
LogoLogo
  • The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility
  • Computing Resources
    • Active Data Storage
    • Archive Data Storage
    • Luria Cluster
      • FAQs
    • Other Resources
  • Bioinformatics Topics
    • Tools - A Basic Bioinformatics Toolkit
      • Getting more out of Microsoft Excel
      • Bioinformatics Applications of Unix
        • Unix commands applied to bioinformatics
        • Manipulate NGS files using UNIX commands
        • Manipulate alignment files using UNIX commands
      • Alignments and Mappers
      • Relational databases
        • Running Joins on Galaxy
      • Spotfire
    • Tasks - Bioinformatics Methods
      • UCSC Genome Bioinformatics
        • Interacting with the UCSC Genome Browser
        • Obtaining DNA sequence from the UCSC Database
        • Obtaining genomic data from the UCSC database using table browser queries
        • Filtering table browser queries
        • Performing a BLAT search
        • Creating Custom Tracks
        • UCSC Intersection Queries
        • Viewing cross-species alignments
        • Galaxy
          • Intro to Galaxy
          • Galaxy NGS Illumina QC
          • Galaxy NGS Illumina SE Mapping
          • Galaxy SNP Interval Data
        • Editing and annotation gene structures with Argo
      • GeneGO MetaCore
        • GeneGo Introduction
        • Loading Data Into GeneGO
        • Data Management in GeneGO
        • Setting Thresholds and Background Sets
        • Search And Browse Content Tab
        • Workflows and Reports Tab
        • One-click Analysis Tab
        • Building Network for Your Experimental Data
      • Functional Annotation of Gene Lists
      • Multiple Sequence Alignment
        • Clustalw2
      • Phylogenetic analysis
        • Neighbor Joining method in Phylip
      • Microarray data processing with R/Bioconductor
    • Running Jupyter notebooks on luria cluster nodes
  • Data Management
    • Globus
  • Mini Courses
    • Schedule
      • Previous Teaching
    • Introduction to Unix and KI Computational Resources
      • Basic Unix
        • Why Unix?
        • The Unix Tree
        • The Unix Terminal and Shell
        • Anatomy of a Unix Command
        • Basic Unix Commands
        • Output Redirection and Piping
        • Manual Pages
        • Access Rights
        • Unix Text Editors
          • nano
          • vi / vim
          • emacs
        • Shell Scripts
      • Software Installation
        • Module
        • Conda Environment
      • Slurm
    • Introduction to Unix
      • Why Unix?
      • The Unix Filesystem
        • The Unix Tree
        • Network Filesystems
      • The Unix Shell
        • About the Unix Shell
        • Unix Shell Manual Pages
        • Using the Unix Shell
          • Viewing the Unix Tree
          • Traversing the Unix Tree
          • Editing the Unix Tree
          • Searching the Unix Tree
      • Files
        • Viewing File Contents
        • Creating and Editing Files
        • Manipulating Files
        • Symbolic Links
        • File Ownership
          • How Unix File Ownership Works
          • Change File Ownership and Permissions
        • File Transfer (in-progress)
        • File Storage and Compression
      • Getting System Information
      • Writing Scripts
      • Schedule Scripts Using Crontab
    • Advanced Utilization of IGB Computational Resources
      • High Performance Computing Clusters
      • Slurm
        • Checking the Status of Computing Nodes
        • Submitting Jobs / Slurm Scripts
        • Interactive Sessions
      • Package Management
        • The System Package Manager
        • Environment Modules
        • Conda Environments
      • SSH Port Forwarding
        • SSH Port Forwarding Jupyter Notebooks
      • Containerization
        • Docker
          • Docker Installation
          • Running Docker Images
          • Building Docker Images
        • Singularity
          • Differences from Docker
          • Running Images in Singularity
      • Running Nextflow / nf-core Pipelines
    • Python
      • Introduction to Python for Biologists
        • Interactive Python
        • Types
          • Strings
          • Lists
          • Tuples
          • Dictionaries
        • Control Flow
        • Loops
          • For Loops
          • While Loops
        • Control Flows and Loops
        • Storing Programs for Re-use
        • Reading and Writing Files
        • Functions
      • Biopython
        • About Biopython
        • Quick Start
          • Basic Sequence Analyses
          • SeqRecord
          • Sequence IO
          • Exploration of Entrez Databases
        • Example Projects
          • Coronavirus Exploration
          • Translating a eukaryotic FASTA file of CDS entries
        • Further Resources
      • Machine Learning with Python
        • About Machine Learning
        • Hands-On
          • Project Introduction
          • Supervised Approaches
            • The Logistic Regression Model
            • K-Nearest Neighbors
          • Unsupervised Approaches
            • K-Means Clustering
          • Further Resources
      • Data Processing with Python
        • Pandas
          • About Pandas
          • Making DataFrames
          • Inspecting DataFrames
          • Slicing DataFrames
          • Selecting from DataFrames
          • Editing DataFrames
        • Matplotlib
          • About Matplotlib
          • Basic Plotting
          • Advanced Plotting
        • Seaborn
          • About Seaborn
          • Basic Plotting
          • Visualizing Statistics
          • Visualizing Proteomics Data
          • Visualizing RNAseq Data
    • R
      • Intro to R
        • Before We Start
        • Getting to Know R
        • Variables in R
        • Functions in R
        • Data Manipulation
        • Simple Statistics in R
        • Basic Plotting in R
        • Advanced Plotting in R
        • Writing Figures to a File
        • Further Resources
    • Version Control with Git
      • About Version Control
      • Setting up Git
      • Creating a Repository
      • Tracking Changes
        • Exercises
      • Exploring History
        • Exercises
      • Ignoring Things
      • Remotes in Github
      • Collaborating
      • Conflicts
      • Open Science
      • Licensing
      • Citation
      • Hosting
      • Supplemental
Powered by GitBook

MIT Resources

  • https://accessibility.mit.edu

Massachusetts Institute of Technology

On this page
  • How many genes and transcripts are in the file?
  • Filter out the transcripts with FPKM value equal to 0
  • Filter out the transcripts with status other than "OK"
  • Filter out the transcripts with length < 125 and length > 10000
  • Separate the locus annotation in 3 columns: chromosome, start, end
  • Consider only the transcripts in a specific genomic interval

Was this helpful?

Export as PDF
  1. Bioinformatics Topics
  2. Tools - A Basic Bioinformatics Toolkit
  3. Bioinformatics Applications of Unix

Manipulate NGS files using UNIX commands

PreviousUnix commands applied to bioinformaticsNextManipulate alignment files using UNIX commands

Last updated 1 year ago

Was this helpful?

The following examples use the file isoforms_fpkm_track.txt that you can copy by typing the following command on Luria:

This is incorrect! Will update this later.

# copy the file from the stock directory to the current directory 
cp /net/ostrom/data/rowleybcc/charliew/old_teaching/Biol2Bioinfo/UNIX/isoforms_fpkm_track.txt .

The given file is tab delimited and represents one of the outputs of the transcriptome assembly tool .

  • tracking_id: A unique identifier describing the object (gene, transcript, CDS, primary transcript)

  • class_code: The class_code attribute for the object, or "-"

  • nearest_ref_id: The reference transcript to which the class code refers, if any

  • gene_id: The gene ID(s) associated with the object

  • gene_short_name: The gene short name(s) associated with the object

  • tss_id: The tss ID associated with the object, or "-"

  • locus: Genomic coordinates of the object

  • length: The number of base pairs in the transcript, or '-' if not a transcript/primary transcript

  • coverage: Estimate for the absolute depth of read coverage across the object

  • FPKM: FPKM of the object

  • FPKM_lo: The lower bound of the 95% confidence interval on the FPKM of the object

  • FPKM_hi: The upper bound of the 95% confidence interval on the FPKM of the object

  • status: Quantification status for the object. Can be one of: OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL (when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution).

In this tutorial we will answer the following questions about the given file:

How many genes and transcripts are in the file?

By counting how many distinct entries there are in the 1st and 4th column, we can determine the number of transcripts and genes respectively.

# count the number of isoforms 
cut -f 1 isoforms_fpkm_track.txt | sort | uniq | wc -l
# count the number of genes 
cut -f 4 isoforms_fpkm_track.txt | sort | uniq | wc -l

Filter out the transcripts with FPKM value equal to 0

The FPKM values are reported in the 10th field. We can ignore the transcripts with zero FPKM value by using simple awk statement.

# print isoforms with nonzero FPKM value 
awk '$10 !=0  {print $0}' isoforms_fpkm_track.txt

Filter out the transcripts with status other than "OK"

Transcripts with status other than "OK" are generally not suitable for downstream analysis. We can filter these out by considering only those records whose last field matches the string "OK".

# print isoforms with nonzero FPKM value 
awk '$13 == "OK" {print $0}' isoforms_fpkm_track.txt 

Note that using a naive grep statement might not always work because a naive grep match is performed on the entire line, rather than on a specific field.

Sort by increasing FPKM values

Because FPKM values are often reported in a scientific format called E notation, a simple numeric sort (specified with the switch -n) does not work properly because the character "e" in the notation is not interpreted as it should (e.g. 2.5e+8 is equivalent to 2.5 x 10 ^ (+8)). The switch "-g" takes care of this by performing a true numeric sort.

# sort numerically by increasing FPKM value (E notation IS NOT handled correctly!) 
sort -n -k 10 isoforms_fpkm_track.txt
# check the highest FPKM values
sort -n -k 10 isoforms_fpkm_track.txt | cut -f 10 | tail
# sort numerically by increasing FPKM value (E notation IS handled correctly!) 
sort -g -k 10 isoforms_fpkm_track.txt 
# check the highest FPKM values
sort -g -k 10 isoforms_fpkm_track.txt | cut -f 10 | tail

Filter out the transcripts with length < 125 and length > 10000

Imposing that the length must be greater than 125 and smaller than 10000 in a simple awk statement, yields to records with transcript length meeting the specified criteria.

# print only transcripts with length between 125 and 10000 
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt
# verify the length of the shortest transcript
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt | cut -f 8 | sort -n | head 
# verify the length of the longest transcript
awk '$8 > 125 && $8 < 10000' isoforms_fpkm_track.txt | cut -f 8 | sort -n | tail

Separate the locus annotation in 3 columns: chromosome, start, end

Suppose we want to split the locus annotation into chromosome, start position and end position. We can use an awk statement to split the content of the 7th field using the separators ":" and "-".

# split locus into chromosome, start and end 
awk 'split($7, a, ":") split(a[2],b,"-") {print $1,a[1],b[1],b[2]}' isoforms_fpkm_track.txt

Consider only the transcripts in a specific genomic interval

We can extract the records of those isoforms that are mapped within a specific genomic interval. For simplicity's sake, first we split the locus into chromosome, start and end, then we impose limits on the coordinates using awk.

# split locus into chromosome, start and end 
awk 'split($7, a, ":") split(a[2],b,"-") {print $1,a[1],b[1], b[2]}' isoforms_fpkm_track.txt > tmp
# impose condition on start and end position (10,000,000 and 12,000,000)
awk '$3 > 10000000 && $4 < 12000000 {print}' tmp

A description of the columns is reported here for convenience, but more information can be found in the .

cufflinks manual
Cufflinks