LogoLogo
LogoLogo
  • The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility
  • Computing Resources
    • Active Data Storage
    • Archive Data Storage
    • Luria Cluster
      • FAQs
    • Other Resources
  • Bioinformatics Topics
    • Tools - A Basic Bioinformatics Toolkit
      • Getting more out of Microsoft Excel
      • Bioinformatics Applications of Unix
        • Unix commands applied to bioinformatics
        • Manipulate NGS files using UNIX commands
        • Manipulate alignment files using UNIX commands
      • Alignments and Mappers
      • Relational databases
        • Running Joins on Galaxy
      • Spotfire
    • Tasks - Bioinformatics Methods
      • UCSC Genome Bioinformatics
        • Interacting with the UCSC Genome Browser
        • Obtaining DNA sequence from the UCSC Database
        • Obtaining genomic data from the UCSC database using table browser queries
        • Filtering table browser queries
        • Performing a BLAT search
        • Creating Custom Tracks
        • UCSC Intersection Queries
        • Viewing cross-species alignments
        • Galaxy
          • Intro to Galaxy
          • Galaxy NGS Illumina QC
          • Galaxy NGS Illumina SE Mapping
          • Galaxy SNP Interval Data
        • Editing and annotation gene structures with Argo
      • GeneGO MetaCore
        • GeneGo Introduction
        • Loading Data Into GeneGO
        • Data Management in GeneGO
        • Setting Thresholds and Background Sets
        • Search And Browse Content Tab
        • Workflows and Reports Tab
        • One-click Analysis Tab
        • Building Network for Your Experimental Data
      • Functional Annotation of Gene Lists
      • Multiple Sequence Alignment
        • Clustalw2
      • Phylogenetic analysis
        • Neighbor Joining method in Phylip
      • Microarray data processing with R/Bioconductor
    • Running Jupyter notebooks on luria cluster nodes
  • Data Management
    • Globus
  • Mini Courses
    • Schedule
      • Previous Teaching
    • Introduction to Unix and KI Computational Resources
      • Basic Unix
        • Why Unix?
        • The Unix Tree
        • The Unix Terminal and Shell
        • Anatomy of a Unix Command
        • Basic Unix Commands
        • Output Redirection and Piping
        • Manual Pages
        • Access Rights
        • Unix Text Editors
          • nano
          • vi / vim
          • emacs
        • Shell Scripts
      • Software Installation
        • Module
        • Conda Environment
      • Slurm
    • Introduction to Unix
      • Why Unix?
      • The Unix Filesystem
        • The Unix Tree
        • Network Filesystems
      • The Unix Shell
        • About the Unix Shell
        • Unix Shell Manual Pages
        • Using the Unix Shell
          • Viewing the Unix Tree
          • Traversing the Unix Tree
          • Editing the Unix Tree
          • Searching the Unix Tree
      • Files
        • Viewing File Contents
        • Creating and Editing Files
        • Manipulating Files
        • Symbolic Links
        • File Ownership
          • How Unix File Ownership Works
          • Change File Ownership and Permissions
        • File Transfer (in-progress)
        • File Storage and Compression
      • Getting System Information
      • Writing Scripts
      • Schedule Scripts Using Crontab
    • Advanced Utilization of IGB Computational Resources
      • High Performance Computing Clusters
      • Slurm
        • Checking the Status of Computing Nodes
        • Submitting Jobs / Slurm Scripts
        • Interactive Sessions
      • Package Management
        • The System Package Manager
        • Environment Modules
        • Conda Environments
      • SSH Port Forwarding
        • SSH Port Forwarding Jupyter Notebooks
      • Containerization
        • Docker
          • Docker Installation
          • Running Docker Images
          • Building Docker Images
        • Singularity
          • Differences from Docker
          • Running Images in Singularity
      • Running Nextflow / nf-core Pipelines
    • Python
      • Introduction to Python for Biologists
        • Interactive Python
        • Types
          • Strings
          • Lists
          • Tuples
          • Dictionaries
        • Control Flow
        • Loops
          • For Loops
          • While Loops
        • Control Flows and Loops
        • Storing Programs for Re-use
        • Reading and Writing Files
        • Functions
      • Biopython
        • About Biopython
        • Quick Start
          • Basic Sequence Analyses
          • SeqRecord
          • Sequence IO
          • Exploration of Entrez Databases
        • Example Projects
          • Coronavirus Exploration
          • Translating a eukaryotic FASTA file of CDS entries
        • Further Resources
      • Machine Learning with Python
        • About Machine Learning
        • Hands-On
          • Project Introduction
          • Supervised Approaches
            • The Logistic Regression Model
            • K-Nearest Neighbors
          • Unsupervised Approaches
            • K-Means Clustering
          • Further Resources
      • Data Processing with Python
        • Pandas
          • About Pandas
          • Making DataFrames
          • Inspecting DataFrames
          • Slicing DataFrames
          • Selecting from DataFrames
          • Editing DataFrames
        • Matplotlib
          • About Matplotlib
          • Basic Plotting
          • Advanced Plotting
        • Seaborn
          • About Seaborn
          • Basic Plotting
          • Visualizing Statistics
          • Visualizing Proteomics Data
          • Visualizing RNAseq Data
    • R
      • Intro to R
        • Before We Start
        • Getting to Know R
        • Variables in R
        • Functions in R
        • Data Manipulation
        • Simple Statistics in R
        • Basic Plotting in R
        • Advanced Plotting in R
        • Writing Figures to a File
        • Further Resources
    • Version Control with Git
      • About Version Control
      • Setting up Git
      • Creating a Repository
      • Tracking Changes
        • Exercises
      • Exploring History
        • Exercises
      • Ignoring Things
      • Remotes in Github
      • Collaborating
      • Conflicts
      • Open Science
      • Licensing
      • Citation
      • Hosting
      • Supplemental
Powered by GitBook

MIT Resources

  • https://accessibility.mit.edu

Massachusetts Institute of Technology

On this page
  • Getting Started
  • Volcano Plot
  • Heatmap
  • Another Example

Was this helpful?

Export as PDF
  1. Mini Courses
  2. Python
  3. Data Processing with Python
  4. Seaborn

Visualizing RNAseq Data

PreviousVisualizing Proteomics DataNextR

Last updated 1 year ago

Was this helpful?

Getting Started

import pandas as pd
import numpy as np
import seaborn as sns
import glob
import matplotlib.pyplot as plt
sns.set_context('paper')
sns.set_style("whitegrid")

Volcano Plot

This is what we aim to reproduce basing on the file volcano_data.tsv. Let's read the volcano_data.tsv file into a pandas dataframe. glob.glob('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\*.tsv') # get a list of files in your directory ending in .tsv

vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\KOvsWTdiffExp.tsv', sep='\t')

inspect the 'head' of the file

vol.head()
vol.shape

(4900, 6)

Notice that the qvalues in the file need to be log-transformed to match the figure Create a new column in the vol dataframe where 'log10_q' = -log_base_10(qval) for each gene...the numpy function np.log10 is helpful

vol['log10_q'] = -np.log10(vol['padj']) #calculate the -log2 of qvalues

output a summary of the data (use the .describe() function)

vol.describe()

We want to plot and color the genes that increase in knockout, decrease with knockout, and show no significant change (q-val > 0.05). Let's categorize our data

create a new column called 'data_category' with entires 'increases in Knockout', 'decreases in Knockout', and 'not significant' set these values appropriately for each gene hint - inspect the 'log2FoldChange' value or the 'padj' fields to determine each case

sns.scatterplot(data=vol, y='log10_q', x='log2FoldChange', hue='data_category', legend='brief',
                palette={'not significant':'grey', 'increases in Knockout':"red", 'decreases in Knockout':"blue"}, 
                edgecolor='grey',s=80, linewidth=0.25)

save a list of the significant genes and their qvalues (any gene_ids with qval<0.05), and output this list to a .csv file

sig_genes = vol.loc[vol['padj']<.05, ['geneID', 'padj', 'log2FoldChange']]
sig_genes.shape

(2582, 3)

sig_genes.head(8)

Save significant genes to a csv file

sig_genes.to_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\significant_hits.csv')

Heatmap

Next, we are going to reproduce a heatmap below. This clustered heat map aims to show groups of genes whose transcript levels are coordinated across age or mutant background

The plotting will be based on rpkm.tsv which contains ~600 significant genes we want to inspect.

read in the 'rpkm.tsv' file as a pandas dataframe, save it as dataset

dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\rpkm.tsv', sep='\t')

briefly inspect the dataframe for shape, general entries, and summary statistics

dataset.shape

(600,7)

dataset.head()
dataset.describe()

People often plot using 'row median centered' each gene This means they divided each row by the median value across that entire row. They also log-transformed that result

Calculate the median value for each gene (row)

row_medians = dataset.median(axis=1,numeric_only=True)

Now create a copy of the dataset, and save it as dataset_row_norm Row median center and log2-transform each gene in dataset_row_norm

dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
    dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))

look at the summary statistics now Also, look at a few random gene rows and make sure the result in sensible Finally, look at the .head() to see how it is indexed

dataset_row_norm.describe()
dataset_row_norm.iloc[200,1:].median() #test some random rows, make sure the median value is 0

-0.04326176484815142

dataset_row_norm.head()

change the indexing to use the gene name instead of the row number

dataset_row_norm = dataset_row_norm.set_index('geneID')

Now let's plot the full dataset. Use seaborn clustermap to generate a heat map and cluster each row. Look at the seaborn clustermap documentation to figure out what arguments to pass

import fastcluster
sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))

Another Example

sometimes you need additional work to make a nice heatmap. See the example below:

Read in the example data

dataset = pd.read_csv('C:\\Users\duan\Desktop\PythonDataProcessingVisualization\\TPM_reads_raw.tsv', sep='\t')

Examine the data

dataset.shape

(5823, 36)

dataset.head()
dataset.describe()

Prepare the data for heatmap plotting

row_medians = dataset.median(axis=1,numeric_only=True) #calculate the median of each row
dataset_row_norm = dataset.copy() #make a copy of the dataset so we can manipulate it
for col in dataset.columns[1:]:
    dataset_row_norm[col] = np.log2((dataset[col]+0.1)/(row_medians+0.1))
dataset_row_norm = dataset_row_norm.set_index('gene')

Heatmap plotting

sns.clustermap(dataset_row_norm, row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(10,10))

It is necessary to 'zoom' in on the genes that showed significant changes. Sometimes people 'capped' their fold changes at -1.5 and +1.5, we'll do the same.

Start by copying our row-normalized dataframe to a new dataframe and setting all values greater than 1.5 to 1.5, and all less than -1.5 to -1.5 Save this as capped_row_norm_dataset

capped_row_norm_dataset = dataset_row_norm.copy()
capped_row_norm_dataset[capped_row_norm_dataset>1.5] = 1.5
capped_row_norm_dataset[capped_row_norm_dataset<-1.5] = -1.5

Look at the summary statistics to see if this worked

Look at the summary statistics to see if this worked

Read volcano plot file

vol = pd.read_csv('C:\\Users\\duan\\Desktop\\PythonDataProcessingVisualization\\volcano_data.tsv', sep='\t')

Inspect volcano plot file

vol.head()
vol.shape

(13547, 4)

Prepare for volcano plotting

vol['log10_q'] = -np.log10(vol['qval']) #calculate the -log2 of qvalues
vol.describe()
vol.loc[vol['log2foldchange']>0,'data_category'] = 'increases with age'
vol.loc[vol['log2foldchange']<0,'data_category'] = 'decreases with age'
vol.loc[vol['qval']>0.05,'data_category'] = 'not significant'

Volcano plotting

sns.scatterplot(data=vol, y='log10_q', x='log2foldchange', hue='data_category', legend='brief',
                palette={'not significant':'grey', 'increases with age':"red", 'decreases with age':"blue"}, 
                edgecolor='grey',s=80, linewidth=0.25)

Identify significant genes

sig_genes = vol.loc[vol['qval']<.05, ['gene_id', 'qval', 'log2foldchange']]

Now we need to just pull out the genes that show significant age-dependence Make a list of all the gene names that are in both dataset_row_norm and the list of significant genes (sig_genes) from above. Save this list as genes_to_cluster

#get the genenames that are the same between the datasets
genes_to_cluster = [gene for gene in capped_row_norm_dataset.index if gene in sig_genes.values] 

use the .loc function to pull out just the rows of genes we want to cluster, and see how many genes that is

capped_row_norm_dataset.loc[genes_to_cluster, :].shape

(1731, 35)

Almost there - now make your heatmap!

sns.clustermap(capped_row_norm_dataset.loc[genes_to_cluster, :], row_cluster=True, col_cluster=False, cmap="RdBu_r", figsize=(8,8))