LogoLogo
LogoLogo
  • The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility
  • Computing Resources
    • Active Data Storage
    • Archive Data Storage
    • Luria Cluster
      • FAQs
    • Other Resources
  • Bioinformatics Topics
    • Tools - A Basic Bioinformatics Toolkit
      • Getting more out of Microsoft Excel
      • Bioinformatics Applications of Unix
        • Unix commands applied to bioinformatics
        • Manipulate NGS files using UNIX commands
        • Manipulate alignment files using UNIX commands
      • Alignments and Mappers
      • Relational databases
        • Running Joins on Galaxy
      • Spotfire
    • Tasks - Bioinformatics Methods
      • UCSC Genome Bioinformatics
        • Interacting with the UCSC Genome Browser
        • Obtaining DNA sequence from the UCSC Database
        • Obtaining genomic data from the UCSC database using table browser queries
        • Filtering table browser queries
        • Performing a BLAT search
        • Creating Custom Tracks
        • UCSC Intersection Queries
        • Viewing cross-species alignments
        • Galaxy
          • Intro to Galaxy
          • Galaxy NGS Illumina QC
          • Galaxy NGS Illumina SE Mapping
          • Galaxy SNP Interval Data
        • Editing and annotation gene structures with Argo
      • GeneGO MetaCore
        • GeneGo Introduction
        • Loading Data Into GeneGO
        • Data Management in GeneGO
        • Setting Thresholds and Background Sets
        • Search And Browse Content Tab
        • Workflows and Reports Tab
        • One-click Analysis Tab
        • Building Network for Your Experimental Data
      • Functional Annotation of Gene Lists
      • Multiple Sequence Alignment
        • Clustalw2
      • Phylogenetic analysis
        • Neighbor Joining method in Phylip
      • Microarray data processing with R/Bioconductor
    • Running Jupyter notebooks on luria cluster nodes
  • Data Management
    • Globus
  • Mini Courses
    • Schedule
      • Previous Teaching
    • Introduction to Unix and KI Computational Resources
      • Basic Unix
        • Why Unix?
        • The Unix Tree
        • The Unix Terminal and Shell
        • Anatomy of a Unix Command
        • Basic Unix Commands
        • Output Redirection and Piping
        • Manual Pages
        • Access Rights
        • Unix Text Editors
          • nano
          • vi / vim
          • emacs
        • Shell Scripts
      • Software Installation
        • Module
        • Conda Environment
      • Slurm
    • Introduction to Unix
      • Why Unix?
      • The Unix Filesystem
        • The Unix Tree
        • Network Filesystems
      • The Unix Shell
        • About the Unix Shell
        • Unix Shell Manual Pages
        • Using the Unix Shell
          • Viewing the Unix Tree
          • Traversing the Unix Tree
          • Editing the Unix Tree
          • Searching the Unix Tree
      • Files
        • Viewing File Contents
        • Creating and Editing Files
        • Manipulating Files
        • Symbolic Links
        • File Ownership
          • How Unix File Ownership Works
          • Change File Ownership and Permissions
        • File Transfer (in-progress)
        • File Storage and Compression
      • Getting System Information
      • Writing Scripts
      • Schedule Scripts Using Crontab
    • Advanced Utilization of IGB Computational Resources
      • High Performance Computing Clusters
      • Slurm
        • Checking the Status of Computing Nodes
        • Submitting Jobs / Slurm Scripts
        • Interactive Sessions
      • Package Management
        • The System Package Manager
        • Environment Modules
        • Conda Environments
      • SSH Port Forwarding
        • SSH Port Forwarding Jupyter Notebooks
      • Containerization
        • Docker
          • Docker Installation
          • Running Docker Images
          • Building Docker Images
        • Singularity
          • Differences from Docker
          • Running Images in Singularity
      • Running Nextflow / nf-core Pipelines
    • Python
      • Introduction to Python for Biologists
        • Interactive Python
        • Types
          • Strings
          • Lists
          • Tuples
          • Dictionaries
        • Control Flow
        • Loops
          • For Loops
          • While Loops
        • Control Flows and Loops
        • Storing Programs for Re-use
        • Reading and Writing Files
        • Functions
      • Biopython
        • About Biopython
        • Quick Start
          • Basic Sequence Analyses
          • SeqRecord
          • Sequence IO
          • Exploration of Entrez Databases
        • Example Projects
          • Coronavirus Exploration
          • Translating a eukaryotic FASTA file of CDS entries
        • Further Resources
      • Machine Learning with Python
        • About Machine Learning
        • Hands-On
          • Project Introduction
          • Supervised Approaches
            • The Logistic Regression Model
            • K-Nearest Neighbors
          • Unsupervised Approaches
            • K-Means Clustering
          • Further Resources
      • Data Processing with Python
        • Pandas
          • About Pandas
          • Making DataFrames
          • Inspecting DataFrames
          • Slicing DataFrames
          • Selecting from DataFrames
          • Editing DataFrames
        • Matplotlib
          • About Matplotlib
          • Basic Plotting
          • Advanced Plotting
        • Seaborn
          • About Seaborn
          • Basic Plotting
          • Visualizing Statistics
          • Visualizing Proteomics Data
          • Visualizing RNAseq Data
    • R
      • Intro to R
        • Before We Start
        • Getting to Know R
        • Variables in R
        • Functions in R
        • Data Manipulation
        • Simple Statistics in R
        • Basic Plotting in R
        • Advanced Plotting in R
        • Writing Figures to a File
        • Further Resources
    • Version Control with Git
      • About Version Control
      • Setting up Git
      • Creating a Repository
      • Tracking Changes
        • Exercises
      • Exploring History
        • Exercises
      • Ignoring Things
      • Remotes in Github
      • Collaborating
      • Conflicts
      • Open Science
      • Licensing
      • Citation
      • Hosting
      • Supplemental
Powered by GitBook

MIT Resources

  • https://accessibility.mit.edu

Massachusetts Institute of Technology

On this page
  • Prepare data to practice ggplot2
  • Advanced Boxplot
  • Advanced Bar Plot
  • PCA
  • Heatmaps

Was this helpful?

Export as PDF
  1. Mini Courses
  2. R
  3. Intro to R

Advanced Plotting in R

PreviousBasic Plotting in RNextWriting Figures to a File

Last updated 5 months ago

Was this helpful?

There’s also a plotting package called that adds a lot of functionality to the basic plots seen above. The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using ggplot functions to get a feel for the syntax.

ggplot is best used on data in the data.frame form, so we will work with our combined df for the following figures. Let’s start by loading the ggplot2 library.

library(ggplot2)

Prepare data to practice ggplot2

baseDir<-getwd()
dataDir<-file.path(baseDir,"data")
metadata <- read.table(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
rpkm_data <- read.table(file.path(dataDir, 'counts.rpkm'), header=T, sep=",", row.names=1)
m <- match(row.names(metadata), colnames(rpkm_data))
data_ordered  <- rpkm_data[,m]
samplemeans <- apply(data_ordered, 2, mean)
# Create a combined data frame
all(rownames(metadata) == names(samplemeans)) # sanity check for sample order
df <- cbind(metadata, samplemeans) 

Advanced Boxplot

The ggplot() command creates a plot object. In it we assign our data frame to the data argument, and aes() creates what Hadley Wickham calls an aesthetic: a mapping of variables to various parts of the plot. Note that ggplot functions can be chained with + signs to adding layers to the final plot. The next in chain is geom_boxplot(). The geom function specifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. Values include geom_point, geom_boxplot, geom_line etc

ggplot(data=df, aes(x= genotype, y=samplemeans))+geom_boxplot() 

Unlike base R graphs, the ggplot graphs are not effected by many of the options set in the par() function (e.g. adjusting relative size of axis labels using cex). They can be modified using the theme() function, and by adding graphic parameters. Here, we will increase the size of the axis labels and the main title. We can also change the fill variable to celltype

ggplot(data=df, aes(x= genotype, y=samplemeans, fill=celltype)) + 
  geom_boxplot() + 
  ggtitle('Genotype differences in average gene expression') +
  xlab('Genotype') +
  ylab('Mean expression') +
  theme(plot.title = element_text(size = rel(2.0)),
        axis.title = element_text(size = rel(1.5)),
        axis.text = element_text(size = rel(1.25)))

Advanced Bar Plot

For the bar plot, we need to define the graph type to geom_bar. Since we don’t have an x variable, we need to specify the row names as our index so each sample is plotted on its own.

ggplot(data=df, aes(x=row.names(df), y=samplemeans, fill=genotype)) +
  geom_bar(colour="black", stat="identity") +
  ggtitle('Average expression for each sample') +
  xlab('') +
  ylab('Mean expression') +
  theme(plot.title = element_text(size = rel(2.0)),
        axis.title = element_text(size = rel(1.5)),
        axis.text = element_text(size = rel(1.25)),
        axis.text.x = element_text(angle=45, vjust=0.5, hjust=0.6, size = rel(1.25)))

PCA

A figure that is often used in exploratory analsyis of data is PCA plot. PCA (principal components analysis) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. PCA takes the expresson levels for all probes and transforms it in principal component space, reducing each sample into one point (as coordinates within that space). This allows us to separate samples according to expression variation, and identify potential outliers.

Data preparation for PCA

library(DESeq)
library(DESeq2)
library(pasilla)
data("pasillaGenes")
countData<-counts(pasillaGenes)
colData<-pData(pasillaGenes)[,c("condition","type")]

Let's first take a look at the gene expression count matrix countData.

head(countData)
            treated1fb treated2fb treated3fb untreated1fb untreated2fb
FBgn0000003          0          0          1            0            0
FBgn0000008         78         46         43           47           89
FBgn0000014          2          0          0            0            0
FBgn0000015          1          0          1            0            1
FBgn0000017       3187       1672       1859         2445         4615
FBgn0000018        369        150        176          288          383
            untreated3fb untreated4fb
FBgn0000003            0            0
FBgn0000008           53           27
FBgn0000014            1            0
FBgn0000015            1            2
FBgn0000017         2063         1711
FBgn0000018          135          174

Then let's take a look at the sample description file colData

colData
             condition        type
treated1fb     treated single-read
treated2fb     treated  paired-end
treated3fb     treated  paired-end
untreated1fb untreated single-read
untreated2fb untreated single-read
untreated3fb untreated  paired-end
untreated4fb untreated  paired-end

One thing we notice is that the count Matrix is genes are rows while samples are columns. But prcomp required to have samples as our rows and genes as our columns.That is, we need a transposed version of what we currently have. R has a built-in function to transpose which is denoted by t(). Let's take a look at how t() works.

> transposed_data<-t(countData)
> transposed_data[,1:5]
             FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017
treated1fb             0          78           2           1        3187
treated2fb             0          46           0           0        1672
treated3fb             1          43           0           1        1859
untreated1fb           0          47           0           0        2445
untreated2fb           0          89           0           1        4615
untreated3fb           0          53           1           1        2063
untreated4fb           0          27           0           2        1711

Running PCA

Now samples are rows and genes are columns after t() function is applied, and we are ready to run prcomp().

pca_results <- prcomp(t(countData))

Use the str function to take a quick peek at what is returned to us from the prcomp function. You can cross-reference with the help pages to see that is corresponds with what you are expected to be returned (?prcomp). There should be a list of five objects; the one we are interested in is x which is a matrix of the principal component vectors. Let's save that data matrix by assigning it to a new variable.

pc_mat <- pca_results$x

Plotting PCA

We are going to take a look at the first two principal components by plotting them against each other. Since we will want to include information from our sample description object colData, we can concatenate the PCA results to our colData into a data frame for input to ggplot. The graphic type that we are using is a scatter plot denoted by geom_point(), and we have specified to color by condition.

df <- cbind(colData, pc_mat[,c('PC1', 'PC2')])
library(ggplot2)
ggplot(df, aes(PC1, PC2, color = condition)) + geom_point(size=3)

Heatmaps

Another useful plot used to identify patterns in your data and potential outliers is to use heatmaps. A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heat maps are well-suited for visualizing large amounts of multi-dimensional data and can be used to identify clusters of rows or columns with similar values, as these are displayed as areas of similar color.

Our data matrix is quite large, and a heatmap would be rather informative not having selected a subset of genes. Instead, we will generate a sample-to-sample correlation matrix by taking the correlation of count values for all pairwise combinations of samples. To compute the correlations R has a built-in function for that, cor which can take in either two vectors or an entire matrix. We will give it the same input we used for PCA above.

cor_mat <- cor(countData)

Check the dimensions of the matrix that is returned, and the range of values. Take a quick peek inside cor_mat.

cor_mat
             treated1fb treated2fb treated3fb untreated1fb untreated2fb
treated1fb    1.0000000  0.9769470  0.9699387    0.9748449    0.9659681
treated2fb    0.9769470  1.0000000  0.9960757    0.9747916    0.9565659
treated3fb    0.9699387  0.9960757  1.0000000    0.9705672    0.9538853
untreated1fb  0.9748449  0.9747916  0.9705672    1.0000000    0.9758301
untreated2fb  0.9659681  0.9565659  0.9538853    0.9758301    1.0000000
untreated3fb  0.9547849  0.9738014  0.9746987    0.9772582    0.9783300
untreated4fb  0.9412804  0.9661953  0.9725255    0.9723502    0.9678800
             untreated3fb untreated4fb
treated1fb      0.9547849    0.9412804
treated2fb      0.9738014    0.9661953
treated3fb      0.9746987    0.9725255
untreated1fb    0.9772582    0.9723502
untreated2fb    0.9783300    0.9678800
untreated3fb    1.0000000    0.9897768
untreated4fb    0.9897768    1.0000000

cor_mat will be the input to our heatmap function. To generate a heatmap we will use heatmap.2 which is part of the gplots package. Let's load the library:

library(gplots)

To plot the heatmap, we simply call the function and pass in our correlation matrix:

heatmap.2(cor_mat)

We notice that the sample names are not fully printed. Thus we can increase margin size by margins = c(10, 10). We can also get rid of traces by using arguments trace='none' and denscol=tracecol

heatmap.2(cor_mat,margins = c(10, 10),trace='none',denscol=tracecol)

It is often useful to combine heatmaps with hierarchical clustering, which is a way of arranging items in a hierarchy based on the distance or similarity between them. The result of a hierarchical clustering calculation is displayed in a heat map as a dendrogram, which is a tree-structure of the hierarchy. In our heatmap both rows and columns have been clustered, but we can change that to remove column clustering (Colv=NULL, and dendrogram="row") since we have a symmetric matrix. We can also remove the trace by setting trace="none" and get rid of the legend key=FALSE.

heatmap.2(cor_mat, margines=c(10,10), Colv=NULL, dendrogram="row", trace="none",key=FALSE)

As with any function in R, there are many way in which we can tweak arguments to customize the heatmap. We encourage you take time to read through the reference manual and explore other ways of generating heatmaps in R

We have only scratched the surface here. To learn more, see the site, and Winston Chang’s excellent site. Though slightly out of date, is still the definative book on this subject.

To plot a PCA plot we will be using ggplot, but first we will need to take the data matrix and generate the principal component vectors using prcomp. We will use RNA-Seq data as input to create a gene expression count matrix and a sample description object.

We have only scratched the surface here. To learn more, see the site, and Winston Chang’s excellent site. Though slightly out of date, is still the definative book on this subject.

This generates a plot using the default settings. The default color gradient sets the highest value in the heat map to white, and the lowest value to a bright red, with a corresponding transition (or gradient) between these extremes. This color scheme can be changed by adding col= and specifying either a different built-in by name, or .

ggplot reference
Cookbook for R
ggplot2: Elegant Graphics for Data Anaysis
pasilla
ggplot reference
Cookbook for R
ggplot2: Elegant Graphics for Data Anaysis
color palette
creating your own palette
ggplot2