Advanced Plotting in R

There’s also a plotting package called ggplot2 that adds a lot of functionality to the basic plots seen above. The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using ggplot functions to get a feel for the syntax.

ggplot is best used on data in the data.frame form, so we will work with our combined df for the following figures. Let’s start by loading the ggplot2 library.

library(ggplot2)

Prepare data to practice ggplot2

baseDir<-getwd()
dataDir<-file.path(baseDir,"data")
metadata <- read.table(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
rpkm_data <- read.table(file.path(dataDir, 'counts.rpkm'), header=T, sep=",", row.names=1)
m <- match(row.names(metadata), colnames(rpkm_data))
data_ordered  <- rpkm_data[,m]
samplemeans <- apply(data_ordered, 2, mean)

# Create a combined data frame
all(rownames(metadata) == names(samplemeans)) # sanity check for sample order
df <- cbind(metadata, samplemeans)

Advanced Boxplot

The ggplot() command creates a plot object. In it we assign our data frame to the data argument, and aes() creates what Hadley Wickham calls an aesthetic: a mapping of variables to various parts of the plot. Note that ggplot functions can be chained with + signs to adding layers to the final plot. The next in chain is geom_boxplot(). The geom function specifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. Values include geom_point, geom_boxplot, geom_line etc

ggplot(data=df, aes(x= genotype, y=samplemeans))+geom_boxplot()

Unlike base R graphs, the ggplot graphs are not effected by many of the options set in the par() function (e.g. adjusting relative size of axis labels using cex). They can be modified using the theme() function, and by adding graphic parameters. Here, we will increase the size of the axis labels and the main title. We can also change the fill variable to celltype

ggplot(data=df, aes(x= genotype, y=samplemeans, fill=celltype)) + 
  geom_boxplot() + 
  ggtitle('Genotype differences in average gene expression') +
  xlab('Genotype') +
  ylab('Mean expression') +
  theme(plot.title = element_text(size = rel(2.0)),
        axis.title = element_text(size = rel(1.5)),
        axis.text = element_text(size = rel(1.25)))

Advanced Bar Plot

For the bar plot, we need to define the graph type to geom_bar. Since we don’t have an x variable, we need to specify the row names as our index so each sample is plotted on its own.

ggplot(data=df, aes(x=row.names(df), y=samplemeans, fill=genotype)) +
  geom_bar(colour="black", stat="identity") +
  ggtitle('Average expression for each sample') +
  xlab('') +
  ylab('Mean expression') +
  theme(plot.title = element_text(size = rel(2.0)),
        axis.title = element_text(size = rel(1.5)),
        axis.text = element_text(size = rel(1.25)),
        axis.text.x = element_text(angle=45, vjust=0.5, hjust=0.6, size = rel(1.25)))

We have only scratched the surface here. To learn more, see the ggplot reference site, and Winston Chang’s excellent Cookbook for R site. Though slightly out of date, ggplot2: Elegant Graphics for Data Anaysis is still the definative book on this subject.

PCA

A figure that is often used in exploratory analsyis of data is PCA plot. PCA (principal components analysis) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. PCA takes the expresson levels for all probes and transforms it in principal component space, reducing each sample into one point (as coordinates within that space). This allows us to separate samples according to expression variation, and identify potential outliers.

Data preparation for PCA

To plot a PCA plot we will be using ggplot, but first we will need to take the data matrix and generate the principal component vectors using prcomp. We will use pasilla RNA-Seq data as input to create a gene expression count matrix and a sample description object.

library(DESeq)
library(DESeq2)
library(pasilla)
data("pasillaGenes")
countData<-counts(pasillaGenes)
colData<-pData(pasillaGenes)[,c("condition","type")]

Let's first take a look at the gene expression count matrix countData.

head(countData)
            treated1fb treated2fb treated3fb untreated1fb untreated2fb
FBgn0000003          0          0          1            0            0
FBgn0000008         78         46         43           47           89
FBgn0000014          2          0          0            0            0
FBgn0000015          1          0          1            0            1
FBgn0000017       3187       1672       1859         2445         4615
FBgn0000018        369        150        176          288          383
            untreated3fb untreated4fb
FBgn0000003            0            0
FBgn0000008           53           27
FBgn0000014            1            0
FBgn0000015            1            2
FBgn0000017         2063         1711
FBgn0000018          135          174

Then let's take a look at the sample description file colData

colData
             condition        type
treated1fb     treated single-read
treated2fb     treated  paired-end
treated3fb     treated  paired-end
untreated1fb untreated single-read
untreated2fb untreated single-read
untreated3fb untreated  paired-end
untreated4fb untreated  paired-end

One thing we notice is that the count Matrix is genes are rows while samples are columns. But prcomp required to have samples as our rows and genes as our columns.That is, we need a transposed version of what we currently have. R has a built-in function to transpose which is denoted by t(). Let's take a look at how t() works.

> transposed_data<-t(countData)
> transposed_data[,1:5]
             FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017
treated1fb             0          78           2           1        3187
treated2fb             0          46           0           0        1672
treated3fb             1          43           0           1        1859
untreated1fb           0          47           0           0        2445
untreated2fb           0          89           0           1        4615
untreated3fb           0          53           1           1        2063
untreated4fb           0          27           0           2        1711

Running PCA

Now samples are rows and genes are columns after t() function is applied, and we are ready to run prcomp().

pca_results <- prcomp(t(countData))

Use the str function to take a quick peek at what is returned to us from the prcomp function. You can cross-reference with the help pages to see that is corresponds with what you are expected to be returned (?prcomp). There should be a list of five objects; the one we are interested in is x which is a matrix of the principal component vectors. Let's save that data matrix by assigning it to a new variable.

pc_mat <- pca_results$x

Plotting PCA

We are going to take a look at the first two principal components by plotting them against each other. Since we will want to include information from our sample description object colData, we can concatenate the PCA results to our colData into a data frame for input to ggplot. The graphic type that we are using is a scatter plot denoted by geom_point(), and we have specified to color by condition.

df <- cbind(colData, pc_mat[,c('PC1', 'PC2')])
library(ggplot2)
ggplot(df, aes(PC1, PC2, color = condition)) + geom_point(size=3)

Heatmaps

Another useful plot used to identify patterns in your data and potential outliers is to use heatmaps. A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heat maps are well-suited for visualizing large amounts of multi-dimensional data and can be used to identify clusters of rows or columns with similar values, as these are displayed as areas of similar color.

Our data matrix is quite large, and a heatmap would be rather informative not having selected a subset of genes. Instead, we will generate a sample-to-sample correlation matrix by taking the correlation of count values for all pairwise combinations of samples. To compute the correlations R has a built-in function for that, cor which can take in either two vectors or an entire matrix. We will give it the same input we used for PCA above.

cor_mat <- cor(countData)

Check the dimensions of the matrix that is returned, and the range of values. Take a quick peek inside cor_mat.

cor_mat
             treated1fb treated2fb treated3fb untreated1fb untreated2fb
treated1fb    1.0000000  0.9769470  0.9699387    0.9748449    0.9659681
treated2fb    0.9769470  1.0000000  0.9960757    0.9747916    0.9565659
treated3fb    0.9699387  0.9960757  1.0000000    0.9705672    0.9538853
untreated1fb  0.9748449  0.9747916  0.9705672    1.0000000    0.9758301
untreated2fb  0.9659681  0.9565659  0.9538853    0.9758301    1.0000000
untreated3fb  0.9547849  0.9738014  0.9746987    0.9772582    0.9783300
untreated4fb  0.9412804  0.9661953  0.9725255    0.9723502    0.9678800
             untreated3fb untreated4fb
treated1fb      0.9547849    0.9412804
treated2fb      0.9738014    0.9661953
treated3fb      0.9746987    0.9725255
untreated1fb    0.9772582    0.9723502
untreated2fb    0.9783300    0.9678800
untreated3fb    1.0000000    0.9897768
untreated4fb    0.9897768    1.0000000

cor_mat will be the input to our heatmap function. To generate a heatmap we will use heatmap.2 which is part of the gplots package. Let's load the library:

library(gplots)

To plot the heatmap, we simply call the function and pass in our correlation matrix:

heatmap.2(cor_mat)

This generates a plot using the default settings. The default color gradient sets the highest value in the heat map to white, and the lowest value to a bright red, with a corresponding transition (or gradient) between these extremes. This color scheme can be changed by adding col= and specifying either a different built-in color palette by name, or creating your own palette.

We notice that the sample names are not fully printed. Thus we can increase margin size by margins = c(10, 10). We can also get rid of traces by using arguments trace='none' and denscol=tracecol

heatmap.2(cor_mat,margins = c(10, 10),trace='none',denscol=tracecol)

It is often useful to combine heatmaps with hierarchical clustering, which is a way of arranging items in a hierarchy based on the distance or similarity between them. The result of a hierarchical clustering calculation is displayed in a heat map as a dendrogram, which is a tree-structure of the hierarchy. In our heatmap both rows and columns have been clustered, but we can change that to remove column clustering (Colv=NULL, and dendrogram="row") since we have a symmetric matrix. We can also remove the trace by setting trace="none" and get rid of the legend key=FALSE.

heatmap.2(cor_mat, margines=c(10,10), Colv=NULL, dendrogram="row", trace="none",key=FALSE)

As with any function in R, there are many way in which we can tweak arguments to customize the heatmap. We encourage you take time to read through the reference manual and explore other ways of generating heatmaps in R

PreviousBasic Plotting in R NextWriting Figures to a File

Last updated 8 months ago

Was this helpful?