Advanced Plotting in R
Last updated
Last updated
MIT Resources
https://accessibility.mit.eduMassachusetts Institute of Technology
There’s also a plotting package called ggplot2 that adds a lot of functionality to the basic plots seen above. The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using ggplot functions to get a feel for the syntax.
ggplot is best used on data in the data.frame form, so we will work with our combined df for the following figures. Let’s start by loading the ggplot2 library.
The ggplot() command creates a plot object. In it we assign our data frame to the data argument, and aes() creates what Hadley Wickham calls an aesthetic: a mapping of variables to various parts of the plot. Note that ggplot functions can be chained with + signs to adding layers to the final plot. The next in chain is geom_boxplot(). The geom function specifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. Values include geom_point, geom_boxplot, geom_line etc
Unlike base R graphs, the ggplot graphs are not effected by many of the options set in the par() function (e.g. adjusting relative size of axis labels using cex). They can be modified using the theme() function, and by adding graphic parameters. Here, we will increase the size of the axis labels and the main title. We can also change the fill variable to celltype
For the bar plot, we need to define the graph type to geom_bar. Since we don’t have an x variable, we need to specify the row names as our index so each sample is plotted on its own.
We have only scratched the surface here. To learn more, see the ggplot reference site, and Winston Chang’s excellent Cookbook for R site. Though slightly out of date, ggplot2: Elegant Graphics for Data Anaysis is still the definative book on this subject.
A figure that is often used in exploratory analsyis of data is PCA plot. PCA (principal components analysis) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. PCA takes the expresson levels for all probes and transforms it in principal component space, reducing each sample into one point (as coordinates within that space). This allows us to separate samples according to expression variation, and identify potential outliers.
To plot a PCA plot we will be using ggplot, but first we will need to take the data matrix and generate the principal component vectors using prcomp. We will use pasilla RNA-Seq data as input to create a gene expression count matrix and a sample description object.
Let's first take a look at the gene expression count matrix countData.
Then let's take a look at the sample description file colData
One thing we notice is that the count Matrix is genes are rows while samples are columns. But prcomp required to have samples as our rows and genes as our columns.That is, we need a transposed version of what we currently have. R has a built-in function to transpose which is denoted by t(). Let's take a look at how t() works.
Now samples are rows and genes are columns after t() function is applied, and we are ready to run prcomp().
Use the str function to take a quick peek at what is returned to us from the prcomp function. You can cross-reference with the help pages to see that is corresponds with what you are expected to be returned (?prcomp). There should be a list of five objects; the one we are interested in is x which is a matrix of the principal component vectors. Let's save that data matrix by assigning it to a new variable.
We are going to take a look at the first two principal components by plotting them against each other. Since we will want to include information from our sample description object colData, we can concatenate the PCA results to our colData into a data frame for input to ggplot. The graphic type that we are using is a scatter plot denoted by geom_point(), and we have specified to color by condition.
We have only scratched the surface here. To learn more, see the ggplot reference site, and Winston Chang’s excellent Cookbook for R site. Though slightly out of date, ggplot2: Elegant Graphics for Data Anaysis is still the definative book on this subject.
Another useful plot used to identify patterns in your data and potential outliers is to use heatmaps. A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heat maps are well-suited for visualizing large amounts of multi-dimensional data and can be used to identify clusters of rows or columns with similar values, as these are displayed as areas of similar color.
Our data matrix is quite large, and a heatmap would be rather informative not having selected a subset of genes. Instead, we will generate a sample-to-sample correlation matrix by taking the correlation of count values for all pairwise combinations of samples. To compute the correlations R has a built-in function for that, cor which can take in either two vectors or an entire matrix. We will give it the same input we used for PCA above.
Check the dimensions of the matrix that is returned, and the range of values. Take a quick peek inside cor_mat.
cor_mat will be the input to our heatmap function. To generate a heatmap we will use heatmap.2 which is part of the gplots package. Let's load the library:
To plot the heatmap, we simply call the function and pass in our correlation matrix:
This generates a plot using the default settings. The default color gradient sets the highest value in the heat map to white, and the lowest value to a bright red, with a corresponding transition (or gradient) between these extremes. This color scheme can be changed by adding col= and specifying either a different built-in color palette by name, or creating your own palette.
We notice that the sample names are not fully printed. Thus we can increase margin size by margins = c(10, 10). We can also get rid of traces by using arguments trace='none' and denscol=tracecol
It is often useful to combine heatmaps with hierarchical clustering, which is a way of arranging items in a hierarchy based on the distance or similarity between them. The result of a hierarchical clustering calculation is displayed in a heat map as a dendrogram, which is a tree-structure of the hierarchy. In our heatmap both rows and columns have been clustered, but we can change that to remove column clustering (Colv=NULL
, and dendrogram="row"
) since we have a symmetric matrix. We can also remove the trace by setting trace="none
" and get rid of the legend key=FALSE
.
As with any function in R, there are many way in which we can tweak arguments to customize the heatmap. We encourage you take time to read through the reference manual and explore other ways of generating heatmaps in R