Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Download Intro_to_R.tar
from https://ki-data.mit.edu/bcc/teaching/Intro_to_R/
onto Desktop.
Un-compress Intro_to_R.tar
to a folder "Intro_to_R"
Open RStudio
Under the File menu, click on New project, choose Existing Directory
Use Browse to locate to the "Intro_to_R" folder and then Create Project This will be your working directory for the rest of the day (e.g. ~/Desktop/Intro_to_R)
Create a new R script (File > New File > R script) and save it in your working directory (e.g. intro_to_R.R). Here, you can type all the commands we run during the course, and save it for later reference.
Using R as a calculator
> 1+2
[1] 3
> 1+2*3+4*5
[1] 27
R is good at statistics
> group1<-c(4.5,4.7,4.2,4.8,3.9)
> group2<-c(6.0,5.9,5.8,5.5,6.2)
> t.test(group1,group2)
Welch Two Sample t-test
data: group1 and group2
t = -7.2281, df = 7.1573, p-value = 0.0001556
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.9355112 -0.9844888
sample estimates:
mean of x mean of y
4.42 5.88
The R syntax from an example script
# Load libraries
library(Biobase)
library(limma)
library(ggplot2)
# Setup directory variables
baseDir <- getwd()
dataDir <- file.path(baseDir, "data")
# Load data
design_dat <- read.table(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
*The above snippet of R code has many different “parts of speech” for R (syntax):
*the comments # and how they are used to document function and its content
*the assignment operator <-
*variables and functions
*the = for arguments for functions
The backbone of the lesson is based on the teaching material developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). Many sections are directly adopted from the above teaching material. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
The materials used in this lesson were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).
Adapted from the lesson by Tracy Teal. Original contributors: Paul Wilson, Milad Fatenejad, Sasha Wood and Radhika Khetani for Software Carpentry (http://software-carpentry.org/)
R is a versatile, open source programming/scripting language available on all platforms that is useful not only for statistics, but also data science. It is inspired by the programming language S.
R is open-source software licensed under the GNU General Public License (GPL). It is superior (if not just comparable) to commercial alternatives, and is widely used both in academia and industry.
RStudio is a freely available open-source Integrated Development Environment (IDE). It can be downloaded from the . It is a great alternative to working on R in the terminal for many reasons:
automatic syntax highlighting/formatting in the editor
direct code execution from editor to console
real-time access to environment, plotting, and history
good tool for workspace management
The RStudio interface has four main panels:
Console: where you can type commands and see output
Editor: where you can type out commands and save to file. You can also run in console with Ctrl Enter
Workspace/History: workspace shows all active objects and history keeps track of all commands run in console
Files/Plots/Packages/Help
Code and workflow are more reproducible if we can document everything that we do.
Our end goal is not just to “do stuff”, but to do it in a way that anyone can easily and exactly replicate our workflow and results.
All code should be written in the editor and saved to file, rather than working in the console. The R console should be used to inspect objects, test a function or get help.
Use #
signs to comment. Comment liberally in your R scripts. This will help future you and other collaborators know what each line of code (or code block) was meant to do. Anything to the right of a # is ignored by R. A shortcut for this is Ctrl + Shift + C if you want to comment an entire chunk of text.
Organizing your working directory: You should separate the original data (raw data) from intermediate datasets that you may create for the need of a particular analysis. For instance, you may want to create a data/ directory within your working directory that stores the data, and have a results/ directory for intermediate datasets and a figures/ directory for the plots you will generate. The current working directory has a data folder already. Create additional directories called "results" and "figures" respectively. You can do this by navigating to the lower right panel and using the New Folder button in the file tab
There are two main ways of interacting with R: using the console or by using script files (plain text files that contain your code).
The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session. It is better to enter the commands in the script editor, and save the script. This way, you have a complete record of what you did, you can easily show others how you did it and you can do it again later on if needed. You can copy-paste the code into the R console, but the Rstudio script editor allows you to ‘send’ the current line or the currently selected text to the R console using the Ctrl-Enter
shortcut.
If R is ready to accept commands, the R console shows a > prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using Ctrl-Enter
), R will try to execute it, and when ready, show the results and come back with a new > prompt to wait for new commands.
If R is still waiting for you to enter more data because it isn’t complete yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. This is because you have not ‘closed’ a parenthesis or quotation. If you’re in Rstudio and this happens, click inside the console window and press Ctrl+c
; this should help you out of trouble.
Genomic data can be too big to run on your laptop. BCC/BMC cluster has lots of memory and CPU which enables processing big data in a parallel fashion. However, you need to be trained by our course in order to run R scripts on our cluster in appropriate ways.
Initialize a plot that will be written directly to a file using pdf
, png
, etc. Within the function you will need to specify a name for your image, and the width and height (optional). Then create a plot using the usual functions in R. Finally, close the file using the dev.off()
function. There are also bmp
, tiff
, and jpeg
functions, though the jpeg
function has proven less stable than the others.
pdf("boxplot.pdf")
ggplot(data=df, aes(x= genotype, y=samplemeans, fill=celltype)) +
geom_boxplot() +
ggtitle('Genotype differences in average gene expression') +
xlab('Genotype') +
ylab('Mean expression') +
theme(plot.title = element_text(size = rel(1.5)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
dev.off()
Let’s get a closer look at our data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. Suppose we wanted to compute the average value for a sample, or the minimum and maximum values? The R base package provides many built-in functions such as mean
, median
, min
, max
, and range
. Try computing the mean for “sample1”
baseDir<-getwd()
dataDir<-file.path(baseDir,"data")
metadata <- read.table(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
rpkm_data <- read.table(file.path(dataDir, 'counts.rpkm'), header=T, sep=",", row.names=1)
m <- match(row.names(metadata), colnames(rpkm_data))
data_ordered <- rpkm_data[,m]
mean(data_ordered[,'sample1'])
max(data_ordered[,'sample1'])
min(data_ordered[,'sample1'])
By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE
(rm
stands for remove). In some cases, it might be useful to remove the missing data from the vector. For this purpose, R comes with the function na.omit
to generate a vector that has NA’s removed. For some applications, it’s useful to keep all observations, for others, it might be best to remove all observations that contain missing data. The function complete.cases()
returns a logical vector indicating which rows have no missing values.
To obtain mean values for all samples we can use mean on each column individually, but there is also an easier way to go about it. The apply
family of functions keep you from having to write loops (R is bad at looping) to perform some sort of operation on every row or column of a data matrix or a data frame. The family includes several functions, each differing slightly on the inputs or outputs.
base::apply Apply Functions Over Array Margins
base::by Apply a Function to a Data Frame Split by Factors
base::eapply Apply a Function Over Values in an Environment
base::lapply Apply a Function over a List or Vector
base::mapply Apply a Function to Multiple List or Vector Arguments
base::rapply Recursively Apply a Function to a List
base::tapply Apply a Function Over a Ragged Array
We will be using apply in our examples today, but do take a moment on your own to explore the many options that are available. The apply function returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. We know about vectors/arrays and functions, but what are these “margins”? Margins are referring to either the rows (denoted by 1), the columns (denoted by 2) or both (1:2). By “both”, we mean apply the function to each individual value. Let’s try this with the mean
function on our data:
samplemeans <- apply(data_ordered, 2, mean)
The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers”, and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R’s base plotting package.
When we are working with large sets of numbers it can be useful to display that information graphically. R has a number of built-in tools for basic graph types such as hisotgrams, scatter plots, bar charts, boxplots and much more. We’ll test a few of these out here on our samplemeans vector, but first we will create a combined data frame that maps our metadata to the sample mean values.
baseDir<-getwd()
dataDir<-file.path(baseDir,"data")
metadata <- read.table(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
rpkm_data <- read.table(file.path(dataDir, 'counts.rpkm'), header=T, sep=",", row.names=1)
m <- match(row.names(metadata), colnames(rpkm_data))
data_ordered <- rpkm_data[,m]
samplemeans <- apply(data_ordered, 2, mean)
# Create a combined data frame
all(rownames(metadata) == names(samplemeans)) # sanity check for sample order
df <- cbind(metadata, samplemeans)
Let’s start with a scatter plot. A scatter plot provides a graphical view of the relationship between two sets of numbers. We don’t have a variable in our metadata that is a continuous variable, so there is nothing to plot it against but we can plot the values against their index values just to demonstrate the function.
par(mar = rep(5, 4))
plot(samplemeans)
Each point represents a sample and the value on the x-axis is the sample number, where the values on the y-axis correspond to the average expression for that sample. For any plot you can customize many features of your graphs (fonts, colors, axes, titles) through graphic options. We can change the shape of the data point using pch
.
plot(samplemeans, pch=8)
We can add a title to the plot by assigning a string to main
plot(samplemeans, pch=8, main="Scatter plot of mean values")
In the case of our data, a barplot would be much more useful. We can use barplot
to draw a single bar representing each sample and the height indicates the average expression level.
barplot(samplemeans)
The sample names appear to be too large for the plot, we can change that by changing the cex.names
value.
barplot(samplemeans, cex.names=0.5)
The names are too small to read. Alternatively we can also just change the names to be numeric values and keep the same size.
barplot(samplemeans, names.arg=c(1:12)) # supply numbers as labels
We can also flip the axes so that the plot is projected horizontally.
barplot(samplemeans, names.arg=c(1:12), horiz=TRUE)
If we are interested in an overall distribution of values, histogram is a plot very commonly used. It plots the frequencies that data appears within certain ranges. To plot a histogram of the data use the hist
command:
hist(samplemeans)
The range of values for sample means is 9 to 16. As you can see R will automatically calculate the intervals to use. There are many options to determine how to break up the intervals. Let’s increase the number of breaks to see how that changes the plot:
hist(samplemeans, xlab="Mean expression level", main="", breaks=20)
Similar to the other plots we can tweak the aesthetics. Let’s color in the bar and remove the borders:
hist(samplemeans, xlab="Mean expression level", main="", col="darkgrey", border=FALSE)
Using addiitonal sample information from our metadata, we can use plots to compare values between the two different celltypes ‘typeA’ and ‘typeB’ using a boxplot. A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set.
boxplot(samplemeans~celltype, df)
Similar to the plots above, we can pass in arguments to add in extras like plot title, axis labels and colors.
boxplot(samplemeans~celltype, df, col=c("blue","red"), main="Average expression differences between celltypes", ylab="Expression")
Variables are buckets that hold information.
A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to “buckets”, where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.
An example > x<-3
In the example above, we created a variable or a ‘bucket’ called x. Inside we put a value. Let’s create another variable called y and give it a value of 5. When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the name.
Other examples:
> y<-5
> x+y
[1] 8
> s<-x+y
Decimal values are called numerics in R. It is the default computational data type. If we assign a decimal value to a variable x as follows, x will be of numeric type.
> x = 10.5 # assign a decimal value
> x # print the value of x
[1] 10.5
> class(x) # print the class name of x
[1] "numeric"
Furthermore, even if we assign an integer to a variable k, it is still being saved as a numeric value.
> k = 1
> k # print the value of k
[1] 1
> class(k) # print the class name of k
[1] "numeric"
The fact that k is not an integer can be further confirmed with the is.integer function. We will discuss how to create an integer in our next tutorial on the integer type.
> is.integer(k) # is k an integer?
[1] FALSE
In order to create an integer variable in R, we invoke the as.integer function. We can be assured that y is indeed an integer by applying the is.integer function.
> y = as.integer(3)
> y # print the value of y
[1] 3
> class(y) # print the class name of y
[1] "integer"
> is.integer(y) # is y an integer?
[1] TRUE
We can coerce a numeric value into an integer with the same as.integer function.
> as.integer(3.14) # coerce a numeric value
[1] 3
And we can parse a string for decimal values in much the same way.
> as.integer("5.27") # coerce a decimal string
[1] 5
On the other hand, it is erroneous trying to parse a non-decimal string.
> as.integer("Joe") # coerce an non−decimal string
[1] NA
Warning message:
NAs introduced by coercion
Often, it is useful to perform arithmetic on logical values. Like the C language, TRUE has the value 1, while FALSE has value 0.
> as.integer(TRUE) # the numeric value of TRUE
[1] 1
> as.integer(FALSE) # the numeric value of FALSE
[1] 0
A character object is used to represent string values in R. We convert objects into character values with the as.character() function:
> x = as.character(3.14)
> x # print the character string
[1] "3.14"
> class(x) # print the class name of x
[1] "character"
Two character values can be concatenated with the paste function.
> fname = "Joe"; lname ="Smith"
> paste(fname, lname)
[1] "Joe Smith"
However, it is often more convenient to create a readable string with the sprintf function, which has a C language syntax.
> sprintf("%s has %d dollars", "Sam", 100)
[1] "Sam has 100 dollars"
To extract a substring, we apply the substr function.
Here is an example showing how to extract the substring between the third and twelfth positions in a string.
> substr("Mary has a little lamb.", start=3, stop=12)
[1] "ry has a l"
And to replace the first occurrence of the word "little" by another word "big" in the string, we apply the sub function.
> sub("little", "big", "Mary has a little lamb.")
[1] "Mary has a big lamb."
More functions for string manipulation can be found in the R documentation.
> help("sub")
True, False
Represent complex numbers with real and imaginary parts (e.g., 1+4i)
> z = 1 + 2i # create a complex number
> z # print the value of z
[1] 1+2i
> class(z) # print the class name of z
[1] "complex"
The following gives an error as −1 is not a complex value.
> sqrt(−1) # square root of −1
[1] NaN
Warning message:
In sqrt(−1) : NaNs produced
Instead, we have to use the complex value −1 + 0i.
> sqrt(−1+0i) # square root of −1+0i
[1] 0+1i
An alternative is to coerce −1 into a complex value.
> sqrt(as.complex(−1))
[1] 0+1i
Vectors are a collection of numbers or characters or both
Vectors are the most common and basic data structure in R, and they are the workhorse of R
The analogy is a bucket with different compartments; Each compartment is called an element
Each element contains a single value
There is no limit to the number of elements
The vector is assigned to a single variable, because regardless of how many elements it contains it is still a single bucket
We create a vector named V shown in the image on the left hand side: V<-c(1,2,3)
> V<-c(1,2,3)
> V
[1] 1 2 3
Each element of the vector contains a single numeric value, and three values will be combined together using c()
(the combine function). All of the values are put within the parentheses and separated with a comma.
Create a vector called glengths with three elements, where each element corresponds with the genome sizes(in Mb) of a certain species.
glengths <- c(4.6, 3000, 50000)
glengths
A vector can also contain characters.
Create another vector called species with three elements, where each element is a species corresponding with the genome size in glengths vector.
species <- c("ecoli", "human", "corn")
species
Factors are used to represent categorical data
Factors can be ordered or unordered
Factors are an important class for statistical analysis and for plotting
Factors are stored as integers, and have labels associated with these unique integers
While factors look (and often behave) like character vectors, they are actually integers under the hood
You need to be careful when treating them like string
To create a factor vector we use the factor() function:
> F<-factor(c("F","M","F"))
> levels(F)
[1] "F" "M"
expression <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(expression)
Sometimes, the order of the factors does not matter, other times you might want to specify the order
because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of
analysis.
Additionally, specifying the order of the levels allows one to compare levels:
expression <- factor(expression, levels=c("low", "medium", "high"))
levels(expression)
min(expression) ## doesn't work
expression <- factor(expression, levels=c("low", "medium", "high"), ordered=TRUE)
levels(expression)
min(expression) ## works!
In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple
integer labels because factors are self describing: "low", "medium", and "high"" is more descriptive
than 1, 2, 3. Which is low?
You wouldn’t be able to tell with just integer data. Factors have this information built in.
It is particularly helpful when there are many levels.
A matrix in R is a collection of vectors of same length and identical datatype
Vectors can be combined as columns in the matrix or by row
Usually matrices are numeric and used in various computational algorithms to serve as a checkpoint
If input data is not of identical data type (numeric, character, etc.), the matrix() function will throw an error and stop any downstream code execution.
An example of creating a matrix M using matrix() function:
> M = matrix(
+ c(1, 2, 3, 4, 5, 6,7,8,9), # the data elements
+ nrow=3, # number of rows
+ ncol=3, # number of columns
+ byrow = TRUE) # fill matrix by rows
> M
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
A data.frame is a collection of vectors of identical lengths
each vector can be of a different data type (e.g., characters, integers, factors)
data.frame is the de facto data structure for most tabular data, and each vector represents a column
data.frame is commonly used for statistics and plotting
An example of creating a data frame DF using data.frame() function:
> v1<-c(1,2,3)
> f2<-factor(c("F","M","F"))
> v3<-c("a","b","c")
> DF<-data.frame(v1,f2,v3)
> DF
v1 v2 v3
1 1 F a
2 2 M b
3 3 F c
A list is a collection of data structures
There is no particular restriction for the components to be of the same mode or type
For example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on
An example of creating a list L using list() function:
> L<-list(gender=factor(c("F","M","F")), letter="a", number=c(1,2))
> L
$gender
[1] "F" "M" "F"
$letter
[1] "a"
$number
[1] 1 2
The other key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe, etc.), process it, and return a result.
The input(s) are called arguments and can be anything, not only numbers or characters, but also other data structures. Exactly what each argument means differs per function, and must be looked up in the documentation (we will discuss help
options at the end of Functions session). If an argument alters the way the function operates, such as whether to ignore ‘bad values’, it is sometimes called an option.
Most functions can take several arguments, but many have so-called defaults. If you don’t specify such an argument when calling the function, the function itself will fall back on using the default. This is a standard value that the author of the function specified as being “good enough in standard cases”. An example would be what symbol to use in a plot. However, if you want something specific, simply change the argument yourself with a value of your choice.
We have already used a few examples of basic functions in the previous lessons i.e c()
, and factor()
. These functions are available as part of R’s built in capabilities, and we will explore a few more of these base functions below. You can also get functions from libraries (which we’ll talk about in a bit), or even write your own.
Let’s revisit a function that we have used previously to combine data, c()
. The arguments it takes is any number of numbers, characters or strings and performs the task of combining them into a single vector. You can also use it to add elements to an existing vector:
glengths <- c(glengths, 90) # adding at the end
glengths <- c(30, glengths) # adding at the beginning
What happens here is that we take the original vector glengths
, and we are adding another item first to the end of the other ones, and then another item at the beginning. We can do this over and over again to build a vector or a dataset.
Since R is used for statistical computing, many of the base functions involve mathematical operations. One example would be the function sqrt()
. The input (argument) must be a number, and the output is the square root of that number. Let’s try finding the square root of 81:
sqrt(81)
Executing a function (or ‘running it’) is referred to as calling the function.
Now what would happen if we called the function on a vector of values instead of a single value?
sqrt(glengths)
In this case the task was performed on each individual value of the vector number and the respective results were displayed.
Let’s try a function that we can change some of the options, for example round:
round(3.14159)
We can see that we get 3. That’s because the default is to round to the nearest whole number. If we want a different number of digits, we can type digits=2 or however many we may want.
round(3.14159, digits=2)
If you provide the arguments in the exact same order as they are defined (in the help manual) you don’t have to name them:
round(3.14159, 2)
However, it’s usually not recommended practice because it’s a lot of remembering to do, and if you share your code with others that includes less known functions it makes your code difficult to read (It’s however OK to not include the names of the arguments for basic functions like mean, min, etc…). Another advantage of naming arguments, is that the order doesn’t matter. This is useful when there start to be more arguments
Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. The two terms are sometimes used synonomously and there has been discussion amongst the community to resolve this. It is somewhat counter-intuitive to load a package using the library()
function and so you can see how confusion can arise.
There are a set of standard (or base) packages which are considered part of the R source code and automatically available as part of your R installation. Base packages contain the basic functions that allow R to work, and enable standard statistical and graphical functions on datasets; for example all of the functions that we have been using so far in our examples.
You can check what base packages are loaded by typing into the console:
sessionInfo()
In this course we will mostly be using functions from the standard base packages. However, the more you work with R you will come to realize that there is a cornucopia of R packages that offer a wide variety of functionality. To use additional packages will require installation.
Packages for R can be installed from the CRAN package repository using the install.packages function. An example is given below for the ggplot2
package that will be required for some images we will create later on. If you do not have access to internet, do not run this code. Instead we will install from source
install.packages('ggplot2')
Alternatively, packages can also be installed from Bioconductor, another repository of packages but mostly pertaining to genomic data analysis. There are many packages that are available in CRAN and Bioconductor, but there are also packages that are specific to one repository. Generally, you can find out this information with a Google search or by trial and error. To install from Bioconductor, you will first need to install Bioconductor and all the standard packages. This only needs to be done once ever for your R installation. For older versions of R (R < 3.5.0):
source("http://bioconductor.org/biocLite.R")
biocLite()
Once you have the standard packages installed, you can install additional packages using the biocLite.R
script. If it’s a new R session you will also have to source the script again. Here we show that the same package ggplot2 is available through Bioconductor:
biocLite('ggplot2')
The current release of Bioconductor is version 3.16; it works with R version 4.2.2. To get the latest version of Bioconductor by entering the commands:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.16")
To install core packages, type the following in an R command window:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
Install specific packages, e.g., DESeq2
BiocManager::install("DESeq2")
Finally, R packages can also be installed from source. This is useful when you do not have an internet connection (and have the source files locally), since the other two methods are retrieving the source files from remote sites. For this class, we can install ggplot2
from source, because we have provided for you a compressed file containing all the required information to build and install the package into your environment. First locate the file ggplot2_3.1.1.tar.gz
in your directory. To install it, we use the same install.packages
function, but we have additional arguments that need to be changed from defaults:
install.packages('ggplot2_3.1.1.tar.gz',type="source",repos=NULL)
Suppose we didn’t know how to use the round function and wanted more digits; the best way of finding out this information is to use the ? followed by the name of the function. Doing this will open up the help manual in the bottom right panel of RStudio:
?round
If you know the function, but just need to remind yourself of the names of the arguments, you can use:
args(round)
If you are looking for a function to do a particular task, you can use help.search()
(but only looks through the installed packages):
help.search("scatter")
If you can’t find what you are looking for, you can use the rdocumention website that searches through the help files across all packages available.
Start by googling the error message. However, this doesn’t always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”).
However, you should check stackoverflow. Search using the R tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers in stackoverflow.
The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.
The R FAQ is dense and technical but it is full of useful information.
The key to get help from someone is for them to grasp your problem rapidly. You should make it as easy as possible to pinpoint where the issue might be.
Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem.
If possible, try to reduce what doesn’t work to a simple reproducible example. If you can reproduce the problem using a very small data.frame
instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question.
To share an object with someone else, if it’s relatively small, you can use the function dput()
. It will output R code that can be used to recreate the exact same object as the one in memory:
dput(head(iris)) # iris is an example data.frame that comes with R
If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your question is not related to a data.frame, you can save any other R data structure that you have in your environment to a file:
save(iris, file="/tmp/iris.RData")
The content of this file is however not human readable and cannot be posted directly on stackoverflow. It can however be sent to someone by email who can read it with this command:
some_data <- load(file="~/Downloads/iris.RData")
Last, but certainly not least, always include the output of sessionInfo()
as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem.
sessionInfo()
Your friendly colleagues: if you know someone with more experience than you, they might be able and willing to help you.
stackoverflow: if your question hasn’t been answered before and is well crafted, chances are you will get an answer in less than 5 min.
The R-help: it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don’t expect that it will come with smiley faces. Also, here more than everywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package.
If your question is about a specific package, see if there is a mailing list for it. Usually it’s included in the DESCRIPTION file of the package that can be accessed using packageDescription("name-of-package")
. You may also want to try to email the author of the package directly.
There are also some topic-specific mailing lists (GIS, phylogenetics, etc…), the complete list is here.
The Posting Guide for the R mailing lists.
The "How to ask for R help" site provides useful guidelines.
There are also selections of base functions in R that are useful for inspecting your data and summarizing it. Let’s start with a simple data structure such as vectors. A commonly used function is length()
, which tells you how many elements are in a particular vector:
length(glengths)
length(species)
The class()
function is useful in indicating the datatype or data structure of a variable. So for example if we were interested in knowing what was inside glengths
:
class(glengths)
We could also use class on a data frame or any other type of object. Let’s load in a data frame to test out some more functions. We will use read.csv to read in data from a csv (comma separated values) file. There are numerous other functions to load in data depending on your filetype, but read.csv
is one of the more commonly used ones.
baseDir<-getwd()
dataDir<-file.path(baseDir,"data")
metadata <- read.csv(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
The function has one required argument and several options that can be changed. The mandatory argument is a path to the file and filename, which in our case is mouse_exp_design.csv
file. We will put the function to the right of the assignment operator, meaning that any output will be saved as the variable name provided on the left.
Take a look at the file by typing out the variable name metadata and pressing return. The file contains information describing the samples in our study. Each row holds information for a single sample, and the columns represent genotype
(WT or KO), celltype
(typeA or typeB), and replicate number.
genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
sample4 KO typeA 1
sample5 KO typeA 2
sample6 KO typeA 3
sample7 Wt typeB 1
sample8 Wt typeB 2
sample9 Wt typeB 3
sample10 KO typeB 1
sample11 KO typeB 2
sample12 KO typeB 3
Note: If you are using older versions of R:
By default, data.frame
converts (= coerces) columns that contain characters (i.e., text) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv()
and read.table()
have an argument called stringsAsFactors
which can be set to FALSE
.
Suppose we had a larger file, we might not want to display all the contents in the console. Instead we could check the top (the first 6 lines) of this data.frame using the function head()
:
head(metadata)
Let’s now check the str
ucture of this data.frame
in more details with the function str()
:
str(metadata)
If you are using older versions of R:
'data.frame': 12 obs. of 3 variables:
$ genotype : Factor w/ 2 levels "KO","Wt": 2 2 2 1 1 1 2 2 2 1 ...
$ celltype : Factor w/ 2 levels "typeA","typeB": 1 1 1 1 1 1 2 2 2 2 ...
$ replicate: int 1 2 3 1 2 3 1 2 3 1 ...
As you can see, the columns genotype
and celltype
are of a special class called factor whereas the replicate column has been interpreted as integer data type.
If you are using R 4.0.0 or newer:
'data.frame': 12 obs. of 3 variables:
$ genotype : chr "Wt" "Wt" "Wt" "KO" ...
$ celltype : chr "typeA" "typeA" "typeA" "typeA" ...
$ replicate: int 1 2 3 1 2 3 1 2 3 1 ...
We already saw how the functions head()
and str()
can be useful to check the content and the structure of a data.frame
. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.
Size:
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dim
ensions of the object)
nrow()
- returns the number of rows
ncol()
- returns the number of columns
Content:
head()
- shows the first 6 rows
tail()
- shows the last 6 rows
Names:
names()
- returns the column names (synonym of colnames() for data.frame objects)
rownames()
- returns the row names
Summary:
str()
- structure of the object and information about the class, length and content of each column
summary()
- summary statistics for each column
Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.
When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let’s begin with vectors, then apply those concepts to dataframes.
If we want to extract one or several values from a vector, we must provide one or several indexes in square brackets. The index represents the element number within a vector (or the compartment number, if you think of the bucket analogy). R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
Let’s start by creating a vector called age
:
age <- c(15, 18, 22, 45, 52, 56, 67, 73, 81)
Suppose we only wanted the fifth value of this vector, we would use the following syntax:
age[5]
If we wanted to index more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of the index values:
idx <- c(3,5,7)
age[idx]
To access a sequence of continuous values from a vector, we would use :
` which is a special function that creates numeric vectors of integer in increasing or decreasing order. Let’s select the first five values from age:
age[1:5]
Alternatively, if you wanted the reverse could try 5:1 for instance and see what is returned. The function seq()
(for seq
uence) can also be used to create sequences, but allow for more complex patterns. Passing in the by argument will allow you to generate a sequence based on the specified interval:
seq(1, 10, by=2)
Additionally, the length.out parameter will provide the restriction on the maximum length of the resulting vector. A combination of parameters can also be used:
seq(5, 10, length.out=3) # equal breaks of sequence into vector length = length.out
seq(50, by=5, length.out=10) # sequence 50 by 5 until you hit vector length = length.out
seq(1, 8, by=3) # sequence by 3 until you hit 8
Dataframes have 2 dimensions (rows and columns), so if we want to extract some specific data from it we need to specify the “coordinates” we want from it. We use the same square bracket syntax but rather than providing a single index, there are two inputs required. Within the square bracket, row numbers come first followed by column numbers (and the two are separated by a comma). For example:
metadata[1, 1] # first element in the first column of the data frame
metadata[1, 3] # first element in the 3rd column
Now if you only wanted to select based on rows, you would provide the indexes for the rows and leave the columns blank. The key here is to include the comma, to let R know that you are accessing a 2 dimensional data structure:
metadata[3, ] # the 3rd row for all columns
metadata[1:3, ] # first three rows
metadata[c(1,3,7), ] # first, third and seventh rows
Similarly, if you were selecting specific columns from the data frame - the rows are left blank:
metadata[ ,3] # the entire 3rd column
For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Is celltype in column 1 or 3? oh, right… they are in column 2). In some cases, the column number for a variable can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.
You can do operations on a particular column, by selecting it using the $
sign. In this case, the entire column is a vector. For instance, to extract all the gentotypes from our dataset, we can use: metadata$genotype
. You can use names(metadata)
or colnames(metadata)
to remind yourself of the column names.
To select multiple columns by name the square bracket syntax is used by concatenating a vector of strings that correspond to column names:
metadata[, c("genotype", "celltype")]
genotype celltype
sample1 Wt typeA
sample2 Wt typeA
sample3 Wt typeA
sample4 KO typeA
sample5 KO typeA
sample6 KO typeA
sample7 Wt typeB
sample8 Wt typeB
sample9 Wt typeB
sample10 KO typeB
sample11 KO typeB
sample12 KO typeB
Another way of partitioning your data, is by filtering based on the content that is in your dataframe using the subset()
function. For example, we can look at the samples of a specific celltype
"typeA
":
subset(metadata, celltype == "typeA")
genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
sample4 KO typeA 1
sample5 KO typeA 2
sample6 KO typeA 3
We can also subset using other logical operators in R. For example suppose we wanted to subset to keep only the WT samples from the typeA
celltype
.
subset(metadata, celltype == "typeA" & genotype == "Wt")
genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
Alternatively, we could try looking at only the first two replicates of each sample set. Here, we can use the less than operator since replicate is currently a numeric vector. Adding in the argument select allows us to specify which columns to keep. Which columns are left?
subset(metadata, replicate < 3, select = c('genotype', 'celltype'))
Often when working with genomic data, we have a data file that corresponds with our metadata file. The data file contains measurements from the biological assay for each individual sample. In our case, the biological assay is gene expression and data was generated using RNA-Seq. Let’s bring in the data matrix of RPKM values:
baseDir<-getwd()
dataDir<-file.path(baseDir,"data")
metadata <- read.table(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
rpkm_data <- read.table(file.path(dataDir, 'counts.rpkm'), header=T, sep=",", row.names=1)
Take a look at the first few lines of the data matrix to see what’s in there.
head(rpkm_data)
It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it’s hard to tell since they are not in the same order. We can do a quick check of the dimensions and at least see if the numbers match up.
ncol(rpkm_data)
What we want to know is, do we have data for every sample that we have for metadata?
There are many ways to answer this using R. We’ll be using the match function, which takes at least 2 arguments: 1) a vector of values to be matched, and 2) a vector of values to be matched against. The function returns the position of the matches in the second vector. Take a look at the example below where vector B is the reverse of vector A:
A <- c(1,3,5,7,9,11)
B <- c(11,9,7,5,3,1)
A
B
Now if we use the match function with A as our first input and B as our second, you will be returned a vector of size length(A). Each number that is returned represents the index of vector B where the matching values was observed.
match(A,B)
Let’s change vector B so that only a subset are retained:
A <- c(1,3,5,7,9,11)
B <- c(9,5,1,1)
A
B
match(A,B)
Note, for values that don’t match you can specify what values you would have it assigned using nomatch argument (by default this is set to NA). Also, illustrated in the example, if there is more than one matching value found only the first is reported.
Subset data using matching
We are trying to match the row names of our metadata with the column names of our expression data, so these will be the arguments for match. There are base functions in R which allow you to extract the row and column names as a vector:
row.names(metadata)
colnames(rpkm_data)
Using these two arguments we will retrieve a vector of match indices. This vector represents the re-ordering of the column names in our data matrix to be identical to the rows in metadata:
m <- match(row.names(metadata), colnames(rpkm_data))
m
Now we can create a new data matrix in which columns are re-ordered based on the match indices:
data_ordered <- rpkm_data[,m]
Check and see what happened by using head. You can also verfy that column names of this new data matrix matches the metadata row names by using the all function:
head(data_ordered)
all(row.names(metadata) == colnames(data_ordered))
The package dplyr
is a package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr
which has been in use for some time but suffered from being slow in some cases. dplyr
addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
There’s also a plotting package called ggplot2 that adds a lot of functionality to the basic plots seen above. The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using ggplot functions to get a feel for the syntax.
ggplot is best used on data in the data.frame form, so we will work with our combined df for the following figures. Let’s start by loading the ggplot2 library.
library(ggplot2)
baseDir<-getwd()
dataDir<-file.path(baseDir,"data")
metadata <- read.table(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
rpkm_data <- read.table(file.path(dataDir, 'counts.rpkm'), header=T, sep=",", row.names=1)
m <- match(row.names(metadata), colnames(rpkm_data))
data_ordered <- rpkm_data[,m]
samplemeans <- apply(data_ordered, 2, mean)
# Create a combined data frame
all(rownames(metadata) == names(samplemeans)) # sanity check for sample order
df <- cbind(metadata, samplemeans)
The ggplot() command creates a plot object. In it we assign our data frame to the data argument, and aes() creates what Hadley Wickham calls an aesthetic: a mapping of variables to various parts of the plot. Note that ggplot functions can be chained with + signs to adding layers to the final plot. The next in chain is geom_boxplot(). The geom function specifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. Values include geom_point, geom_boxplot, geom_line etc
ggplot(data=df, aes(x= genotype, y=samplemeans))+geom_boxplot()
Unlike base R graphs, the ggplot graphs are not effected by many of the options set in the par() function (e.g. adjusting relative size of axis labels using cex). They can be modified using the theme() function, and by adding graphic parameters. Here, we will increase the size of the axis labels and the main title. We can also change the fill variable to celltype
ggplot(data=df, aes(x= genotype, y=samplemeans, fill=celltype)) +
geom_boxplot() +
ggtitle('Genotype differences in average gene expression') +
xlab('Genotype') +
ylab('Mean expression') +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
For the bar plot, we need to define the graph type to geom_bar. Since we don’t have an x variable, we need to specify the row names as our index so each sample is plotted on its own.
ggplot(data=df, aes(x=row.names(df), y=samplemeans, fill=genotype)) +
geom_bar(colour="black", stat="identity") +
ggtitle('Average expression for each sample') +
xlab('') +
ylab('Mean expression') +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)),
axis.text.x = element_text(angle=45, vjust=0.5, hjust=0.6, size = rel(1.25)))
We have only scratched the surface here. To learn more, see the ggplot reference site, and Winston Chang’s excellent Cookbook for R site. Though slightly out of date, ggplot2: Elegant Graphics for Data Anaysis is still the definative book on this subject.
A figure that is often used in exploratory analsyis of data is PCA plot. PCA (principal components analysis) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. PCA takes the expresson levels for all probes and transforms it in principal component space, reducing each sample into one point (as coordinates within that space). This allows us to separate samples according to expression variation, and identify potential outliers.
To plot a PCA plot we will be using ggplot, but first we will need to take the data matrix and generate the principal component vectors using prcomp. We will use pasilla RNA-Seq data as input to create a gene expression count matrix and a sample description object.
library(DESeq)
library(DESeq2)
library(pasilla)
data("pasillaGenes")
countData<-counts(pasillaGenes)
colData<-pData(pasillaGenes)[,c("condition","type")]
Let's first take a look at the gene expression count matrix countData.
head(countData)
treated1fb treated2fb treated3fb untreated1fb untreated2fb
FBgn0000003 0 0 1 0 0
FBgn0000008 78 46 43 47 89
FBgn0000014 2 0 0 0 0
FBgn0000015 1 0 1 0 1
FBgn0000017 3187 1672 1859 2445 4615
FBgn0000018 369 150 176 288 383
untreated3fb untreated4fb
FBgn0000003 0 0
FBgn0000008 53 27
FBgn0000014 1 0
FBgn0000015 1 2
FBgn0000017 2063 1711
FBgn0000018 135 174
Then let's take a look at the sample description file colData
colData
condition type
treated1fb treated single-read
treated2fb treated paired-end
treated3fb treated paired-end
untreated1fb untreated single-read
untreated2fb untreated single-read
untreated3fb untreated paired-end
untreated4fb untreated paired-end
One thing we notice is that the count Matrix is genes are rows while samples are columns. But prcomp required to have samples as our rows and genes as our columns.That is, we need a transposed version of what we currently have. R has a built-in function to transpose which is denoted by t(). Let's take a look at how t() works.
> transposed_data<-t(countData)
> transposed_data[,1:5]
FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017
treated1fb 0 78 2 1 3187
treated2fb 0 46 0 0 1672
treated3fb 1 43 0 1 1859
untreated1fb 0 47 0 0 2445
untreated2fb 0 89 0 1 4615
untreated3fb 0 53 1 1 2063
untreated4fb 0 27 0 2 1711
Now samples are rows and genes are columns after t() function is applied, and we are ready to run prcomp().
pca_results <- prcomp(t(countData))
Use the str function to take a quick peek at what is returned to us from the prcomp function. You can cross-reference with the help pages to see that is corresponds with what you are expected to be returned (?prcomp). There should be a list of five objects; the one we are interested in is x which is a matrix of the principal component vectors. Let's save that data matrix by assigning it to a new variable.
pc_mat <- pca_results$x
We are going to take a look at the first two principal components by plotting them against each other. Since we will want to include information from our sample description object colData, we can concatenate the PCA results to our colData into a data frame for input to ggplot. The graphic type that we are using is a scatter plot denoted by geom_point(), and we have specified to color by condition.
df <- cbind(colData, pc_mat[,c('PC1', 'PC2')])
library(ggplot2)
ggplot(df, aes(PC1, PC2, color = condition)) + geom_point(size=3)
We have only scratched the surface here. To learn more, see the ggplot reference site, and Winston Chang’s excellent Cookbook for R site. Though slightly out of date, ggplot2: Elegant Graphics for Data Anaysis is still the definative book on this subject.
Another useful plot used to identify patterns in your data and potential outliers is to use heatmaps. A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heat maps are well-suited for visualizing large amounts of multi-dimensional data and can be used to identify clusters of rows or columns with similar values, as these are displayed as areas of similar color.
Our data matrix is quite large, and a heatmap would be rather informative not having selected a subset of genes. Instead, we will generate a sample-to-sample correlation matrix by taking the correlation of count values for all pairwise combinations of samples. To compute the correlations R has a built-in function for that, cor which can take in either two vectors or an entire matrix. We will give it the same input we used for PCA above.
cor_mat <- cor(countData)
Check the dimensions of the matrix that is returned, and the range of values. Take a quick peek inside cor_mat.
cor_mat
treated1fb treated2fb treated3fb untreated1fb untreated2fb
treated1fb 1.0000000 0.9769470 0.9699387 0.9748449 0.9659681
treated2fb 0.9769470 1.0000000 0.9960757 0.9747916 0.9565659
treated3fb 0.9699387 0.9960757 1.0000000 0.9705672 0.9538853
untreated1fb 0.9748449 0.9747916 0.9705672 1.0000000 0.9758301
untreated2fb 0.9659681 0.9565659 0.9538853 0.9758301 1.0000000
untreated3fb 0.9547849 0.9738014 0.9746987 0.9772582 0.9783300
untreated4fb 0.9412804 0.9661953 0.9725255 0.9723502 0.9678800
untreated3fb untreated4fb
treated1fb 0.9547849 0.9412804
treated2fb 0.9738014 0.9661953
treated3fb 0.9746987 0.9725255
untreated1fb 0.9772582 0.9723502
untreated2fb 0.9783300 0.9678800
untreated3fb 1.0000000 0.9897768
untreated4fb 0.9897768 1.0000000
cor_mat will be the input to our heatmap function. To generate a heatmap we will use heatmap.2 which is part of the gplots package. Let's load the library:
library(gplots)
To plot the heatmap, we simply call the function and pass in our correlation matrix:
heatmap.2(cor_mat)
This generates a plot using the default settings. The default color gradient sets the highest value in the heat map to white, and the lowest value to a bright red, with a corresponding transition (or gradient) between these extremes. This color scheme can be changed by adding col= and specifying either a different built-in color palette by name, or creating your own palette.
We notice that the sample names are not fully printed. Thus we can increase margin size by margins = c(10, 10). We can also get rid of traces by using arguments trace='none' and denscol=tracecol
heatmap.2(cor_mat,margins = c(10, 10),trace='none',denscol=tracecol)
It is often useful to combine heatmaps with hierarchical clustering, which is a way of arranging items in a hierarchy based on the distance or similarity between them. The result of a hierarchical clustering calculation is displayed in a heat map as a dendrogram, which is a tree-structure of the hierarchy. In our heatmap both rows and columns have been clustered, but we can change that to remove column clustering (Colv=NULL
, and dendrogram="row"
) since we have a symmetric matrix. We can also remove the trace by setting trace="none
" and get rid of the legend key=FALSE
.
heatmap.2(cor_mat, margines=c(10,10), Colv=NULL, dendrogram="row", trace="none",key=FALSE)
As with any function in R, there are many way in which we can tweak arguments to customize the heatmap. We encourage you take time to read through the reference manual and explore other ways of generating heatmaps in R