1 of 11

Intro to R

Acknowledgement

The backbone of the lesson is based on the teaching material developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). Many sections are directly adopted from the above teaching material. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The materials used in this lesson were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

Adapted from the lesson by Tracy Teal. Original contributors: Paul Wilson, Milad Fatenejad, Sasha Wood and Radhika Khetani for Software Carpentry (http://software-carpentry.org/)

Before We Start

What is R?

R is a versatile, open source programming/scripting language available on all platforms that is useful not only for statistics, but also data science. It is inspired by the programming language S.

R is open-source software licensed under the GNU General Public License (GPL). It is superior (if not just comparable) to commercial alternatives, and is widely used both in academia and industry.

What is RStudio?

RStudio is a freely available open-source Integrated Development Environment (IDE). It can be downloaded from the . It is a great alternative to working on R in the terminal for many reasons:

automatic syntax highlighting/formatting in the editor
direct code execution from editor to console
real-time access to environment, plotting, and history
good tool for workspace management

The RStudio interface has four main panels:

Console: where you can type commands and see output
Editor: where you can type out commands and save to file. You can also run in console with Ctrl Enter
Workspace/History: workspace shows all active objects and history keeps track of all commands run in console
Files/Plots/Packages/Help

Best practices

Code and workflow are more reproducible if we can document everything that we do.
Our end goal is not just to “do stuff”, but to do it in a way that anyone can easily and exactly replicate our workflow and results.
All code should be written in the editor and saved to file, rather than working in the console. The R console should be used to inspect objects, test a function or get help.

Interacting with R

There are two main ways of interacting with R: using the console or by using script files (plain text files that contain your code).

The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session. It is better to enter the commands in the script editor, and save the script. This way, you have a complete record of what you did, you can easily show others how you did it and you can do it again later on if needed. You can copy-paste the code into the R console, but the Rstudio script editor allows you to ‘send’ the current line or the currently selected text to the R console using the Ctrl-Enter shortcut.

Command prompt

If R is ready to accept commands, the R console shows a > prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using Ctrl-Enter), R will try to execute it, and when ready, show the results and come back with a new > prompt to wait for new commands.

If R is still waiting for you to enter more data because it isn’t complete yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. This is because you have not ‘closed’ a parenthesis or quotation. If you’re in Rstudio and this happens, click inside the console window and press Ctrl+c; this should help you out of trouble.

More powerful R computing for genomic analyses

Genomic data can be too big to run on your laptop. BCC/BMC cluster has lots of memory and CPU which enables processing big data in a parallel fashion. However, you need to be trained by our course in order to run R scripts on our cluster in appropriate ways.

Getting to Know R

Getting ready for the course

Download Intro_to_R.tar from https://ki-data.mit.edu/bcc/teaching/Intro_to_R/ onto Desktop.

Variables in R

Variables are Buckets

Variables are buckets that hold information.

A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to “buckets”, where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.

An example > x<-3

In the example above, we created a variable or a ‘bucket’ called x. Inside we put a value. Let’s create another variable called y and give it a value of 5. When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the name.

Functions in R

Functions and Their Arguments

The other key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe, etc.), process it, and return a result.

The input(s) are called arguments and can be anything, not only numbers or characters, but also other data structures. Exactly what each argument means differs per function, and must be looked up in the documentation (we will discuss help options at the end of Functions session). If an argument alters the way the function operates, such as whether to ignore ‘bad values’, it is sometimes called an option.

Most functions can take several arguments, but many have so-called defaults. If you don’t specify such an argument when calling the function, the function itself will fall back on using the default. This is a standard value that the author of the function specified as being “good enough in standard cases”. An example would be what symbol to use in a plot. However, if you want something specific, simply change the argument yourself with a value of your choice.

We have already used a few examples of basic functions in the previous lessons i.e c(), and factor(). These functions are available as part of R’s built in capabilities, and we will explore a few more of these base functions below. You can also get functions from libraries (which we’ll talk about in a bit), or even write your own.

Let’s revisit a function that we have used previously to combine data, c(). The arguments it takes is any number of numbers, characters or strings and performs the task of combining them into a single vector. You can also use it to add elements to an existing vector:

What happens here is that we take the original vector glengths, and we are adding another item first to the end of the other ones, and then another item at the beginning. We can do this over and over again to build a vector or a dataset.

Since R is used for statistical computing, many of the base functions involve mathematical operations. One example would be the function sqrt(). The input (argument) must be a number, and the output is the square root of that number. Let’s try finding the square root of 81:

Executing a function (or ‘running it’) is referred to as calling the function.

Now what would happen if we called the function on a vector of values instead of a single value?

In this case the task was performed on each individual value of the vector number and the respective results were displayed.

Let’s try a function that we can change some of the options, for example round:

We can see that we get 3. That’s because the default is to round to the nearest whole number. If we want a different number of digits, we can type digits=2 or however many we may want.

If you provide the arguments in the exact same order as they are defined (in the help manual) you don’t have to name them:

However, it’s usually not recommended practice because it’s a lot of remembering to do, and if you share your code with others that includes less known functions it makes your code difficult to read (It’s however OK to not include the names of the arguments for basic functions like mean, min, etc…). Another advantage of naming arguments, is that the order doesn’t matter. This is useful when there start to be more arguments

Packages and Libraries

Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. The two terms are sometimes used synonomously and there has been discussion amongst the community to resolve this. It is somewhat counter-intuitive to load a package using the library() function and so you can see how confusion can arise.

There are a set of standard (or base) packages which are considered part of the R source code and automatically available as part of your R installation. Base packages contain the basic functions that allow R to work, and enable standard statistical and graphical functions on datasets; for example all of the functions that we have been using so far in our examples.

You can check what base packages are loaded by typing into the console:

In this course we will mostly be using functions from the standard base packages. However, the more you work with R you will come to realize that there is a cornucopia of R packages that offer a wide variety of functionality. To use additional packages will require installation.

Packages for R can be installed from the CRAN package repository using the install.packages function. An example is given below for the ggplot2 package that will be required for some images we will create later on. If you do not have access to internet, do not run this code. Instead we will install from source

Alternatively, packages can also be installed from Bioconductor, another repository of packages but mostly pertaining to genomic data analysis. There are many packages that are available in CRAN and Bioconductor, but there are also packages that are specific to one repository. Generally, you can find out this information with a Google search or by trial and error. To install from Bioconductor, you will first need to install Bioconductor and all the standard packages. This only needs to be done once ever for your R installation. For older versions of R (R < 3.5.0):

Once you have the standard packages installed, you can install additional packages using the biocLite.R script. If it’s a new R session you will also have to source the script again. Here we show that the same package ggplot2 is available through Bioconductor:

The current release of Bioconductor is version 3.16; it works with R version 4.2.2. To get the latest version of Bioconductor by entering the commands:

To install core packages, type the following in an R command window:

Install specific packages, e.g., DESeq2

Finally, R packages can also be installed from source. This is useful when you do not have an internet connection (and have the source files locally), since the other two methods are retrieving the source files from remote sites. For this class, we can install ggplot2 from source, because we have provided for you a compressed file containing all the required information to build and install the package into your environment. First locate the file ggplot2_3.1.1.tar.gz in your directory. To install it, we use the same install.packages function, but we have additional arguments that need to be changed from defaults:

Seeking Help

I know the name of the function I want to use, but I’m not sure how to use it

Suppose we didn’t know how to use the round function and wanted more digits; the best way of finding out this information is to use the ? followed by the name of the function. Doing this will open up the help manual in the bottom right panel of RStudio:

If you know the function, but just need to remind yourself of the names of the arguments, you can use:

I want to use a function that does X, there must be a function for it but I don’t know which one…

If you are looking for a function to do a particular task, you can use help.search() (but only looks through the installed packages):

If you can’t find what you are looking for, you can use the that searches through the help files across all packages available.

I am stuck… I get an error message that I don’t understand

Start by googling the error message. However, this doesn’t always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”).

However, you should check . Search using the R tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers in stackoverflow.

The can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.

The is dense and technical but it is full of useful information.

Asking for help

The key to get help from someone is for them to grasp your problem rapidly. You should make it as easy as possible to pinpoint where the issue might be.

Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem.

If possible, try to reduce what doesn’t work to a simple reproducible example. If you can reproduce the problem using a very small data.frame instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question.

To share an object with someone else, if it’s relatively small, you can use the function dput(). It will output R code that can be used to recreate the exact same object as the one in memory:

If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your question is not related to a data.frame, you can save any other R data structure that you have in your environment to a file:

The content of this file is however not human readable and cannot be posted directly on stackoverflow. It can however be sent to someone by email who can read it with this command:

Last, but certainly not least, always include the output of sessionInfo() as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem.

Where to ask for help?

Your friendly colleagues: if you know someone with more experience than you, they might be able and willing to help you.
: if your question hasn’t been answered before and is well crafted, chances are you will get an answer in less than 5 min.
: it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don’t expect that it will come with smiley faces. Also, here more than everywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package.

More resources

The for the R mailing lists.
The "" site provides useful guidelines.

Simple Statistics in R

Simple statistics

Let’s get a closer look at our data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. Suppose we wanted to compute the average value for a sample, or the minimum and maximum values? The R base package provides many built-in functions such as mean, median, min, max, and range. Try computing the mean for “sample1”

Basic Plotting in R

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers”, and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R’s base plotting package.

When we are working with large sets of numbers it can be useful to display that information graphically. R has a number of built-in tools for basic graph types such as hisotgrams, scatter plots, bar charts, boxplots and much . We’ll test a few of these out here on our samplemeans vector, but first we will create a combined data frame that maps our metadata to the sample mean values.

Prepare data to practice basic plots in R

Advanced Plotting in R

There’s also a plotting package called that adds a lot of functionality to the basic plots seen above. The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using ggplot functions to get a feel for the syntax.

ggplot is best used on data in the data.frame form, so we will work with our combined df for the following figures. Let’s start by loading the ggplot2 library.

Prepare data to practice ggplot2

Writing Figures to a File

Initialize a plot that will be written directly to a file using pdf, png, etc. Within the function you will need to specify a name for your image, and the width and height (optional). Then create a plot using the usual functions in R. Finally, close the file using the dev.off() function. There are also bmp, tiff, and jpeg functions, though the jpeg function has proven less stable than the others.

pdf("boxplot.pdf")
ggplot(data=df, aes(x= genotype, y=samplemeans, fill=celltype)) + 
  geom_boxplot() + 
  ggtitle('Genotype differences in average gene expression') +
  xlab('Genotype') +
  ylab('Mean expression') +
  theme(plot.title = element_text(size = rel(1.5)),
        axis.title = element_text(size = rel(1.5)),
        axis.text = element_text(size = rel(1.25)))
dev.off()

Further Resources

Functions in R

Functions and Their Arguments

Executing a function (or ‘running it’) is referred to as calling the function.

Now what would happen if we called the function on a vector of values instead of a single value?

In this case the task was performed on each individual value of the vector number and the respective results were displayed.

Let’s try a function that we can change some of the options, for example round:

We can see that we get 3. That’s because the default is to round to the nearest whole number. If we want a different number of digits, we can type digits=2 or however many we may want.

If you provide the arguments in the exact same order as they are defined (in the help manual) you don’t have to name them:

Packages and Libraries

You can check what base packages are loaded by typing into the console:

The current release of Bioconductor is version 3.16; it works with R version 4.2.2. To get the latest version of Bioconductor by entering the commands:

To install core packages, type the following in an R command window:

Install specific packages, e.g., DESeq2

Seeking Help

I know the name of the function I want to use, but I’m not sure how to use it

If you know the function, but just need to remind yourself of the names of the arguments, you can use:

I want to use a function that does X, there must be a function for it but I don’t know which one…

If you are looking for a function to do a particular task, you can use help.search() (but only looks through the installed packages):

If you can’t find what you are looking for, you can use the that searches through the help files across all packages available.

I am stuck… I get an error message that I don’t understand

However, you should check . Search using the R tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers in stackoverflow.

The can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.

The is dense and technical but it is full of useful information.

Asking for help

The key to get help from someone is for them to grasp your problem rapidly. You should make it as easy as possible to pinpoint where the issue might be.

To share an object with someone else, if it’s relatively small, you can use the function dput(). It will output R code that can be used to recreate the exact same object as the one in memory:

The content of this file is however not human readable and cannot be posted directly on stackoverflow. It can however be sent to someone by email who can read it with this command:

Where to ask for help?

Your friendly colleagues: if you know someone with more experience than you, they might be able and willing to help you.
: if your question hasn’t been answered before and is well crafted, chances are you will get an answer in less than 5 min.
: it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don’t expect that it will come with smiley faces. Also, here more than everywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package.

More resources

The for the R mailing lists.
The "" site provides useful guidelines.

Data Manipulation

Using functions to explore data structures

There are also selections of base functions in R that are useful for inspecting your data and summarizing it. Let’s start with a simple data structure such as vectors. A commonly used function is length(), which tells you how many elements are in a particular vector:

The class() function is useful in indicating the datatype or data structure of a variable. So for example if we were interested in knowing what was inside glengths:

We could also use class on a data frame or any other type of object. Let’s load in a data frame to test out some more functions. We will use read.csv to read in data from a csv (comma separated values) file. There are numerous other functions to load in data depending on your filetype, but read.csv is one of the more commonly used ones.

The function has one required argument and several options that can be changed. The mandatory argument is a path to the file and filename, which in our case is mouse_exp_design.csv file. We will put the function to the right of the assignment operator, meaning that any output will be saved as the variable name provided on the left.

Take a look at the file by typing out the variable name metadata and pressing return. The file contains information describing the samples in our study. Each row holds information for a single sample, and the columns represent genotype (WT or KO), celltype (typeA or typeB), and replicate number.

Note: If you are using older versions of R:

By default, data.frame converts (= coerces) columns that contain characters (i.e., text) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE.

Suppose we had a larger file, we might not want to display all the contents in the console. Instead we could check the top (the first 6 lines) of this data.frame using the function head():

Let’s now check the structure of this data.frame in more details with the function str():

If you are using older versions of R:

As you can see, the columns genotype and celltype are of a special class called factor whereas the replicate column has been interpreted as integer data type.

If you are using R 4.0.0 or newer:

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

Size:
- dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow() - returns the number of rows

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Using indexes and sequences to select data from vectors and dataframes

When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let’s begin with vectors, then apply those concepts to dataframes.

Vectors

If we want to extract one or several values from a vector, we must provide one or several indexes in square brackets. The index represents the element number within a vector (or the compartment number, if you think of the bucket analogy). R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

Let’s start by creating a vector called age:

Suppose we only wanted the fifth value of this vector, we would use the following syntax:

If we wanted to index more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of the index values:

To access a sequence of continuous values from a vector, we would use :` which is a special function that creates numeric vectors of integer in increasing or decreasing order. Let’s select the first five values from age:

Alternatively, if you wanted the reverse could try 5:1 for instance and see what is returned. The function seq() (for sequence) can also be used to create sequences, but allow for more complex patterns. Passing in the by argument will allow you to generate a sequence based on the specified interval:

Additionally, the length.out parameter will provide the restriction on the maximum length of the resulting vector. A combination of parameters can also be used:

Dataframes

Dataframes have 2 dimensions (rows and columns), so if we want to extract some specific data from it we need to specify the “coordinates” we want from it. We use the same square bracket syntax but rather than providing a single index, there are two inputs required. Within the square bracket, row numbers come first followed by column numbers (and the two are separated by a comma). For example:

Now if you only wanted to select based on rows, you would provide the indexes for the rows and leave the columns blank. The key here is to include the comma, to let R know that you are accessing a 2 dimensional data structure:

Similarly, if you were selecting specific columns from the data frame - the rows are left blank:

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Is celltype in column 1 or 3? oh, right… they are in column 2). In some cases, the column number for a variable can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. For instance, to extract all the gentotypes from our dataset, we can use: metadata$genotype. You can use names(metadata) or colnames(metadata) to remind yourself of the column names.

To select multiple columns by name the square bracket syntax is used by concatenating a vector of strings that correspond to column names:

Subsetting data using logical operators

Another way of partitioning your data, is by filtering based on the content that is in your dataframe using the subset() function. For example, we can look at the samples of a specific celltype "typeA":

We can also subset using other logical operators in R. For example suppose we wanted to subset to keep only the WT samples from the typeA celltype.

Alternatively, we could try looking at only the first two replicates of each sample set. Here, we can use the less than operator since replicate is currently a numeric vector. Adding in the argument select allows us to specify which columns to keep. Which columns are left?

Re-organizing data based on match indices

Often when working with genomic data, we have a data file that corresponds with our metadata file. The data file contains measurements from the biological assay for each individual sample. In our case, the biological assay is gene expression and data was generated using RNA-Seq. Let’s bring in the data matrix of RPKM values:

Take a look at the first few lines of the data matrix to see what’s in there.

It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it’s hard to tell since they are not in the same order. We can do a quick check of the dimensions and at least see if the numbers match up.

What we want to know is, do we have data for every sample that we have for metadata?

There are many ways to answer this using R. We’ll be using the match function, which takes at least 2 arguments: 1) a vector of values to be matched, and 2) a vector of values to be matched against. The function returns the position of the matches in the second vector. Take a look at the example below where vector B is the reverse of vector A:

Now if we use the match function with A as our first input and B as our second, you will be returned a vector of size length(A). Each number that is returned represents the index of vector B where the matching values was observed.

Let’s change vector B so that only a subset are retained:

Note, for values that don’t match you can specify what values you would have it assigned using nomatch argument (by default this is set to NA). Also, illustrated in the example, if there is more than one matching value found only the first is reported.

Subset data using matching

We are trying to match the row names of our metadata with the column names of our expression data, so these will be the arguments for match. There are base functions in R which allow you to extract the row and column names as a vector:

Using these two arguments we will retrieve a vector of match indices. This vector represents the re-ordering of the column names in our data matrix to be identical to the rows in metadata:

Now we can create a new data matrix in which columns are re-ordered based on the match indices:

Check and see what happened by using head. You can also verfy that column names of this new data matrix matches the metadata row names by using the all function:

An R package for data manipulation

The package dplyr is a package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.

Intro to R

Acknowledgement

Before We Start

What is R?

What is RStudio?

Best practices

Interacting with R

Command prompt

More powerful R computing for genomic analyses

Getting to Know R

Getting ready for the course

Variables in R

Variables are Buckets

Functions in R

Functions and Their Arguments

Packages and Libraries

Seeking Help

I know the name of the function I want to use, but I’m not sure how to use it

I want to use a function that does X, there must be a function for it but I don’t know which one…

I am stuck… I get an error message that I don’t understand

Asking for help

Where to ask for help?

More resources

Simple Statistics in R

Simple statistics

Basic Plotting in R

Prepare data to practice basic plots in R

Advanced Plotting in R

Prepare data to practice ggplot2

Writing Figures to a File

Further Resources

Intro to R

Acknowledgement

Before We Start

What is R?

What is RStudio?

Best practices

Interacting with R

Command prompt

More powerful R computing for genomic analyses

Getting to Know R

Getting ready for the course

Writing Figures to a File

Further Resources

Simple Statistics in R

Simple statistics

Missing values

The apply Function

Basic Plotting in R

Prepare data to practice basic plots in R

Barplot

Histogram

Boxplot

Functions in R

Functions and Their Arguments

Packages and Libraries

Seeking Help

I know the name of the function I want to use, but I’m not sure how to use it

I want to use a function that does X, there must be a function for it but I don’t know which one…

I am stuck… I get an error message that I don’t understand

Asking for help

Where to ask for help?

More resources

Variables in R

Variables are Buckets

Advanced Plotting in R

Prepare data to practice ggplot2

Basic Data Types

Numeric

Integer

Character

Logical

Complex

Basic Data Structures

Vectors

Factors

Matrix

DataFrames

Lists

Advanced Bar Plot