Simple Statistics in R

Simple statistics

Let’s get a closer look at our data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. Suppose we wanted to compute the average value for a sample, or the minimum and maximum values? The R base package provides many built-in functions such as mean, median, min, max, and range. Try computing the mean for “sample1”

baseDir<-getwd()
dataDir<-file.path(baseDir,"data")
metadata <- read.table(file.path(dataDir, 'mouse_exp_design.csv'), header=T, sep=",", row.names=1)
rpkm_data <- read.table(file.path(dataDir, 'counts.rpkm'), header=T, sep=",", row.names=1)
m <- match(row.names(metadata), colnames(rpkm_data))
data_ordered  <- rpkm_data[,m]
mean(data_ordered[,'sample1'])
max(data_ordered[,'sample1'])
min(data_ordered[,'sample1'])

Missing values

By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove). In some cases, it might be useful to remove the missing data from the vector. For this purpose, R comes with the function na.omit to generate a vector that has NA’s removed. For some applications, it’s useful to keep all observations, for others, it might be best to remove all observations that contain missing data. The function complete.cases() returns a logical vector indicating which rows have no missing values.

The apply Function

To obtain mean values for all samples we can use mean on each column individually, but there is also an easier way to go about it. The apply family of functions keep you from having to write loops (R is bad at looping) to perform some sort of operation on every row or column of a data matrix or a data frame. The family includes several functions, each differing slightly on the inputs or outputs.

base::apply             Apply Functions Over Array Margins
base::by                Apply a Function to a Data Frame Split by Factors
base::eapply            Apply a Function Over Values in an Environment
base::lapply            Apply a Function over a List or Vector
base::mapply            Apply a Function to Multiple List or Vector Arguments
base::rapply            Recursively Apply a Function to a List
base::tapply            Apply a Function Over a Ragged Array

We will be using apply in our examples today, but do take a moment on your own to explore the many options that are available. The apply function returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. We know about vectors/arrays and functions, but what are these “margins”? Margins are referring to either the rows (denoted by 1), the columns (denoted by 2) or both (1:2). By “both”, we mean apply the function to each individual value. Let’s try this with the mean function on our data:

samplemeans <- apply(data_ordered, 2, mean) 

Last updated

Massachusetts Institute of Technology