Simple Statistics in R
Simple statistics
Let’s get a closer look at our data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. Suppose we wanted to compute the average value for a sample, or the minimum and maximum values? The R base package provides many built-in functions such as mean
, median
, min
, max
, and range
. Try computing the mean for “sample1”
Missing values
By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE
(rm
stands for remove). In some cases, it might be useful to remove the missing data from the vector. For this purpose, R comes with the function na.omit
to generate a vector that has NA’s removed. For some applications, it’s useful to keep all observations, for others, it might be best to remove all observations that contain missing data. The function complete.cases()
returns a logical vector indicating which rows have no missing values.
The apply Function
To obtain mean values for all samples we can use mean on each column individually, but there is also an easier way to go about it. The apply
family of functions keep you from having to write loops (R is bad at looping) to perform some sort of operation on every row or column of a data matrix or a data frame. The family includes several functions, each differing slightly on the inputs or outputs.
We will be using apply in our examples today, but do take a moment on your own to explore the many options that are available. The apply function returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. We know about vectors/arrays and functions, but what are these “margins”? Margins are referring to either the rows (denoted by 1), the columns (denoted by 2) or both (1:2). By “both”, we mean apply the function to each individual value. Let’s try this with the mean
function on our data:
Last updated