LogoLogo
LogoLogo
  • The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility
  • Computing Resources
    • Active Data Storage
    • Archive Data Storage
    • Luria Cluster
      • FAQs
    • Other Resources
  • Bioinformatics Topics
    • Tools - A Basic Bioinformatics Toolkit
      • Getting more out of Microsoft Excel
      • Bioinformatics Applications of Unix
        • Unix commands applied to bioinformatics
        • Manipulate NGS files using UNIX commands
        • Manipulate alignment files using UNIX commands
      • Alignments and Mappers
      • Relational databases
        • Running Joins on Galaxy
      • Spotfire
    • Tasks - Bioinformatics Methods
      • UCSC Genome Bioinformatics
        • Interacting with the UCSC Genome Browser
        • Obtaining DNA sequence from the UCSC Database
        • Obtaining genomic data from the UCSC database using table browser queries
        • Filtering table browser queries
        • Performing a BLAT search
        • Creating Custom Tracks
        • UCSC Intersection Queries
        • Viewing cross-species alignments
        • Galaxy
          • Intro to Galaxy
          • Galaxy NGS Illumina QC
          • Galaxy NGS Illumina SE Mapping
          • Galaxy SNP Interval Data
        • Editing and annotation gene structures with Argo
      • GeneGO MetaCore
        • GeneGo Introduction
        • Loading Data Into GeneGO
        • Data Management in GeneGO
        • Setting Thresholds and Background Sets
        • Search And Browse Content Tab
        • Workflows and Reports Tab
        • One-click Analysis Tab
        • Building Network for Your Experimental Data
      • Functional Annotation of Gene Lists
      • Multiple Sequence Alignment
        • Clustalw2
      • Phylogenetic analysis
        • Neighbor Joining method in Phylip
      • Microarray data processing with R/Bioconductor
    • Running Jupyter notebooks on luria cluster nodes
  • Data Management
    • Globus
  • Mini Courses
    • Schedule
      • Previous Teaching
    • Introduction to Unix and KI Computational Resources
      • Basic Unix
        • Why Unix?
        • The Unix Tree
        • The Unix Terminal and Shell
        • Anatomy of a Unix Command
        • Basic Unix Commands
        • Output Redirection and Piping
        • Manual Pages
        • Access Rights
        • Unix Text Editors
          • nano
          • vi / vim
          • emacs
        • Shell Scripts
      • Software Installation
        • Module
        • Conda Environment
      • Slurm
    • Introduction to Unix
      • Why Unix?
      • The Unix Filesystem
        • The Unix Tree
        • Network Filesystems
      • The Unix Shell
        • About the Unix Shell
        • Unix Shell Manual Pages
        • Using the Unix Shell
          • Viewing the Unix Tree
          • Traversing the Unix Tree
          • Editing the Unix Tree
          • Searching the Unix Tree
      • Files
        • Viewing File Contents
        • Creating and Editing Files
        • Manipulating Files
        • Symbolic Links
        • File Ownership
          • How Unix File Ownership Works
          • Change File Ownership and Permissions
        • File Transfer (in-progress)
        • File Storage and Compression
      • Getting System Information
      • Writing Scripts
      • Schedule Scripts Using Crontab
    • Advanced Utilization of IGB Computational Resources
      • High Performance Computing Clusters
      • Slurm
        • Checking the Status of Computing Nodes
        • Submitting Jobs / Slurm Scripts
        • Interactive Sessions
      • Package Management
        • The System Package Manager
        • Environment Modules
        • Conda Environments
      • SSH Port Forwarding
        • SSH Port Forwarding Jupyter Notebooks
      • Containerization
        • Docker
          • Docker Installation
          • Running Docker Images
          • Building Docker Images
        • Singularity
          • Differences from Docker
          • Running Images in Singularity
      • Running Nextflow / nf-core Pipelines
    • Python
      • Introduction to Python for Biologists
        • Interactive Python
        • Types
          • Strings
          • Lists
          • Tuples
          • Dictionaries
        • Control Flow
        • Loops
          • For Loops
          • While Loops
        • Control Flows and Loops
        • Storing Programs for Re-use
        • Reading and Writing Files
        • Functions
      • Biopython
        • About Biopython
        • Quick Start
          • Basic Sequence Analyses
          • SeqRecord
          • Sequence IO
          • Exploration of Entrez Databases
        • Example Projects
          • Coronavirus Exploration
          • Translating a eukaryotic FASTA file of CDS entries
        • Further Resources
      • Machine Learning with Python
        • About Machine Learning
        • Hands-On
          • Project Introduction
          • Supervised Approaches
            • The Logistic Regression Model
            • K-Nearest Neighbors
          • Unsupervised Approaches
            • K-Means Clustering
          • Further Resources
      • Data Processing with Python
        • Pandas
          • About Pandas
          • Making DataFrames
          • Inspecting DataFrames
          • Slicing DataFrames
          • Selecting from DataFrames
          • Editing DataFrames
        • Matplotlib
          • About Matplotlib
          • Basic Plotting
          • Advanced Plotting
        • Seaborn
          • About Seaborn
          • Basic Plotting
          • Visualizing Statistics
          • Visualizing Proteomics Data
          • Visualizing RNAseq Data
    • R
      • Intro to R
        • Before We Start
        • Getting to Know R
        • Variables in R
        • Functions in R
        • Data Manipulation
        • Simple Statistics in R
        • Basic Plotting in R
        • Advanced Plotting in R
        • Writing Figures to a File
        • Further Resources
    • Version Control with Git
      • About Version Control
      • Setting up Git
      • Creating a Repository
      • Tracking Changes
        • Exercises
      • Exploring History
        • Exercises
      • Ignoring Things
      • Remotes in Github
      • Collaborating
      • Conflicts
      • Open Science
      • Licensing
      • Citation
      • Hosting
      • Supplemental
Powered by GitBook

MIT Resources

  • https://accessibility.mit.edu

Massachusetts Institute of Technology

On this page
  • Variables are Buckets
  • Basic Data Types
  • Numeric
  • Integer
  • Character
  • Logical
  • Complex
  • Basic Data Structures
  • Vectors
  • Factors
  • Matrix
  • DataFrames
  • Lists

Was this helpful?

Export as PDF
  1. Mini Courses
  2. R
  3. Intro to R

Variables in R

Variables are Buckets

Variables are buckets that hold information.

A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to “buckets”, where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.

An example > x<-3

In the example above, we created a variable or a ‘bucket’ called x. Inside we put a value. Let’s create another variable called y and give it a value of 5. When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the name.

Other examples:
> y<-5
> x+y
[1] 8
> s<-x+y

Basic Data Types

Numeric

Decimal values are called numerics in R. It is the default computational data type. If we assign a decimal value to a variable x as follows, x will be of numeric type.

> x = 10.5 # assign a decimal value
> x # print the value of x
[1] 10.5
> class(x) # print the class name of x
[1] "numeric"

Furthermore, even if we assign an integer to a variable k, it is still being saved as a numeric value.

> k = 1
> k # print the value of k
[1] 1
> class(k) # print the class name of k
[1] "numeric"

The fact that k is not an integer can be further confirmed with the is.integer function. We will discuss how to create an integer in our next tutorial on the integer type.

> is.integer(k) # is k an integer?
[1] FALSE

Integer

In order to create an integer variable in R, we invoke the as.integer function. We can be assured that y is indeed an integer by applying the is.integer function.

> y = as.integer(3)
> y # print the value of y
[1] 3
> class(y) # print the class name of y
[1] "integer"
> is.integer(y) # is y an integer?
[1] TRUE

We can coerce a numeric value into an integer with the same as.integer function.

> as.integer(3.14) # coerce a numeric value
[1] 3

And we can parse a string for decimal values in much the same way.

> as.integer("5.27") # coerce a decimal string
[1] 5

On the other hand, it is erroneous trying to parse a non-decimal string.

> as.integer("Joe") # coerce an non−decimal string
[1] NA
Warning message:
NAs introduced by coercion

Often, it is useful to perform arithmetic on logical values. Like the C language, TRUE has the value 1, while FALSE has value 0.

> as.integer(TRUE) # the numeric value of TRUE
[1] 1
> as.integer(FALSE) # the numeric value of FALSE
[1] 0

Character

A character object is used to represent string values in R. We convert objects into character values with the as.character() function:
> x = as.character(3.14) 
> x              # print the character string 
[1] "3.14" 
> class(x)       # print the class name of x 
[1] "character"

Two character values can be concatenated with the paste function.
> fname = "Joe"; lname ="Smith" 
> paste(fname, lname) 
[1] "Joe Smith"

However, it is often more convenient to create a readable string with the sprintf function, which has a C language syntax.
> sprintf("%s has %d dollars", "Sam", 100) 
[1] "Sam has 100 dollars"

To extract a substring, we apply the substr function. 
Here is an example showing how to extract the substring between the third and twelfth positions in a string.
> substr("Mary has a little lamb.", start=3, stop=12) 
[1] "ry has a l"

And to replace the first occurrence of the word "little" by another word "big" in the string, we apply the sub function.
> sub("little", "big", "Mary has a little lamb.") 
[1] "Mary has a big lamb."

More functions for string manipulation can be found in the R documentation.
> help("sub")

Logical

True, False

Complex

Represent complex numbers with real and imaginary parts (e.g., 1+4i)
> z = 1 + 2i     # create a complex number 
> z              # print the value of z 
[1] 1+2i 
> class(z)       # print the class name of z 
[1] "complex"

The following gives an error as −1 is not a complex value.
> sqrt(−1)       # square root of −1 
[1] NaN 
Warning message: 
In sqrt(−1) : NaNs produced

Instead, we have to use the complex value −1 + 0i.
> sqrt(−1+0i)    # square root of −1+0i 
[1] 0+1i

An alternative is to coerce −1 into a complex value.
> sqrt(as.complex(−1)) 
[1] 0+1i

Basic Data Structures

Vectors

  • Vectors are a collection of numbers or characters or both

  • Vectors are the most common and basic data structure in R, and they are the workhorse of R

  • The analogy is a bucket with different compartments; Each compartment is called an element

  • Each element contains a single value

  • There is no limit to the number of elements

  • The vector is assigned to a single variable, because regardless of how many elements it contains it is still a single bucket

  • We create a vector named V shown in the image on the left hand side: V<-c(1,2,3)

> V<-c(1,2,3)
> V
[1] 1 2 3

Each element of the vector contains a single numeric value, and three values will be combined together using c() (the combine function). All of the values are put within the parentheses and separated with a comma.

Create a vector called glengths with three elements, where each element corresponds with the genome sizes(in Mb) of a certain species.
glengths <- c(4.6, 3000, 50000)
glengths

A vector can also contain characters. 
Create another vector called species with three elements, where each element is a species corresponding with the genome size in glengths vector.

species <- c("ecoli", "human", "corn")
species

Factors

  • Factors are used to represent categorical data

  • Factors can be ordered or unordered

  • Factors are an important class for statistical analysis and for plotting

  • Factors are stored as integers, and have labels associated with these unique integers

  • While factors look (and often behave) like character vectors, they are actually integers under the hood

  • You need to be careful when treating them like string

  • To create a factor vector we use the factor() function:

> F<-factor(c("F","M","F"))
> levels(F)
[1] "F" "M"
expression <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(expression)

Sometimes, the order of the factors does not matter, other times you might want to specify the order
because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of
analysis. 
Additionally, specifying the order of the levels allows one to compare levels:

expression <- factor(expression, levels=c("low", "medium", "high"))
levels(expression)
min(expression) ## doesn't work

expression <- factor(expression, levels=c("low", "medium", "high"), ordered=TRUE)
levels(expression)
min(expression) ## works!

In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple
integer labels because factors are self describing: "low", "medium", and "high"" is more descriptive
than 1, 2, 3. Which is low? 
You wouldn’t be able to tell with just integer data. Factors have this information built in. 
It is particularly helpful when there are many levels.

Matrix

  • A matrix in R is a collection of vectors of same length and identical datatype

  • Vectors can be combined as columns in the matrix or by row

  • Usually matrices are numeric and used in various computational algorithms to serve as a checkpoint

  • If input data is not of identical data type (numeric, character, etc.), the matrix() function will throw an error and stop any downstream code execution.

An example of creating a matrix M using matrix() function:
> M = matrix( 
+   c(1, 2, 3, 4, 5, 6,7,8,9), # the data elements 
+   nrow=3,              # number of rows 
+   ncol=3,              # number of columns 
+   byrow = TRUE)        # fill matrix by rows 

> M
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9 

DataFrames

  • A data.frame is a collection of vectors of identical lengths

  • each vector can be of a different data type (e.g., characters, integers, factors)

  • data.frame is the de facto data structure for most tabular data, and each vector represents a column

  • data.frame is commonly used for statistics and plotting

An example of creating a data frame DF using data.frame() function:
> v1<-c(1,2,3)
> f2<-factor(c("F","M","F"))
> v3<-c("a","b","c")
> DF<-data.frame(v1,f2,v3)
> DF
  v1 v2 v3
1  1  F  a
2  2  M  b
3  3  F  c

Lists

  • A list is a collection of data structures

  • There is no particular restriction for the components to be of the same mode or type

  • For example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on

An example of creating a list L using list() function:
> L<-list(gender=factor(c("F","M","F")), letter="a", number=c(1,2))
> L
$gender
[1] "F" "M" "F"

$letter
[1] "a"

$number
[1] 1 2
PreviousGetting to Know RNextFunctions in R

Last updated 5 months ago

Was this helpful?