Unix commands applied to bioinformatics
This demonstration shows how to perform some basic bioinformatics tasks using simple UNIX commands.
The following examples use primarily two files: arrayDat.txt
and arrayAnnot.txt
. You can copy them using the following command on Luria:
# copy from stock directory to current directory
cp /net/ostrom/data/dropbox/arrayDat.txt .
cp /net/ostrom/data/dropbox/arrayAnnot.txt .
arrayDat.txt contains some microarray data, with rows corresponding to probe IDs and columns corresponding to four samples.
arrayAnnot.txt contains the annotation for the probe IDs, specifically the gene description (i.e. geneTitle) and the gene symbol (i.e. geneSymbol).


Note that in general there is a "many-to-one" correspondence between probe IDs and gene symbols, that is more than one probe ID could be associated with the same gene symbol.
head
You can use this command to view the content of the top part of a file.
The switch "-n" allows you to specify how many lines to display, starting from the first. If you specify a negative number N, then all the lines except the bottom N will be displayed.
# print first 5 lines
head -n 5 arrayDat.txt
head -n 5 arrayAnnot.txt
# print everything except the last line
head -n -1 arrayDat.txt
head -n -1 arrayAnnot.txt
cut
This command extracts portions of lines of a file.
The switch "-f" allows you to specify which field (by default column) to extract.
Multiple fields can be selected either by specifying a range (e.g. 1-3) or by listing the fields (e.g. 1,2,3).
# get 1st column (i.e. probeID)
cut -f 1 arrayDat.txt
# get 3rd column (i.e. geneSymbol)
cut -f 3 arrayAnnot.txt
# get the first three columns
cut -f 1-3 arrayDat.txt
# get the 1st and 3rd columns and redirect the output into a file (mapping probeID to geneSymbol)
cut -f 1,3 arrayAnnot.txt > probe2gene.txt
paste
This command merges the lines of files and puts them side by side.
# put each column in a separate tmp files
cut -f 1 arrayAnnot.txt > tmp1
cut -f 2 arrayAnnot.txt > tmp2
cut -f 3 arrayAnnot.txt > tmp3
# merge lines in a new column order
paste tmp3 tmp1 tmp2 > arrayAnnotOrdered.txt
sort
This command sorts the lines of a text file.
By default the sort is in lexicographic order according to the first field.
In a lexicographic order, letters follow number: 0-9 then aA-zZ.
Note that numbers are to be treated as strings, so 10 comes before 2 because there is no positional weighting (the symbol 1 comes before 2).
Other sorting criteria are available:
the switch -k lets you specify which field to use as key for sorting.
the switch -n specifies a numeric sort.
the switch -r specifies a sort in reverse order (either lexicographic or numeric).
the switch -R specifies a random sort.
# sort by 1st field (probeID)
sort probe2gene.txt
# sort by the 2nd field (geneSymbol)
sort -k 2 probe2gene.txt
# sort by geneSymbol in reverse order
sort -r -k 2 probe2gene.txt
# sort by 2nd field (default lexicographic sort)
sort -k 2 arrayDat.txt
# sort by 2nd field (numeric sort)
sort -n -k 2 arrayDat.txt
wc
This command counts the lines, words and bytes in a text file.
It is often used in conjunction to other filtering operations to count the number of items that pass the filter.
# print number of lines, words and bytes
wc arrayDat.txt
# print number of lines only
wc -l arrayDat.txt
# print the number of fastq files in the current directory
ls *.fastq | wc -l
uniq
This command reports or filters repeated lines in a file.
It compares adjacent lines and reports those lines that are unique. Because repeated lines in the input will not be detected if they are not adjacent, it might be necessary to sort the input file before invoking this command.
The switch "-f" specifies how many fields (starting from the first one), to skip when performing the comparisons.
The switch "-c" specifies to return a count of how many occurrences for each distinct line.
The switch "-d" specifies to print only duplicates lines.
The switch "-u" specifies to print only unique lines (i.e. occurring only once).
The switch "-i" specifies a case insensitive comparison.
# print all distinct rows then count
uniq arrayAnnot.txt | wc -l
# count distinct rows without considering the 1st field (probeID)
uniq -f 1 arrayAnnot.txt |wc -l
# report distinct gene symbols and count number of occurrences
uniq -f 1 -c arrayAnnot.txt
# report repeated gene symbols
uniq -f 1 -d arrayAnnot.txt
# report list of unique gene symbols
uniq -f 1 -u arrayAnnot.txt
grep
Acronym for "general regular expression print".
This command prints the lines of the input file that match the given pattern(s).
# print lines that match “chr” (lines with the words "chromosome" and/or "cytochrome" are matched).
grep chr arrayAnnot.txt
The switch "-w" specifies that the match has to occur with a whole word, not just part of a word. Thus, in the example input file, no line matches the pattern "chr" as whole word.
# print lines that match “chr” as a whole word
grep -w chr arrayAnnot.txt
The switch "-i" can be used to specify a case-insensitive match (by default this command is case sensitive).
In the example below, when the switch "-i" is used for the pattern "SS", lines containing the words "class", "repressor", "associated", "expressed" as well as the gene symbol "PRSS33" are matched.
# print lines that match “SS” (case sensitive)
grep SS arrayAnnot.txt
# print lines that match “SS” (case insensitive, that is "SS" or “ss” or “Ss” or “sS”)
grep -i SS arrayAnnot.txt
The switch "-c" returns the number of lines where the specified pattern is found.
# print how many lines match the pattern "orf"
grep -c orf arrayAnnot.txt
The switch "-v" performs a "reverse-matching" and prints only the lines that do not match the pattern.
# print lines that do NOT match the pattern "orf"
grep -v orf arrayAnnot.txt
The switch "-n" prints out the matching line with its number.
# preceed the matching line with the line number
grep -n orf arrayAnnot.txt
When multiple patterns must be matched, you can use the switch "-f" and pass as argument a file containing the patterns to match (one per line).
# make a list with 5 gene symbols
cut -f 3 arrayAnnot.txt|head -n 5 > tmp
# use list "tmp" to match lines in arrayAnnot.txt
grep -f tmp arrayAnnot.txt
join
This command provides relational database functionality.
It merges the lines of two files based on the common value of a given field, also known as "key".
By default the first field of each file is considered as key.
To perform a relational joint, where you merge all the lines that have identical values in the specified field of two files, the files themselves must be sorted. Unless otherwise specified in the command (with option "--nocheck-order", a warning is given when the files are not sorted on the key field).
# join two files (use default field as key)
join arrayDat.txt arrayAnnot.txt
join arrayAnnot.txt arrayDat.txt
The switch "-t" lets you specify what character to use as field separator in the input and output files. To specify a tab character, use the string '$\t'.
Note that by default the separator is a blank space. It is a generally a good idea to specify a tab as output separator so that if a field consists of multiple words separated by blank spaces, these words remain bundled in the same field and you can use the cut command to access specific fields.
In the examples below, a joint on default settings, yields an output where the fields are separated by blanks. If we try to extract the description (6th field), even if we specify that the delimiter is a blank space rather than a tab, we cannot extract the entire description.
# join two files (default separator is blank space)
join arrayDat.txt arrayAnnot.txt
# note how joint fields are separated by blank spaces and default setting of cut do not yield the desired output
join arrayDat.txt arrayAnnot.txt | cut -f 6
# note how specifying a blank space as delimiter for cut does not yield the desired output because the description field consists of mulitple words
join arrayDat.txt arrayAnnot.txt | cut -d ' ' -f 6
# join two files (separator is a tab)
join -t $'\t' arrayDat.txt arrayAnnot.txt
join -t $'\t' arrayDat.txt arrayAnnot.txt | cut -f 6
The key field of each file can be specified using the switch "-1" and "-2" for the first and second file respectively. Fields are numbered starting from 1.
In the example below, we join the files probe2gene.txt and arrayDat.txt based on the gene symbol (2nd and 3rd field respectively). Before performing the joint, the files must be sorted according to their keys.
# sort by key, then use the probeID as key (probeID is the 2nd and 3rd field in the files)
sort -k 2 probe2gene.txt > tmp1
sort -k 3 -t $'\t' arrayAnnot.txt > tmp2
join -1 2 -2 3 -t $'\t' tmp1 tmp2
The switch "-o" lets you specify which fields in each file should be output.
Note that in the example below the output has repeated rows because the joint was performed on non-unique keys (the gene Symbol).
# specify which field to output for each file
join -1 2 -o '1.1 2.2 2.3 2.4 2.5' arrayAnnotOrdered.txt arrayDat.txt
join -1 2 -2 3 -t $'\t' -o '1.1 1.2 2.2' tmp1 tmp2
split
This command splits a file into a series of smaller files.
The content of the input file is split into lexically ordered files named with the prefix "x", unless another prefix is provided as argument to the command.
The switch "-l" lets you specify how many lines to include in each file.
# split the content into separate files of 50 lines each
split -l 50 arrayDat.txt
sed
This command invoke a stream editor that modifies the input as specified by a list of commands.
The pattern replacement syntax is: s/pattern/replacement/
# substitute the word “chromosome” with the string “chr”
sed s/chromosome/chr/g arrayAnnot.txt
awk
awk is more than just a command, it is a text processing language.
The input file is treated as sequence of records. By default, each line is a record and is broken into fields.
An awk command is a sequence of condition-action statements, where:
condition is typically an expression
action is a series of commands
awk 'condition {action}' inputfile
Typically, awk reads the input one line at a time; when a line matches the provided condition, the associated action is executed.
There are several built-in variables:
field variables: $1 refers to the first field, $2 refers to the second field, etc.
record variable: $0 refers to the entire record (by default the entire line).
NR represents the current count of records (by default the line number).
NF represents the total number of fields in a record (by default the last column).
FILENAME represents the name of the current input file.
FS represents the field separator character used to parse fields of the input record. By default, FS is the white space (both space and tab). FS can be reassigned to another character to change the field separator.
# print every line of the input file (missing condition)
awk '{print $0}' arrayDat.txt
# print every line of the input file (default argument for print is $0)
awk '{print}' arrayDat.txt
# print 1st field only
awk '{print $1}' arrayDat.txt
# rearrange the fields, separated by a blank space
awk '{print $1, $3, $2}' arrayDat.txt
# rearrange the fields, separated by a tab
awk '{print $1, "\t", $3, "\t", $2}' arrayDat.txt
# print the number of fields for each record
awk '{print NF}' arrayDat.txt
# print the last field of each record
awk '{print $NF}' arrayDat.txt
# print a string
awk 'BEGIN {print "Hello world!"}'
#print the result of an arithmetic operation
awk 'BEGIN {print 2+3}'
awk 'BEGIN {print "2+3=" 2+3}'
# print first 5 lines (default action statement is to print)
awk 'NR < 6' arrayDat.txt
# print first line (i.e. column headers)
awk 'NR == 1' arrayDat.txt
# print first 5 records, excluding headers (NR ==1)
awk 'NR > 1 && NR < 7' arrayDat.txt
# print the total number of records
awk 'END {print NR}' arrayDat.txt
By using loops and conditional statements, awk allows to perform quite sophisticated actions. For example, we can compute the sum or the mean of the intensity values for each probe across all samples:
Compute the sum of intensity values for each gene: for all records except the column headers, we initialize the variable "s" to zero, then we loop through the all the values in a given records and sum cumulatively. When all the values have been considered, we print the sum.
Compute the mean of intensity values for each gene: we compute the sum of the intensity values as before, but we divide the sum by the number of values considered before printing the result. Note that the number of values considered is the number of fields in each record minus one (the probe ID).
# print sum of values for each gene
awk 'NR > 1 {s=0; for (i=2; i<=NF; i++) s=s+$i; print s}' arrayDat.txt
# print mean of values for each gene
awk 'NR > 1 {s=0; n=NF-1; for (i=2; i<=NF; i ++) s=s+$i; s=s/n; print s}' arrayDat.txt
Because the condition can be a pattern matching expression, awk can easily emulate grep.
# print the lines that match the string "orf"
awk '/orf/' arrayAnnot.txt
# print the probe IDs whose annotation contain the string "orf"
awk '/orf/ {print $1}' arrayAnnot.txt
# print number of lines matching “orf” (emulates grep -c)
awk '/orf/ {n++}; END {print n+0}' arrayAnnot.txt
Last updated
Was this helpful?