Unix commands applied to bioinformatics

This demonstration shows how to perform some basic bioinformatics tasks using simple UNIX commands.

The following examples use primarily two files: arrayDat.txt and arrayAnnot.txt. You can copy them using the following command on Luria:

# copy from stock directory to current directory 
cp /net/ostrom/data/dropbox/arrayDat.txt .
cp /net/ostrom/data/dropbox/arrayAnnot.txt .
  • arrayDat.txt contains some microarray data, with rows corresponding to probe IDs and columns corresponding to four samples.

  • arrayAnnot.txt contains the annotation for the probe IDs, specifically the gene description (i.e. geneTitle) and the gene symbol (i.e. geneSymbol).

Note that in general there is a "many-to-one" correspondence between probe IDs and gene symbols, that is more than one probe ID could be associated with the same gene symbol.

  • You can use this command to view the content of the top part of a file.

  • The switch "-n" allows you to specify how many lines to display, starting from the first. If you specify a negative number N, then all the lines except the bottom N will be displayed.

# print first 5 lines
head -n 5 arrayDat.txt 
head -n 5 arrayAnnot.txt
# print everything except the last line
head -n -1 arrayDat.txt 
head -n -1 arrayAnnot.txt

cut

  • This command extracts portions of lines of a file.

  • The switch "-f" allows you to specify which field (by default column) to extract.

  • Multiple fields can be selected either by specifying a range (e.g. 1-3) or by listing the fields (e.g. 1,2,3).

# get 1st column (i.e. probeID)
cut -f 1 arrayDat.txt
# get 3rd column (i.e. geneSymbol)
cut -f 3 arrayAnnot.txt
# get the first three columns
cut -f 1-3 arrayDat.txt
# get the 1st and 3rd columns and redirect the  output into a file (mapping probeID to geneSymbol)
cut -f 1,3 arrayAnnot.txt > probe2gene.txt

paste

  • This command merges the lines of files and puts them side by side.

# put each column in a separate tmp files 
cut -f 1 arrayAnnot.txt > tmp1 
cut -f 2 arrayAnnot.txt > tmp2 
cut -f 3 arrayAnnot.txt > tmp3 
# merge lines in a new column order 
paste tmp3 tmp1 tmp2 > arrayAnnotOrdered.txt

sort

  • This command sorts the lines of a text file.

  • By default the sort is in lexicographic order according to the first field.

    • In a lexicographic order, letters follow number: 0-9 then aA-zZ.

    • Note that numbers are to be treated as strings, so 10 comes before 2 because there is no positional weighting (the symbol 1 comes before 2).

  • Other sorting criteria are available:

    • the switch -k lets you specify which field to use as key for sorting.

    • the switch -n specifies a numeric sort.

    • the switch -r specifies a sort in reverse order (either lexicographic or numeric).

    • the switch -R specifies a random sort.

# sort by 1st field (probeID)
sort probe2gene.txt
# sort by the 2nd field (geneSymbol)
sort -k 2 probe2gene.txt
# sort by geneSymbol in reverse order
sort -r -k 2 probe2gene.txt
# sort by 2nd field (default lexicographic sort)
sort -k 2 arrayDat.txt
# sort by 2nd field (numeric sort)
sort -n -k 2 arrayDat.txt

wc

  • This command counts the lines, words and bytes in a text file.

  • It is often used in conjunction to other filtering operations to count the number of items that pass the filter.

# print number of lines, words and bytes 
wc arrayDat.txt
# print number of lines only
wc -l arrayDat.txt
# print the number of fastq files in the current directory
ls *.fastq | wc -l

uniq

  • This command reports or filters repeated lines in a file.

  • It compares adjacent lines and reports those lines that are unique. Because repeated lines in the input will not be detected if they are not adjacent, it might be necessary to sort the input file before invoking this command.

  • The switch "-f" specifies how many fields (starting from the first one), to skip when performing the comparisons.

  • The switch "-c" specifies to return a count of how many occurrences for each distinct line.

  • The switch "-d" specifies to print only duplicates lines.

  • The switch "-u" specifies to print only unique lines (i.e. occurring only once).

  • The switch "-i" specifies a case insensitive comparison.

 # print all distinct rows then count 
uniq arrayAnnot.txt | wc -l
# count distinct rows without considering the 1st field (probeID)
uniq -f 1 arrayAnnot.txt |wc -l
# report distinct gene symbols and count number of occurrences
uniq -f 1 -c arrayAnnot.txt
# report repeated gene symbols
uniq -f 1 -d arrayAnnot.txt
# report list of unique gene symbols 
uniq -f 1 -u arrayAnnot.txt

grep

  • Acronym for "general regular expression print".

  • This command prints the lines of the input file that match the given pattern(s).

# print lines that match “chr” (lines with the words "chromosome" and/or  "cytochrome" are matched). 
grep chr arrayAnnot.txt
  • The switch "-w" specifies that the match has to occur with a whole word, not just part of a word. Thus, in the example input file, no line matches the pattern "chr" as whole word.

# print lines that match “chr” as a whole word 
grep -w chr arrayAnnot.txt
  • The switch "-i" can be used to specify a case-insensitive match (by default this command is case sensitive).

  • In the example below, when the switch "-i" is used for the pattern "SS", lines containing the words "class", "repressor", "associated", "expressed" as well as the gene symbol "PRSS33" are matched.

# print lines that match “SS” (case sensitive) 
grep SS arrayAnnot.txt
# print lines that match “SS”  (case insensitive, that is "SS" or “ss” or “Ss” or “sS”) 
grep -i SS arrayAnnot.txt
  • The switch "-c" returns the number of lines where the specified pattern is found.

# print how many lines match the pattern "orf" 
grep -c orf arrayAnnot.txt
  • The switch "-v" performs a "reverse-matching" and prints only the lines that do not match the pattern.

# print lines that do NOT match the pattern "orf" 
grep -v orf arrayAnnot.txt
  • The switch "-n" prints out the matching line with its number.

# preceed the matching line with the line number
grep -n orf arrayAnnot.txt
  • When multiple patterns must be matched, you can use the switch "-f" and pass as argument a file containing the patterns to match (one per line).

# make a list with 5 gene symbols
cut -f 3 arrayAnnot.txt|head -n 5 > tmp 
# use list "tmp" to match lines in arrayAnnot.txt
grep -f tmp arrayAnnot.txt

join

  • This command provides relational database functionality.

  • It merges the lines of two files based on the common value of a given field, also known as "key".

  • By default the first field of each file is considered as key.

  • To perform a relational joint, where you merge all the lines that have identical values in the specified field of two files, the files themselves must be sorted. Unless otherwise specified in the command (with option "--nocheck-order", a warning is given when the files are not sorted on the key field).

# join two files (use default field as key)
join arrayDat.txt arrayAnnot.txt 
join arrayAnnot.txt arrayDat.txt
  • The switch "-t" lets you specify what character to use as field separator in the input and output files. To specify a tab character, use the string '$\t'.

    • Note that by default the separator is a blank space. It is a generally a good idea to specify a tab as output separator so that if a field consists of multiple words separated by blank spaces, these words remain bundled in the same field and you can use the cut command to access specific fields.

    • In the examples below, a joint on default settings, yields an output where the fields are separated by blanks. If we try to extract the description (6th field), even if we specify that the delimiter is a blank space rather than a tab, we cannot extract the entire description.

# join two files (default separator is blank space) 
join arrayDat.txt arrayAnnot.txt 
# note how joint fields are separated by blank spaces and default setting of cut do not yield the desired output 
join arrayDat.txt arrayAnnot.txt | cut -f 6
# note how specifying a blank space as delimiter for cut does not yield the desired output because the description field consists of mulitple words 
join arrayDat.txt arrayAnnot.txt | cut -d ' ' -f 6
# join two files (separator is a tab)
join -t $'\t' arrayDat.txt arrayAnnot.txt 
join -t $'\t' arrayDat.txt arrayAnnot.txt | cut -f 6
  • The key field of each file can be specified using the switch "-1" and "-2" for the first and second file respectively. Fields are numbered starting from 1.

    • In the example below, we join the files probe2gene.txt and arrayDat.txt based on the gene symbol (2nd and 3rd field respectively). Before performing the joint, the files must be sorted according to their keys.

# sort by key, then use the probeID as key (probeID is the 2nd and 3rd field in the files)
sort -k 2 probe2gene.txt > tmp1
sort -k 3 -t $'\t' arrayAnnot.txt > tmp2
join -1 2 -2 3  -t $'\t' tmp1 tmp2 
  • The switch "-o" lets you specify which fields in each file should be output.

    • Note that in the example below the output has repeated rows because the joint was performed on non-unique keys (the gene Symbol).

# specify which field to output for each file 
join -1 2 -o '1.1 2.2 2.3 2.4 2.5' arrayAnnotOrdered.txt arrayDat.txt
join -1 2 -2 3  -t $'\t'  -o '1.1 1.2 2.2' tmp1 tmp2

split

  • This command splits a file into a series of smaller files.

  • The content of the input file is split into lexically ordered files named with the prefix "x", unless another prefix is provided as argument to the command.

  • The switch "-l" lets you specify how many lines to include in each file.

# split the content into separate files of 50 lines each 
split -l 50 arrayDat.txt

sed

  • This command invoke a stream editor that modifies the input as specified by a list of commands.

  • The pattern replacement syntax is: s/pattern/replacement/

# substitute  the word “chromosome” with the string “chr” 
sed s/chromosome/chr/g arrayAnnot.txt

awk

  • awk is more than just a command, it is a text processing language.

  • The input file is treated as sequence of records. By default, each line is a record and is broken into fields.

  • An awk command is a sequence of condition-action statements, where:

    • condition is typically an expression

    • action is a series of commands

awk 'condition {action}' inputfile
  • Typically, awk reads the input one line at a time; when a line matches the provided condition, the associated action is executed.

  • There are several built-in variables:

    • field variables: $1 refers to the first field, $2 refers to the second field, etc.

    • record variable: $0 refers to the entire record (by default the entire line).

    • NR represents the current count of records (by default the line number).

    • NF represents the total number of fields in a record (by default the last column).

    • FILENAME represents the name of the current input file.

    • FS represents the field separator character used to parse fields of the input record. By default, FS is the white space (both space and tab). FS can be reassigned to another character to change the field separator.

# print every line of the input file (missing condition) 
awk '{print $0}' arrayDat.txt 
# print every line of the input file (default argument for print is $0)
awk '{print}' arrayDat.txt 
# print 1st field only
awk '{print $1}' arrayDat.txt 
# rearrange the fields, separated by a blank space 
awk '{print $1, $3, $2}' arrayDat.txt
# rearrange the fields, separated by a tab 
awk '{print $1, "\t", $3, "\t", $2}' arrayDat.txt
# print the number of fields for each record 
awk '{print NF}' arrayDat.txt
# print the last field of each record 
awk '{print $NF}' arrayDat.txt
# print a string 
awk 'BEGIN {print "Hello world!"}'
#print the result of an arithmetic operation 
awk 'BEGIN {print 2+3}'
awk 'BEGIN {print "2+3=" 2+3}'
# print first 5 lines (default action statement is to print)
awk 'NR < 6' arrayDat.txt
# print first line (i.e. column headers)
awk 'NR == 1' arrayDat.txt
# print first 5 records, excluding headers (NR ==1) 
awk 'NR > 1 && NR < 7' arrayDat.txt
# print the total number of records 
awk 'END {print NR}' arrayDat.txt
  • By using loops and conditional statements, awk allows to perform quite sophisticated actions. For example, we can compute the sum or the mean of the intensity values for each probe across all samples:

    • Compute the sum of intensity values for each gene: for all records except the column headers, we initialize the variable "s" to zero, then we loop through the all the values in a given records and sum cumulatively. When all the values have been considered, we print the sum.

    • Compute the mean of intensity values for each gene: we compute the sum of the intensity values as before, but we divide the sum by the number of values considered before printing the result. Note that the number of values considered is the number of fields in each record minus one (the probe ID).

# print sum of values for each gene 
awk 'NR > 1 {s=0; for (i=2; i<=NF; i++) s=s+$i; print s}' arrayDat.txt
# print mean of values for each gene 
awk 'NR > 1 {s=0; n=NF-1; for (i=2; i<=NF; i ++) s=s+$i; s=s/n; print s}' arrayDat.txt
  • Because the condition can be a pattern matching expression, awk can easily emulate grep.

# print the lines that match the string "orf" 
awk '/orf/' arrayAnnot.txt
# print the probe IDs whose annotation contain the string "orf" 
awk '/orf/ {print $1}' arrayAnnot.txt
# print number of lines matching “orf” (emulates grep -c) 
awk '/orf/ {n++}; END {print n+0}' arrayAnnot.txt

Last updated