Unix commands applied to bioinformatics
Last updated
Last updated
MIT Resources
https://accessibility.mit.eduMassachusetts Institute of Technology
This demonstration shows how to perform some basic bioinformatics tasks using simple UNIX commands.
The following examples use primarily two files: arrayDat.txt
and arrayAnnot.txt
. You can copy them using the following command on Luria:
arrayDat.txt contains some microarray data, with rows corresponding to probe IDs and columns corresponding to four samples.
arrayAnnot.txt contains the annotation for the probe IDs, specifically the gene description (i.e. geneTitle) and the gene symbol (i.e. geneSymbol).
Note that in general there is a "many-to-one" correspondence between probe IDs and gene symbols, that is more than one probe ID could be associated with the same gene symbol.
You can use this command to view the content of the top part of a file.
The switch "-n" allows you to specify how many lines to display, starting from the first. If you specify a negative number N, then all the lines except the bottom N will be displayed.
This command extracts portions of lines of a file.
The switch "-f" allows you to specify which field (by default column) to extract.
Multiple fields can be selected either by specifying a range (e.g. 1-3) or by listing the fields (e.g. 1,2,3).
This command merges the lines of files and puts them side by side.
This command sorts the lines of a text file.
By default the sort is in lexicographic order according to the first field.
In a lexicographic order, letters follow number: 0-9 then aA-zZ.
Note that numbers are to be treated as strings, so 10 comes before 2 because there is no positional weighting (the symbol 1 comes before 2).
Other sorting criteria are available:
the switch -k lets you specify which field to use as key for sorting.
the switch -n specifies a numeric sort.
the switch -r specifies a sort in reverse order (either lexicographic or numeric).
the switch -R specifies a random sort.
This command counts the lines, words and bytes in a text file.
It is often used in conjunction to other filtering operations to count the number of items that pass the filter.
This command reports or filters repeated lines in a file.
It compares adjacent lines and reports those lines that are unique. Because repeated lines in the input will not be detected if they are not adjacent, it might be necessary to sort the input file before invoking this command.
The switch "-f" specifies how many fields (starting from the first one), to skip when performing the comparisons.
The switch "-c" specifies to return a count of how many occurrences for each distinct line.
The switch "-d" specifies to print only duplicates lines.
The switch "-u" specifies to print only unique lines (i.e. occurring only once).
The switch "-i" specifies a case insensitive comparison.
Acronym for "general regular expression print".
This command prints the lines of the input file that match the given pattern(s).
The switch "-w" specifies that the match has to occur with a whole word, not just part of a word. Thus, in the example input file, no line matches the pattern "chr" as whole word.
The switch "-i" can be used to specify a case-insensitive match (by default this command is case sensitive).
In the example below, when the switch "-i" is used for the pattern "SS", lines containing the words "class", "repressor", "associated", "expressed" as well as the gene symbol "PRSS33" are matched.
The switch "-c" returns the number of lines where the specified pattern is found.
The switch "-v" performs a "reverse-matching" and prints only the lines that do not match the pattern.
The switch "-n" prints out the matching line with its number.
When multiple patterns must be matched, you can use the switch "-f" and pass as argument a file containing the patterns to match (one per line).
This command provides relational database functionality.
It merges the lines of two files based on the common value of a given field, also known as "key".
By default the first field of each file is considered as key.
To perform a relational joint, where you merge all the lines that have identical values in the specified field of two files, the files themselves must be sorted. Unless otherwise specified in the command (with option "--nocheck-order", a warning is given when the files are not sorted on the key field).
The switch "-t" lets you specify what character to use as field separator in the input and output files. To specify a tab character, use the string '$\t'.
Note that by default the separator is a blank space. It is a generally a good idea to specify a tab as output separator so that if a field consists of multiple words separated by blank spaces, these words remain bundled in the same field and you can use the cut command to access specific fields.
In the examples below, a joint on default settings, yields an output where the fields are separated by blanks. If we try to extract the description (6th field), even if we specify that the delimiter is a blank space rather than a tab, we cannot extract the entire description.
The key field of each file can be specified using the switch "-1" and "-2" for the first and second file respectively. Fields are numbered starting from 1.
In the example below, we join the files probe2gene.txt and arrayDat.txt based on the gene symbol (2nd and 3rd field respectively). Before performing the joint, the files must be sorted according to their keys.
The switch "-o" lets you specify which fields in each file should be output.
Note that in the example below the output has repeated rows because the joint was performed on non-unique keys (the gene Symbol).
This command splits a file into a series of smaller files.
The content of the input file is split into lexically ordered files named with the prefix "x", unless another prefix is provided as argument to the command.
The switch "-l" lets you specify how many lines to include in each file.
This command invoke a stream editor that modifies the input as specified by a list of commands.
The pattern replacement syntax is: s/pattern/replacement/
awk is more than just a command, it is a text processing language.
The input file is treated as sequence of records. By default, each line is a record and is broken into fields.
An awk command is a sequence of condition-action statements, where:
condition is typically an expression
action is a series of commands
Typically, awk reads the input one line at a time; when a line matches the provided condition, the associated action is executed.
There are several built-in variables:
field variables: $1 refers to the first field, $2 refers to the second field, etc.
record variable: $0 refers to the entire record (by default the entire line).
NR represents the current count of records (by default the line number).
NF represents the total number of fields in a record (by default the last column).
FILENAME represents the name of the current input file.
FS represents the field separator character used to parse fields of the input record. By default, FS is the white space (both space and tab). FS can be reassigned to another character to change the field separator.
By using loops and conditional statements, awk allows to perform quite sophisticated actions. For example, we can compute the sum or the mean of the intensity values for each probe across all samples:
Compute the sum of intensity values for each gene: for all records except the column headers, we initialize the variable "s" to zero, then we loop through the all the values in a given records and sum cumulatively. When all the values have been considered, we print the sum.
Compute the mean of intensity values for each gene: we compute the sum of the intensity values as before, but we divide the sum by the number of values considered before printing the result. Note that the number of values considered is the number of fields in each record minus one (the probe ID).
Because the condition can be a pattern matching expression, awk can easily emulate grep.