1 of 31

Tasks - Bioinformatics Methods

UCSC Genome Bioinformatics

Interacting with the UCSC Genome Browser

Open the UCSC genome informatics page.
The horizontal blue menu provides access to the tools.
Particularly useful links are Genomes, Genome Browser, Tools and My Data.
Clicking on the Genomes link will lead to a page that allows selection of organism and genome assembly.

Position or search term queries - examples of how to query the database.
Assembly details - General assembly information
Summary Statistics - Detailed statistics of the active assembly

Proceed to the genome assembly by clicking "Go"
Search for the gene Protocadherin 10 by pasting that name into the "position or search term" box and clicking "submit".

The Genome Browser

The genome browser is the central visualization tool of the UCSC Genome Informatics Group.
Most everything is hyperlinked and adjustable

Restore the visualization to default by clicking the "default tracks" button under the image. Note that settings from previous sessions with the browser may be retained by your browser.

Positional and scale controls are located above and below the browser image (red boxes)

Zoom and movement buttons
Coordinate window
Coordinate bar in view - click re-centers and zooms in 3x
Dragging across the upper track is a zoom to selection function.
Move start and end functions
Clicking in window and dragging re-positions the view
Clicking the chromosome ideogram re-positions the window to selected part of the chromosome

Data organization and track controls (blue boxes)

Like the search results, the information displayed in the browser window is organized into groups and tracks of data. The kinds of tracks being displayed and the ways they are presented can be controlled.
NOTE: This human assembly is mature, heavily annotated and extensively analyzed. The genomes of other organisms with less mature assemblies have less associated information and therefore fewer tracks.
The buttons "default tracks" and "hide all" will do as their name suggests.
The "configure" button displays a list of the available data with display options.
The bottom of the page consists of a series of pull-down menus. Each is associated with a type of data and the options control display mode.
IMPORTANT: Useful details about each track can be obtained by clicking the hyperlinked track name.

Navigate to chr4:133,101,179-133,262,320 using the position window.
In the "Genes and Gene Predictions Tracks" section, use the pull-down menu to set the "GENCODEv24" track to full.
Click the "refresh" button. Toggle the GENCODEv24 track between dense and full using the pull-down menu and refresh button.
For reference, the effects of these different options are described HERE

Note:After changes are made to the pull-down menus, refresh is required for them to take effect.

The visual cues that you see in the "Full" display of known genes are:
- Arrowheads indicate the strand of the gene.
- Thin lines are introns, medium lines are non-coding portions of the mRNA, thick lines are coding region.
Zoom in to display the region around the 5' end of the PCDH10 gene using coordinates chr4:133,148,483-133,162,403

Obtaining DNA sequence from the UCSC Database

Obtaining genomic data from the UCSC database.

Open the UCSC browser and navigate to the "Dec. 2013 (GRCh38/hg38)" human assembly.
Enter the position chr4:133,148,410-133,195,705 and click submit.
The appropriate view is here.
Click the View-->DNA link to enter the "Get DNA in Window" page
Sequence range and formatting adjustments are available, accepting the defaults and clicking "Get DNA" will retrieve the genomic sequence that was present in the browser view.
Go back to the "Get DNA in Window" page
Click the “extended case/color options” button and configure the display so that known gene exons are in bold, red and upper case letters.
- Set default case set to "Lower" with the radio button.
- Check the boxes for "Toggle Case" and "Bold" and enter 255 in the "Red" box on the GENCODE v24 line.
- Click submit and view the newly formatted results.

Obtaining genomic data from the UCSC database using table browser queries

The UCSC table browser allows execution of complex database queries without knowledge of database query languages.
Open the and activate the "Dec. 2013 (GRCh38/hg38) human assembly.
Enter the position chr4:133149103-133195252 and click submit.
Click the "Tools-->Table Browser" link in the upper blue bar to open a Table Browser page.

The Table Browser

Note the organism and assembly controls (red box)
Note the data type controls (blue box) that are associated with different kinds of data (browser tracks).
Note the region selection option
Set group to "Genes and Gene Prediction Tracks", track to "GENCODEv24", and table to "knownGene".
Restrict region to chr4:133,149,846-133,195,995
locate the output format menu and select "BED - browser extensible data" and click the "summary/statistics" button at the bottom of the page.
- This provides details about your query results and can be very useful for data summaries and as a guide to ensure your query is behaving as desired.
The count of genes returned should be 3. Note that only 2 are visible in the browser. This is do to the default "Basic" display of the new known genes set ENCODEv24. For more information see .
Return to the Table Browser and click the "get output" button to view the bed format version of the data.
Switch the output format to "GTF" and compare the results
Switch the output to "sequence" and select mRNA on the next page, save the resulting output file to your desktop as kg.fa, this file will be used to demonstrate searching. In case of problems, the file is available
Note that all different kinds of data can be accessed and downloaded in ways similar to this although the available output formats may be different.
Zoom to the 5' end of PCDH10 (chr4:133,148,848-133,150,064) and experiment with Comparative Genomics --> Cons 20 Mammals track. Note the base level conservation scores from output format "data points".
Request output in "custom track" format and view the data with "Full" setting in the browser.
Note the high scoring region just upstream of the 5'end that also overlaps with a DNase I hypersensitivity peak and an area conserved in fish and lamprey.
Turn on the CpG islands track in the Regulation group and observe the overlap with conserves region and DHS. Open the Encode Regulation track and turn on all the other available tracks by setting them to "pack". In addition to those other features, there is also a valley in the histone marks. Together these data are consistent with this region having a role in the regulation of expression of PCDH10.

Filtering table browser queries

Open the UCSC browser
Click the "Genomes" link and select group "Mammal", genome "Human" and assembly "Dec. 2013 (GRCh38/hg38)
Enter the position chr4:133,143,371-133,155,540 and click submit.

The eye icon allows you to view the result, the pencil is for editing attributes and the X allows deletion.
Select the FASTA manipulation toolbox on the left. Then select the "Compute sequence length" function. When finished, view the results.

Performing a BLAT search

BLAT stands for "Blast Like Alignment Tool". It is very fast and is designed to find nearly identical matches.
A description of blat is located at the bottom of this .

The query is displayed with identities in blue and splice junctions in light blue.
The genomic sequence colored the same way.
The sequences in pairwise alignment.

Return to the BLAT results page and select the "browser" hyperlink. Note the new "Blat sequence" track being displayed in the "Mapping and Sequencing" group, this track has the same visualization controls as other similar tracks.
Click the "hide all" button, use the pull-down menus to set ENCODEv24 and Blat sequence to "full" and refresh the view. Zoom out so the entire footprint of the PCDH10 is visible. The coordinates are: chr4:133,147,918-133,213,548

Creating Custom Tracks

Data Formats

A wide range of data formats can be converted to custom tracks than can then be visualized and compared to other genomic data.
Custom tracks are managed through the "My Data --> Custom Tracks" window. Select this page.
Available formats for custom tracks are described .

Creating a custom track

Paste the following block of text into the Paste URLs or Data window and click the Submit button. Then select the "view in Genome Browser" option on the next page.

Note the presence of the group "Custom Tracks" with the track "ClassExample". Toggle the display mode to full and click Refresh. Zoom out do that all 4 Examples are visible.
Select "Tools-->Table Browser". The region will usually correspond to the part of the genome visualized in the browser. Select the “summary/statistics” function to display some statistics about the information in the custom track.

UCSC Intersection Queries

Intersection queries are one of the most useful functions implemented by UCSC. They allow users to characterize the positional relationships of 2 different overlapping tracks.

The demonstration of this tool will build on the observations made during examination of the 5' end of the PCDH10 in . The region upstream of the 5' end of the gene overlapped with a CpG island. CpG islands correlate with promoters because they tend to be unmethylated. In non-promoter sequence, the C of a CG dinucleotide is usually methylated and thus subject to spontaneous deaminiation. See the details page of the track for more information. Note: all UCSC tracks are well documented and together contain extensive information about a wide range of genomics methods and data types.

summary/statistics and the browser can be used to confirm the custom track contains the desired data.
This processing creates one interval for each GENCODE transcript and many intervals will be similar because many start positions are close to one another. This redundancy can be collapsed by sending the results to galaxy and using the bedtools functions "Sort BED files" followed by "Merge BED files". There are fine controls on this processing and it will not be demonstrated here but the output is available .
Create a custom track using the URL to that merged data and count the results.
Compare the contents of the custom tracks in the browser. The region is chr2:109,000,000-119,000,000
Use an intersection query to identify regions in the merged upstream sequences that overlap with a CpG island.
- Open Tools --> Table Browser and set the "group" and "track" options so that they specify the merged bed custom track.
- Select the "create" button next to "intersection" button and select intersection with CpG islands according to the following image:

Note the ability to control the degree of intersection (some, none, partial).
After intersection is in place, use summary/statistics to count the results and view some in the genome browser.

Note about intersection direction

The results of the intersection query can be used in the same ways as any other kind of data.

Viewing cross-species alignments

Cross-species Alignments and Conservation

Query the March 2006 human genome browser gateway with the gene symbol FOXP2. View the gene on chromosome 7. Adjust the tracks displayed so that base position, chromosome band and refseq are set to full.
Switch the Zebrafish, Dog and Mouse Net Alignments to Full. What portions of the gene are most highly conserved across all of these species? Compare the dog and mouse alignments. Based on what you see which species is more closely related to human? Why? Which species diverged from human more recently?

Examine the pan-vertebrate conservation upstream from the transcription start site (TSS). How far from the TSS is the conserved block? Can you think of a reason why it is conserved? Can you find any tracks that could support this idea?
Examine the details of the 7X Reg Potential track. Reconfigure the view so that it includes an indicator line at 0.05.

Galaxy

The Penn State Galaxy team has implemented a powerful set of genomics tools that interfaces nicely with the UCSC Genome informatics tools and also extends UCSC functionality.

The Galaxy team provides a wide range of excellent training videos viewable from there home page.

They also maintain a highly informative wiki.

Intro to Galaxy

Background

Galaxy is a framework for integrating computational bioinformatics tools. It provides a common interface that wraps diverse tools, as well as an environment for interactive analysis that transparently tracks the details of analysis steps, a workflow system for convenient reuse, data management, sharing, publishing, and more.

The web-based Galaxy site is freely available to the public. User registration at the site is optional, but can be beneficial to increase its efficiency for the user. The registration is free and allows the user to create custom workflows and save analysis results.

Open in a new window or type in your url bar. In the main page of Galaxy, you will find some introductory material, including useful screencasts that show how to use the various tools to perform many types of analyses, such as text manipulation, format conversion, NGS quality control, NGS Mapping, interval data manipulations, relational database operations, and many more.

General Information

There are three panes in the Galaxy desktop:
- Tool pane (left).
- Current Analysis pane (centre).
- Job/History pane (right).

To analyze the data in Galaxy, you perform an ordered sequence of tasks, each consisting in applying a specific tool (chosen from the left pane) to a specific set of files (chosen in the central pane or among the results from previous steps as reported in the History pane).
The History pane shows the sequence of steps/tasks you are running. In particular, the color of the frame associated with a particular job has the following meaning:
- gray : the job is in the queue.
- yellow: the job is running.
- green: the job is completed.
- red: the job produced an error.

Galaxy NGS Illumina QC

This tutorial shows how to perform basic QC on Illumina data, such as basic quality statistics, quality score boxplots, trimming and masking.

1. Load a fastq file and annotate the uploaded data

On the Tool Panel, click on Get Data → Upload File.
- Browse to your home folder and select the file Galaxy_GM12878.fastqillumina'.
- Set the format to "fastqillumina".
Click the "Execute" button.
Once the job is completed, click on the pencil icon to edit the attributes.
- Set the Database field to "Human Mar 2006 (NCBI36/hg18).
- Click the "Save" button.

2. Convert to Sanger FASTQ format

On the Tool Panel, click on NGS Toolbox Beta → NGS: QC and Manipulation → FASTQ Groomer.
This tool converts between various FASTQ quality formats.
- By default, the quality format output is Sanger FASTQ.
- Sanger FASTQ is the required format for downstream analyses in Galaxy.
- Set the input type to "Illumina 1.3+"
- Click the "Execute" button.
Once the job is completed, click on the pencil icon and edit the name of the job as "GM12878 fastqsanger".

3. Compute Quality Statistics

On the Tool Panel, click on NGS Toolbox Beta → Fastx-Toolkit → Compute Quality Statistics.
This tool compute quality statistics such as min, max, mean, median, Q1, Q3, IQR, etc. of quality scores.
- Select Data 2 as input library.
- Click the "Execute" button.

4. Draw Quality Score Boxplot

On the Tool Panel, click on NGS Toolbox Beta → Fastx-Toolkit → Draw Quality Score Boxplot.
This tool creates a box graph of the quality scores in the library.
- Select Data 3 as statistic report file.
- Click the "Execute" button.
Once the job is completed, click on the eye icon to see the boxplot figure. You can expand and collapse the figure by clicking on the arrows placed on the sides of the main panel.

5. Trim Sequence Reads to length of 60 bases

On the Tool Panel, click on NGS Toolbox Beta → FASTQ Trimmer (by column).
This tool trims the end of the reads.
- Select Data 2 as input FASTQ file.
- Set the offset from 5' end to 16.
- With these parameters, all reads are trimmed after the 60th base.
- Click the "Execute" button.
Once the job is completed, click on the eye icon to edit the attributes of the resulting data and change the name to "GM12878 Trimmed fastqsanger".

6. Apply Quality Masker to bases with quality lower than 20

On the Tool Panel, click on NGS Toolbox Beta → FASTQ Masker.
This tool allows masking base characters in FASTQ files according to quality score value and comparison method.
- Select Data 2 as input file to mask.
- Set the criterion as "less than" and the threshold to 20.
- Click the "Execute" button.
- With these parameters, any base with quality less than 20 will be masked with a symbol "N".

7. Apply FASTQ Quality Trimmer

On the Tool Panel, click on NGS Toolbox Beta → FASTQ Quality Trimmer (by sliding window).
This tool allows trimming the ends of reads based upon the aggregate value of quality scores found within a sliding window. Several criteria can be used to determine the aggregate value (min, max, sum, mean) within the sliding window.
- Select Data 2 as input file.
- Select "Trim 5' end" only from the scroll down menu.
- Set window size to 3.
- Select "max score" as aggregate action.
- Select ">= 2" as criterion for trimming
- Click the "Execute" button.

8. Create a Data Subset by selecting the first 2,500 sequence reads

On the Tool Panel, click on Text Manipulation → Select first lines.
This tool select the first N lines of the input dataset.
- Set to 10,000 the number of lines to select.
- Select data 5 as input.
- Click the "Execute" button.
- With these parameters, the first 10,000 lines of the input FASTQ file are selected, corresponding to the first 2,500 sequence reads.

Galaxy NGS Illumina SE Mapping

This tutorial shows how to perform basic QC on Illumina data, such as basic quality statistics, quality score boxplots, trimming and masking.

Single-End Mapping of Illumina Data

1. Load fastq file and annotate the uploaded data

On the Tool Panel, click on Get Data → Upload File.
This tools allows you to upload a data from a file, url or a textbox.
- Select Galaxy_GM12878_trimmed.fastq as input file.
- Select "fastqsanger" as file format.
- Select "hg18" as genome.
Click the "Execute" button.

2. Map to reference genome (hg18) using Bowtie

On the Tool Panel, click on NGS Toolbox Beta → NGS Mapping → Map with Bowtie for Illmina.
This tools allows you to run the aligner Bowtie. The output is a SAM files with all the read alignments.
- Select hg canonical as reference genome.
- Leave other settings as default.
Click the "Execute" button.

3. Filter SAM file on bitwise flag values

On the Tool Panel, click on NGS Toolbox Beta → NGS SAMtools → Filter SAM on bitwise flag values.
This tools
- Select data 2 as input dataset.
- Add new flag with type set to "the read is unmapped" and the value set to "No".
- Add new flag with type set to "read strand" and the value set to "Yes".
Click the "Execute" button.
With these parameters, the resulting output consists of those reads that are properly mapped and are on the reverse strand.

4. Find how many reads map to each chromosome

On the Tool Panel, click on Join, Subtract and Group → Group →
This tools
- Select data 2 (bowtie output) as input dataset.
- Group by column 3 (reference name, i.e. chromosome name)
- Add new operation: count on column 1 .
Click the "Execute" button.
With these parameters, the resulting output consists of those reads that are properly mapped and are on the reverse strand.
Edit the attributes of this module by clicking on the eye icon: rename as "read distribution by chromosome".

5. Find the most represented chromosome

On the Tool Panel, click on Filter and Sort → Sort data →
This tools
- Select the column representing the key to sort
- Select "Numerical sort" in "descending order" as options.
Click the "Execute" button.
With these parameters, the results show that chr19 is the most represented chromosome.

6. Convert SAM to BAM

On the Tool Panel, click on NGS Toolbox Beta→ NGS Samtools &rarr SAM to BAM converter.
This tools converts SAM-formatted files into BAM-formatted files.
- Select the SAM file to convert.
- Click the "Execute" button.

7. Compute general statistics via Flagstat operation

On the Tool Panel, click on NGS Toolbox Beta→ NGS Samtools &rarr flagstat.
This tools provides a simple summary based on BAM-format.
- Select data 6 (BAM file) as input.

Galaxy SNP Interval Data

his example is inspired by a screencast published on the Galaxy website. It consists in combining exon information and SNP information, both represented as interval data.

1. Load exon data from UCSC tables

On the Tool Panel, click on Get Data → UCSC Main Table Browser.
This tools allows you to upload data from the UCSC Tables.
- Use the following parameters:
  - Group: Variation and Repeats
  - Track: SNP(130)
  - Region: chr19:1-100,000
  - Output format: BED
  - Send output to Galaxy: checked
- Click "Get Output" button.
  - Select the radiobox so that one BED record is created for the whole gene.
  - Click the button "send query to Galaxy"
With these parameters, this tool creates a BED file containing all the SNPs for the first 1M bases of chromosome 19.
Once the job is completed, change the name of the dataset to "SNPs chr19".

2. Load SNP data from UCSC tables

On the Tool Panel, click on Get Data → UCSC Main Table Browser.
This tools allows you to upload data from the UCSC Tables.
- Use the following parameters:
  - Group: Genes and Gene Prediction
  - Track: UCSC Genes
  - Region: chr19:1-100,000
  - Output format: BED
  - Send output to Galaxy: checked
- Click "Get Output" button.
  - Select the radiobox so that one BED record is created per coding exon.
  - Click the button "send query to Galaxy"
With these parameters, this tool creates a BED file containing all the exon information for the first 1M bases of chromosome 19.
Once the job is completed, change the name of the dataset to "exons chr19".

3. Join exon and SNP information

On the Tool Panel, click on Operate on Genomic Intervals → Join the intervals.
This tools allows you to join the information from two interval files based on the coordinates of each feature.
- Select the SNP chr19 and the exons chr19 files as input.
- Click on the "Execute" button.
Because some exons might contain multiple SNPs, the resulting output might have size greater than the two input files.

4. Find the number of SNPs per exon

On the Tool Panel, click on Join, Subtract and Group → Group.
This tools groups the information based on a given column and performs the aggregation operations on the other columns.
- Select data 3 as input.
- Select column 4 (exon ID).
- Add operation to count c4.
- Click on the "Execute" button.

5. Find the exon with the most SNPs

On the Tool Panel, click on Filter and Sort → Sort.
This tools ...
- Select data 4 as input.
- Select column 2 as sorting key.
- Click on the "Execute" button.

6. Find how many chromosomes have a given number of exons

On the Tool Panel, click on Join, Subtract, Group → Group.
This tools ...
- Select data 5 as input (sorted).
- Select column 2 as sorting key.
- Set the operation to "count" on column 1.
- Click on the "Execute" button.

7. Filter exons with at least 10 SNPs

On the Tool Panel, click on Filter and Sort → Filter.
This tools ...
- Select data 5 as input (sorted).
- Set the condition to SNP count greater than 10 (i.e. c2 >= 10).
- Click on the "Execute" button.

8. Retrieve original information for exons

On the Tool Panel, click on Join, Subtract, Group → Join.
This tools ...
This is equivalent to a relational join (not an interval join).
- Select the exons with more than 10 SNPs as first input.
- Select the exon data for chr19:1-1,000,000 as second input.
- Select column 1 (exonID) for the first file.
- Select column 4 (exonID) for the second file.
- Click the "Execute" button.
Now repeat this step but invert the order of the file. Note that this time the output is a BED-formatted output, wherease before it was a tabular file.

9. Display using the UCSC Browser

On the Data Panel on the right-hand size, click on the last job → Display at UCSC.
- The User track show the exons that have more than 10 SNPs in the region of chr19 considered.

Editing and annotation gene structures with Argo

Annotating Genomic Sequence with Argo

Argo is a genome browser and annotation tool written by Reinhard Engles at the Broad Institute. It combines powerful data display functions with the ability to create and edit genomic features.

Start Argo. You can use the web start or download the jar and double-click.
The data files your need for this exercise are located HERE. Save each file to your working directory.
Load Sequence into argo by selecting the "File-->Open Sequence File" menu item. The protocadherin 10 gene sequence is in the file "pcdh10.gene". In the sequence file format pull-down menu, select Fasta. In the sequence range window, accept the default of the entrire sequence range. In the Feature Map Track Table, click the "Load Tracks From File..." button. Select "known.gff". In the Track File Format window, accept the default of "gff".
A line will appear in the track table representing the gff file you just loaded. If you click on the colored square, you can change the color of the track, pick dark blue. The click OK.
The resulting feature map shows the 3 known gene isoforms for PCDH10. Select one of the isoforms.
This activates the feature inspector window in the lower left hand corner of argo. The properties tab describes the feature you have selected, the DNA tab contains the DNA sequence(exons in blue, introns in black), the mRNAis found inder the mRNA tab and the Protein translation of the mRNA is in the Protein tab. Note: Right-clicking (command click for macs) while hovering over the mRNA sequence will allow you to adjust the reading frame.
A right-click while hovering over the Feature map will open a function menu. Select the option "Track Table". This opens the same tool that appeared while we were loading the sequence. Click the "Load Tracks From File..." button.
Select the file "cad.blastx" and hit OK. This results in a Track File Format menu, select the Blast (standard) option. cad.blastx contains standard blast output (look at it here) in text format. The query sequence was the genomic DNA we are displaying, the database was a set of protein sequences related to the gene we are working on. Therefore, after hitting the OK button you will see a "Blast Feature Coordinates" dialog box. The answer to this question is no because of the subject sequences were the protein, not the DNA in argo. This will add the blast results to the Feature map. When you select a blast hit, information about that hit appears in the feature instpector.
Control the way the blast data is displayed by opening the track table and clicking on the area next to the blast track and under the style section. Change it to "Bar Graph".
Using similar methods, load the files cpg.gff and est.gff. All are in gff format.
The resulting display may place the known genes near the top of the page. This is not ideal, open the track table and change know genes to "Segregated".
In the UCSC browser, we identified the EST DA219615 as evidence for an uncharacterized PCDH10 splice variant. Search for this feature using the tool on the lower right hand side of argo. The search tool will result in selection of that feature.
You can inspect details about the feature by right clicking while hovering over the feature map. For example, select the option "Splice Site Profile". This results in window showing the splice junction for the selected feature.
Create a gene model based on this evidence using the "Edit --> Insert Compound Feature" menu item. This will open a window showing the coordinates of the feature. Give it the Gene name "PCDH10" and the transcript name "iso3". Next you will be asked for a file to save the data. Create a new gff3 file called iso3 and click OK.
The result will be a feature is inserted into the map that is identical to EST DA219615. It is located at the very top of the map. Move it's display position down using the track table. Segregated is a good choice.
The purple color indicates a pending insert. Save it using "Edit --> Save".
Examine the 3' End of the gene model. While hovering over the feature map, hold down the Shift Key. A magnifying glass will appear. Use the magnifying glass to draw and small box around the 3' exon of the inserted model.
Compare the exon to the est BG201062. The EST contains more sequence and is likely to represent the real 3' end of the gene. Select the BG201062 and drag it to the end of your inserted model. You will be given a transcript extension dialog box. Select Merge Right. Now, your edited feature should be highlighted in orange. This indicates a pending edit. Save the new model. You will get an error, but that is OK for now.
Zoom out to view the full gene by Shift-Right Clicking several times. Or use the Zoom to 100% function.
Next Add the 5' end of the gene. Select one of the longest known gene models and drag it to your edited feature. This time accept the merge left option. Save the model, you will now get 2 errors, accept them as well.
Repair the gene model using the Feature Inspector window on the lower left. Select your gene model, then click the DNA tab. The letters in DNA will turn red indicating that the feature has non-canonical (ie. non-GT-AG) splice junctions. These will be highlighted in red in the sequence pane. Scroll down until you find the first one. It is located at the 5' end of exon 3. Highlight a couple of lines before and after the red letters. Notice the yellow bar in the feature map. Zoom in on the troublesome exon. The exon is different from the know genes because it's 5'end hangs over a little. Select the known exon and drag it to the working model. select the replace option and save. This will repair the first problem.
Next, zoom in on the 3' exon of the working model. We have an exon that is a few bases off because of the EST-based extension we used to create the model. Go to the DNA tab of the Feature Inspector, scroll to the bottom. Note the non-canonical junction highlighted in red. Right next to the red highlight is the sequence AG that we need. Highlight the red letters and sequence up to and including the AG and do a right-click. One of the available options is "Make Highlighted Sequence Intronic". (WARNING: Mac users without a 2+-button mouse may have trouble with this operation - you can use the drag and drop features to get this exon right) Select this option then save the model. At this time, it should save without errors.
Go to the mRNA tab, right click and select the option "Select ORF > Select Longest Start to StopCodon ORF". Go to the Protein tab to see the protein sequence.
Note the green start and red stop codons indicated in the feature map. Also note the consensus poly-A signal analysis available from the right-click menu.
The entire feature map can be exported as an image for use in presentations using the "File --> Export Map as Image..." menu item.
Select the hand-edited isoform, return to the mRNA tab in the feature inspector, do select all and copy, then return to the section on the UCSC browser. We will start by pasting the mRNA sequence from this annotation effort into a Blat search window.

GeneGO MetaCore

GeneGo Introduction

Competitor of Ingenuity, offers manually curated pathways and networks, the tools to analyze those groups in the context of various experiments and powerful network building methods.
Owned by Thomson Reuters
Major Applications:

Access to manually curated databases
Evaluation of pathways, networks, annotations and associations as they relate gene lists. These gene lists could be deferentially expressed genes or any other list of genes.
Construction of networks based on gene lists using a database of interactions.
1. GeneGO ontologies
2. Pathway Maps
3. Process Networks
4. GO Processes
5. Disease (by Biomarkers)
6. Metabolic Networks

High quality documentation is available by following Help -> Help

Loading Data Into GeneGO

A link to example data is below. xls files are required, open both and examine the data.
Key structure is:

NOTE: Raw gene list of genes of interest works fine as well.
From the , there are 3 ways to access data loading tool:

File-->Upload Data...
Upload Data icon
Upload Page-->Upload Experiments with Gene or Protein IDs

NOTES: Tools emphasize the Activate/Deactivate/Reload requirements.
Choose the file and click Next
Adjust data types as needed. Rename columns and experiments as necessary. Make sure that fold changes and supporting p-value columns are adjacent to each other. Click Next
Specify the species and click Next.
The Background processing status page will indicate when upload is complete. Once processed, the line can be deleted from the table.
Return to the Background processing page from the start page by File-->Background processing item.

Data Management in GeneGO

The upper left panel of the hosts the data management functions.
Click the "Start Page" icon or your browsers "refresh" button to make sure that the displayed data is current.
Double-click the "Experiments" folder. You should see "ColonMet_Faked" inside the folder.
Right-Click on the My Data folder and select Create Folder... Create a folder called May1Training
Drag ColonMet_Faked into the new folder. Right-click on the original folder to delete. Note: folders can also be renamed.
Data can also be shared. Right-click on May1Training and select Share..., enter charliew and search. select username and click Add Selected to List. Note: Types of permissions can be controlled.
The lower left panel of the displays the active data. Active data is available for analysis and can be controlled by the Activate/Deactivate icon, right clicking or drag-and-drop.

Setting Thresholds and Background Sets

Select an experiment, right click and activate Properties...
- The page displayed indicates information about that experiment. There is a place to enter notes or descriptions, the number of IDs in play are also listed as is the Threshold and P-Threshold.
Threshold refers to the fold-change or log ration cutoff that is in play.
Select all three experiments in the ColonMet_Faked folder and activate Tools-->Set Threshold and Background List. Next to General, Enter 1 in the Threshold and 0.1 in the P Threshold. Leave background option at default and click OK.
- If an array is the source of the data, the background list can be set to the gene content of the array, otherwise the default behavior is to use everything.
Use the right-click, properties function to confirm that threshold settings have been set for each experiment.
Activate the ColonMet_Faked_FC experiment and refresh the page so that it is visible in the Active Data panel.
Select the experiment in the active data panel and click Tools-->Set Threshold and Background List. Click the statistics button. Counts of pass-threshold items in various categories are displayed. The signal distribution button will show the data for these faked values.

Search And Browse Content Tab

EZ Search

From the Start Page, select the Search and Browse Content link and pick the EZ search button
Click human FN1 from the results list
Results page is outlined according to category, available information includes pathway maps and interactions.
Under interactions page, compare the details for alpha2/beta1 binding and alpha5/beta1 binding.

Ontologies - Pathway Maps

From the Start Page, select the Search and Browse Content link and pick the Pathway Maps option.
This leads to a hierarchy of GeneGO pathway maps
Select Regulatory maps-->Regulation of cellular processes-->Cell adhesion-->Cell adhesion_Integrin inside-out signaling
NOTE: Data from activated experiments is displayed on the map.
The Overview tab describes details of the map and provides references.
The Processes tab lists significant GO processes in the map.

Ontologies - Process Networks

From the Start Page, select the Search and Browse Content link and pick the Process Networks option.
This leads to a hierarchy of GeneGO process networks
Select Process Networks-->Cell adhesion_ECM remodeling
NOTE: Data from activated experiments is displayed on the network.
The network is highly interactive.

Other - Protein Groups and Complexes

From the Start Page, select the Search and Browse Content link and pick the Protein Groups & Complexes tool.
The tool links to a page of table of results, select Integrin Receptor Family (Human)

Workflows and Reports Tab

Analyze Single Experiment Workflow

The analyze single experiment workflow allows one experiment to be tested. Activate the an experiment with desired thresholds applied and select Analyze Single Experiment from the Workflows & Reports list.
Confirm Network object count and threshold settings and click Apply.
Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
- Pathway maps are nice interactive maps
- Process networks and network statistics produce editable network maps.
- GO Processes produce lists that can be used to draw networks.
- Diseases link to disease pages.

Enrichment Analysis Workflow

This workflow requires at least 2 active experiments. Activate all 3 ColonMet experiments and refresh the browser page to confirm.
Select the Enrichment Analysis link to launch the workflow. Ensure the correct experiments are activated and click Next.
Fold change, p-value signal directionality can be adjusted on the next page. NOTE: the number of Network objects that meet threshold criteria are indicated. Clicking Apply launches the analysis.
Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
- Pathway maps are nice interactive maps
- Process networks produce editable network maps.
- GO Processes produce lists that can be used to draw networks.
- Diseases link to disease pages.

Compare Experiments Workflow

This workflow requires at least 2 active experiments. Activate all 3 ColonMet experiments and refresh the browser page to confirm.
Select the Compare Experiments link to launch the workflow. Ensure the correct experiments are activated and click Next.
Fold change, p-value signal directionality can be adjusted on the next page. NOTE: the number of Network objects that meet threshold criteria are indicated. Clicking Apply launches the analysis.
Genes passing thresholds in each experiment are compared and grouped in 3 categories:

Unique
Similar - present in more than one list
Common - present in all lists

Results are returned by ontology, statistics are -log(pValue) (1.3 = 0.05)
Results can be examined by clicking on the name. See Legend for more information about maps and networks. Note: network objects included in filtered sets display expression results in image.
Different results link to different types of data.
- Pathway maps are nice interactive maps
- Process networks and produces editable network maps.
- GO Processes produce lists that can be used to draw networks.
- Diseases link to disease pages.
Histograms can be sorted according to statistics

Statistically significant
Deferentially effected
Similarity by
Particular

Histogram are interactive, network statistics panel reports results.

One-click Analysis Tab

Pathway Maps

Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Pathway Maps tool
Analysis will consider only the Pathway maps
It can be run on multiple experiments as well.

Map Folders

Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Map Folders tool
Analysis will consider functionally related Pathway Map groups resulting in alteration of scoring and statistics.
It can be run on multiple experiments as well.
Clicking on a result folder leads to a results page displaying data for the composite Pathway Maps

Process Networks

Activate only the ColonMet_Faked experiment and select appropriate thresholds.
From the Start Page, select the One-click Analysis button and pick the Process Networks tool
Analysis will consider enrichment in various annotated process networks.
It can be run on multiple experiments as well.
Clicking on a result leads to a page displaying data for the indicated Process Network.

Building Network for Your Experimental Data

Run a Compare Experiments workflow for ColonMet_Faked_FC and ColonMet_Faked_FCrand1_edit.
Identify and select the common network objects in this comparison (10 Network objects) by clicked the blue/white histogram.
Click Analyze networks to produce a list of networks associated with the list.
Return to the Network options page and click Add network objects and add ITGA5.
Re-run Analyze Networks, using default settings. NOTE: Adding ITGA5 produces and additional network.
There 11 different network building algorithms available.
Clicking on network name opens the highly interactive network viewer. Explore the extensive control over the visualization.

Functional Annotation of Gene Lists

Multiple Sequence Alignment

Clustalw2

Multiple sequence alignment and phylogenetic analysis allow the identification of conserved positions in protein and nucleic acid sequences. This can lead to an appreciation of the evolutionary history of a group of sequences.

The clustal family of programs is commonly used to produce multiple sequence alignments. Other options are available as well:

There are 3 different ways to use the clustal programs.

Web-based clustalw can be used HERE.
ClustalX GUI (and also the command-line clustalw executable) is available for download HERE. exampledata
Command-line clustalw2 is installed on rous

The major difference between the 3 options is the interface and the calculation of bootstrap values for the tree, which is only available in the command-line and GUI versions. Other details of running the program are the same. For this lesson, we will use the web-based version to align these sequences.
Login to rous and copy the clustal training files to your home directory.

cp -r /net/n3/data/Teaching/IAP_2010_day3/clustal .

NOTE: -r means "recursive", copy the folder and everything inside it to the new location "."
The lack of a trailing "/" on the clustal path ensures that the folder is moved, not just it's contents.

enter the clustal directory:

cd clustal

view the raw sequence files:

cat *.pep

launch the clustalw application:

clustalw2

The following interactive menu appears, options 1 and 2 will be used in this demonstration. Additional information is available HERE

 **************************************************************
 ******** CLUSTAL 2.0.12 Multiple Sequence Alignments  ********
 **************************************************************


     1. Sequence Input From Disc
     2. Multiple Alignments
     3. Profile / Structure Alignments
     4. Phylogenetic trees

     S. Execute a system command
     H. HELP
     X. EXIT (leave program)


Your choice:

Select option 1 to load sequence and specify the file "tiny.pep". Before being returned to the original meny, you should see:

Sequences should all be in 1 file.

7 formats accepted: 
NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF,                  RSF.


Enter the name of the sequence file : tiny.pep
Sequence format is Pearson
Sequences assumed to be PROTEIN


Sequence 1: zf_12a1_a     28 aa
Sequence 2: zf_12a1_b     28 aa
Sequence 3: hs_a11        28 aa
Sequence 4: hs_22a1       28 aa
Sequence 5: zf_a11        28 aa

From the original menu, now select option 2. Multiple Alignments. This activates the Alignment menu:

****** MULTIPLE ALIGNMENT MENU ******


    1.  Do complete multiple alignment now Slow/Accurate
    2.  Produce guide tree file only
    3.  Do alignment using old guide tree file

    4.  Toggle Slow/Fast pairwise alignments = SLOW

    5.  Pairwise alignment parameters
    6.  Multiple alignment parameters

    7.  Reset gaps before alignment? = OFF
    8.  Toggle screen display          = ON
    9.  Output format options
    I. Iteration = NONE

    S.  Execute a system command
    H.  HELP
    or press [RETURN] to go back to main menu

All default options are correct except option 9,select this to see the output format menu:

 ********* Format of Alignment Output *********


     F. Toggle FASTA format output       =  OFF

     1. Toggle CLUSTAL format output     =  ON
     2. Toggle NBRF/PIR format output    =  OFF
     3. Toggle GCG/MSF format output     =  OFF
     4. Toggle PHYLIP format output      =  OFF
     5. Toggle NEXUS format output       =  OFF
     6. Toggle GDE format output         =  OFF

     7. Toggle GDE output case           =  LOWER
     8. Toggle CLUSTALW sequence numbers =  OFF
     9. Toggle output order              =  ALIGNED

     0. Create alignment output file(s) now?

     T. Toggle parameter output          = OFF
     R. Toggle sequence range numbers =  OFF

     H. HELP

User toggle #4 to activate PHYLIP output, hit return to revisit the alignment menu
Select option 1 to do the alignment. Accept the default conditions and hit return until you end up at the main menu. Exit the program with "X".

cat tiny.[ap][lh][ny]

NOTE: Square brackets are regular expression syntax. For example [ap] = either an "a" or a "p" at that position.

This is the clustal format output:

CLUSTAL 2.0.12 multiple sequence alignment


zf_12a1_a       VADLVFLVDGSWSVGRENFRFIRSFIGA--
zf_12a1_b       KADLVFLIDGSWSIGDDSFAKVRQFVFS--
hs_22a1         HYDLVFLLDTSSSVGKEDFEKVRQWVAN--
hs_a11          YMDIVIVLDGSNSIYP--WVEVQHFLINIL
zf_a11          YMDIVIVLDGSNSIYP--WNEVQDFLINIL
                  *:*:::* * *:    :  :: ::

This is the phylip format output:

     5     30
zf_12a1_a  VADLVFLVDG SWSVGRENFR FIRSFIGA-- 
zf_12a1_b  KADLVFLIDG SWSIGDDSFA KVRQFVFS-- 
hs_22a1    HYDLVFLLDT SSSVGKEDFE KVRQWVAN-- 
hs_a11     YMDIVIVLDG SNSIYP--WV EVQHFLINIL 
zf_a11     YMDIVIVLDG SNSIYP--WN EVQDFLINIL

Phylogenetic analysis

Neighbor Joining method in Phylip

To carry out phylogenetic analysis it is important to begin with an alignment in which all gap-containing positions have been removed.

This can be accomplished using the tool.
There is a bug in the process but the gap-free alignment can be loaded into clustalX, the the save sequences as "Phylip" format function can be used.

The correct alignment can be created with a text editor. Another version of it is in your clustal/phylip folder called tiny_nogap.phy
enter the phylip folder:

The first step to to create the bootstrap replicates. This is done with the phylip application seqboot.

when prompted, provide the name "tiny_nogap.phy".
At the interactive menu:
- change option r to 10 for 10 replicates. NOTE: normally the number used is 1000 for neighbor
- press Y to accept changes
- enter 111 as random number seed.

Move outfile to tiny_nogap.seqboot

The next step is to calculate the distances between the sequences in each replicate. This is done with protdist

when prompted, provide the name "tiny_nogap.seqboot".
At the interactive menu:
- change option m to d for datasets and 10 for number of replicates.
- press Y to accept changes

Rename the outfile

Build the neighbor-joining tree for each distance matrix wth neighbor:

when prompted, provide the name "tiny_nogap.protdist".
At the interactive menu:
- change option m to 10 for number of replicates.
- enter 111 for random number seed
- press Y to accept changes

Rename the output:

Create a consensus version of the outtree file with consense:

when prompted, provide the name "tiny_nogap.neightree".
At the interactive menu, default settings are ok so type "Y"

Rename the output:

Draw the tree with drawtree:

when prompted for a file enter tiny_nogap.neigh.constree
when prompted for a fontfile, enter:

At the interactive menu, select option V, then N to turn off preview. Then option Y
This results in a postscript plot of the tree. rename it to:

The contents of this file can be viewed with photo software like but it looks like this:

The bootstrap values can be viewed in the file tiny_nogap.neigh.constree. Or graphicall by opening this file with . With branch lengths selected, the view looks like this:

Microarray data processing with R/Bioconductor

Bioconductor is:

a large number of life sciences software implemented in R
an extensive collection of experimental and annotation data that relate to the analysis software.

Documentation

The packages are written, documented and supported according to consistent standards.
Each package has a website. For example see:

affy

affydata

both pdf documentation and example R scripts exits.

Mailing List

Instructions on joining the mailing list

IMPORTANT:Carefully read the posting guide and follow the instructions.

The list is high-traffic and widely read so be careful with postings.

UseExample

#Connect to rous.mit.edu and start R
R
#check available packages:
library()
#load affdata
library(affydata)
#Run some of the commands
data(Dilution)
ls()
Dilution
class(Dilution)
expressionData<-exprs(Dilution)
class(expressionData)
expressionData[1:3,]
log2(expressionData[1:3,])
round(log2(expressionData[1:3,]), digits=2)

Processing Microarray Data with r/bioconductor

The following series of commands can be used to process array data with gcrma and do differential expression testing with LPE:

library(affy)
library(gcrma)
library(LPE)
library(affyPLM)

#Set the working directory to the location of your CEL files.

setwd("/Path/TO/CEL_Files")

#Import Data
#Order of samples in the resulting matrix can be specified by the order in the list.

Name_Dat<-ReadAffy(
"Condition1a.CEL", "Condition1b.CEL", "Condition1c.CEL",
"Condition2a.CEL", "Condition2b.CEL", "Condition2c.CEL")

#RNA degradation Work
RNAdeg<-AffyRNAdeg(Name_Dat)
png(file="Name_rnaDeg.png", bg="white")
plotAffyRNAdeg(RNAdeg,cols=c(1:16))
dev.off()

#PLM work
pset1<-fitPLM(Name_Dat)

#RLE plot
png(file="rle.png", bg="white")
par(mar=c(3, 10, 3, 3))
RLE(pset1, main = "RLE for Name", horizontal=TRUE, las=2)
dev.off()

#NUSE plot
png(file="nuse.png", bg="white")
par(mar=c(3, 10, 3, 3))
NUSE(pset1, ylim= c(0.95,1.2), main = "NUSE for Name", horizontal=TRUE, las=2)
dev.off()

#Process the Data
Name_Exp<-gcrma(Name_Dat, fast=FALSE)

Name_Tab<-exprs(Name_Exp)
Name_Tab<-round(Name_Tab, digits=2)
write.table(data.frame(Name_Tab), sep="\t", quote=FALSE, file="Name.txt")

#Differential Expression Testing
#In editor, add "ProbeID" to top of first column and delete the affy control rows that start with AFFX

Name<-read.table("Name.txt", header=TRUE)
attach(Name)
names(Name)

set.seed(0)

#testing the columns by printing the first 3 rows for each condition
#var.Cond1
Name[1:3,c(2,3,4)]
#var.Cond2
Name[1:3,c(5,6,7)]

#LPE tests
var.Cond1<-baseOlig.error(Name[,c(2,3,4)],q=0.01)
var.Cond2<-baseOlig.error(Name[,c(5,6,7)],q=0.01)

lpeVal.Cond1.Cond2<-data.frame(lpe(Name[,c(5,6,7)], Name[,c(2,3,4)], var.Cond1, var.Cond2, probe.set.name = Name$ProbeID))

lpeVal.Cond1.Cond2<-round(lpeVal.Cond1.Cond2, digits=2)

fdrBH.Cond1.Cond2<-fdr.adjust(lpeVal.Cond1.Cond2, adjp="BH")

write.table(lpeVal.Cond1.Cond2, quote=FALSE, sep="\t", file="lpeVal.Cond1.Cond2.txt")
write.table(fdrBH.Cond1.Cond2, quote=FALSE, sep="\t", file="fdrBH.Cond1.Cond2.txt")

#clustering with pvalues

library(pvclust)

Name.pv<-pvclust(Name, method.hclust="ward",
method.dist="correlation", use.cor="pairwise.complete.obs",
nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE)

png(filename="Name_PV.png", bg="white",width=960, height=500)
plot(TAM.pv)
dev.off()