Editing and annotation gene structures with Argo

Annotating Genomic Sequence with Argo

Argo is a genome browser and annotation tool written by Reinhard Engles at the Broad Institute. It combines powerful data display functions with the ability to create and edit genomic features.

  • Start Argo. You can use the web start or download the jar and double-click.

  • The data files your need for this exercise are located HERE. Save each file to your working directory.

  • Load Sequence into argo by selecting the "File-->Open Sequence File" menu item. The protocadherin 10 gene sequence is in the file "pcdh10.gene". In the sequence file format pull-down menu, select Fasta. In the sequence range window, accept the default of the entrire sequence range. In the Feature Map Track Table, click the "Load Tracks From File..." button. Select "known.gff". In the Track File Format window, accept the default of "gff".

  • A line will appear in the track table representing the gff file you just loaded. If you click on the colored square, you can change the color of the track, pick dark blue. The click OK.

  • The resulting feature map shows the 3 known gene isoforms for PCDH10. Select one of the isoforms.

  • This activates the feature inspector window in the lower left hand corner of argo. The properties tab describes the feature you have selected, the DNA tab contains the DNA sequence(exons in blue, introns in black), the mRNAis found inder the mRNA tab and the Protein translation of the mRNA is in the Protein tab. Note: Right-clicking (command click for macs) while hovering over the mRNA sequence will allow you to adjust the reading frame.

  • A right-click while hovering over the Feature map will open a function menu. Select the option "Track Table". This opens the same tool that appeared while we were loading the sequence. Click the "Load Tracks From File..." button.

  • Select the file "cad.blastx" and hit OK. This results in a Track File Format menu, select the Blast (standard) option. cad.blastx contains standard blast output (look at it here) in text format. The query sequence was the genomic DNA we are displaying, the database was a set of protein sequences related to the gene we are working on. Therefore, after hitting the OK button you will see a "Blast Feature Coordinates" dialog box. The answer to this question is no because of the subject sequences were the protein, not the DNA in argo. This will add the blast results to the Feature map. When you select a blast hit, information about that hit appears in the feature instpector.

  • Control the way the blast data is displayed by opening the track table and clicking on the area next to the blast track and under the style section. Change it to "Bar Graph".

  • Using similar methods, load the files cpg.gff and est.gff. All are in gff format.

  • The resulting display may place the known genes near the top of the page. This is not ideal, open the track table and change know genes to "Segregated".

  • In the UCSC browser, we identified the EST DA219615 as evidence for an uncharacterized PCDH10 splice variant. Search for this feature using the tool on the lower right hand side of argo. The search tool will result in selection of that feature.

  • You can inspect details about the feature by right clicking while hovering over the feature map. For example, select the option "Splice Site Profile". This results in window showing the splice junction for the selected feature.

  • Create a gene model based on this evidence using the "Edit --> Insert Compound Feature" menu item. This will open a window showing the coordinates of the feature. Give it the Gene name "PCDH10" and the transcript name "iso3". Next you will be asked for a file to save the data. Create a new gff3 file called iso3 and click OK.

  • The result will be a feature is inserted into the map that is identical to EST DA219615. It is located at the very top of the map. Move it's display position down using the track table. Segregated is a good choice.

  • The purple color indicates a pending insert. Save it using "Edit --> Save".

  • Examine the 3' End of the gene model. While hovering over the feature map, hold down the Shift Key. A magnifying glass will appear. Use the magnifying glass to draw and small box around the 3' exon of the inserted model.

  • Compare the exon to the est BG201062. The EST contains more sequence and is likely to represent the real 3' end of the gene. Select the BG201062 and drag it to the end of your inserted model. You will be given a transcript extension dialog box. Select Merge Right. Now, your edited feature should be highlighted in orange. This indicates a pending edit. Save the new model. You will get an error, but that is OK for now.

  • Zoom out to view the full gene by Shift-Right Clicking several times. Or use the Zoom to 100% function.

  • Next Add the 5' end of the gene. Select one of the longest known gene models and drag it to your edited feature. This time accept the merge left option. Save the model, you will now get 2 errors, accept them as well.

  • Repair the gene model using the Feature Inspector window on the lower left. Select your gene model, then click the DNA tab. The letters in DNA will turn red indicating that the feature has non-canonical (ie. non-GT-AG) splice junctions. These will be highlighted in red in the sequence pane. Scroll down until you find the first one. It is located at the 5' end of exon 3. Highlight a couple of lines before and after the red letters. Notice the yellow bar in the feature map. Zoom in on the troublesome exon. The exon is different from the know genes because it's 5'end hangs over a little. Select the known exon and drag it to the working model. select the replace option and save. This will repair the first problem.

  • Next, zoom in on the 3' exon of the working model. We have an exon that is a few bases off because of the EST-based extension we used to create the model. Go to the DNA tab of the Feature Inspector, scroll to the bottom. Note the non-canonical junction highlighted in red. Right next to the red highlight is the sequence AG that we need. Highlight the red letters and sequence up to and including the AG and do a right-click. One of the available options is "Make Highlighted Sequence Intronic". (WARNING: Mac users without a 2+-button mouse may have trouble with this operation - you can use the drag and drop features to get this exon right) Select this option then save the model. At this time, it should save without errors.

  • Go to the mRNA tab, right click and select the option "Select ORF > Select Longest Start to StopCodon ORF". Go to the Protein tab to see the protein sequence.

  • Note the green start and red stop codons indicated in the feature map. Also note the consensus poly-A signal analysis available from the right-click menu.

  • The entire feature map can be exported as an image for use in presentations using the "File --> Export Map as Image..." menu item.

  • Select the hand-edited isoform, return to the mRNA tab in the feature inspector, do select all and copy, then return to the section on the UCSC browser. We will start by pasting the mRNA sequence from this annotation effort into a Blat search window.

Last updated

Massachusetts Institute of Technology