LogoLogo
LogoLogo
  • The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility
  • Computing Resources
    • Active Data Storage
    • Archive Data Storage
    • Luria Cluster
      • FAQs
    • Other Resources
  • Bioinformatics Topics
    • Tools - A Basic Bioinformatics Toolkit
      • Getting more out of Microsoft Excel
      • Bioinformatics Applications of Unix
        • Unix commands applied to bioinformatics
        • Manipulate NGS files using UNIX commands
        • Manipulate alignment files using UNIX commands
      • Alignments and Mappers
      • Relational databases
        • Running Joins on Galaxy
      • Spotfire
    • Tasks - Bioinformatics Methods
      • UCSC Genome Bioinformatics
        • Interacting with the UCSC Genome Browser
        • Obtaining DNA sequence from the UCSC Database
        • Obtaining genomic data from the UCSC database using table browser queries
        • Filtering table browser queries
        • Performing a BLAT search
        • Creating Custom Tracks
        • UCSC Intersection Queries
        • Viewing cross-species alignments
        • Galaxy
          • Intro to Galaxy
          • Galaxy NGS Illumina QC
          • Galaxy NGS Illumina SE Mapping
          • Galaxy SNP Interval Data
        • Editing and annotation gene structures with Argo
      • GeneGO MetaCore
        • GeneGo Introduction
        • Loading Data Into GeneGO
        • Data Management in GeneGO
        • Setting Thresholds and Background Sets
        • Search And Browse Content Tab
        • Workflows and Reports Tab
        • One-click Analysis Tab
        • Building Network for Your Experimental Data
      • Functional Annotation of Gene Lists
      • Multiple Sequence Alignment
        • Clustalw2
      • Phylogenetic analysis
        • Neighbor Joining method in Phylip
      • Microarray data processing with R/Bioconductor
    • Running Jupyter notebooks on luria cluster nodes
  • Data Management
    • Globus
  • Mini Courses
    • Schedule
      • Previous Teaching
    • Introduction to Unix and KI Computational Resources
      • Basic Unix
        • Why Unix?
        • The Unix Tree
        • The Unix Terminal and Shell
        • Anatomy of a Unix Command
        • Basic Unix Commands
        • Output Redirection and Piping
        • Manual Pages
        • Access Rights
        • Unix Text Editors
          • nano
          • vi / vim
          • emacs
        • Shell Scripts
      • Software Installation
        • Module
        • Conda Environment
      • Slurm
    • Introduction to Unix
      • Why Unix?
      • The Unix Filesystem
        • The Unix Tree
        • Network Filesystems
      • The Unix Shell
        • About the Unix Shell
        • Unix Shell Manual Pages
        • Using the Unix Shell
          • Viewing the Unix Tree
          • Traversing the Unix Tree
          • Editing the Unix Tree
          • Searching the Unix Tree
      • Files
        • Viewing File Contents
        • Creating and Editing Files
        • Manipulating Files
        • Symbolic Links
        • File Ownership
          • How Unix File Ownership Works
          • Change File Ownership and Permissions
        • File Transfer (in-progress)
        • File Storage and Compression
      • Getting System Information
      • Writing Scripts
      • Schedule Scripts Using Crontab
    • Advanced Utilization of IGB Computational Resources
      • High Performance Computing Clusters
      • Slurm
        • Checking the Status of Computing Nodes
        • Submitting Jobs / Slurm Scripts
        • Interactive Sessions
      • Package Management
        • The System Package Manager
        • Environment Modules
        • Conda Environments
      • SSH Port Forwarding
        • SSH Port Forwarding Jupyter Notebooks
      • Containerization
        • Docker
          • Docker Installation
          • Running Docker Images
          • Building Docker Images
        • Singularity
          • Differences from Docker
          • Running Images in Singularity
      • Running Nextflow / nf-core Pipelines
    • Python
      • Introduction to Python for Biologists
        • Interactive Python
        • Types
          • Strings
          • Lists
          • Tuples
          • Dictionaries
        • Control Flow
        • Loops
          • For Loops
          • While Loops
        • Control Flows and Loops
        • Storing Programs for Re-use
        • Reading and Writing Files
        • Functions
      • Biopython
        • About Biopython
        • Quick Start
          • Basic Sequence Analyses
          • SeqRecord
          • Sequence IO
          • Exploration of Entrez Databases
        • Example Projects
          • Coronavirus Exploration
          • Translating a eukaryotic FASTA file of CDS entries
        • Further Resources
      • Machine Learning with Python
        • About Machine Learning
        • Hands-On
          • Project Introduction
          • Supervised Approaches
            • The Logistic Regression Model
            • K-Nearest Neighbors
          • Unsupervised Approaches
            • K-Means Clustering
          • Further Resources
      • Data Processing with Python
        • Pandas
          • About Pandas
          • Making DataFrames
          • Inspecting DataFrames
          • Slicing DataFrames
          • Selecting from DataFrames
          • Editing DataFrames
        • Matplotlib
          • About Matplotlib
          • Basic Plotting
          • Advanced Plotting
        • Seaborn
          • About Seaborn
          • Basic Plotting
          • Visualizing Statistics
          • Visualizing Proteomics Data
          • Visualizing RNAseq Data
    • R
      • Intro to R
        • Before We Start
        • Getting to Know R
        • Variables in R
        • Functions in R
        • Data Manipulation
        • Simple Statistics in R
        • Basic Plotting in R
        • Advanced Plotting in R
        • Writing Figures to a File
        • Further Resources
    • Version Control with Git
      • About Version Control
      • Setting up Git
      • Creating a Repository
      • Tracking Changes
        • Exercises
      • Exploring History
        • Exercises
      • Ignoring Things
      • Remotes in Github
      • Collaborating
      • Conflicts
      • Open Science
      • Licensing
      • Citation
      • Hosting
      • Supplemental
Powered by GitBook

MIT Resources

  • https://accessibility.mit.edu

Massachusetts Institute of Technology

On this page

Was this helpful?

Export as PDF
  1. Bioinformatics Topics
  2. Tasks - Bioinformatics Methods
  3. UCSC Genome Bioinformatics

Editing and annotation gene structures with Argo

PreviousGalaxy SNP Interval DataNextGeneGO MetaCore

Last updated 1 year ago

Was this helpful?

Annotating Genomic Sequence with Argo

Argo is a genome browser and annotation tool written by Reinhard Engles at the Broad Institute. It combines powerful data display functions with the ability to create and edit genomic features.

  • Start . You can use the web start or download the jar and double-click.

  • The data files your need for this exercise are located HERE. Save each file to your working directory.

  • Load Sequence into argo by selecting the "File-->Open Sequence File" menu item. The protocadherin 10 gene sequence is in the file "pcdh10.gene". In the sequence file format pull-down menu, select Fasta. In the sequence range window, accept the default of the entrire sequence range. In the Feature Map Track Table, click the "Load Tracks From File..." button. Select "known.gff". In the Track File Format window, accept the default of "gff".

  • A line will appear in the track table representing the gff file you just loaded. If you click on the colored square, you can change the color of the track, pick dark blue. The click OK.

  • The resulting feature map shows the 3 known gene isoforms for PCDH10. Select one of the isoforms.

  • This activates the feature inspector window in the lower left hand corner of argo. The properties tab describes the feature you have selected, the DNA tab contains the DNA sequence(exons in blue, introns in black), the mRNAis found inder the mRNA tab and the Protein translation of the mRNA is in the Protein tab. Note: Right-clicking (command click for macs) while hovering over the mRNA sequence will allow you to adjust the reading frame.

  • A right-click while hovering over the Feature map will open a function menu. Select the option "Track Table". This opens the same tool that appeared while we were loading the sequence. Click the "Load Tracks From File..." button.

  • Select the file "cad.blastx" and hit OK. This results in a Track File Format menu, select the Blast (standard) option. cad.blastx contains standard blast output (look at it here) in text format. The query sequence was the genomic DNA we are displaying, the database was a set of protein sequences related to the gene we are working on. Therefore, after hitting the OK button you will see a "Blast Feature Coordinates" dialog box. The answer to this question is no because of the subject sequences were the protein, not the DNA in argo. This will add the blast results to the Feature map. When you select a blast hit, information about that hit appears in the feature instpector.

  • Control the way the blast data is displayed by opening the track table and clicking on the area next to the blast track and under the style section. Change it to "Bar Graph".

  • Using similar methods, load the files cpg.gff and est.gff. All are in gff format.

  • The resulting display may place the known genes near the top of the page. This is not ideal, open the track table and change know genes to "Segregated".

  • In the UCSC browser, we identified the EST DA219615 as evidence for an uncharacterized PCDH10 splice variant. Search for this feature using the tool on the lower right hand side of argo. The search tool will result in selection of that feature.

  • You can inspect details about the feature by right clicking while hovering over the feature map. For example, select the option "Splice Site Profile". This results in window showing the splice junction for the selected feature.

  • Create a gene model based on this evidence using the "Edit --> Insert Compound Feature" menu item. This will open a window showing the coordinates of the feature. Give it the Gene name "PCDH10" and the transcript name "iso3". Next you will be asked for a file to save the data. Create a new gff3 file called iso3 and click OK.

  • The result will be a feature is inserted into the map that is identical to EST DA219615. It is located at the very top of the map. Move it's display position down using the track table. Segregated is a good choice.

  • The purple color indicates a pending insert. Save it using "Edit --> Save".

  • Examine the 3' End of the gene model. While hovering over the feature map, hold down the Shift Key. A magnifying glass will appear. Use the magnifying glass to draw and small box around the 3' exon of the inserted model.

  • Compare the exon to the est BG201062. The EST contains more sequence and is likely to represent the real 3' end of the gene. Select the BG201062 and drag it to the end of your inserted model. You will be given a transcript extension dialog box. Select Merge Right. Now, your edited feature should be highlighted in orange. This indicates a pending edit. Save the new model. You will get an error, but that is OK for now.

  • Zoom out to view the full gene by Shift-Right Clicking several times. Or use the Zoom to 100% function.

  • Next Add the 5' end of the gene. Select one of the longest known gene models and drag it to your edited feature. This time accept the merge left option. Save the model, you will now get 2 errors, accept them as well.

  • Repair the gene model using the Feature Inspector window on the lower left. Select your gene model, then click the DNA tab. The letters in DNA will turn red indicating that the feature has non-canonical (ie. non-GT-AG) splice junctions. These will be highlighted in red in the sequence pane. Scroll down until you find the first one. It is located at the 5' end of exon 3. Highlight a couple of lines before and after the red letters. Notice the yellow bar in the feature map. Zoom in on the troublesome exon. The exon is different from the know genes because it's 5'end hangs over a little. Select the known exon and drag it to the working model. select the replace option and save. This will repair the first problem.

  • Next, zoom in on the 3' exon of the working model. We have an exon that is a few bases off because of the EST-based extension we used to create the model. Go to the DNA tab of the Feature Inspector, scroll to the bottom. Note the non-canonical junction highlighted in red. Right next to the red highlight is the sequence AG that we need. Highlight the red letters and sequence up to and including the AG and do a right-click. One of the available options is "Make Highlighted Sequence Intronic". (WARNING: Mac users without a 2+-button mouse may have trouble with this operation - you can use the drag and drop features to get this exon right) Select this option then save the model. At this time, it should save without errors.

  • Go to the mRNA tab, right click and select the option "Select ORF > Select Longest Start to StopCodon ORF". Go to the Protein tab to see the protein sequence.

  • Note the green start and red stop codons indicated in the feature map. Also note the consensus poly-A signal analysis available from the right-click menu.

  • The entire feature map can be exported as an image for use in presentations using the "File --> Export Map as Image..." menu item.

  • Select the hand-edited isoform, return to the mRNA tab in the feature inspector, do select all and copy, then return to the section on the UCSC browser. We will start by pasting the mRNA sequence from this annotation effort into a Blat search window.

Argo