Galaxy NGS Illumina QC
Last updated
Last updated
MIT Resources
https://accessibility.mit.eduMassachusetts Institute of Technology
This tutorial shows how to perform basic QC on Illumina data, such as basic quality statistics, quality score boxplots, trimming and masking.
1. Load a fastq file and annotate the uploaded data
On the Tool Panel, click on Get Data → Upload File.
Browse to your home folder and select the file Galaxy_GM12878.fastqillumina'.
Set the format to "fastqillumina".
Click the "Execute" button.
Once the job is completed, click on the pencil icon to edit the attributes.
Set the Database field to "Human Mar 2006 (NCBI36/hg18).
Click the "Save" button.
2. Convert to Sanger FASTQ format
On the Tool Panel, click on NGS Toolbox Beta → NGS: QC and Manipulation → FASTQ Groomer.
This tool converts between various FASTQ quality formats.
By default, the quality format output is Sanger FASTQ.
Sanger FASTQ is the required format for downstream analyses in Galaxy.
Set the input type to "Illumina 1.3+"
Click the "Execute" button.
Once the job is completed, click on the pencil icon and edit the name of the job as "GM12878 fastqsanger".
3. Compute Quality Statistics
On the Tool Panel, click on NGS Toolbox Beta → Fastx-Toolkit → Compute Quality Statistics.
This tool compute quality statistics such as min, max, mean, median, Q1, Q3, IQR, etc. of quality scores.
Select Data 2 as input library.
Click the "Execute" button.
4. Draw Quality Score Boxplot
On the Tool Panel, click on NGS Toolbox Beta → Fastx-Toolkit → Draw Quality Score Boxplot.
This tool creates a box graph of the quality scores in the library.
Select Data 3 as statistic report file.
Click the "Execute" button.
Once the job is completed, click on the eye icon to see the boxplot figure. You can expand and collapse the figure by clicking on the arrows placed on the sides of the main panel.
5. Trim Sequence Reads to length of 60 bases
On the Tool Panel, click on NGS Toolbox Beta → FASTQ Trimmer (by column).
This tool trims the end of the reads.
Select Data 2 as input FASTQ file.
Set the offset from 5' end to 16.
With these parameters, all reads are trimmed after the 60th base.
Click the "Execute" button.
Once the job is completed, click on the eye icon to edit the attributes of the resulting data and change the name to "GM12878 Trimmed fastqsanger".
6. Apply Quality Masker to bases with quality lower than 20
On the Tool Panel, click on NGS Toolbox Beta → FASTQ Masker.
This tool allows masking base characters in FASTQ files according to quality score value and comparison method.
Select Data 2 as input file to mask.
Set the criterion as "less than" and the threshold to 20.
Click the "Execute" button.
With these parameters, any base with quality less than 20 will be masked with a symbol "N".
7. Apply FASTQ Quality Trimmer
On the Tool Panel, click on NGS Toolbox Beta → FASTQ Quality Trimmer (by sliding window).
This tool allows trimming the ends of reads based upon the aggregate value of quality scores found within a sliding window. Several criteria can be used to determine the aggregate value (min, max, sum, mean) within the sliding window.
Select Data 2 as input file.
Select "Trim 5' end" only from the scroll down menu.
Set window size to 3.
Select "max score" as aggregate action.
Select ">= 2" as criterion for trimming
Click the "Execute" button.
8. Create a Data Subset by selecting the first 2,500 sequence reads
On the Tool Panel, click on Text Manipulation → Select first lines.
This tool select the first N lines of the input dataset.
Set to 10,000 the number of lines to select.
Select data 5 as input.
Click the "Execute" button.
With these parameters, the first 10,000 lines of the input FASTQ file are selected, corresponding to the first 2,500 sequence reads.