FASTQ Quality Check

Content of this page:


Introduction

The "FASTQ Quality Check" tool provides an easy way to perform a quality control check on sequence data coming from high throughput sequencing pipelines. The analysis is performed by nine modules which provide a quick overview of whether the data looks good and there are no problems or biases which may affect downstream analysis. Results and evaluations are returned in the form of charts and tables.

This tool is based on the popular FastQC software. Please cite FastQC as:
Andrews S (2010)."FastQC: a quality control tool for high throughput sequence data". Available online at:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Figure 1: FASTQ Quality Check Interface

Run FASTQ Quality Check

This functionality can be found under Tools → FASTQ Tools → FASTQ Quality Check. The wizard allows to select input files and adjust analysis parameters (Figure 2). 

  • Raw Sequence Data: Select the files containing the sequence data. These files are assumed to be in FASTQ format (or compressed in gzip format). 
  • Additional Adapter Sequences: This option allows to specify a file which contains the list of adapter sequences which will be explicitly searched against the library. The file must contain sets of named adapters in the form of "Name <Tab> Sequence". If this option is not set, Blast2GO searches for the following adapter sequences:

    • Illumina Universal Adapter: AGATCGGAAGAG

    • Illumina Small RNA 3' Adapter: TGGAATTCTCGG

    • Illumina Small RNA 5' Adapter: GATCGTCGGACT

    • Nextera Transposase Sequence: CTGTCTCTTATA

    • SOLID Small RNA Adapter: CGCCTTGGCCGT

  • Additional Contaminant Sequences: This option allows to specify a file which contains the list of contaminants to screen over-represented sequences against. The file must contain sets of named contaminants in the form of "Name <Tab> Sequence". If this option is not set, Blast2GO searches for a list of common contaminant sequences.

  • Chart Read Length Binning: Enable grouping of bases for reads. If not, reports will show data for every base in the read. 

    Disabling this option on long reads (> 50 bp) can cause that the plots look very small. 


Figure 2: FASTQ Quality Check Wizard Page

Results

Once finished, a new tab is opened containing a simple composition statistics of each analyzed file (Figure 3). Each row corresponds to an input file, and columns show the following information:

  • Name: The name of the file which was analyzed.
  • File type: Shows whether the file appeared to contain actual base calls or colorspace data which had to be converted to base calls.
  • Encoding: Shows the ASCII encoding of quality values was detected in this file.
  • Total Sequences: The total number of read sequences processed. 
  • Poor quality reads: Sequences flagged as poor quality reads.
  • Sequence Length: Provides the length of the shortest and longest sequence in the set. If all sequences are the same length only one value is reported.
  • %GC: The overall %GC of all bases in all sequences.

Figure 3: FASTQ Quality Check Project 


Furthermore, a result page will show a summary of the "FASTQ Quality Check" results (Figure 4). This page provides a quick evaluation of whether the results of each module seem entirely normal (pass), slightly normal (warning) or very unusual (fail).

Note that these evaluations must be taken in the context of what is expected from each library. For example, some experiments may be expected to produce libraries which are biased in particular ways. Therefore, the summary evaluations should be treated as pointers that guide the preprocessing of the libraries. 

The result summary can be generated via Side Panel → Summary ReportAdditionally, the report of each file can be opened by clicking on the button of the column "Report". 

Figure 4: FASTQ Quality Check Report


The results of each module for each file can be accessed as follows:

  • To open the summary report of each file, right-click on a row and click on Show report. A new report is opened containing a summary of the statistics and results for the selected file (Figure 5).
  • To open the result of each module for a file, right-click on a row and go to the Show Statistics submenuThese results also can be accessed by clicking on the buttons of the "Details" column of the results table (Figure 5). 

Figure 5: Report of a FASTQ file


Per Base Sequence Quality

This chart shows an overview of the range of quality values across all bases at each position in the FASTQ file (Figure 6).

For each position (x-axis), a box and whisker type plot is drawn:

  • The central black line is the median value. 
  • The yellow box represents the interquartile range (25-75%).
  • The upper and lower whiskers represent the 10% and 90% points. 
  • The blue line represents the mean quality. 

The y-axis shows the quality scores. The background of the graph divides the y-axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). 

The title of the graph will describe the encoding that the input files used. 

A WARNING is issued if the lower quartile for any base is less than 10, or if the median for any base is less than 25. This module raises a FAIL if the lower quartile for any base is less than 5 or if the median for any base is less than 20. 

The most common reason for warnings and failures is a general degradation of quality over the duration of long runs. If the quality of the library falls to a low level then the most common procedure is to perform a quality trimming to truncate reads based on their average quality. 


Figure 6: Per Base Sequence Quality Chart


Per Sequence Quality Scores

This chart displays the number of read sequences that have the same mean sequence quality (Figure 7). It allows to see if a subset of your sequences have universally low quality values. 

A WARNING is raised if the most frequently observed mean quality is below 27 (0.2% error rate). A FAIL is raised if the most frequently observed mean quality is below 20 (1% error rate). 

If a significant proportion of the reads in a run have overall low quality then this indicates some kind of systematic problem. This may be alleviated through quality trimming. 


Figure 7: Per Sequence Quality Scores Chart


Per Base Sequence Content

This chart plots out the proportion of each base position in a FASTQ file for which each of the four normal DNA bases has been called (Figure 8). In a random library, it is expected that there would be little to no difference between the different bases of the sequence reads, so the lines in this plot should run parallel with each other. 

WARNING is issued if the difference between A and T, or G and C is greater than 10% in any position. A FAIL is raised if the difference between A and T, or G and C is greater than 20% in any position. 

The common reasons for warnings and failures are: 

  • Overrepresented sequences (such as adapter dimers or rRNA in a sample). 
  • Biased fragmentation (nearly all RNA-Seq libraries will fail this module because of this bias). 
  • Biased composition libraries.
  • If the library has been adapter trimmed. 

Figure 8: Per Base Sequence Content Chart


Per Sequence GC Content

This module measures the GC content across the whole length of each sequence read in a file and compares it to a modeled normal distribution of GC content (Figure 9). Since the GC content of the genome is not known, the modal GC content is calculated from the observed data and used to build a reference distribution. 

WARNING is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads. A FAIL indicates that the sum of the deviations from the normal distribution represents more than 30% of the reads. 

Warnings and failures indicate a problem with the library (e.g. specific contaminant). An unusually shaped distribution could indicate a contaminated library. A normal distribution which is shifted indicates some systematic bias which is independent of base position. 

If there is a systematic bias which creates a shifted normal distribution then this won't be flagged as an error by the module since it doesn't know what the genome's GC content should be. 

Figure 9: Per Sequence GC Content Chart


Per Base N Content

This module plots out the percentage of base calls at each position for which an N was called (Figure 10). N replaces a conventional base call when the sequence is unable to make a base call with sufficient confidence. 

WARNING is raised if any position shows an N content of >5%. A FAIL is raised if any position shows an N content of >20%. 

It is not unusual to see a very low proportion of Ns appearing in a sequence (especially near the end of a sequence). However, if this proportion rises above a few percents it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls. 

Figure 10: Per Base N Content Chart


Sequence Length Distribution

This chart shows the distribution of fragment sizes in the file which was analyzed (Figure 11). In many cases this will produce a simple graph showing a peak only at one size, but for variable length FASTQ files this will show the relative amounts of each different size of sequence fragment. 

WARNING is raised if all sequences are not the same length. A FAIL is raised if any of the sequences have zero length. 

For some sequencing platforms it is entirely normal to have different read lengths so warnings here can be ignored. 

Figure 11: Sequence Length Distribution Chart


Adapter Content

This chart shows a cumulative percentage of the proportion of the library in which each of the adapter sequences at each position has been detected (Figure 12). Once a sequence has been detected in a read, it is counted as being present right through to the end of the read so the percentage increases as the read length continues.

WARNING is issued if any sequence is present in more than 5% of all reads. A FAIL is issued if any sequence is present in more than 10% of all reads. 

This module indicates if the sequences will need to be trimmed for adapters before proceeding with any downstream analysis. 

Figure 12: Adapter Content Chart


Overrepresented Sequences

This module lists all of the sequences which make up more than 0.1% of the total (Figure 13). To conserve memory only sequences which appear in the first 100,000 sequences are tracked to the end of the file. Therefore, it is possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.

For each overrepresented sequence, the program will look for matches in a database of common contaminants and will report the best hit that it finds. Hits must be at least 20 bp in length and have no more than 1 mismatch.

WARNING is issued if any sequence is found to represent more than 0.1% of the total. A FAIL is issued if any sequence is found to represent more than 1% of the total. 

This module will often be triggered when used to analyze small RNA libraries where sequences are not subjected to random fragmentation, and the same sequence may naturally be present in a significant proportion of the library.

Figure 13: Overrepresented Sequences Table


Sequence Duplication Levels

This module counts the degree of duplication for every sequence in a library and creates a graph showing the relative number of sequences with different degrees of duplication (Figure 14). The chart shows the proportion of the library which is made up for sequences in each of the different duplication level bins.

There are two lines on the plot:

  • The blue line takes the full sequence set and shows how its duplication levels are distributed.
  • The red line displays the proportions of the sequences that are deduplicated which come from different duplication levels in the original data.

The module also calculates an expected overall loss of sequences when the library is deduplicated. This is shown at the top of the plot and gives a reasonable impression of the potential overall level of loss. 

A WARNING is raised if non-unique sequences make up more than 20% of the total. A FAIL is raised if non-unique sequences make up more than 50% of the total. 

In general, there are two potential types of duplicates in a library, technical duplicates arising from PCR artifacts, or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected.

In RNA-Seq libraries, sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts, it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large sets of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins.

To reduce the memory requirements only the first 100000 sequences of each file are analyzed. 

Figure 14: Sequence Duplication Levels Chart