FASTQ Preprocessing

Content of this page:


Introduction

As Next Generation Sequencing (NGS) technology is used more broadly in scientific applications and research, sequencing data quality control is becoming more important. Experiments and sequencing processes always introduce errors and biases, so downstream sequence analyses are compromised by low-quality sequences, sequence artifacts, and sequence contamination. These problems eventually lead to erroneous conclusions in processes such as assembly and alignment, so a preprocessing step is necessary to produce better analysis results. 

Preprocessing FASTQ files in Blast2GO consists of removing adapters and contamination sequences, trimming low quality bases and filtering short and low quality reads. Before proceeding, it is advisable to carry out a quality control check of the sequencing data within Blast2GO (FASTQ Quality Check). In this way, problems and biases can be detected, which allows to better configure the preprocessing procedure.

The FASTQ Preprocessing tool uses the well-known preprocessing software Trimmomatic. Trimmomatic is a fast, multithreaded command line tool that can be used to trim and crop sequencing data as well as to remove adapters. For further information visit the Trimmomatic web page

Please, cite Trimmomatic as Bolger AM, Lohse M, Usadel B (2014). "Trimmomatic: A flexible trimmer for Illumina Sequence Data". Bioinformatics, btu170.


Adapter Removal

This step is used to find and remove adapters and contaminant sequences. The application uses two approaches to detect technical sequences within the reads:

  • Simple mode: The simple mode approach works by finding an approximate match between the read and supplied technical sequences. These sequences can be detected in any location or orientation within the reads but require a minimum overlap between the read to prevent false-positives. However, short partial adapter sequences cannot achieve this minimum overlap requirement, so they are not detectable. 
  • Palindrome mode: The palindrome mode approach is specifically aimed at detecting the common "adapter read-through" scenario whereby the sequenced DNA fragment is shorter than the read length. When "read-through" happens, both reads in a pair will consist of an equal number of valid bases, followed by contaminating sequence from the "opposite" adapters. Furthermore, the valid sequence within the two reads will be reverse complements. This mode can only be used with paired-end data but has considerable advantages in sensitivity and specificity over "simple" mode. 

Trimming

This step is used to remove low quality bases from the reads. The application offers four trimming alternatives:

  • Sliding window trimming: The sliding window approach works by scanning from the 5' end of the read and removes the remaining 3' end of the read when the average quality of a group of bases drops below a specified threshold. 
  • Adaptive quality trimming: The adaptive quality trim approach, also known as "Maximum information quality trimming", balances the benefits of retaining longer reads against the cost of retaining bases with errors.
  • Quality trimming: The quality trimming approach removes low quality bases from the beginning or the end of the read. As long as a base has a value below this threshold, the base is removed and the next base will be investigated. 
  • Length trimming: The length trimming approach removes a specified number of bases regardless of quality from the beginning or the end of the read. 

Filtering

This step is used to filter out reads:

  • Filter by quality: Remove reads that fall below the specified average quality.
  • Filter by length: Remove reads that fall below the specified minimum length. 


Run FASTQ Preprocessing

This functionality can be found under Tools → FASTQ Tools → FASTQ Preprocessing. The input data and the different preprocessing steps can be configured using the wizard.

Input Data Page

  • Sequencing Data: Choose the type of data to be preprocessed: single-end or paired-end reads. Note that if paired-end is selected, two files per sample are required. 
  • Input Reads: Provide the files containing sequencing reads. These files are assumed to be in FASTQ format. 
  • Paired-end configuration: In case of paired-end reads, the pattern to distinguish upstream files from downstream files is required. The provided patterns are searched right before the extension, and the start of the name should be the same for both files of each sample. 
    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files. 
    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files. 

      For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.

Figure 1: Input Data Page

Adapter Removal Page

  • Remove Adapters: Enable the adapter removal step. 
  • Use Adapters From: Choose between using the default adapter sequences provided by Trimmomatic, or providing custom adapter sequences.
    • Default Adapter Sequences: By default, the application provides adapter sequences for TruSeq2 (GAII machines), TruSeq3 (HiSeq and MiSeq machines) and Nextera, for both single-end and paired-end data.

      If you use the FASTQ Quality Check tool, the "Adapter Content" and "Overrepresented Sequences" modules can help to choose which default adapter sequences are best suited for your data. "Illumina Single-End" or "Illumina Paired-End" sequences indicate single-end or paired-end TruSeq2 libraries. "TruSeq Universal Adapter" or "TruSeq Adapter, Index..." sequences indicate TruSeq-3 libraries. Note that these sequences have not been extensively tested, so other sequences may work better for a given dataset.

    • Custom Adapter Sequences: Specifies a FASTA file containing all the adapters and contaminant sequences to be removed. The names of sequences determine how they are used, especially for paired-end data.
      • For "palindrome" mode, matched pairs of adapter sequences must be supplied. The sequence names should start with "Prefix", and end in "/1" for the forward adapter and "/2" for the reverse adapter. The part of the name between "Prefix" and "/1" or "/2" must match exactly within each pair. 

        >PrefixPE/1
        TACACTCTTTCCCTACACGACGCTCTTCCGATCT
        >PrefixPE/2
        GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

      • For "simple" mode, sequences with names ending in "/1" or "/2" will be searched only in the forward or reverse read respectively. Otherwise, sequences will be searched in both the forward and reverse read. 

        >Adapter_a
        AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
        >Adapter_b
        AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

  • Seed Mismatches: Set the maximum mismatch count which allows performing a full match. 
  • Simple Clip Threshold: Establish how accurate the match between the adapter sequence must be against a read. This option is only considered for the simple mode. 
  • Palindrome Clipt Threshold: Establish how accurate the match between the two "adapter ligated" reads must be. This option is only considered for the palindrome mode. 
  • Minimum Adapter Length: Set a minimum length for adapters to be detected. This option is only considered for the palindrome mode. 
  • Keep Both Reads: Deleting adapters after read-through detection (palindrome mode) causes the reverse read to contain the same information as the forward read, although in reverse complement. This option allows retaining the reverse read. Otherwise, the reverse read will be discarded. 

Figure 2: Adapter Removal Page


Trimming Page

  • Trimming: Enable the trimming step.
  • Trimming option: Select the strategy to perform the trimming step.
  • Sliding Window Trimming:
    • Window Size: Set the number of bases that the window has to span to average the quality.
    • Required Quality: Set the average quality required to retain bases.
  • Adaptive Quality Trimming:
    • Target Length: Set the minimum read length which is likely to allow the location of the read within the target sequence. 
    • Strictness: This value establishes the balance between preserving as much read length as possible versus removal of incorrect bases. It should be set between 0 and 1. A low value favors longer reads, while a high value favors read correctness. 
  • Quality Trimming:
    • Trimming from: Choose between removing bases from the beginning or the end of the sequence.
    • Trimming threshold: Establish a minimum quality required to keep bases. 
  • Length Trimming:
    • Trimming from: Choose between removing bases from the beginning or the end of the sequence.
    • Trimming threshold: In case of removing bases from the end, specifies the number of bases to be kept from the start of the read so that it has maximally the specified length after this step. In case of removing bases from the start, specifies the number of bases to be removed from the start of the read. 

Figure 3: Trimming Page

Filtering Page

  • Filter By Quality: Enable the filtering by quality step.
  • Average Quality: Minimum average quality of reads to be kept. 
  • Filter By Length: Enable the filtering by length step.
  • Minimum Length: Minimum length of reads to be kept. 

Figure 4: Filtering Page

Save Results Page

  • Output Prefix: Define a prefix to establish the name of output files. The prefix will be added before each original file name.
  • Output Reads: Select a destination folder to save the preprocessed FASTQ files. 
  • Unpaired Reads: When preprocessing paired-end data, some read pairs can lose a member as a result of trimming and filtering. Select a destination folder to save the FASTQ files containing unpaired reads. These files contain the word "unpaired" in their file names. 

Figure 5: Save Results Page

Results

Once finished, output files containing preprocessed reads are stored in the "Output Reads" folder set in the wizard. Files are generated in compressed format (fastq.gz).

For single-end data, one output file per input file is generated. For paired-end data, four output files per input sample (2 FASTQ files) are generated, two that contain upstream and downstream paired reads and two that contain upstream and downstream unpaired reads. The name of each output file begins with the provided prefix and continues with the original name of the file. Files with unpaired reads contain the word "unpaired" in their name so that they can be distinguished from those that contain paired reads. These files are placed in the "Unpaired Reads" folder. 

Furthermore, a result page will show a summary of the "FASTQ Preprocessing" results (Figure 6). This page provides a table that shows how many reads have survived and how many have been dropped during the analysis. 

Figure 6: FASTQ Preprocessing Report