Transcript-level Quantification - PRO Feature

Content of this page:

Introduction

The transcript-level quantification tool is designed for estimating gene and isoform expression levels from RNA-Seq data. It expects the sequencing reads in FASTQ format (so a prior alignment is not necessary), and it supports both single-end and paired-end data. In addition, a set of transcript sequences in FASTA format is required, such as one produced by a de novo transcriptome assembler. Therefore it lacks of the requirement of a reference genome. A Count Table is obtained and it can be used to perform a differential expression analysis within Blast2GO. 

The application is based on RSEM, a software package that quantifies expression from transcriptome data. This program handles both the alignment of reads against the reference transcript sequences and the calculation for relative abundances. RSEM uses the Bowtie2 aligner to align reads, with parameters specifically chosen for RNA-Seq quantification. Since RNA-Seq reads do not always map uniquely to a single gene or isoform, this method is able to allocate multi-mapping reads among transcripts using an expectation maximization approach.

This feature uses RSEM and Bowtie2. Please cite RSEM and Bowtie2 as:

  • Li B and Dewey CN (2011). "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome." BMC Bioinformatics, 12:323
  • Langmead B, Salzberg S (2012). "Fast gapped-read alignment with Bowtie 2." Nature Methods, 9:357-359


Create Count Table Interface

Figure 1: Create Count Table Interface

Run Create Count Table

This functionality can be found under rna-seq → Create Count Table, Transcript-level Quantification option. The wizard allows to select input files and adjust analysis parameters (Figure 2 and Figure 3). 

Input Data

  • Sequencing Data: Choose the type of data to be preprocessed: single-end or paired-end reads. Note that if paired-end is selected, two files per sample are required. 
  • Input Reads: Provide the files containing sequencing reads. These files are assumed to be in FASTQ format. 
  • Paired-end configuration: In case of paired-end reads, the pattern to distinguish upstream files from downstream files is required. The provided patterns are searched right before the extension, and the start of the name should be the same for both files of each sample. 
    • Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files. 
    • Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files. 

      For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.

  • Transcript References: This tool works with a set of transcripts sequences instead of a genome, such a file could be obtained from a reference genome database or a de novo transcriptome assembler. A FASTA file containing the sequences of reference transcripts should be provided. 

Figure 2: Input Data Page

Advanced Configuration

  • Gene-level Estimations: This option allows to estimate expression both at gene-level and isoform-level. In this way, the gene's expression estimates are just the sum of its transcripts' expression estimates, and results will be provided separately. Otherwise, the program assumes that each transcript provided as a reference sequence is a separated gene. 
  • Transcript to Gene Map File: Provide a file with information to map from transcript (isoform) identifiers to gene identifiers. Each line should be of the form: gene id transcript id, with the two columns separated by a tab character (Figure 4). 

  • Transcript to gene map fileFigure 4: Transcript to gene map file file example


  • Append Poly(A) Tails: For poly(A) mRNA analysis, the program will append a poly(A) tail sequences to reference transcripts to allow more accurate read alignment.
  • Poly(A) Tails Length: Establish the length of the poly(A) tails to be added. 
  • Estimate RSPD: This option allows to estimate a read start position distribution (RSPD), which increases the accuracy of expression estimates. Highly recommended if the protocol produces read position distributions that are highly 5' or 3' biased. Otherwise, the program will use a uniform RSPD. 
  • Strand Specificity: This option defines the strandedness of the RNA-Seq reads:
    • Non Strand Specific: Refers to non-strand-specific protocols. 
    • Strand Specific Forward: Means all (upstream) reads are derived from the forward strand. 
    • Strand Specific Reverse: Means all (upstream) reads are derived from the reverse strand.

Figure 3: Advanced Configuration Page

Results

Once the analysis has been finished results will be returned in two different ways, depending on the option chosen in the "Gene-level Estimations" parameter:

  • Isoform-level and gene-level estimations: Two Count Tables are returned. One shows the expression level of each transcript or isoform (input sequences) and other shows the expression level of each gene (Figure 4 and Figure 5). They have an additional column that shows the gene or transcript identifiers (respectively) associated to each record.
  • Transcript-level estimations only: One Count Table is returned that shows the expression level of each transcript sequence provided as input (Figure 6). 

Figure 4: Count table of isoform-level estimations. The Gene column shows the identifier of the parent gene of each isoform

Figure 5: Count table of gene-level estimations. The Isoform column shows the ids of all isoforms associated with each gene

Figure 6: Count table of transcript-level estimations.


Furthermore, a result page will show a summary of the "Create Count table" results (Figure 7). This page contains information about the reference transcript sequences, input FASTQ files and obtained results. The results summary can be generated via Side Panel → Result Summary and it can be exported in pdf.

Figure 7: Result Summary

Charts and Statistics

Different statistical charts can be generated from the results. These provide additional information about the process of quantifying expression, as well as a quality assessment of the resulting counts. All these charts can be found under the Side Panel of the Count Table Viewer. 

  • Library Size per Sample: Bar chart showing the number of read counts aligned to genomic features contained in each sample (Figure 8(a)). 
  • Distribution of Counts: Box plot that allows seeing how counts are distributed within each sample for all the transcripts (Figure 8(b)). Features with 0 counts in all samples will be discarded for this chart. The binary logarithm of raw counts is represented. 
  • Counts per Category: Bar chart showing the number of reads of each input file sorted by different categories (Figure 8(c)). This chart and the next one are only available for count tables created by the "Create Count Table" tool within Blast2GO. 
    • Aligned Concordantly Exactly 1 Time: Reads that have been assigned once to a reference transcript. 
    • Aligned Concordantly > Time: Reads that have been assigned to more than one reference transcript. 
    • Not Aligned: Reads that have not been assigned to any reference transcript.
  • Counts per Category (%): The same chart as the explained above in percentages (Figure 7(d)).

Last two charts are only available for count tables created by the "Create Count Table" tools within Blast2GO.

Library Size

Figure 8 (a): Library Size per Sample

Distribution of Counts

Figure 8 (b): Distribution of Counts

Counts per Category

Figure 8 (c): Counts per Category

Figure 8 (d): Counts per Category (%)