Pairwise Differential Expression Analysis - PRO Feature

Content of this page:



Introduction

This tool is designed to perform differential expression analysis of count data arising from RNA-seq technology. This application, based on the edgeR program, allows identification of differentially expressed genomic features (e.g. genes) in a pairwise comparison of two different experimental conditions. The software package edgeR (empirical analysis of DGE in R), which belongs to the Bioconductor project, implements quantitative statistical methods to evaluate the significance of individual genes between two experimental conditions.

Please cite edgeR as:
Robinson MD, McCarthy DJ and Smyth GK (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 26, pp. -1.

General Workflow

The workflow for the analysis of differential expression is described in the following scheme (Figure 2).

Differential Expression Analysis Interface

Figure 1: Differential Expression Analysis Interface

Figure 2: General Workflow


Load Data

Go to File → Load  Load Count Table and select your .txt file containing the count table in tab-delimited format (Figure 3). It is also possible to create a Count Table within Blast2GO through the "Create Count Table" functionality (see Quantify Expression section).


Count Table File

  Figure 3: Count Table File


The Count Table can be saved as 'Count Table' object (File → Save).

Notes:

  • This application only accepts raw counts without any type of normalization.
  • Replicates for each experimental condition are necessary.

Run Pairwise Differential Expression Analysis

Go to rna-seq → Run Differential Expression Analysis and choose the ``Pairwise Differential Analysis" Option. Here you can specify the following parameters, which are divided into three different sections: Preprocessing Data (Figure 4), Experimental Design (Figure 5) and Comparison and Test (Figure 6).

Preprocessing Data Page

  • Filter low count genes:
    • CPM Filter: Establish a filter to exclude genes with low counts across libraries, as those genes may interfere with the subsequent statistical approximations. Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million reads).
    • Samples reaching CPM Filter: Set a minimum number of samples in which the gene's CPM is above the filter level (is expressed). If this value is set to e.g. five, at least 5 of the samples have to be above the given CPM. The number of samples of the smallest group is usually used (e.g. in an experiment that has two replicates for each condition (or group), a gene should be expressed in at least two samples). Set value to 0 if no filter is desired.
  • Calculate normalization factors to scale the raw library sizes:
    • Normalization Method: Here the normalization takes the form of scaling factors for library sizes that enter into the statistical model. These correctional factors are used to compute the effective library sizes. For further details please refer to the edgeR User's Guide. You can select the normalization method to be used:
      • TMM: Weighted trimmed mean of M-values. In this method, weights are obtained from the delta method on Binomial Data (this method is recommended).
      • RLE: Relative log expression. Scale factors are the median ratio of each sample to the median library (geometric mean of all samples).
      • Upper-quartile: 75% quantile for the counts for each library is used to calculate the scale factors.
      • None: Normalization factors are set to 1.

Figure 4: Preprocessing Data Page

Experimental Design Page

  • Experimental design file: Select your .txt file containing your experimental factors with the experimental conditions associated to each sample in tab-delimited format. As demonstrated in Figure 7, rows correspond to samples and columns to experimental factors. Make sure that the names in the first column of the experimental design table are exactly the same as the sample names in the count table header. If your experimental design file has fewer samples than in the count table, only the samples contained in this file will be analyzed.

Image pde_designfile

Figure 7: Experimental Design File


Experimental Design Page

Figure 5: Experimental Design Page

Comparison and Test Page

  • Design Type: Choose the design type to adjust the analysis
    • Simple design: Makes a pairwise comparison between samples belonging to two experimental conditions. You only have to select the experimental factor of interest and establish the comparison selecting the reference and contrast conditions in ``Primary Target''.
    • Paired design: Makes a pairwise comparison between samples belonging to two experimental conditions, adjusting for baseline differences of other experimental factors. In this design, you have to establish the conditions for the comparison in ``Primary Target'' and the experimental factor for baseline difference in ``Secondary Target''. This design type is appropriate for paired or blocking design, or experiments with batch effects.
    • Multifactorial Design: Makes a pairwise comparison between samples belonging to two experimental conditions with two experimental factors. For this design, you have to select the two experimental factors of interest and establish the reference and contrast group for each in ``Primary Target'' and ``Secondary Target''. This design type is appropriate if you want to analyze the effects of combined experimental conditions on gene expression.
  • Statistical Test:
    • Select a Statistical Test:
      • Exact Test: Based on the quantile-adjusted conditional maximum likelihood (qCML) methods (similar to Fisher's exact test). It is only applicable to datasets with a single factor design (simple design).
      • GLM (Likelihood Ratio Test): Based on fitting negative binomial Generalized Linear Models (GLMs) with the Cox-Reid dispersion estimates. Is a good choice for inferences with GLMs.
      • GLM (Quasi Likelihood F-Test): The empirical Bayes quasi-likelihood F-test is an alternative to Likelihood Ratio Test and provides a more robust and reliable error rate control when the number of replicates is small.
    • Robust: Estimation is strengthened against potential outlier genes.

Comparison and Test Page

Figure 6: Comparison and Test Page

Results

Once the input counts have been processed and analyzed via the ``Pairwise Differential Expression Analysis'' tool, a new tab is opened containing results (Figure 8):

  • logFC: A measure that describes how much the expression changes between conditions (log2-fold-changes are shown).
  • logCPM: The average log2-counts-per-millions.
  • LR: Likelihood ratio statistic for the GLM (Likelihood Ratio Test).
  • F: Quasi-likelihood F-statistic for the GLM (Quasi Likelihood F-test).
  • FDR: False Discovery Rate calculated by the Benjamini-Hochberg method (multiple hypothesis testing corrections).
  • Tags: Indicate whether a gene is upregulated (FDR ≤ 0.05, logFC ≥ 0) or downregulated (FDR ≤ 0.05, logFC ≥ 0).

Genes that have not passed the filtering step are not shown in the new tab.


Figure 8: Table Viewer


Results can be saved as a Pairwise Results object. Note that it is not possible to perform the analysis on this object. For this purpose, you have to open the Count Table object. If you want to see both count table and results, go to the File Manager and open the two .b2g files together.

A result page will show a summary of the pairwise differential expression analysis results (Figure 9).


Figure 9: Results Summary


If you want to filter for differential expression based on other FDR and/or logFC cutoffs, you can go to Side Panel → Set Up/Down Tags and establish new values for both cutoffs. Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs. To view, the updated summary results go to Side Panel → Result Summary and it can be exported in pdf.

Charts and Statistics

Different statistics charts can be generated for a global visualization of the results. These charts can be found under the Side Panel of the Pairwise Results viewer. 

MDS Plot

Generates a two-dimensional scatterplot in which the distances represent the typical log2 fold changes between samples. You can select an experimental factor by which you want to color the MDS graphic (Figure 10(a)).


Figure 10(a): MDS Plot


Result Summary

Bar chart which shows the number of total features, kept features (those who have passed the filtering step), differentially expressed features, up-regulated features and down-regulated features (Figure 10(b)). 


Figure 10(b): Result Summary


Volcano Plot

Scatter chart that is constructed by plotting the negative log of the adjusted p-values (FDR) on the y-axis versus the log of the fold changes on the x-axis (Figure 10(c)). Upregulated and downregulated genes are shown in green and red respectively.

Figure 10(c): Volcano Plot


MA Plot

A smear plot showing the log of the fold changes on the y-axis versus the average of the log of the CPM on the x-axis. DE genes are marked in red (Figure 10(d)).

Figure 10(d): MA Plot


Heatmap

A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors (Figure 10 (e)). The dendrograms added to the left and top side are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between genes (left) and samples (top). 

The heatmap supports zooming by keeping clicked a node of either of the two dendrograms. The first bars contain the experimental design of the data showing the association between samples and experimental covariates. 

Genes that will be displayed can be selected in the wizard. There are three options:

  • The Top 50 differentially expressed genes (ranked by FDR). 
  • All differentially expressed genes.
  • Provide an ID list containing the genes to represent.


Differentially expressed genes are those that are labelled as UP or DOWN in the table project ("Tags" column). The criteria for considering a gene as differentially expressed can be adjusted using the option "Set Up/Down Tags". 

Furthermore, the wizard allows adjusting the type of expression data that will be represented, as well as the transformation that can be applied to this data.

Figure 10(e): Heatmap

Enrichment Analysis

It is possible to perform a functional enrichment analysis from the pairwise differential expression project. Both options, Fisher's Exact Test and Gene Set Enrichment Analysis, can be found under the Side Panel of the Pairwise Results viewer. 

Fisher's Exact Test

Choose the subset of genes that will be considered as Test-set. Up-regulated and down-regulated genes are those that are tagged according to the criteria established by the option "Set Up/Down Tags". 

The project containing the functionally annotated sequences that will be used as a reference background set should be provided. 

The rest of the parameters are explained in the Fisher's Exact Test section.

Gene Set Enrichment Analysis

The "FDR Filter for Ranked List" parameters allows setting a filter to exclude those genes whose FDR is above it. The ranked gene list will be created using the logFC statistic. 

The project containing the functionally annotated sequences that will be used as a reference background set should be provided. 

The rest of the parameters are explained in the Gene Set Enrichment Analysis section