RNA-Seq de novo Assembly
De novo transcriptome assembly is one of the most frequent analyses performed in bioinformatics and it consists of reconstructing the transcriptome from RNA sequencing data, assembling short nucleotide sequences into longer ones without the use of a reference genome. This functionality is based on Trinity, a well-known de novo sequence assembler software developed at the Broad Institute and the Hebrew University of Jerusalem.
Trinity combines three independent software modules applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.
Please, cite Trinity as Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nature Biotechnology, 29(7):644-52
Run RNA-Seq de novo Assembly
- Sequencing Data: Choose the type of data to be preprocessed: single-end or paired-end reads. Note that if paired-end is selected, two files per sample are required.
- Input Reads: Provide the files containing sequencing reads. These files are assumed to be in FASTQ format.
- Paired-end configuration: In case of paired-end reads, the pattern to distinguish upstream files from downstream files is required. The provided patterns are searched right before the extension, and the start of the name should be the same for both files of each sample.
- Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.
For example, if the upstream file is named SRR037717_1.fastq and the downstream one SRR037717_2.fastq, you should establish "_1" as the upstream pattern and "_2" as the downstream pattern.
Figure 1: Input Data Page
- Strand Specificity: This option defines the strandedness of the RNA-seq reads:
- Non-Strand Specific: Refers to non-strand-specific protocols.
- Strand Specific Forward: For single-end data, the single read is in the sense (forward) orientation. In case of paired-end data, the first read of fragment pair is sequenced as sense (forward), and the second is in the antisense strand (reverse).
- Strand Specific Reverse: For single-end data, the single read is in the antisense (reverse) orientation. In case of paired-end data, the first read of fragment pair is sequenced as anti-sense (reverse), and the second read is in the sense strand (forward). Typical of the dUTP/UDG sequencing method.
Minimizing Falsely Fused Transcripts: If the transcriptome RNA-seq data under study are derived from a gene-dense compact genome, fusion transcripts can be minimized. This option is only available for paired-end data. In compact fungal genomes, it is highly recommended.
Note that it is an expensive operation, so avoid using it unless necessary.
- Pair Distance: Maximum length expected between fragment pairs (500 nucleotides by default). Reads outside this distance are treated as single-end.
- Transcript to Gene Mapping: Select a location to place the transcript to gene mapping file. It is a tab-delimited file with information to map from transcript (isoform) identifiers to gene identifiers. It could be used in downstream analysis such as the Transcript-level Quantification.
Figure 2: Advanced Configuration Page
To recognize the input reads, the algorithm requires that the read names of the upstream FASTQ files end in "/1" and the names of the downstream FASTQ files in "/2". In this way, the algorithm can recognize reads from the same read pair and distinguish the upstream read from the downstream one. For example:
- Upstream read: @D15C0ACXX120910:4:1101:10000:3558/1
- Downstream read: @D15C0ACXX120910:4:1101:10000:3558/2
This usually happens when the FASTQ files are downloaded from the SRA. If it is the case, you will have to download your data again using the command suggested in the error message:
- Linux/Mac --> SRA_TOOLKIT/fastq-dump --define-seq '@$sn[_$rn]/$ri' --split-files SRA_ACCESION
- Windows --> SRA_TOOLKIT/fastq-dump --define-seq @$sn[_$rn]/$ri --split-files SRA_ACCESION
When the RNA-seq de novo assembly completes, it creates a sequence table containing the assembled transcripts sequences (Figure 3). Trinity groups transcripts into clusters based on shared sequence content. Such a transcript cluster can be considered as a 'gene'.
Figure 3: Sequence table project containing the sequences of the assembled transcripts
This information is encoded in the Trinity FASTA accession. An example FASTA entry for one of the transcripts is formatted like so:
- Isoform 1: TRINITY_DN869_c0_g1_i1
- Isoform 2: TRINITY_DN869_c0_g1_i2
The accession encodes the Trinity 'gene' and 'isoform' information. In the example above, the accession 'TRINITY_DN869_c0_g1_i1' indicates Trinity read cluster 'TRINITY_DN869_c0, gene 'g1', and isoform 'i1' and 'i2'. Because a given run of trinity involves many clusters of reads, each of which are assembled separately, and because the 'gene' numbering is unique within a given processed read cluster, the 'gene' identifier should be considered an aggregate of the read cluster and corresponding gene identifier, which in this case would be 'TRINITY_DN869_c0_g1'.
Furthermore, a result page will show a summary of the RNA-seq de novo assembly results (Figure 4). It contains the following information:
- Details of input FASTQ files.
- Results overview that informs about the number of total transcripts and genes detected, the percentage of GC and the total assembled bases.
- Statistics based on the lengths of the assembled transcriptome contigs. The conventional Nx length statistic means that at least x% of the assembled transcript nucleotides are found in contigs that are at least of Nx length. For example, the N50 means that at least half of all assembled bases are in transcript contigs of at least the N50 length value.
- The RNA-Seq Read Representation, that allows assessing the read composition of the assembly. It shows the number of reads that map to the assembled transcripts, including the properly paired and those that are not (details below).
Figure 4: Summary report
Finally, two charts showing the read representation of the assembly are generated (Figure 5 and Figure 6). These charts display the number of reads of each input file sorted by different categories (the second chart represents the same information in percentages). Bowtie2 is used to align the reads to the transcriptome and then the number of the single-end reads or proper pairs and improper or orphan read alignments are counted.
Figure 5: Read Representation Chart
Figure 6: Read Representation (%) Chart