Tools (PRO Feature)
Menu Items
- Set-to-Sense (Based on Best-Blast-Hit): Convert all selected sequences with a negative reading frame Best-Blast-Hit to anti-sense i.e. query-sequences will be translated to its reverse compliment (e.g.: ATTG ->CAAT). The tag "_antisense" will be added to the end of the sequence names. Use the batch rename function to undo the name change.
- Translate Longest ORF: Convert all selected sequences to its longest ORF protein sequence. The tag "_ORF'' will be added to the sequence names. Use the batch rename function to undo the name change. The user may select the reading frame, the genetic code depending to the species that will be considered to the prediction.
- Search Loaded Annotations in Another Annotation Set: Compare a set of annotations for a given group of sequences against the annotations already loaded in Blast2GO.
- Find Duplicated Sequences: Mark as selected or directly remove all sequences in the dataset which have the exact same sequence string.
- Find Similar Sequences: Detect, Select and/or remove similar sequences within one project.
- Batch Rename: Perform a batch rename of all selected sequences by converting, replacing or adding text to the actual sequence name.
- Retrieve Blast Top-Hit
- Retrieve Blat Top-Hit
- Create NCBI GenBank Genome Submission Files (online) (see NCBI Submission)
Find Similar Sequences
This function allows to search for similar sequences within a dataset. The search for similar sequences is done via BLAT ((, )) alignments. The function searches a list of sequences against itself and reports all alignments above a certain similarity percentage. It is possible to remove similar sequences from the project or to extract a less redundant result dataset into a new project.
Find Duplicated Sequences
This function allows to quickly identify and remove redundant sequences (exactly the same sequences) within a dataset.
Retrieve Blast Top-Hit
This feature allows to retrieve the sequence information of Top Blast Hits in a Blast2GO project. Data can be obtained from the the NCBI, Ensembl or Uniprot webservices and stored in a new project or replace the existing IDs/sequences (see Figure 1). A possible use case scenario would be a so called "Double-Blast'': The blast results of a first run are used to replace the sequence data for a second run against a different set of query sequences. Imagine an RNA-seq data-set with a high percentage of sequences without any alignments against a protein database (e.g. blastx against NR). This feature could be used to select and extract the sequences without hits (red ones) into a new project. These sequences could be basted first against a set of EST sequences. The initial unaligned sequences are now replace with the ESTs. Now the initial blastx search is repeated again the protein.
For each Top-Hit (first significant alignment from an already performed BLAST), apply the filters (bottom part of the dialog) and search them in the corresponding database (online).
It is possible to either replace the sequence from your data-set or to extract them into a new data-set (Action option). You can also decide whether you want to keep the original sequence names or if you want to rename them to the downloaded sequences names. The latter will add a small note to the sequence description, telling you the original name.
The last remaining option allows you to decide whether you want to replace your sequences with the downloaded ones or if you just want to retrieve their name. This option is activated by default.
Figure 1: Retrieve Blast Top-Hit Dialog.
Retrieve Blat Top-Hit
This tool is very similar to "Retrieve Blast Top-Hit'' explained above, but it employs BLAT ((, )) instead (see figure below). The dialog is therefore quite similar and the first 3 options are identical. BLAT needs a reference FASTA file which it uses to search for similar sequences. The last 2 options allow you to filter by similarity and if BLAT should consider the reverse strand.
Figure 2: Retrieve Blat Top-Hit Dialog.
Data Import and Export (PRO Feature)
Under the File and Tools menu there are several useful PRO features that can be used to manipulate sequence data.
Load
- Extract and import sequences from a FASTA and a GFF/GTF file (Figure 3).
- Load GTF/GFF2/GFF3
- Load Accession List: Load Gene Ontology annotations via an Accession list.
- Load GeneSymbol List: Load Gene Ontology annotations via a GeneSymbol list.
- Load GI-List: load Gene Ontology annotations via a GenInfo Identifier (gi) list. Please consider the identifier to be between vertical bar e.g. gi|356569257|.
- Load Data from BioMart: Load Gene Ontology annotations from BioMart.
The Accession List and the GeneSymbol file should contain two columns (separated by tab) per line. The first column the accession id or gene symbol and the second column may contain the corresponding taxonomy. The second column is optional.
Figure 3: Extract and import sequences from a FASTA and a GFF/GTF file.
Export
- Generic Export: This option allows you to export all the desired information to a text file.
- Export Selected Sequences as Project: Only the selected sequences can be exported and saved in .dat file.
- Export Sequence Table: Export the current Main Sequence Table for the selected sequences.
- Export TopBlast data: It will export the best-blast-hit for each sequence, this is the hit with the lowest e-value.
- Export GO Propagation: Exports the GO parents up to the root for the annotated sequences.
- Export Sequences per GO (Gene Set):
- Export GFF2/GTF
- Export GFF3: This option is only visible if a GFF file is loaded in Blast2GO or if a GFF has been generated from Gene Finding.