This article contains promotional content. (July 2021) |
MG-RAST, an open-source web application server, facilitates automatic phylogenetic and functional analysis of metagenomes. It stands as one of the largest repositories for metagenomic data, employing the acronym for Metagenomic Rapid Annotations using Subsystems Technology (MG-RAST). This platform utilizes a pipeline that automatically assigns functions to metagenomic sequences, conducting sequence comparisons at both nucleotide and amino acid levels. Users benefit from phylogenetic and functional insights into the analyzed metagenomes, along with tools for comparing different datasets. MG-RAST also offers a RESTful API for programmatic access.
Original author(s) | Argonne National Laboratory, University of Chicago, San Diego State University |
---|---|
Developer(s) | F. Meyer, D. Paarmann, M. D'Souza, R. Olson, E.M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, R.A. Edwards |
Initial release | 2008 |
Stable release | 4.0
/ 15 November 2016 |
Type | Bioinformatics |
Website | http://metagenomics.anl.gov/ |
Argonne National Laboratory from the University of Chicago created and maintains this server. As of December 29, 2016, MG-RAST had analyzed a substantial 60 terabase-pairs of data from over 150,000 datasets. Notably, more than 23,000 of these datasets are publicly available. Computational resources are currently sourced from the DOE Magellan cloud at Argonne National Laboratory, Amazon EC2 Web services, and various traditional clusters.
Background
editMG-RAST was developed to serve as a free, public resource dedicated to the analysis and storage of metagenome sequence data. It addresses a key bottleneck in metagenome analysis by eliminating the dependence on high-performance computing for annotating data.
The significance of MG-RAST becomes evident in metagenomic and metatranscriptomic studies, where the processing of large datasets often requires computationally intensive analyses. With the substantial reduction in sequencing costs in recent years, scientists can generate vast amounts of data. However, the limiting factor has shifted to computing costs. For example, a recent University of Maryland study estimated a cost exceeding $5 million per terabase using their CLOVR metagenome analysis pipeline. As sequence datasets' size and number continue to grow, the associated analysis costs are expected to rise.
Beyond analysis, MG-RAST functions as a repository tool for metagenomic data. Metadata collection and interpretation are crucial for genomic and metagenomic studies. MG-RAST addresses challenges related to the exchange, curation, and distribution of this information. The system has embraced minimal checklist standards and biome-specific environmental packages established by the Genomics Standards Consortium. Furthermore, MG-RAST provides a user-friendly uploader for capturing metadata at the time of data submission.
Pipeline for metagenomic data analysis
editThe MG-RAST application provides a comprehensive suite of services, including automated quality control, annotation, comparative analysis, and archiving for metagenomic and amplicon sequences. It utilizes a combination of various bioinformatics tools to achieve these functionalities. Originally designed for metagenomic data analysis, MG-RAST also extends support to amplicon sequences (16S, 18S, and ITS) and metatranscriptome (RNA-seq) sequences processing. However, it's important to note that MG-RAST currently lacks the capability to predict coding regions from eukaryotes, limiting its utility for eukaryotic metagenome analysis.
The MG-RAST pipeline can be segmented into five distinct stages:
Data hygiene
editThe MG-RAST pipeline incorporates a series of steps for quality control and artifacts removal, ensuring robust processing of metagenomic and metatranscriptome datasets. The initial stage involves trimming low-quality regions using SolexaQA and eliminating reads with inappropriate lengths. In the case of metagenome and metatranscriptome datasets, a dereplication step is introduced to enhance data processing efficiency.
The subsequent step employs DRISEE (Duplicate Read Inferred Sequencing Error Estimation) to evaluate sample sequencing errors by measuring Artificial Duplicate Reads (ADRs). This assessment contributes to enhancing the accuracy of downstream analyses.
Finally, the pipeline offers the option to screen reads using the Bowtie aligner. It identifies and removes reads that exhibit matches close to the genomes of model organisms, including fly, mouse, cow, and human. This step aids in refining the dataset by filtering out reads associated with potential contaminants or unintended sequences.
Feature extraction
editIn the gene identification process, MG-RAST employs a machine learning approach known as FragGeneScan. This method is utilized to identify gene sequences within the metagenomic or metatranscriptomic data.
For the identification of ribosomal RNA sequences, MG-RAST initiates a BLAT search against a reduced version of the SILVA database. This step allows the system to pinpoint and categorize ribosomal RNA sequences within the dataset, contributing to a more detailed understanding of the biological composition of the analyzed metagenomes or metatranscriptomes.
Feature annotation
editTo identify the putative functions and annotations of the genes, MG-RAST follows a multi-step process. Initially, it builds clusters of proteins at a 90% identity level using the UCLUST implementation in QIIME. The longest sequence within each cluster is then selected for further analysis.
For the similarity analysis, MG-RAST employs sBLAT, a parallelized version of the BLAT algorithm using OpenMP. The search is conducted against a protein database derived from the M5nr, which integrates nonredundant sequences from various databases such as GenBank, SEED, IMG, UniProt, KEGG, and eggNOGs.
In the case of reads associated with rRNA sequences, a clustering step is performed at a 97% identity level. The longest sequence from each cluster is chosen as the representative and is used for a BLAT search against the M5rna database. This database integrates sequences from SILVA, Greengenes, and RDP, providing a comprehensive reference for the analysis of ribosomal RNA sequences.
Profile generation
editThe data feeds several key products, primarily abundance profiles. These profiles summarize and reorganize the information found in the similarity files in a more easily digestible format.
Data loading
editFinally, the obtained abundance profiles are loaded into the respective databases.
Detailed steps of the MG-RAST pipeline
editMG-RAST Pipeline | Description |
---|---|
qc_stats | Generate quality control statistics |
preprocess | Preprocessing, to trim low-quality regions from FASTQ data |
dereplication | Dereplication for shotgun metagenome data by using k-mer approach |
screen | Removing reads that are near-exact matches to the genomes of model organisms (fly, mouse, cow and human) |
rna detection | BLAT search against a reduced RNA database, to identifies ribosomal RNA |
rna clustering | rRNA-similar reads are then clustered at 97% identity |
rna sims blat | BLAT similarity search for the longest cluster representative against the M5rna database |
genecalling | A machine learning approach, FragGeneScan, to predict coding regions in DNA sequences |
aa filtering | Filter proteins |
aa clustering | Cluster proteins at 90% identity level using uclust |
aa sims blat | BLAT similarity analysis to identify protein |
aa sims annotation | Sequence similarity against protein database from the M5nr |
rna sims annotation | Sequence similarity against RNA database from the M5rna |
index sim seq | Index sequence similarity to data sources |
md5 annotation summary | Generate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation |
function annotation summary | Generate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation |
organism annotation summary | Generate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation |
lca annotation summary | Generate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation |
ontology annotation summary | Generate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation |
source annotation summary | Generate summary report md5 annotation, function annotation, organism annotation, LCAa annotation, ontology annotation and source annotation |
md5 summary load | Load summary report to the project |
function summary load | Load summary report to the project |
organism summary load | Load summary report to the project |
lca summary load | Load summary report to the project |
ontology summary load | Load summary report to the project |
done stage | |
notify job completion | Send notification to user via email |
MG-RAST utilities
editG-RAST isn't just a powerhouse for metagenome analysis, it's also a treasure trove for data exploration. Dive into a diverse toolbox for visualizing and comparing metagenome profiles across various datasets. Filter based on specifics like composition, quality, functionality, or sample type to tailor your search. Delve deeper with statistical inferences and ecological analyses – all within the user-friendly web interface.
See also
editReferences
edit- ^ Field, Dawn; Amaral-Zettler, Linda; Cochrane, Guy; Cole, James R.; Dawyndt, Peter; Garrity, George M.; Gilbert, Jack; Glöckner, Frank Oliver; Hirschman, Lynette (2011-06-21). "The Genomic Standards Consortium". PLOS Biology. 9 (6): e1001088. doi:10.1371/journal.pbio.1001088. ISSN 1545-7885. PMC 3119656. PMID 21713030.