Workflows
This page documents the workflows available in the CAMERA Portal. Additional information is available by logging into the CAMERA Portal .
Data Preparation
QC Filter
Each base in a given read has a quality score, Q, associated with it. Q=-10*log10(p), where p is the probability error. To have a sense of the quality of the given reads, the read average score can be used to see the quality performance. "Quality Control Filter" takes fasta and qual files or fastq file as input, calculates the average score for each read, then fetches high quality reads, filters out shorter than minimum read length; and generates statistical analysis on the input reads.
Note: This workflow does not have a graphical output but the results can be downloaded to your machine to view.
454 Duplicate Clustering
This workflow identifies the duplicates from 454 reads, including exact duplicates and near identical duplicates. These duplicates are mostly sequencing artifacts in metagenomic samples, and therefore should be removed. However, most duplicates in transcriptomic reads are not artificial, so it is not suggested to run this workflow for transcriptomic datasets.
Note: This workflow does not have a graphical output but the results can be downloaded to your machine to view.
BLAST
BLASTn
BLASTN searches nucleotide databases using a nucleotide query.
CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search.
Note: Using the following advanced parameters with BLAST on the NCBI web site will yield identical results.
Match reward: 2
Mismatch penalty: -3
Gap open cost: 5
Gap extend cost: 2
Only the top alignment per hit will be kept for blast jobs when (CAMERA_REF)NCBI Refseq Genomes (N) is used as the reference data set.
BLASTp
BLASTP searches protein databases using a protein query. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site.
Gap open cost: 11
Gap extend cost: 1
BLASTx
BLASTX searches protein databases using a translated nucleotide query. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site.
Gap open cost: 11
Gap extend cost: 1
MEGA Blast
Megablast is intended for comparing a query to closely related sequences and works best if the target percent identity is 95% or more but is very fast. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search.
Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site.
Mismatch penalty: -2
Only top alignment per hit will be kept for blast jobs when (CAMERA_REF)NCBI Refseq Genomes (N) is used as the reference data set.
TBLASTn
TBLASTN searches translated nucleotide databases using a protein query. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site. Gap open cost: 11Gap extend cost: 1
TBLASTx
TBLASTX searches translated nucleotide databases using a translated nucleotide query. CAMERA uses the NCBI default blastall parameters. These, however, can be changed to better suit the nature of your query and the purpose of your search. Note: Using the following advanced parameters will yield results that match BLAST on the NCBI web site. Gap open cost: 11Gap extend cost: 1
BLAST Kegg
This workflow uses BLAST to search protein sequences against the KEGG protein database. The KEGG number and its pathway/functions will be returned. Note: This workflow does not have a graphical output, but the results can be downloaded to your computer for viewing and analysis.
DNA Clustering
This workflow uses cd-hit-est program to cluster DNA sequences. The non-redundant sequences, cluster file, cluster distribution and cluster table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
RNA Prediction
rRNA prediction by hmmer
This workflow uses hmmer 3.0 program to predict rRNA sequences from input DNA reads. The predicted rRNAs, masked input sequences and predicted rRNA coordinate table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
rRNA prediction by blastn
This workflow uses blastn program to predict rRNA sequences from input DNA reads. The predicted rRNAs, masked input sequences and predicted rRNA coordinate table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
tRNA prediction
This workflow uses tRNAscan-SE program to predict tRNA sequences from input DNA reads. The predicted tRNAs, masked input sequences and predicted tRNA coordinate table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Clustering
DNA Clustering
This workflow uses cd-hit-est program to cluster DNA sequences. The non-redundant sequences, cluster file, cluster distribution and cluster table will be outputted.
Protein clustering
This workflow uses cd-hit program (default sequence identity cutoff=0.9) to cluster protein sequences in just one step. The non-redundant sequences, cluster file, cluster distribution and cluster table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Hierarchical protein clustering
This workflow uses cd-hit program to cluster protein sequences in two steps. First it uses default sequence identity cutoff=0.9 to do clustering. Based on clustering results at the first step, we use cd-hit (default sequence identity cutoff=0.6) again to do clustering for the second step. The non-redundant sequences, cluster file, cluster distribution and cluster table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
454 Duplicate Clustering
This workflow identifies the duplicates from 454 reads, including exact duplicates and near identical duplicates. These duplicates are mostly sequencing artifacts in metagenomic samples, and therefore should be removed. However, most duplicates in transcriptomic reads are not artificial, so it is not suggested to run this workflow for transcriptomic datasets.
Note: This workflow does not have a graphical output but the results can be downloaded to your machine to view.
Sequence Assembly
Assembly
This workflow assembles the 454 reads using a meta-assembler developed by CAMERA. This meta-assembler first run a list of assembly programs to generate a pool of contigs, it then re-assemble the contigs into final results. Our analysis showed that the meta-assembler is better than any of its component assembly programs.
Note: This workflow does not have a graphical output but the results can be downloaded to your machine to view.
Orf Prediction
Orf finder by fraggene_scan
This workflow uses fraggene_scan program to predict orfs from input DNA reads. The orf sequences and predicted orf coordinate table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Orf finder by metagene
This workflow uses metagene program to predict orfs from input reads. The orf sequences and predicted orf coordinate table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Orf finder by six-reading-frame
This workflow uses six-reading-frame translation technique to predict orfs from input DNA reads. The orf sequences and predicted orf coordinate table will be outputted.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Functional Annotation
Metagenomic data annotation and clustering
This is the full RAMMCAP pipeline for analysis of metagenomic sequences. It accepts a FASTA file of raw reads. The pipeline identifies the tRNA, rRNA, and ORFs from the reads. It then performs clustering analysis on the reads and the ORFs. The ORFs are annotated against PFAM, TIGRFAM, and COG.
Function annotation by PFAM
This workflow uses hmmer 3.0 program to give function annotation to protein sequences. It is based on PFAM database. Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view
Function annotation by COG
This workflow uses rpsblast program to give function annotation to protein sequences. It is based on COG database for prokaryotic proteins.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Function annotation by KOG
This workflow uses rpsblast program to give function annotation to protein sequences. It is based on KOG database for eukaryotic proteins.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Function annotation by TIGRFAM
This workflow uses hmmer 3.0 program to assign function annotation to protein sequences. It is based on TIGRFAM database.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Function annotation by NCBI PRK
This workflow uses rpsblast program to give function annotation to protein sequences. It is based on PRK database for Reference Sequence proteins.
Note: This workflow does not have a graphical output interface but the results can be downloaded to your machine to view.
Diversity
Alpha Diversity (Rohwer)
This workflow employs PHACCS, developed by Rohwer lab, to estimate a viral community structure and diversity based on contig spectrum calculated from metagenomic information obtained from one viral community. It accepts a FASTA file of viral nucleotide sequences per community.
Gamma Diversity (Rohwer)
This workflow employs PHACCS, developed by Rohwer lab, to estimate overall viral community structure and diversity in combined viral communities based on contig spectrum calculated from metagenomic sequences obtained from multiple viral communities. It accepts two to five FASTA files of viral nucleotide sequences.
1 Comment
Andrew Noske
Hi David, was just typing you an e-mail and just noticed you've created a page.... it's a pretty long page though - you might consider adding a table of contents by clicking: Edit then Insert > Table of Contents. :-)