include "mmetsp_header.inc.php"; ?>

General instructions for evaluating species 18S diversity in your data

A BWA [1] index, or database, is made for a set of 18S sequences from the expected species and potential non-target organisms. The 18S FASTA file from the MMETSP samples is a good place to start. You may want to add in another small subunit database (e.g. SILVA [2] or Greengenes [3]) if other bacterial or eukaryotic contaminants are suspected. If additional sequences are added to the MMETSP 18S FASTA file, we recommend using CD-HIT-EST [4] to remove redundant sequences because redundancy in the database complicates interpretation of results. Sequence reads are aligned in paired-end mode to the 18S database using BWA, allowing one mismatch per 50 bases. PCR duplicates are removed at this point using SAMtools [5] rmdup to remove identical read pairs, preventing a single molecule from being counted multiple times.

There are two metrics we recommend calculating from the 18S alignments: all counts and unique pairs. “All counts” is simply how many individual reads (not considering pairs) are aligned to each 18S sequence. Typically, you would expect the 18S from the sample sequenced to have the highest “all counts”, but similar 18S sequences can result in high numbers of reads aligned. “Unique pairs” is a count of how many read pairs are aligned to a particular 18S sequence, but not to any other 18S sequence. This is as close to a diagnostic of purity as is possible. In a perfectly pure sample, ideally, only the 18S of the species sequenced will have uniquely aligned read pairs. Significant numbers of unique pairs in a species other than the intended target may indicate presence of another organism in the sample. In practice, such unique pairs can occur due to random sequencing errors, errors introduced during PCR amplification, the non-axenic nature of many aquatic samples, cryptic contaminants in culture, laboratory handling, and the highly sensitive nature of deep sequencing. An example is shown below.

Sample-A Alignments (Reads from Sample-A aligned to an 18S database)

#18S	ALL-COUNTS	UNIQUE-PAIRS
Species-x	84772	16478
Species-y	56678	10
Species-z	4165	830

Interpreting counts can be challenging, but we can gain some insight about what may have unexpectedly been present in a culture, at the time of RNA extraction, or during the sequencing process. In the example above, sample-A was intended to be a culture of species-x. One might conclude that sample-A was correctly identified and cultured because it is largely the intended species due to the comparatively high counts in both “all counts” and “unique pairs” aligned to the 18S from species-x. Species-y had many read alignments indicating regional similarity in the 18S sequences between species x and y, but the low number of uniquely aligning read pairs should not raise concern—it could represent a very minor contaminant. On the other hand, the significantly higher number of uniquely aligning pairs to the SSU from species-z may indicate that species-z was a contaminant in sample-A, perhaps representing an intended bacterial food source or some form of unintended contaminant.

Software and databases referenced:
1. http://bio-bwa.sourceforge.net
2. http://www.arb-silva.de
3. http://greengenes.secondgenome.com/downloads/database/13_5
4. http://bioinformatics.org/cd-hit/
5. http://samtools.sourceforge.net