After working on this for awhile, I stumbled across a blog maintained by Amanda. I borrowed fairly heavily from some of her experiences and examples. I recommend you check it out, lots of nice, clear posts!
I am planning on generating fairly large metagenomic and metatranscriptomic datasets within the next few months. Since I have a little bit of down time while some 16S datasets run through my pipeline, I thought I’d play around with some simulated and publicly available datasets.
One of my good friends recommended starting with IDBA-UD. I thought I’d try this medium to record my experiences as I’m in the middle of updating my personal website as well.
Using the author’s simulated datasets to test the install
Testing on a different dataset Using only sample BP101
Using multiple different sizes:
idba_ud pbs script calc_contig_stats.py is a little script I wrote that calculates a few of the assembly values I’m interested in, names: - Number of Contigs - N50 value - Length of longest contig - Mean contig size - Number of contigs > 1kb - Percent of reads used to make assembly
SPAdes is another fairly recent assembler that has options to handle metagenomic datasets. It comes pre-packaged with linux binaries that I was able to use out of the box, always nice.
Installing using the guidelines provided:
Adding the scripts to my path:
Running the provided test scripts:
Running my 1% dataset:
Another de Bruijn graph assembler that was created specifically for large complex metagenomic datasets. It has the option to use a GPU to increase the speed of the assembly, but I haven’t tried this out yet.
Testing Megahit on the 1% dataset
Evaluating Metagenomic Assemblies
Results to date
|Contigs > 500bp||363||16,005||273||12,043||250||12,211||166,899|
|Contigs > 1kb||68||3,023||43||1,845||44||1,955||35,348|
|Total Length (>500bp)||306,037||13,467,620||213,926||9,558,868||200,843||9,691,657||152,384,054|
|Percent Reads Used||2.7||9.7||4.8||14.7||2.2||9.6||23.7|
IDBA_UD and SPAdes keep running out of RAM on the full dataset. We only have 2 high memory nodes on our cluster, so they’ve bee in queue for awhile now waiting for enough ram. I’ll update the table with the results.
However, I’m pretty happy with Megahit. Much more efficient memory usage, CPU times are very similar to IDBA_UD, and the stats are similar (slightly lower % reads used, slightly shorter contigs on average, the longest contigs tend to be shorter etc…) However, I am not benchmarking these aligners with mock datasets, so I don’t know the fidelity of the contigs.
I will probably end up moving this to a new blog post, but since this one is currently open I’ll put the text here for now.
I’m installing CONCOCT on our cluster using anaconda
I’m following the full example provided by the CONCOCT developers.
Setting up the environment:
Cut up the contigs (using the megahit assembly of the 10% dataset)
Mapping the raw reads back onto the contigs to estimate coverage of each contig.
I need to feed the CONCOCT script non-interleaved fasta files. Unfortunately, I don’t have the raw files that were used to make the interleaved ones. Here is a quick perl based one-liner that splits your file into R1 and R2 depending on /1 or /2 at the end of the sequence head.
Basically, check if the header has a /1 or /2, if so, print the next line. I don’t bother moving ahead 2 lines since I don’t need it optimized.