Testing Metagenomic Assemblers

IDBA-UD
SPAdes
Megahit
QUAST
Results to date
CONCOCT binning

After working on this for awhile, I stumbled across a blog maintained by Amanda. I borrowed fairly heavily from some of her experiences and examples. I recommend you check it out, lots of nice, clear posts!

IDBA-UD

I am planning on generating fairly large metagenomic and metatranscriptomic datasets within the next few months. Since I have a little bit of down time while some 16S datasets run through my pipeline, I thought I’d play around with some simulated and publicly available datasets.

One of my good friends recommended starting with IDBA-UD. I thought I’d try this medium to record my experiences as I’m in the middle of updating my personal website as well.

Installing IDBA:

git clone https://github.com/loneknightpy/idba
cd idba
./build.sh

Using the author’s simulated datasets to test the install

#Copy the provided data files, un-tar them, and enter the directory
wget http://hku-idba.googlecode.com/files/lacto-genus.tar.gz
tar -zxvf lacto-genus.tar.gz
cd lacto-genus

#Run the sim_reads command as provided by the authors
sim_reads sim_reads 220668.fa 220668.reads-10 --paired --depth 10 & sim_reads 321956.fa 321956.reads-100 --paired --depth 100 & sim_reads 557433.fa 557433.reads-1000 --paired --depth 1000 

#concatenate the simulated files into "meta.fa"
ls | grep "reads" | xargs -I file cat file >> meta.fa

#Test the idba_ud command
idba_ud -r meta.fa -o idba_meta_assembly

Testing on a different dataset Using only sample BP101

Using multiple different sizes:

head -n 51855188 BP101.CoupledReads.fa > BP101_10.fa &
head -n 5185518 BP101.CoupledReads.fa > BP101_1.fa &

#$INPUT=BP101.CoupledReads.fa, BP101_10.fa, BP101_1.fa
#$OUTPUT=idba_full_mason_seqs, idba_bp101_1, idba_bp101_10
time idba_ud -r $INPUT --pre_correction -o $OUTPUT --num_threads $PROCS

calc_contig_stats.py -i contig.fa -r ../BP101_CoupledReads.fa

idba_ud pbs script calc_contig_stats.py is a little script I wrote that calculates a few of the assembly values I’m interested in, names: - Number of Contigs - N50 value - Length of longest contig - Mean contig size - Number of contigs > 1kb - Percent of reads used to make assembly

SPAdes

SPAdes is another fairly recent assembler that has options to handle metagenomic datasets. It comes pre-packaged with linux binaries that I was able to use out of the box, always nice.

Installing using the guidelines provided:

wget http://spades.bioinf.spbau.ru/release3.7.1/SPAdes-3.7.1.tar.gz
tar -zxvf SPAdes-3.7.1.tar.gz
cd SPAdes-3.7.1

Adding the scripts to my path:

emacs ~/.bashrc
export PATH=$PATH:$HOME/data/program_files/SPAdes-3.7.1/bin
source ~/.bashrc

Running the provided test scripts:

spades.py --test

Running my 1% dataset:

spades.py -o spades_bp101_1 --meta --12 BP101_1.fa --only-assembler

spades pbs script

Megahit

Another de Bruijn graph assembler that was created specifically for large complex metagenomic datasets. It has the option to use a GPU to increase the speed of the assembly, but I haven’t tried this out yet.

Installing Megahit

git clone https://github.com/voutcn/megahit.git
cd megahit
make

Testing Megahit on the 1% dataset

$HOME/data/program_files/megahit/megahit -12 $INPUT -o $OUTPUT -t $PROCS --presets meta-large

megahit pbs script

QUAST

Evaluating Metagenomic Assemblies

wget https://downloads.sourceforge.net/project/quast/quast-3.2.tar.gz
tar -xzf quast-3.2.tar.gz
cd quast-3.2

python metaquast.py --test-no-ref

metaquast.py --max-ref-num 0 --threads 10 contig.fa

cat quast_results/latest/report.txt

Results to date

	IDBA_UD_1%	IDBA_UD_10%	IDBA_UD_100%	SPAdes_1%	SPAdes_10%	SPAdes_100%	Megahit_1%	Megahit_10%	Megahit_100%
Time	13m34	161m24	FAILED	59m32	997m25	FAILED	11m40	167m34	701m21
Memory (Gb)			192			102			48
Contigs	1,834	62,749		15,074	304,115		1,430	64,320	771,306
Contigs > 500bp	363	16,005		273	12,043		250	12,211	166,899
Contigs > 1kb	68	3,023		43	1,845		44	1,955	35,348
N50	414	459		237	257		387	405	442
N50 (>500bp)	818	809		736	744		771	759	883
Max Contig	5,485	24,394		5,441	22,375		5,473	14,971	71,426
Mean Contig	453	487		224	267		409	422	460
Total Length (>500bp)	306,037	13,467,620		213,926	9,558,868		200,843	9,691,657	152,384,054
Percent Reads Used	2.7	9.7		4.8	14.7		2.2	9.6	23.7

IDBA_UD and SPAdes keep running out of RAM on the full dataset. We only have 2 high memory nodes on our cluster, so they’ve bee in queue for awhile now waiting for enough ram. I’ll update the table with the results.

However, I’m pretty happy with Megahit. Much more efficient memory usage, CPU times are very similar to IDBA_UD, and the stats are similar (slightly lower % reads used, slightly shorter contigs on average, the longest contigs tend to be shorter etc…) However, I am not benchmarking these aligners with mock datasets, so I don’t know the fidelity of the contigs.

Also, I stumbled across Dr. Titus Brown’s blog again, and his group recommends using Megahit, with a review found here

CONCOCT binning

I will probably end up moving this to a new blog post, but since this one is currently open I’ll put the text here for now.

I’m installing CONCOCT on our cluster using anaconda

# Installing dependencies

#bedtools2
git clone https://github.com/arq5x/bedtools2
cd bedtools2
make

#gnu parallel - availabe on our cluster already
module load gnuparallel/20150422

#samtools - already installed

#bowtie2 - already installed

#blast - already installed

#picard - available on cluster
module load picardtools/1.93

#Installing CONCOCT
module load anaconda2/2.1.0
conda create -n concoct_env python=2.7.6

source activate concoct_env

conda install cython numpy scipy biopython pandas pip scikit-learn

git clone http://github.com/BinPro/CONCOCT

python setup.py install

Running CONCOCT

I’m following the full example provided by the CONCOCT developers.

Setting up the environment:

module load gnuparallel/20150422; module load picardtools/1.93; module load samtools; module load anaconda2/2.1.0
source activate concoct_env

CONCOCT=$HOME/data/program_files/CONCOCT/
export MRKDUP=/usr/local/packages/picardtools/1.93/lib/MarkDuplicates.jar
export PATH=$PATH:$HOME/data/program_files/bedtools2/bin

Cut up the contigs (using the megahit assembly of the 10% dataset)

cd ~/scratch/megahit_bp101_10/
mkdir concoct
cd concoct

#-c = chuck size, -o = overlap size, -m = flag, concatenate final part to last contig
$CONCOCT/scripts/cut_up_fasta.py -c 10000 -o 0 -m ../final.contigs.fa > final.contigs_c10k.fa

Mapping the raw reads back onto the contigs to estimate coverage of each contig.

I need to feed the CONCOCT script non-interleaved fasta files. Unfortunately, I don’t have the raw files that were used to make the interleaved ones. Here is a quick perl based one-liner that splits your file into R1 and R2 depending on /1 or /2 at the end of the sequence head.

Basically, check if the header has a /1 or /2, if so, print the next line. I don’t bother moving ahead 2 lines since I don’t need it optimized.

perl -ne 'BEGIN {$next = 0;} if ($next > 0) {print $_; $next = 0;} else {if ($_ =~ m/\/2/) {print $_; $next = 1;} }'

bowtie2-build final.contigs_c10k.fa final.contigs_c10k.fa

/nv/hp10/woverholt3/data/program_files/CONCOCT/scripts/map-bowtie2-markduplicates.sh -ct 1 -p '-f' ../../BP101_10_R1.fa ../../BP101_10_R2.fa pair final.contigs_c10k.fa asm bowtie2
#pair = name for the sample used (name refering to R1 & R2)
#asm = name of the assembly used
#results in a file: "assembly_name"_"pair_name".*

python /nv/hp10/woverholt3/data/program_files/CONCOCT/scripts/gen_input_tab
le.py --isbedfiles --samplenames samplename.txt ../final.contigs_c10k.fa asm_pair-smds.coverage > concoct_inputtable.tsv

cut -f1,3- concoct_inputtable.tsv > concoct_inputtableR.tsv

concoct -c 40 --coverage_file bowtie2/concoct_inputtableR.tsv --composition_file ../final.contigs.fa -b concoct_output/

Will A. Overholt

Will's personal website