Testing Megahit metagenomic assembly & binning approachs on subsampled metagenomes

Will Overholt bio photo By Will Overholt

Subsamppling the metagenomes

The first thing I need to do is subsample all the individual metagenomes. I opted to do a 1% subsetting so the following processes can run fairly quickly in a day, making the debugging and testing process that much faster.

As usual I’m using the Enveomics script FastA.subsample.pl to accomplish this. I used our clusters multiple job submission format with these two scripts (pbs script, [submission script] (/assets/internal_files/submit_multiple_qsub_subsample.sh).

I then concatenate all the subsampled libraries into 1 fasta file.

for FILE in $(ls ind_libs); do cat $FILE >> all_1perc.fa; done

I re-used my previous megahit command, trying to get it done in 12 hours with 40Gb of RAM.

Building the bowtie index of the final contigs file from megahit

bowtie2-build ../../megahit/final.contigs.fa all_deepc_1perc

Mapping individual sample libraries to the index

bowtie2 -x all_deepc_1perc -U ../../ind_lib
s/BP101.CoupledReads.fa.1.0000-1.fa -S BP101.sam -f

Converting the produced SAM file to its corresponding binary BAM file

samtools view -bS BP101.sam > BP101.bam

Running in a loop

find ../../ind_libs/ -name "*" -type f | xargs -I file bowtie2 -x all_deepc_1perc -U file -S file.sam -f

for FILE in $(find ./ -name "*.sam" -type f); do ~/data/program_files/metageno

Converting the produced SAM file to its corresponding binary BAM file

samtools view -bS BP101.sam > BP101.bam

Running in a loop

find ../../ind_libs/ -name "*" -type f | xargs -I file bowtie2 -x all_deepc_1perc -U file -S file.sam -f

for FILE in $(find ./ -name "*.sam" -type f); do ~/data/program_files/metagenomic_binning/berkeleylab-metabat-cbdca756993e/samtools/bin/samtools view -bS $FILE > "$FILE.bam"; done

for FILE in $(find ./ -name "*.bam" -type f); do ~/data/program_files/metagenomic_binning/berkeleylab-metabat-cbdca756993e/samtools/bin/samtools sort $FILE "$FILE.sorted"

Run metaBAT default command

~/data/program_files/metagenomic_binning/berkeleylab-metabat-cbdca756993e/runMetaBat.sh ../megahit/final.contigs.fa bowtie_mapping/sam_files/*.bam

##Metagenomic binning on the uncontaminated oil samples Looks like I went ahead and did a lot of work without documenting everything, shame! But running metahit on the fulldataset failed to finish so I separated the dataset into clean samples and oiled samples. Since I only care about the clean samples I pooled all of them into 1 fasta file and ran metahit on that using the above command.

From there the process is exactly as described above, except I used mutliple job submissions instead of loops to run each samples coverage.

#building the bowtie index from the metahit assembled contigs
bowtie2-build final.contigs.fa clean_samps

#Mapping each individual samples sequences back to this index in parallel using our clusters multiple job submission format
~/job_scripts/metagenome_seqs/submit_multiple_qsub_subsample.sh

[pbs script](/assets/internal_files/multiple_qsub_bowtie.pbs)
[submission script](/assets/internal_files/submit_multiple_qsub_bowtie.sh).

#Converting SAM files to BAM files with samtools
[pbs script](/assets/internal_files/multiple_qsub_sam2bam.pbs)
[submission script](/assets/internal_files/multiple_qsub_sam2bam.pbs)

#Running metabat with default parameters
"$HOME/data/program_files/metagenomic_binning/berkeleylab-metabat-cbdca756993e/runMetaBat.sh -t $PROCS path/megahit_full_clean/final.contigs.fa path/../bam_files/*.bam"
[pbs script](/assets/internal_files/metahit.pbs)