Preprossing Steps - Poorly Annotated at the moment
Done to Date:
I do not have copies of the preprocessing commands, further down I will be providing example commands.
All libraries used are from the Michigan State sequencing facility. I ended up discarding our ANL reads.
All raw reads were merged using PEAR.
Assembled reads were quality filtered at q30 using QIIME split_libraries_fastq.py.
Mothur was used to trim the sequences and remove sequences <250bp and >255bp. The primers had been removed by the sequencing facility.
##Chimera Detection and Removal with usearch7
Due to the memory constraints on the free version of usearch7 I split the merged sample library back into individual sequence files
Next, using the GA Tech biocluster environment I ran the series of commands in parallel on each individual library to remove chimeras
1) Dereplicate the files (the && waits for the command to finish with an exit status of 0 before moving to the next command)
2) Identify chimeras using the denovo detection and export the nonchimeras
3) From those remaining sequences ID reference based chimeras agains’t silva’s gold database
4) Convert the dereplicate UC file (from step 1) to a qiime mapping file
I need this to go back and “rereplicate” the sequences before OTU picking
5) Identify dereplicated sequence headers representing nonchimeras (each representing a cluster of identical sequences). I grab only those greater than size 1, since the qiime mapping file does not have single tons present (these are handled next).
6) Identify all the nonchimeric singletons (missing from the otu mapping file)
7) Merge the two sequence ID lists together as input for qiime’s filter_fasta
Note the double escapes in the perl command, I didn’t quite figure why this was necessary, but it took me WAAY too long to get it to work.
8) Use QIIME to get all the nonchimeric sequences from the original fasta file using the ID’d “good” reads.
Following the pipeline there were 43 fasta files present as input and missing from the chimera removal output
Writing a quick bash script to figure out if the fasta files were run but executed with an error
Dammit, something weird is going on where not all the jobs are getting executed. First time around ~5% of the jobs failed. This time around 11% failed (5 files).
Next I want to move all the .final. files to the same directory and delete everything else to clean up my directory (plus it takes forever to loop over!).
Move all the final chimera checked files into the same directory & ensure each original fasta file is accounted for
Although I’m well aware the following is a suboptimal approach after the recent publications from the Schloss lab, I’m stuck with trying to
Using the QIIME open reference (take 20) pipeline. I’ve upped the percent subsampling to 10% to try and see if I can reduce the size of failures.failures that get passed to step4.
Trying to fix a previous run that used 0.1 % subsampling (the qiime default). At the end of step3 I had 16 million sequences left (a lot, but didn’t seem too bad). However, after trying to denovo cluster them for 13 days and having the job still not be completed I thought I’d try a different strategy.
It seems like either something happenend to a subset of these sequences that is preventing them from clustering well. I’m going to re-subsample at 10% and see if I can figure out what is going on.