Preprossing Steps - Poorly Annotated at the moment
Done to Date: I do not have copies of the preprocessing commands, further down I will be providing example commands.
All libraries used are from the Michigan State sequencing facility. I ended up discarding our ANL reads.
All raw reads were merged using PEAR.
Assembled reads were quality filtered at q30 using QIIME split_libraries_fastq.py.
Mothur was used to trim the sequences and remove sequences <250bp and >255bp. The primers had been removed by the sequencing facility.
mothur > trim.seqs(fasta = all_kostka_seqs.fna, maxhomop=7, minlength=250, maxlength = 255)
mothur > summary.seqs(fasta=all_kostka_seqs.trim.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 253 253 0 3 1837913
25%-tile: 1 253 253 0 4 18379129
Median: 1 253 253 0 4 36758257
75%-tile: 1 253 253 0 5 55137385
97.5%-tile: 1 254 254 0 6 71678601
Maximum: 1 255 255 0 7 73516513
Mean: 1 253.071 253.071 0 4.4815
# of Seqs: 73516513
##Chimera Detection and Removal with usearch7 Due to the memory constraints on the free version of usearch7 I split the merged sample library back into individual sequence files
split_sequence_file_on_sample_ids.py -i $HOME/data/qiime_files/all_gom_seqs/all_kostka_seqs.trim.fasta -o ind_samp_seqs
Next, using the GA Tech biocluster environment I ran the series of commands in parallel on each individual library to remove chimeras
1) Dereplicate the files (the && waits for the command to finish with an exit status of 0 before moving to the next command)
usearch -derep_fulllength ${INFILE} -output ${OUTFILE} -uc ${UCFILE} -sizeout &&
2) Identify chimeras using the denovo detection and export the nonchimeras
usearch -uchime_denovo ${OUTFILE} -nonchimeras ${NONCHIMFILE} &&
3) From those remaining sequences ID reference based chimeras agains’t silva’s gold database
usearch1.7.0 -uchime_ref ${NONCHIMFILE} -db /nv/hp10/woverholt3/data/program_files/Silva_ref_dbs/silva.gold.notalign.fasta -nonchimeras ${NONCHIMNONREF} -strand plus &&
4) Convert the dereplicate UC file (from step 1) to a qiime mapping file I need this to go back and “rereplicate” the sequences before OTU picking
convert_uc2map.py ${UCFILE} > ${MAPFILE} &&
5) Identify dereplicated sequence headers representing nonchimeras (each representing a cluster of identical sequences). I grab only those greater than size 1, since the qiime mapping file does not have single tons present (these are handled next).
perl -ne 'if ($_ =~ m/>/ && $_ !~ m/size=1;/) {($ID = $_) =~ s/>(.*);size.*/$1/; print $ID;}' ${NONCHIMNONREF} > ${GOODOTUFILE} &&
6) Identify all the nonchimeric singletons (missing from the otu mapping file)
perl -ne 'if ($_ =~ m/>/ && $_ =~ m/size=1;/) { ($ID = $_) =~ s/>(.*);size.*/$1/; print $ID }' ${NONCHIMNONREF} > ${GOODSEQIDS} &&
7) Merge the two sequence ID lists together as input for qiime’s filter_fasta Note the double escapes in the perl command, I didn’t quite figure why this was necessary, but it took me WAAY too long to get it to work.
join -j 1 <(sort -k 1 ${GOODOTUFILE}) <(sort -k 1 ${MAPFILE}) | perl -pe "s/\\s/\\n/g" >> ${GOODSEQIDS} &&
8) Use QIIME to get all the nonchimeric sequences from the original fasta file using the ID’d “good” reads.
filter_fasta.py -f ${INFILE} -o ${NOCHIM} -s ${GOODSEQIDS}
Links to the pipeline PBS script and the multiple submission shell script that contains the variable names needed
Testing results and troubleshooting
Following the pipeline there were 43 fasta files present as input and missing from the chimera removal output
#To count the number of samples used in the input
#path was $HOME/data/qiime_files/all_gom_seqs/ind_fasta_files
find ./ -maxdepth 1 -type f > ../list_of_ind_fasta_files.txt
perl -i.bak -pe 's/.*\/(.*)/$1/g' list_of_final_chim.txt
#To count the output files
find denovo_chimera2/ -regex ".*final.*" -maxdepth 1 > ../list_of_final_chim.txt
perl -i.bak -pe 's/(.*)\.nochim\.final(\..*)/$1$2/g' list_of_final_chim.txt
#Find samples that don't exist in the final_chim file
comm -23 <(sort list_of_ind_fasta_files.txt) <(sort list_of_final_chim.txt)
Writing a quick bash script to figure out if the fasta files were run but executed with an error
#!/bin/bash
WORK_DIR=$HOME/job_output_files/;
F_LIST=$HOME/samples_missing_chim_detect.txt
for FILE in $(find $WORK_DIR -type f -regex ".*chimera.*"); do
for LINE in $(cat $F_LIST); do
if grep --quiet $LINE $FILE; then
echo $FILE
fi
done;
done
#No detected files, looks like they just weren't run for some reason? Going to resubmit just them!
#Trying to find the raw fasta files that were skipped
find $HOME/data/qiime_files/all_gom_seqs/ind_samp_seqs/ -maxdepth 1 -type f > full_path_list_all_ind_fasta_files.txt
##################
#!/bin/bash
mkdir -p "$WORK_DIR/denovo_chimera_fix_failures"
#List of all the individual fasta files
FILE_LIST=$HOME/data/qiime_files/all_gom_seqs/full_path_list_all_ind_fasta_files.txt
#List of those missing from the first round
FAILED_LIST=$HOME/data/qiime_files/all_gom_seqs/samples_missing_chim_detect.txt
#loop over each individual file
for FILE in $(cat $FILE_LIST); do
#loop over each line in the failed list (only 43) & if they match then execute pipeline
for LINE in $(cat $FAILED_LIST); do
#echo "current file is $FILE";
#echo "the matching pattern from failures is: $LINE";
if [[ $FILE =~ $LINE ]]; then
echo $FILE;
BASE=${FILE%.fasta}
BASE_NAME=$(basename ${FILE%.fasta})
BASE="$WORK_DIR/denovo_chimera_fix_failures/$BASE_NAME"
qsub -v INFILE=$FILE,OUTFILE="$BASE.derep.fasta",UCFILE="$BASE.uc",NONCHIMFILE="$BASE.nonchimeras.fasta",NONCHIMNONREF="$BASE.nonchim.nonchimref.fasta",MAPFILE="$BASE.uc.map",GOODSEQIDS="$BASE.goodseqids.txt",GOODOTUFILE="$BASE.goodotuids.txt",NOCHIM="$BASE.nochim.final.fasta" job_scripts/multiple_qsub_chimera_detection.pbs
fi
done
done
######################
Dammit, something weird is going on where not all the jobs are getting executed. First time around ~5% of the jobs failed. This time around 11% failed (5 files).
#$HOME/data/qiime_files/all_gom_seqs/ind_samp_seqs/denovo_chimera_fix_failures
find ./ -name "*.final.*" -type f | perl -pe 's/\.\/(.*)\.nochim.final(.*)/$1$2/g' > ../../samples_missing_2nd_try.txt
comm -23 <(sort ../../samples_missing_chim_detect.txt) <(sort ../../samples_missing_2nd_try.txt) > ../../samples_to_rerun_2nd.txt
#rerun the above bash script modifying the FAILED_LIST to point to "samples_to_rerun_2nd.txt"
Next I want to move all the .final. files to the same directory and delete everything else to clean up my directory (plus it takes forever to loop over!).
pwd
#$HOME/data/qiime_files/all_gom_seqs/ind_samp_seqs/denovo_chimera2
find ./ -not -name "*.final.*" -type f -delete
pwd
#
#$HOME/data/qiime_files/all_gom_seqs/ind_samp_seqs/denovo_chimera_fix_failures/
find ./ -not -name "*.final.*" -type f -delete
Move all the final chimera checked files into the same directory & ensure each original fasta file is accounted for
find ind_samp_seqs/denovo_chimera2/ -type f > list_chim_checked_fasta.txt
perl -i.bak -pe 's/.*\/.*\/(.*)\.nochim.final(.*)/$1$2/g' list_chim_checked_fasta.txt
#all files that appear in both (result = 890)
comm -12 <(sort list_of_ind_fasta_files.txt) <(sort list_chim_checked_fasta.txt) | wc -l
OTU Picking
Although I’m well aware the following is a suboptimal approach after the recent publications from the Schloss lab, I’m stuck with trying to
Using the QIIME open reference (take 20) pipeline. I’ve upped the percent subsampling to 10% to try and see if I can reduce the size of failures.failures that get passed to step4.
#PBS -N qiime_open_ref
#PBS -l nodes=1:ppn=1
#PBS -l mem=5gb
#PBS -l walltime=10:00:00:00
#PBS -q biocluster-6
#PBS -j oe
#PBS -o $HOME/job_output_files/new_deepc_otus.$PBS_JOBID
#PBS -m abe
#PBS -M waoverholt@gmail.com
export WORKON_HOME=$HOME/data/program_files/VirtualEnvs
source $HOME/.local/bin/virtualenvwrapper.sh
export QIIME_CONFIG_FP=/nv/hp10/woverholt3/data/program_files/VirtualEnvs/qiime1.9.1/qiime_config_cluster
workon qiime1.9.1
module load R/3.2.2
INPUT=$HOME/data/qiime_files/all_gom_seqs/all_kostka_seqs.trim_nochim.fasta
OUTPUT=$HOME/data/qiime_files/all_gom_seqs/qiime_open_ref_20160502
PARAMS=$HOME/data/program_files/qiime1.8/qiime_params
pick_open_reference_otus.py -i $INPUT -o $OUTPUT -p $PARAMS -a -O 100 --suppress_align_and_tree -s 0.1
Trying to fix a previous run that used 0.1 % subsampling (the qiime default). At the end of step3 I had 16 million sequences left (a lot, but didn’t seem too bad). However, after trying to denovo cluster them for 13 days and having the job still not be completed I thought I’d try a different strategy.
It seems like either something happenend to a subset of these sequences that is preventing them from clustering well. I’m going to re-subsample at 10% and see if I can figure out what is going on.
Subsampling using an Enveomics script.
FastA.subsample.pl -f 10 -r 1 failures_failures.fasta