Snakemake pipeline for 16S rRNA amplicon analysis

Will Overholt bio photo By Will Overholt

Installing the pipeline

I have now hosted all files on a public github project snakemake_16S_pipeline. The README file has information about installing everything, but essentially you need python3, miniconda, and snakemake.

First clone the github repository.

mkdir path/to/new/directory
cd path/to/new/directory
git clone https://github.com/waoverholt/snakemake_16S_pipeline
cd snakemake_16S_pipeline
conda env create -n snakemake_16S python=3.5 --file environment.yaml

If everything worked, you should have a virtual environment named “snakemake_16S” that contains all necessary dependencies. To start it:

source activate snakemake_16S
#if conda isn't in your path you need to specify it
source /path/to/conda/install/bin/activate snakemake_16S

Preformating sequence library names

The pipeline assumes you have paired end libraries, which each sample pair in a different file. E.G. “sample1_R1_001.fastq.gz” The extensions can be changed in the config.yaml file.

You may wish to do some initial name cleaning before running the pipeline. E.G. I changed my sample names from WAO_T0C1_S112_L001_R1_001.fastq.gz to WAO_T0C1_R1_001_fastq.gz

Initial configuration & parameters

Check out the config.yaml file. Here you can specify specifics for your sample set. Change the paths for: read_directory: chimera_db:

You may want to change the threads to match your system.

The oligos file should be in the “additional_files” directory, but you can change this path if you need to.

Running snakefile

To start the pipeline simply type:

snakemake --configfile config.yaml --snakefile Snakefile

If you’d like to run the pipeline with some steps in parallel, specify the number of threads available with:

snakemake --configfile config.yaml --snakefile Snakefile -j 7