# Processing R1, R2, and Index Files to be used with DADA2

I was contacted by a collaborator about some sequence analysis I had done years ago. I wanted to update that analysis with current best practices and to leaverage the latest classification databases to hopefully improve some of the results. However, I ran into some fun issues where the sequencing facility gave us the reads in 3 files, the forward reads (R1), the reverse reads (R2), and the index reads (I1). This is a quick summary of how I formatted these files for input into the DADA2 pipeline. I would’ve had to do this anyways to submit the raw sequences to one of the repositories.

##Using QIIME1 to split libraries and generate the fastq files OK, so I found out the hardway (QIIME error) that my barcodes were non-golay, 12bp, and were the reverse complement to what was in the mapping file. I ran the split_libraries_fastq.py command this twice, on each direction to keep the read directions separate for DADA2. I also found out the hardway (oops) that you need to disable the error checking in the script to make sure that no sequences are removed. Essentially, if a sequence from R1 is removed and is retained in R2 it’ll be difficult to match them back up (at least using the approach listed below, you could always go through and keep only shared sequence identifiers if necessary).

Next we want to split the seqs.fastq file into single fastq files per sample using QIIME1. Nothing fancy here.

## Some sanity checks to make sure we didn’t mess up the pair-end orders

Check to make sure they all have the same number of sequences: I did some really ugly piped bash one-liners to do this. Here is the command in all its unholy glory

Basically I’m checking for differences in the number of sequence headers between each file. I tested all the parts individually used head, so something like:

Because the file names are the same in each directory, the default sort puts them in the same order. If this wasn’t true, just add a sort command to your pipe. To furtehr break it down, I’m running two grep -c (count) on each set of folders. I do a super ugly substitution to remove the path of the files in the different directory with sed (backslash to escape the period and forwardslash special characters).

## Check to make sure that the paired-end sequences are in the same order

I modified the above diff to compare the fastq sequence identifiers. My sequence headers look like this: @BW-001_2174333 M00176:78:000000000-A3M8L:1:2114:15243:27709 1:N:0:0 orig_bc=CAGCTCATCAGC new_bc=CAGCTCATCAGC bc_diffs=0 So I grab only this part: M00176:78:000000000-A3M8L:1:2114:15243:27709

## Bonus work on sorting paired-end reads based of their illumina identifier

OK, so the first time I did this, I didn’t catch that error checking would mess up the paired-sequences. I stupidly thought one of the scripts ended up shuffling the sequences (maybe read them into a dictionary or something?), but regardless the sequences looked completely shuffled. I had to work out a way to resort the files based on the illumina-based sequence identifier.

The key step here is this command:

Then I wrap it in a loop to process every fastq file separately and save the results in a new directory called “sorted”.

Fix the sequence numbering. I actually don’t think this matters? But it is SUPER confusing since the QIIME assigned sequence IDs don’t match up (but the sequence identifiers do, which is all that matters). Definitely worth convincing yourself this is true, because it would be a disaster is the reads randomly were joined. Regardless I’m going to use a perl loop to quickly renumber my sequences. Note, the regex I’m using is very specific to this sample set.

The m/^\@BW-[0-9]{3}/ matches my sample names for every sample. Basically says, if the line starts with @BW- and then has a number with 3 digits (000-999). My samples are named BW-001 through BW-199. You would need to change the s/(BW-[0-9]{3})[0-9]+/$1$i/ code as well. Here I place the BW-[0-9]{3} within parentheses to “save” this, then I replace the next number (any number of digits) with my counter, \$i.

You can pipe this to a new file, or use the perl -i operator to do it in place. You want to only do this when you’re sure that you’re loop is working! Here’s how I loop it through my files.