Pathogenseq: FastQ to consensus sequence

Prerequisites

This tutorial requires you to have installed the pathogenseq analysis pipeline. To do so go to https://github.com/jodyphelan/pathogenseq and look for the installation instructions.

Initial setup

You will need the following:

A reference sequence
Your sequencing data in fastQ format

We will be using Mtuberculosis for this but if you have your own data already just perform all the steps with your own fastQ files. The following code downloads all the data you need:

#Getting the raw fastQ data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_2.fastq.gz

# Getting the reference genome
wget ftp://ftp.ensemblgenomes.org/pub/release-32/bacteria//fasta/bacteria_0_collection/mycobacterium_tuberculosis_h37rv/dna/Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
gunzip Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
mv Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa H37Rv.fa

Processing steps

Many of these steps make use of multiple cores on your computer and will speed up analysis if more than 1 is used. The -t flag is often used in these scripts followed by the number of cores you want to use. In this example we use -t 40 but you can set the number of the 75% of the number of cores available.

Mapping

First we will align the data using the mapping.py wrapper script:

mapping.py  -1 ERR1664619_1.fastq.gz -2 ERR1664619_2.fastq.gz -p ERR1664619 -r H37Rv.fa -t 40

This will first trim the reads using trimmomatic and then align the trimmed reads to create a BAM file.

Variant calling

Now we can call varaints. We will use the bam2vcf.py wrapper script:

bam2vcf.py ERR1664619.bam H37Rv.fa ERR1664619 -t 40

This will use bcftools to call SNPs. BCFtools does not have a multithreading mode of calling, so this scripts breaks the genome up into chunks and runs the calling algorithm on these in parallel. It also allows the user mark regions as missing accoring to a specified depth cutoff using the --min_dp flag. It also chooses the correct parameters required for Illumina vs minION variant calling. You can also mask primer sites with the --primers flag and a supplied FASTA file with the primers.

Consensus sequence generation

Finally we are ready to generate a consensus using the bcf2consensus.py script:

bcf2consensus.py ERR1664619.gbcf H37Rv.fa

This will create a consensus file for your sample (if a multi-sample bcf is provided a consensus will be generated for each sample). The low-coverage regions and primer regions defined in the previous step will be masked with Ns.

FastQ to consensus sequence