Pathogenseq: FastQ to phylogenetic tree

Prerequisites

This tutorial requires you to have installed the pathogenseq analysis pipeline. To do so go to https://github.com/jodyphelan/pathogenseq and look for the installation instructions.

Initial setup

You need the following:

A reference genome
Your sequencing data in fastQ format

We will be using Mtuberculosis for this but if you have your own data already just perform all the steps with your own fastQ files. The following code downloads all the data you need:

# Getting the raw fastQ data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/000/ERR1664620/ERR1664620_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/000/ERR1664620/ERR1664620_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/002/ERR1664662/ERR1664662_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/002/ERR1664662/ERR1664662_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664669/ERR1664669_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664669/ERR1664669_2.fastq.gz

# Getting the reference genome
wget ftp://ftp.ensemblgenomes.org/pub/release-32/bacteria//fasta/bacteria_0_collection/mycobacterium_tuberculosis_h37rv/dna/Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
gunzip Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
mv Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa H37Rv.fa

# Get a BED file of the PE PPE genes (used for filtering)
wget http://pathogenseq.lshtm.ac.uk/downloads/pe_ppe.bed

Processing steps

Many of these steps make use of multiple cores on your computer and will speed up analysis if more than 1 is used. The -t flag is often used in these scripts followed by the number of cores you want to use. In this example we use -t 40 but you can set the number of the 75% of the number of cores available.

Mapping

We will generate the BAM files using the mapping.py script:

mapping.py  -1 ERR1664619_1.fastq.gz -2 ERR1664619_2.fastq.gz -p ERR1664619 -r H37Rv.fa -t 40
mapping.py  -1 ERR1664620_1.fastq.gz -2 ERR1664620_2.fastq.gz -p ERR1664620 -r H37Rv.fa -t 40
mapping.py  -1 ERR1664662_1.fastq.gz -2 ERR1664662_2.fastq.gz -p ERR1664662 -r H37Rv.fa -t 40
mapping.py  -1 ERR1664669_1.fastq.gz -2 ERR1664669_2.fastq.gz -p ERR1664669 -r H37Rv.fa -t 40

This will first trim the reads using trimmomatic and then align the trimmed reads to the reference to create a BAM file.

Variant calling

bam2vcf.py ERR1664619.bam H37Rv.fa ERR1664619 -t 40 
bam2vcf.py ERR1664620.bam H37Rv.fa ERR1664620 -t 40 
bam2vcf.py ERR1664662.bam H37Rv.fa ERR1664662 -t 40
bam2vcf.py ERR1664669.bam H37Rv.fa ERR1664669 -t 40

This will use bcftools to call SNPs. BCFtools does not have a multithreading mode of calling, so this scripts breaks the genome up into chunks and runs the calling algorithm on these in parallel. It also allows the user mark regions as missing accoring to a specified depth cutoff using the --min_dp flag. It also chooses the correct parameters required for Illumina vs minION variant calling. You can also mask primer sites with the --primers flag and a supplied FASTA file with the primers.

Creating a merged multi-sample BCF file

We will now merge the individual BCF files together using the merge_vcfs.py script. We first need to create a file containing all the sample names and then we can merge. Use nano to create the samples.txt file. Here is a bief summery how to:

Type nano samples.txt and hit enter
Write the sample names (e.g. ERR1664619) with each one a new line
When finished hit control+x
It will ask you if you want to save. Hit y and then enter to save and leave nano

Now we can use the file we just created with the script. We also need to remove repetitive regions from the analysis. We downloaded the pe_ppe.bed file. This contains the coordinates of all repetitive genes in Mtb. We can supply this to the script with the --bed_exlude flag:

merge_vcfs.py samples.txt H37Rv.fa merged -t 40 --bed_exclude pe_ppe.bed

This will have created many output files, but importantly for us we will use the merged.snps.fa file which contains an alignment of all the SNPs in our samples.

We will now use raxml-ng to build a phylogenetic tree:

raxml-ng --msa merged.snps.fa --model GTR+G --search

The resulting tree will be stored in merged.snps.fa.raxml.bestTree.

FastQ to phylogenetic tree