Phylogenetics basics
Requirements:
- Files from the previous tutorial here
- Premade VCF files from more Mtb clinical isolates
- FastTree software for reconstructing phylogenies
- vcf2fasta.py script to extract variants from VCF files and convert to fasta format
Please follow the following commands to set up:
wget http://pathogenseq.lshtm.ac.uk/downloads/vcf_files.tgz
tar -xvf vcf_files.tgz
wget http://www.microbesonline.org/fasttree/FastTree
chmod 755 FastTree
git clone https://github.com/jodyphelan/variant_calling.git
cp variant_calling/vcf2fasta.py .
The basic steps required to create a multi-fasta file from individual VCF files are:
- Index the VCF files
- Merge the VCF files together
- Extract only SNPs and remove any variant sites which have a lot of missing data
- Calculate fraction of missing data over all SNP positions and identify low quality samples
- Remove low quality samples
- Convert to a fasta format
These steps are automatically handled by the vcf2fasta.py script and it can be run like this:
vcf2fasta.py <sample_file> <prefix> <ref.fa> <threads>
The inputs to the script should be the following:
- A text file containing the prefix to all the vcf files
- The prefix for the result files
- The reference genome used to perform the mapping and variant calling
- The number of threads available on the system
A list of the samples can be generated using this command:
ls *.vcf.gz | sed 's/.vcf.gz//' > samples.txt
We can now run the script using the following command:
./vcf2fasta.py samples.txt Mtb H37Rv.fa 4
The script will perform the 6 steps stated above and prints out each command it is running.
All SNPs with > 10% missing data and all samples with >15% missing data have been excluded from the final variant file.
The high quality samples have been writen to a text file calle Mtb.HQ.samples
. Similarly the low quality samples have been writeen to a file called Mtb.LQ.samples
. Let's take a look at the samples that have been excluded:
cat Mtb.LQ.samples
We can see that three samples have been removed from the analysis. We can examine the calls from these samples using:
bcftools view -s ERR1035229 Mtb.raw.bcf
As we can see there is a lot of missing data and a number of very low coverage positions for this sample. The numbers are summarised in the ERR1035229.calls
file.
We now have a file named Mtb.snps.fa
.This file contains the high quality SNPs from the high quality samples and can be used to reconstruct a phylogeny. We will use FastTree to quickly create a tree:
./FastTree -gtr -nt Mtb.snps.fa > Mtb.tree
This can be visualised using a website such as Itol or a stand along programs such as FigTree. The tree is saved in newick format in Mtb.tree
. Print it using cat Mtb.tree
and visualise using Itol. Have the isolates from the previous excersise clustered with any other strains?