Phylogenetics basics

Requirements:

  1. Files from the previous tutorial here
  2. Premade VCF files from more Mtb clinical isolates
  3. FastTree software for reconstructing phylogenies
  4. vcf2fasta.py script to extract variants from VCF files and convert to fasta format

Please follow the following commands to set up:

wget http://pathogenseq.lshtm.ac.uk/downloads/vcf_files.tgz
tar -xvf vcf_files.tgz
wget http://www.microbesonline.org/fasttree/FastTree
chmod 755 FastTree
git clone https://github.com/jodyphelan/variant_calling.git
cp variant_calling/vcf2fasta.py .

The basic steps required to create a multi-fasta file from individual VCF files are:

  1. Index the VCF files
  2. Merge the VCF files together
  3. Extract only SNPs and remove any variant sites which have a lot of missing data
  4. Calculate fraction of missing data over all SNP positions and identify low quality samples
  5. Remove low quality samples
  6. Convert to a fasta format

These steps are automatically handled by the vcf2fasta.py script and it can be run like this:

vcf2fasta.py <sample_file> <prefix> <ref.fa> <threads>

The inputs to the script should be the following:

  1. A text file containing the prefix to all the vcf files
  2. The prefix for the result files
  3. The reference genome used to perform the mapping and variant calling
  4. The number of threads available on the system

A list of the samples can be generated using this command:

ls *.vcf.gz | sed 's/.vcf.gz//' > samples.txt

We can now run the script using the following command:

./vcf2fasta.py samples.txt Mtb H37Rv.fa 4

The script will perform the 6 steps stated above and prints out each command it is running.

All SNPs with > 10% missing data and all samples with >15% missing data have been excluded from the final variant file.

The high quality samples have been writen to a text file calle Mtb.HQ.samples. Similarly the low quality samples have been writeen to a file called Mtb.LQ.samples. Let's take a look at the samples that have been excluded:

cat Mtb.LQ.samples

We can see that three samples have been removed from the analysis. We can examine the calls from these samples using:

bcftools view -s ERR1035229 Mtb.raw.bcf

As we can see there is a lot of missing data and a number of very low coverage positions for this sample. The numbers are summarised in the ERR1035229.calls file.

We now have a file named Mtb.snps.fa.This file contains the high quality SNPs from the high quality samples and can be used to reconstruct a phylogeny. We will use FastTree to quickly create a tree:

./FastTree -gtr -nt Mtb.snps.fa > Mtb.tree

This can be visualised using a website such as Itol or a stand along programs such as FigTree. The tree is saved in newick format in Mtb.tree. Print it using cat Mtb.tree and visualise using Itol. Have the isolates from the previous excersise clustered with any other strains?

results matching ""

    No results matching ""