Fastq to variants - A rough Pipeline

Initial Setup

To demonstrate the pipeline I am using three samples from this paper.

Here are the ENA accesssions with some meta data:

ENA accession RIFAMPICIN ISONIAZID STREPTOMYCIN ETHAMBUTOL PYRAZINAMIDE
ERR1664619 R R R R R
ERR1664620 R R S S S
ERR1664662 R R R R R

First we should download the data and the H37Rv reference:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/000/ERR1664620/ERR1664620_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/000/ERR1664620/ERR1664620_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/002/ERR1664662/ERR1664662_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/002/ERR1664662/ERR1664662_2.fastq.gz

wget ftp://ftp.ensemblgenomes.org/pub/release-32/bacteria//fasta/bacteria_0_collection/mycobacterium_tuberculosis_h37rv/dna/Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
gunzip Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
mv Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa H37Rv.fa
wget http://pathogenseq.lshtm.ac.uk/downloads/H37Rv.gff

We will use BWA, SAMtools and BCFtools for this analysis. Please make sure you have the latest versions. These programs can make use of multiple cores so we will create a variable which is used to assign the number of threads in each program call.

export THREADS=20

You can change the number according to your own system.

Mapping

First we will index the reference:

bwa index H37Rv.fa

Now we can perform mapping:

bwa mem -t $THREADS -R '@RG\tID:ERR1664619\tSM:ERR1664619\tPL:Illumina' H37Rv.fa ERR1664619_1.fastq.gz ERR1664619_2.fastq.gz | samtools view -b -@ $THREADS - | samtools sort -@ $THREADS -o ERR1664619.bam -
bwa mem -t $THREADS -R '@RG\tID:ERR1664620\tSM:ERR1664620\tPL:Illumina' H37Rv.fa ERR1664620_1.fastq.gz ERR1664620_2.fastq.gz | samtools view -b -@ $THREADS - | samtools sort -@ $THREADS -o ERR1664620.bam -
bwa mem -t $THREADS -R '@RG\tID:ERR1664662\tSM:ERR1664662\tPL:Illumina' H37Rv.fa ERR1664662_1.fastq.gz ERR1664662_2.fastq.gz | samtools view -b -@ $THREADS - | samtools sort -@ $THREADS -o ERR1664662.bam -

The | (pipe) symbol passes information from one program to another. There are three program calls seperated by two pipes which perform the following:

  1. bwamaps the reads to the reference and outputs in SAM format
  2. samtools view converts from SAM format to BAM format
  3. samtools sort sorts the BAM file

We could perform these steps individually but using a pipe (known as data streaming) ensures the data is processed as is generated and we do not have to wait for the individual programs to finish. Adiitionally the data remains in RAM and is accessed faster than if we would write to disk and read in to another program

Variant Calling

Now that we have an aligment file for each sample we can perform variant calling:

samtools mpileup -ugf H37Rv.fa ERR1664619.bam -t DP | bcftools call --threads $THREADS -g 10 -m -O z -o ERR1664619.vcf.gz
samtools mpileup -ugf H37Rv.fa ERR1664620.bam -t DP | bcftools call --threads $THREADS -g 10 -m -O z -o ERR1664620.vcf.gz
samtools mpileup -ugf H37Rv.fa ERR1664662.bam -t DP | bcftools call --threads $THREADS -g 10 -m -O z -o ERR1664662.vcf.gz

This creates a bcf file for each sample which described the variation within each sample. The options are explained here.

Samtools:

Option Explanation
mpileup Runs through the BAM file per position and finds all overlapping reads
-u Output uncompressed format
-g Output BCF format
-f <reference.fa> Specifies the reference to use
-t DP Passes Depth (DP) information to the output

Bcftools:

Option Explanation
call Function used to call variants
--ploidy 1 Allows the user to set the ploidy using a specifed integer. (1 in our case)
-g <int> Allows the user to specify a minimum depth to allow for reference allele stretches to be merged into blocks. If the depth falls below the cutoff a single record is created for the position.
-m Allows for multisample and rare variant calling
-O z Outputs in compressed VCF format
-o file.vcf.gz Output is written to file.vcf.gz. We add the ".gz" extension to represent gzip compression

We will now collate the VCF files into one VCF file. We first need to index the files and then we can merge:

bcftools index ERR1664619.vcf.gz
bcftools index ERR1664620.vcf.gz
bcftools index ERR1664662.vcf.gz

bcftools merge --threads $THREADS -O v -o merged.vcf -g H37Rv.fa ERR1664619.vcf.gz ERR1664620.vcf.gz ERR1664662.vcf.gz

Take a look at the merged.vcf file. This contains information about all variants in the dataset:

Chromosome      1977    .       A       G       225     .       VDB=0.0263822;SGB=-0.693147;MQSB=1;MQ0F=0;MQ=60;DP=249;DP4=0,0,125,43;AN=3;AC=3 GT:PL:DP        1:255,0:48      1:255,0:63     1:255,0:57

Additionally it merged stretches of reference alleles into blocks:

Chromosome      3       .       G       .       .       .       END=1976;MinDP=10;AN=3  GT:DP   0:10    0:11    0:18

This line tells us that between position 3 and 1976 we only have reference alleles with minimim depth of 10 (i.e. we can be pretty confident that they are reference and not just a lack of data).

Filtering and annotating variants

Now we have a large file containing a list of reference and alternate alleles. Most likely we are interested in the variation in our dataset. We fill filter the VCF file to extract only the variants:

bcftools view -c 1 merged.vcf  > filtered.vcf

The -c option tells bcftools to extract all records with at lease one alternate allele.

We can then annotate the VCF file using the GFF annotation file and extract in a user friendly format:

bcftools csq -f H37Rv.fa -g test.gff filtered.vcf  > filtered.annotated.vcf
bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' filtered.annotated.vcf > variants.txt

The variants.txt file contains a line per variant for each sample and the corresponding annotation such as gene and amino acid change. We can see from the first table that all samples are Rifampicin resistant. Rifampicin resistance is cause by mutations in the rpoB gene (ID:Rv0667). Lets see if we can identify the causal variant:

grep Rv0667 variants.txt

From the output we can see all isolates have the S450L mutation in rpoB. Google this mutation and see if you think it is involved with resistance.

Try to find mutations for the other drug resistance types by looking at associated targets/activators:

Drug Gene Locus Tag
Rifampicin rpoB Rv0667
Isoniazid katG, inhA Rv1908c, Rv1484c
Streptomycin rpsL, rrs Rv0682, rrs
Ethambutol embB Rv3795
Pyrazinamide pncA Rv2043c

results matching ""

    No results matching ""