Fastq to variants - A rough Pipeline
Initial Setup
To demonstrate the pipeline I am using three samples from this paper.
Here are the ENA accesssions with some meta data:
ENA accession | RIFAMPICIN | ISONIAZID | STREPTOMYCIN | ETHAMBUTOL | PYRAZINAMIDE |
---|---|---|---|---|---|
ERR1664619 | R | R | R | R | R |
ERR1664620 | R | R | S | S | S |
ERR1664662 | R | R | R | R | R |
First we should download the data and the H37Rv reference:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/000/ERR1664620/ERR1664620_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/000/ERR1664620/ERR1664620_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/002/ERR1664662/ERR1664662_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/002/ERR1664662/ERR1664662_2.fastq.gz
wget ftp://ftp.ensemblgenomes.org/pub/release-32/bacteria//fasta/bacteria_0_collection/mycobacterium_tuberculosis_h37rv/dna/Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
gunzip Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
mv Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa H37Rv.fa
wget http://pathogenseq.lshtm.ac.uk/downloads/H37Rv.gff
We will use BWA, SAMtools and BCFtools for this analysis. Please make sure you have the latest versions. These programs can make use of multiple cores so we will create a variable which is used to assign the number of threads in each program call.
export THREADS=20
You can change the number according to your own system.
Mapping
First we will index the reference:
bwa index H37Rv.fa
Now we can perform mapping:
bwa mem -t $THREADS -R '@RG\tID:ERR1664619\tSM:ERR1664619\tPL:Illumina' H37Rv.fa ERR1664619_1.fastq.gz ERR1664619_2.fastq.gz | samtools view -b -@ $THREADS - | samtools sort -@ $THREADS -o ERR1664619.bam -
bwa mem -t $THREADS -R '@RG\tID:ERR1664620\tSM:ERR1664620\tPL:Illumina' H37Rv.fa ERR1664620_1.fastq.gz ERR1664620_2.fastq.gz | samtools view -b -@ $THREADS - | samtools sort -@ $THREADS -o ERR1664620.bam -
bwa mem -t $THREADS -R '@RG\tID:ERR1664662\tSM:ERR1664662\tPL:Illumina' H37Rv.fa ERR1664662_1.fastq.gz ERR1664662_2.fastq.gz | samtools view -b -@ $THREADS - | samtools sort -@ $THREADS -o ERR1664662.bam -
The |
(pipe) symbol passes information from one program to another. There are three program calls seperated by two pipes which perform the following:
bwa
maps the reads to the reference and outputs in SAM formatsamtools view
converts from SAM format to BAM formatsamtools sort
sorts the BAM file
We could perform these steps individually but using a pipe (known as data streaming) ensures the data is processed as is generated and we do not have to wait for the individual programs to finish. Adiitionally the data remains in RAM and is accessed faster than if we would write to disk and read in to another program
Variant Calling
Now that we have an aligment file for each sample we can perform variant calling:
samtools mpileup -ugf H37Rv.fa ERR1664619.bam -t DP | bcftools call --threads $THREADS -g 10 -m -O z -o ERR1664619.vcf.gz
samtools mpileup -ugf H37Rv.fa ERR1664620.bam -t DP | bcftools call --threads $THREADS -g 10 -m -O z -o ERR1664620.vcf.gz
samtools mpileup -ugf H37Rv.fa ERR1664662.bam -t DP | bcftools call --threads $THREADS -g 10 -m -O z -o ERR1664662.vcf.gz
This creates a bcf file for each sample which described the variation within each sample. The options are explained here.
Samtools:
Option | Explanation |
---|---|
mpileup | Runs through the BAM file per position and finds all overlapping reads |
-u | Output uncompressed format |
-g | Output BCF format |
-f <reference.fa> | Specifies the reference to use |
-t DP | Passes Depth (DP) information to the output |
Bcftools:
Option | Explanation |
---|---|
call | Function used to call variants |
--ploidy 1 | Allows the user to set the ploidy using a specifed integer. (1 in our case) |
-g <int> | Allows the user to specify a minimum depth to allow for reference allele stretches to be merged into blocks. If the depth falls below the cutoff a single record is created for the position. |
-m | Allows for multisample and rare variant calling |
-O z | Outputs in compressed VCF format |
-o file.vcf.gz | Output is written to file.vcf.gz. We add the ".gz" extension to represent gzip compression |
We will now collate the VCF files into one VCF file. We first need to index the files and then we can merge:
bcftools index ERR1664619.vcf.gz
bcftools index ERR1664620.vcf.gz
bcftools index ERR1664662.vcf.gz
bcftools merge --threads $THREADS -O v -o merged.vcf -g H37Rv.fa ERR1664619.vcf.gz ERR1664620.vcf.gz ERR1664662.vcf.gz
Take a look at the merged.vcf
file. This contains information about all variants in the dataset:
Chromosome 1977 . A G 225 . VDB=0.0263822;SGB=-0.693147;MQSB=1;MQ0F=0;MQ=60;DP=249;DP4=0,0,125,43;AN=3;AC=3 GT:PL:DP 1:255,0:48 1:255,0:63 1:255,0:57
Additionally it merged stretches of reference alleles into blocks:
Chromosome 3 . G . . . END=1976;MinDP=10;AN=3 GT:DP 0:10 0:11 0:18
This line tells us that between position 3 and 1976 we only have reference alleles with minimim depth of 10 (i.e. we can be pretty confident that they are reference and not just a lack of data).
Filtering and annotating variants
Now we have a large file containing a list of reference and alternate alleles. Most likely we are interested in the variation in our dataset. We fill filter the VCF file to extract only the variants:
bcftools view -c 1 merged.vcf > filtered.vcf
The -c
option tells bcftools to extract all records with at lease one alternate allele.
We can then annotate the VCF file using the GFF annotation file and extract in a user friendly format:
bcftools csq -f H37Rv.fa -g test.gff filtered.vcf > filtered.annotated.vcf
bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' filtered.annotated.vcf > variants.txt
The variants.txt
file contains a line per variant for each sample and the corresponding annotation such as gene and amino acid change. We can see from the first table that all samples are Rifampicin resistant. Rifampicin resistance is cause by mutations in the rpoB gene (ID:Rv0667). Lets see if we can identify the causal variant:
grep Rv0667 variants.txt
From the output we can see all isolates have the S450L mutation in rpoB. Google this mutation and see if you think it is involved with resistance.
Try to find mutations for the other drug resistance types by looking at associated targets/activators:
Drug | Gene | Locus Tag |
---|---|---|
Rifampicin | rpoB | Rv0667 |
Isoniazid | katG, inhA | Rv1908c, Rv1484c |
Streptomycin | rpsL, rrs | Rv0682, rrs |
Ethambutol | embB | Rv3795 |
Pyrazinamide | pncA | Rv2043c |