BCF/VCF analysis
For the following analysis we are _Mtb _bcf file. You can just substitute in your own bcf/vcf file.
To get the file, run the following:
wget http://pathogenseq.lshtm.ac.uk/downloads/mtb.bcf
wget http://pathogenseq.lshtm.ac.uk/downloads/lineages.txt
bcftools index mtb.bcf
We should index the genome to allow for random access of specific locations without streaming throuhgh the whole file
Getting allele frequencies
If you are interested in the allele frequency for a specific variant you can compute this using the BCFtools. Let's say we are interested in the allele frequency of variants at position 7582 (a variant involved in fluoroquinolone resistance). We can use the following command:
bcftools view mtb.bcf Chromosome:7582 | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AN\n'
Because Mtb only has one chromosome named "Chromosome" the locus is
Chromosome:7570
. Make sure you specify the right chromosome name for position of interest.
This will outupt the following column:
- Chromosome
- Position
- Reference
- Alternate alleles (if more than one they are delimited by comma)
- Number of alternates (again comma delimited)
- Total number of calls
We can get allele frequencies by dividing the number of alternate calls (either individually or summed up) by the total number.
You can stratify this by supplying a file with samples you are interested in using the -S
parameter. For example if you are interested in only Lineage4.3.4.2 you can do the following:
awk '$2=="Lineage4.3.4.2" {print $1}' lineages.txt > lineage4.3.4.2.txt
bcftools view -S lineage4.3.4.2.txt mtb.bcf Chromosome:7582 | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AN\n'