Kmer abundance
Here we will go throuhg the process of finding the abundance of kmers in a genome.
Requirements:
- DSK software
- Input genome
- Kmer(s)
If you do not have access to dsk
we will need to install it:
if [ ! -d ~/bin ]; then
mkdir -p ~/bin;
fi
cd ~/bin
git clone --recursive https://github.com/GATB/dsk.git
cd dsk
sh INSTALL
cp build/bin/dsk ../
cp build/bin/dsk2ascii ../
echo 'export PATH="~/bin:$PATH"' >> ~/.bashrc
bash
Running analysis
As an example we will be looking for the kmer AGCGCCGAGATT in the Mtb genome.
First we will download the genome:
wget ftp://ftp.ensemblgenomes.org/pub/release-32/bacteria//fasta/bacteria_0_collection/mycobacterium_tuberculosis_h37rv/dna/Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
Now we will count the kmers and extract it into a human-readable format:
dsk -nb-cores 20 -file Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz -kmer-size 12 -abundance-min 1
dsk2ascii -file Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.h5 -out kmers.txt
Note: The number of cores can be changes with the -nb-cores
parameter. Set this appropriately for your system.
Now all we have to do is look into the kmers.txt
file to find the number of kmers. We can use grep to do this quickly. To save space, dsk
both the kmers and its reverse complement in the same line. We will use grep to look for both:
grep "AGCGCCGAGATT\|AATCTCGGCGCT" kmers.txt
Note: We use \|
to seperate queries with grep
Automated analysis with multiple kmers
If you have many kmers you can use the script available here to run the analysis automatically.
Run these commands to install the script:
git clone --recursive https://github.com/jodyphelan/kmer.git
cd kmer
bash install.sh
bash
Store your kmers in fasta format:
>kmer1
ACGTGGACGATG
>kmer2
GCGGAGGTT
>kmer3
CCAGTTGAGGTTGA
Run the script as follows:
find_kmer_matches.py <kmer.fa> <genome.fa> <outfile.txt> <threads>
The arguments should be the following:
- A fasta formatted file containing your kmers/primers
- The fasta for the genome you are interested in
- A file name for the output
- The number of threads you would like to run your analysis with (usually 4 for low power machines)