Kmer abundance

Here we will go throuhg the process of finding the abundance of kmers in a genome.

Requirements:

  1. DSK software
  2. Input genome
  3. Kmer(s)

If you do not have access to dsk we will need to install it:

if [ ! -d ~/bin ]; then
  mkdir -p ~/bin;
fi
cd ~/bin
git clone --recursive https://github.com/GATB/dsk.git
cd dsk
sh INSTALL
cp build/bin/dsk ../
cp build/bin/dsk2ascii ../
echo 'export PATH="~/bin:$PATH"' >> ~/.bashrc
bash

Running analysis

As an example we will be looking for the kmer AGCGCCGAGATT in the Mtb genome.

First we will download the genome:

wget ftp://ftp.ensemblgenomes.org/pub/release-32/bacteria//fasta/bacteria_0_collection/mycobacterium_tuberculosis_h37rv/dna/Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz

Now we will count the kmers and extract it into a human-readable format:

dsk -nb-cores 20 -file Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz -kmer-size 12 -abundance-min 1
dsk2ascii -file  Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.h5 -out kmers.txt

Note: The number of cores can be changes with the -nb-cores parameter. Set this appropriately for your system.

Now all we have to do is look into the kmers.txt file to find the number of kmers. We can use grep to do this quickly. To save space, dsk both the kmers and its reverse complement in the same line. We will use grep to look for both:

 grep "AGCGCCGAGATT\|AATCTCGGCGCT" kmers.txt

Note: We use \| to seperate queries with grep

Automated analysis with multiple kmers

If you have many kmers you can use the script available here to run the analysis automatically.

Run these commands to install the script:

git clone --recursive https://github.com/jodyphelan/kmer.git
cd kmer
bash install.sh
bash

Store your kmers in fasta format:

>kmer1
ACGTGGACGATG
>kmer2
GCGGAGGTT
>kmer3
CCAGTTGAGGTTGA

Run the script as follows:

find_kmer_matches.py <kmer.fa> <genome.fa> <outfile.txt> <threads>

The arguments should be the following:

  1. A fasta formatted file containing your kmers/primers
  2. The fasta for the genome you are interested in
  3. A file name for the output
  4. The number of threads you would like to run your analysis with (usually 4 for low power machines)

results matching ""

    No results matching ""