Test dataset

Objectives

There are many ways to perform the basic QC, mapping, variant calling and filtering. Each different way could present potential differences in final variant dataset. I would like to quantify the differences using a benchmark dataset dataset.

The main objective is to look at the effect of pipeline differences in the final variant dataset. Using this information we can establish a pipeline we can all use for the basic analyses.

The data

The dataset can be found at /opt/storage/test_data

There is an Mycobacterium tuberculosis dataset and a Plasmodium Knowelsi dataset. Each directory contains the raw fastqs and the reference.

Required outputs

To look at the differences we will need:

  1. Basic QC file (download template here)
  2. BAM files
  3. Single sample VCF files
  4. Variant Matrix
  5. Phylogenetic tree

The first columns in the variant matrix are:

  1. Chromosome
  2. Position
  3. Reference

The following columns are the calls for all the samples. Missing data can be represented with N. Mixed calls can be represented by putting both calls together. For example:

chr pos ref sample1 sample2 sample3 sample4
PKNH_01_v2 100 A A G A A
PKNH_04_v2 5021 T T T N A
PKNH_09_v2 4234 T TG C T N

results matching ""

    No results matching ""