Test dataset
Objectives
There are many ways to perform the basic QC, mapping, variant calling and filtering. Each different way could present potential differences in final variant dataset. I would like to quantify the differences using a benchmark dataset dataset.
The main objective is to look at the effect of pipeline differences in the final variant dataset. Using this information we can establish a pipeline we can all use for the basic analyses.
The data
The dataset can be found at /opt/storage/test_data
There is an Mycobacterium tuberculosis dataset and a Plasmodium Knowelsi dataset. Each directory contains the raw fastqs and the reference.
Required outputs
To look at the differences we will need:
- Basic QC file (download template here)
- BAM files
- Single sample VCF files
- Variant Matrix
- Phylogenetic tree
The first columns in the variant matrix are:
- Chromosome
- Position
- Reference
The following columns are the calls for all the samples. Missing data can be represented with N
. Mixed calls can be represented by putting both calls together. For example:
chr | pos | ref | sample1 | sample2 | sample3 | sample4 |
---|---|---|---|---|---|---|
PKNH_01_v2 | 100 | A | A | G | A | A |
PKNH_04_v2 | 5021 | T | T | T | N | A |
PKNH_09_v2 | 4234 | T | TG | C | T | N |