Test dataset

Objectives

There are many ways to perform the basic QC, mapping, variant calling and filtering. Each different way could present potential differences in final variant dataset. I would like to quantify the differences using a benchmark dataset dataset.

The main objective is to look at the effect of pipeline differences in the final variant dataset. Using this information we can establish a pipeline we can all use for the basic analyses.

The data

The dataset can be found at /opt/storage/test_data

There is an Mycobacterium tuberculosis dataset and a Plasmodium Knowelsi dataset. Each directory contains the raw fastqs and the reference.

Required outputs

To look at the differences we will need:

Basic QC file (download template here)
BAM files
Single sample VCF files
Variant Matrix
Phylogenetic tree

The first columns in the variant matrix are:

Chromosome
Position
Reference

The following columns are the calls for all the samples. Missing data can be represented with N. Mixed calls can be represented by putting both calls together. For example:

chr	pos	ref	sample1	sample2	sample3	sample4
PKNH_01_v2	100	A	A	G	A	A
PKNH_04_v2	5021	T	T	T	N	A
PKNH_09_v2	4234	T	TG	C	T	N

Test data

Test dataset

Objectives

The data

Required outputs

results matching ""

No results matching ""