Population-Based Structural Variant Analysis
Structural variation of a genome (ie. anything bigger than ~10bp) has a key role in meaningful variation of any genome. For example, forms of drug resistance in Plasmodium falciparum have been linked to duplication of mdr1 and gch1. However, in stark contrast to smaller SNPs, usual analysis of this variation is very low resolution concerning at most 5 samples and/or candidate regions. That's rubbish and we can do better. Here's a pipeline for doing so.
SV Discovery
First, we need to identify structural variations. There are a heap of programs out there, each not neccessarily being consistent with one another, but we're going to use DELLY. Once individually predicted, our primary variants of interest will be those high quality SVs found in multiple samples or overlapping specific regions. This second step requires a different approach.
Population Analysis
Population-based support of a structural variant allows us to reliably identify true SVs and therefore speeds up novel variant detection. For this, we have developed SV-Pop. Once provided with a collection of per-sample VCFs containing predicted SVs, it will return population-based statistics (such as frequency or Fst). Population-based filtering can then be applied and specific regional subsets produced.
- Merge vcf files
./SV-Pop.py --inFile=<list_of_vcfs.txt> --model=<INS,INV,DUP,DEL> --outFile=<output_prefix> -subPops='popA,popB,popC,popD' --mergeChr=True --refFile='/path/to/annotation.txt' --doFiltering=True --suppressWarnings=True --writeSamples=True
- Merge across models
./SVPop.py --MERGE-MODEL --variantFile=<INS_variants.csv>,<INV_variants.csv>,<DEL_variants.csv>,<DUP_variants.csv>
./SVPop.py --MERGE-MODEL --variantFile=<INS_windows.csv>,<INV_windows.csv>,<DEL_windows.csv>,<DUP_windows.csv>
- Subset to candidate regions
# Get overlaps with a known feature (ie. gene name)
./SVPop.py --SUBSET --variantFile=<file> --feature=<feature_name>
# Get overlaps with a specific genomic region
./SVPop.py --SUBSET --variantFile=<file> --region=<chr:bpA-bpB>
<Stats Exploration, Fst etc.>
Visualisation
<With SV-Pop R Shiny app>
Verification
The gold standard for SV verification is pcr-based wet lab analysis, but if you don't have the samples or want to avoid the lab it's best to consider concordance For example, true duplications should demonstrate a clear spike in read coverage for the predicted region whilst deletions should show a significant drop in read coverage (ideally to zero). Alternatively pacbio sequencing may support an SV predicted by a short read DELLY approach.
Here's a pipeline for generating coverage plots from base coverage files (for example as produced by delly cov): <Link to coveragePipeline.py>