Germline WGS

This case study considers whole-genome germline variant calling in an individual. We will use the syntheic-diploid sample CHM1-CHM13.

This tutorial is included as a Snakemake workflow. Run it with

$ snakemake --snakefile germline.smk -j 16 --use-singularity

Prerequisites#

samtools
bcftools
BWA
RTG Tools
Octopus (with random forests installed).

This tutorial assumes a directory structure like

.
└───data
│   └───references
│   └───reads
│   │   └───raw
│   │   └───mapped
│   └───truth
└───results
    └───calls
    └───eval

We'll go ahead and create this upfront:

$ mkdir -p data/{references,truth} data/reads/{raw,mapped}
$ mkdir -p results/{calls,eval}

Download data#

First, download WGS CHM1-CHM13 PCR-free Illumina HiSeq-X10 reads (PRJEB13208) from EBI:

$ curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR134/003/ERR1341793/ERR1341793_1.fastq.gz | gzip > data/reads/raw/CHM1-CHM13_1.fastq.gz
$ curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR134/003/ERR1341793/ERR1341793_2.fastq.gz | gzip > data/reads/raw/CHM1-CHM13_2.fastq.gz

Next, we need a reference genome. In this example we'll use GRCh38 with ALT and decoys contigs (hs38DH). The simplest way to get this is with bwakit:

$ run-gen-ref hs38DH
$ mv hs38DH.fa* data/references

We'll also need the truth variants to evaluate our calls, which we can get from the CHM-eval GitHub:

$ curl -L https://github.com/lh3/CHM-eval/releases/download/v0.5/CHM-evalkit-20180222.tar | tar xf - > data/truth/syndip

Map reads#

First, we need to index the reference genome for alignment:

$ samtools faidx data/references/hs38DH.fa
$ bwa index data/references/hs38DH.fa

Next, we map the raw reads to the reference genome with bwa mem:

$ bwa mem -t 16 \
     -R "@RG\tID:0\tSM:CHM1-CHM1378\tLB:Broad\tPU:Illumina" \
     data/reference/hs38DH.fa \
     data/reads/raw/CHM1-CHM13_1.fastq.gz data/reads/raw/CHM1-CHM13_2.fastq.gz | \
     samtools view -bh | \
     samtools sort -@ 4 -o data/reads/mapped/CHM1-CHM13.hs38DH.bwa-mem.bam -
$ samtools index data/reads/mapped/CHM1-CHM13.hs38DH.bwa-mem.bam

This should take a few hours.

Call variants#

Now we can call variants with octopus. Since this is single-sample germline calling, we'll be using the individual calling model. We'll also set the sequence error model to reflect the PCR-free library design of this sample and Illlumina X10 sequencer, and use random forest filtering. Finally, we restrict calling to the primary chromosomes and use the built-in multithreading:

$ octopus \
     -R data/reference/hs38DH.fa \
     -I data/reads/mapped/CHM1-CHM13.hs38DH.bwa-mem.bam \
     -T chr1 to chrM \
     --sequence-error-model PCRF.X10 \
     --forest resources/forests/germline.v0.8.0.forest \
     -o results/calls/CHM1-CHM13.hs38DH.bwa-mem.octopus.vcf.gz \
     --threads 16

This should complete in a few hours.

Evaluate variants#

Finally, we will evaluate our calls with RTG Tools vcfeval. This tool requires the reference sequence to be preprocessed:

$ rtg format -o ~/reference/hs37d5.sdf ~/reference/hs37d5.fa

Then evaluate the calls with vcfeval:

$ rtg vcfeval \
     -t data/reference/hs38DH.sdf \
     -b data/truth/syndip/full.38.vcf.gz \
     --evaluation-regions data/truth/syndip/full.38.bed.gz \
     -c results/calls/CHM1-CHM13.hs38DH.bwa-mem.octopus.vcf.gz \
     -o results/eval/CHM1-CHM13.hs38DH.bwa-mem.octopus.pass.vcfeval \
     -f RFGQ \
     -m annotate \
     --ref-overlap

We should see the following results:

Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------