VCF Format
Octopus reports variants in VCF 4.3 format.
Haplotypes#
Octopus always reports phased genotypes (GT separated with | rather than /). The extent of phasing is provided in the PS FORMAT field, which refers to the POS of a previous record, so:
indicates the variants at positions 100 and 200 are phased. While
indicates the variants are unphased. Note that phase sets may not be contiguous.
Spanning alleles#
To represent complex multi-allelic loci, Octopus prefers a decompose alleles into multiple records and use the * allele to resolve conflicts. In particular, Octopus always splits variants that have unique REF alleles into multiple VCF records. For example, two overlapping deletions are represented like:
In contrast, some other tools may choose to represent these variant with
which is inconsistent as the reference is deduced in each record, or
which is consistent, but rapidly becomes unmanageable as the length and number of overlapping variants increases.
Multiple * records#
In some cases, a site may overlap with multiple distinct alleles. Octopus represents these sites using multiple * alleles in the ALT field.
Consider the realignments from the Ashkenazim trio:
The variants called by Octopus in this region are:
Notably, Octopus reports two * alleles at the chr4:19232732 record. This indicates that there are two distinct alleles overlapping this site (at chr4:19232728 and chr4:19232736). 
caution
In some rare cases, Octopus may report double * records in a single sample, resulting in calls like
This generally indicates a homozygous deletion directly upstream of a heterozygous deletion, like: