VCF Format
Octopus reports variants in VCF 4.3 format.
#
HaplotypesOctopus always reports phased genotypes (GT
separated with |
rather than /
). The extent of phasing is provided in the PS
FORMAT
field, which refers to the POS
of a previous record, so:
indicates the variants at positions 100
and 200
are phased. While
indicates the variants are unphased. Note that phase sets may not be contiguous.
#
Spanning allelesTo represent complex multi-allelic loci, Octopus prefers a decompose alleles into multiple records and use the *
allele to resolve conflicts. In particular, Octopus always splits variants that have unique REF
alleles into multiple VCF records. For example, two overlapping deletions are represented like:
In contrast, some other tools may choose to represent these variant with
which is inconsistent as the reference is deduced in each record, or
which is consistent, but rapidly becomes unmanageable as the length and number of overlapping variants increases.
*
records#
Multiple In some cases, a site may overlap with multiple distinct alleles. Octopus represents these sites using multiple *
alleles in the ALT
field.
Consider the realignments from the Ashkenazim trio:
The variants called by Octopus in this region are:
Notably, Octopus reports two *
alleles at the chr4:19232732
record. This indicates that there are two distinct alleles overlapping this site (at chr4:19232728
and chr4:19232736
).
caution
In some rare cases, Octopus may report double *
records in a single sample, resulting in calls like
This generally indicates a homozygous deletion directly upstream of a heterozygous deletion, like: