Realigned BAMs
Octopus can generate realigned BAMs that provide visual evidence for why a call has been made. Realigned BAMs are particularly helpful for confirming complex variation where the mapper alignments are incorrect, as can be seen in the IGV pileups below
Evidence BAMs are requested using the --bamout
option. The argument to --bamout
changes slightly depending on whether you're calling one or more samples: If you're only calling a single sample then the argument to --bamout
is a file path to write the BAM to, e.g.:
For multiple samples the argument to --bamout
is a directory path, e.g.:
Realigned BAMs with the same names as the input BAMs will be written to this directory, so this cannot be a directory where any of the input BAMs are located.
important
Realigned BAMs are only available for single-sample BAMs and when --output
is specified (i.e. no stdout output).
Octopus adds several useful annotations to realigned reads:
Name | Description |
---|---|
HP | A list (, separated) of haplotype IDs that the read is inferred to originate from. A haplotype ID, which is zero-indexed, corresponds to column in the GT field of the affiliated phased VCF. A haplotype ID indicates that the read was unambiguously assigned to the haplotype, while multiple values indicate that the read could equally well be assigned to any of the listed haplotype. |
MD | Reference free alignments. As defined in the SAM specficiation |
md | Like MD but alignment is relative to the inferred haplotype rather than the reference (i.e. mismatches are inferred sequencing errors). |
hc | The CIGAR alignment to the inferred haplotype. |
PS | The phase set the read was assigned to. |
tip
The HP
tag is useful for colouring and grouping alignments in IGV.
By default, only reads supporting regions containing called variation are realigned. However, Octopus can also copy reads overlapping regions where no variation was called using the --bamout-type FULL
command. Only primary reads are used for BAM realignment.
caution
Reads are assigned and realigned to haplotypes called in the --output
VCF. This means that read-pairs in different phase sets can appear discordant, and reads that are not completely spanned by a phase set (or overlap multiple phase sets) may have poor alignments. Consider trying to increase haplotype lengths if this occurs.