Command Line Reference
General options#
--help#
Command --help (short -h) prints the list of available command line options.
--version#
Command --version prints the version of the current binary, including the git branch and commit. The command also prints some system information about the binary.
--config#
Option --config is used to specify configuration files.
--trace#
Option --trace results in a log file containing detailed debug information being created.
Warning Trace files can get very large, so only use on small inputs..
--working-directory#
Option --working-directory (short -w) is used to set the working directory of the run. All output and temporary files will be relative to the working directory, unless absolute file paths are provided.
The output will be in ~/vcf/octopus.vcf.
Notes
- If no working directory is specified then the current directory is used.
 
--resolve-symlinks#
Command --resolve-symlinks forces all symbolic link input paths to be replaced with their resolved targets during program initialisation.
--threads#
Option --threads is used to enable multithreaded execution. The option accepts a positive integer argument that specifies the maximum number of threads to use:
If no argument is provided then automatic thread handling is used. This may be greater than the number of system threads.
--max-reference-cache-memory#
Option --max-reference-cache-memory (short -X) controls the size of the buffer used for reference caching, and is therefore one way to control memory use. The option accepts a non-negative integer argument in bytes, and an optional unit specifier.
Notes
- Capitisation of the units is ignored.
 - An argument of 
0disables reference caching. - The buffer size never exceeds the size of the reference.
 
--target-read-buffer-memory#
Option --target-read-buffer-memory (short -B) controls the size of the memory buffer used for input sequencing reads, and is therefore one way to control memory use. The option accepts a positive integer argument in bytes, and an optional unit specifier.
Notes
- Capitisation of the units is ignored.
 - The minimum read buffer size is 
50Mb; arguments less than this are ignored. 
--target-working-memory#
Option --target-working-memory sets the target amount of working memory for computation, and is therefore one way to control memory use. The option accepts a positive integer argument in bytes, and an optional unit specifier. The option is not strictly enforced, but is sometimes used to decide whether to switch to lower-memory versions of some methods (possibly at the cost of additional runtime).
Notes
- Capitisation of the units is ignored.
 
--temp-directory-prefix#
Option --temp-directory-prefix sets the Octopus working temporary directory prefix.
Notes
- Like all path arguments in Octopus, the argument is assumed to be relative to the 
--working-directory, unless the path is absolute and exists. - If a directory with the prefix already exists, then 
-Nis appended to the prefix whereNis the smallest integer such that the resulting path does not exist. 
--reference#
Option --reference (short -R) sets the reference FASTA file used for variant calling. This is a required option.
Notes
- In addition to the FASTA file, a reference FASTA index file (extension 
.fai) is also required. This must be present in the same directory as the FASTA file. - Like all path arguments in Octopus, the argument is assumed to be relative to the 
--working-directory, unless the path is absolute and exists. 
--reads#
Option --reads (short -I) specifies the input read files (extension .bam or .cram) used for variant calling. The argument is a list of file paths. This is a required option (unless --reads-file is specified).
Notes
- Each read file must have a paired index file in the same directory.
 - The option can be specified multiple times on the command line (arguments are concatenated).
 - Can be used in conjunction with 
--reads-file(arguments are concatenated). - Like all path arguments in Octopus, the argument is assumed to be relative to the 
--working-directory, unless the path is absolute and exists. 
--reads-file#
Option --reads-file (short -i) specifies a file containing a list of input read files (one per line).
--regions#
Option --regions (short -T) specifies a list of genomic regions to call.
Notes
- If this option is not specified (and neither is 
--regions-file), then all contigs present in the reference genome index are used. - Commas in the position tokens are ignored.
 - The option can be specified multiple times on the command line (arguments are concatenated).
 
--regions-file#
Option --regions-file (short -t) specifies a file containing a list of genomic regions to call (one per line). The file format can either be plain text (using the same input format as the --regions option), or BED format (i.e. tab-separated). GZipped files are also accepted.
--skip-regions#
Option --skip-regions (short -K) specifies a list of genomic regions to ignore during calling.
Notes
- The option can be specified multiple times on the command line (arguments are concatenated).
 
--skip-regions-file#
Option --skip-regions-file (short -k) specifies a file containing a list of genomic regions to skip (one per line). The file format can either be plain text (using the same input format as the --regions option), or BED format (i.e. tab-separated). GZipped files are also accepted.
--one-based-indexing#
Command --one-based-indexing directs the program to read all input regions (i.e. those specified in options --regions, --regions-file, --skip-regions, and --skip-regions-file) using 1-based indexing rather than 0-based indexing.
--ignore-unmapped-contigs#
Command --ignore-unmapped-contigs can be used to force execution if there is a mismatch between the input reference genome and the one used to map input reads (as specified in the SAM header). In particular, if there are contigs present in the reference genome but not in the reference used for read mapping.
--samples#
Option --samples (short -S) specifies a list of samples to use for variant calling. Each sample must be present in the input read files (using the SM SAM tag).
Notes
- If no samples are specified, all samples present in the input read files are used.
 - The option can be specified multiple times on the command line (arguments are concatenated).
 
--samples-file#
Option --samples-file (short -s) specifies a file containing a list (one per line) of samples to use for variant calling.
--output#
Option --output (short -O) sets the output destination. The file extension is used to determine the output file type (valid types are .vcf, .vcf.gz, and .bcf). If the file type is not recognised then uncompressed VCF is used.
Notes
- If no output is specified, 
stdoutis used. - Like all path arguments in Octopus, the argument is assumed to be relative to the 
--working-directory, unless the path is absolute and exists. 
--contig-output-order#
Option --contig-output-order specifies the order that records will be processed and written to the output. Possible options are: lexicographicalAscending, lexicographicalDescending, contigSizeAscending, contigSizeDescending, asInReferenceIndex, asInReferenceIndexReversed, unspecified
--sites-only#
Command --sites-only removes genotype information from the final output (i.e. drops the VCF FORMAT and sample columns).
--bamout#
Option --bamout is used to produce realigned evidence BAMs. The option input is a file prefix where the BAMs should be written.
Notes
- The 
--bamoutcommand is only allowed if the final output is written to file (i.e. with--output). - The default behaviour of 
--bamoutis to realign only those reads supporting variant haplotypes (see also--full-bamout). 
--bamout-type#
Option --bamout-type is used to select the type of realigned evidence BAM to produce when using the --bamout option. If FULL is chosen then all reads in the input BAM are written to the evidence BAM. If MINI is chosen (default) then only reads supporting variant sites are written.
--pedigree#
Option --pedigree is used to input a pedigree file (.ped). Some calling models may use pedigree information to improve calling accuracy.
where family.ped might contain, for example
Notes
- Specifying this option with a 
.pedfile defining a trio relationship between three input samples will automatically activate thetriocalling model. 
--fast#
Command --fast sets up Octopus for fast variant calling.
Warning This command may reduce calling accuracy.
--very-fast#
Command --very-fast sets up Octopus for very fast variant calling.
Warning This command may reduce calling accuracy
--data-profile#
Option --data-profile is used to generate a profile of the input read data, which may be used to make new sequence error models.
--bad-region-tolerance#
Option --bad-region-tolerance specifies the user tolerance for regions that may be 'uncallable' (e.g. due to mapping errors) and slow down calling. The possible arguments are:
LOWLow tolerance to bad regions.NORMALDefault tolerance to bad regions.HIGHHigh tolerance to bad regions.UNLIMITEDTurn off bad region detection.
Read pre-processing options#
--disable-read-preprocessing#
Command --disable-read-preprocessing can be used to disable all optional read preprocessing - all viable raw input alignments will be used.
--max-base-quality#
Option --max-base-quality caps the base quality of all input reads.
--mask-low-quality-tails#
Option --mask-low-quality-tails is used to mask (assign base quality zero) read tail bases that have low base quality scores. The value provided to the option is the threshold used to define a 'low' quality score.
--mask-tails#
Option --mask-tails is used to unconditionally mask (assign base quality zero) read tail bases. The value provided to the option is the number of bases to mask.
--mask-soft-clipped-bases#
Command --mask-soft-clipped-bases is used to mask (assign base quality zero) all read bases that are soft clipped in the input alignments.
--soft-clip-mask-threshold#
Option --soft-clip-mask-threshold makes the option --soft-clip-masking conditional on the base quality score; only bases below the value given to the option will be masked.
--mask-soft-clipped-boundary-bases#
Option --mask-soft-clipped-boundary-bases makes the option --soft-clip-masking mask addition bases adjacent to soft clipped bases. The value provided to the option is the number of additional bases to mask.
--mask-inverted-soft-clipping#
Command --mask-inverted-soft-clipping is used to mask (assign base quality zero) all read bases that are soft clipped and are inverted copies of nearby sequence.
--mask-3prime-shifted-soft-clipped-heads#
Command --mask-3prime-shifted-soft-clipped-heads is used to mask (assign base quality zero) all read bases that are soft clipped and are shifted copies of nearby 3' sequence.
--disable-adapter-masking#
Command --disable-adapter-masking is used to mask (assign base quality zero) read bases that are considered to be adapter contamination.
Notes
- The algorithm used to detect adapter contamination depends only on the input alignment mapping information; no library of adapter sequences are used.
 
--disable-overlap-masking#
Command --disable-overlap-masking is used to mask (assign base quality zero) read bases of read templates that contain overlapping segments, in order to remove non-independent base observations. 
Notes
- All but of the overlapping read bases are masked, leaving one base untouched. If two segments are overlapping, then half of the 5' bases of each segment are masked.
 
--split-long-reads#
Command --split-long-reads stipulates that reads longer than --max-read-length should be split into smaller linked reads.
--consider-unmapped-reads#
Command --consider-unmapped-reads turns off the read filter that removes reads marked as unmapped in the input alignments.
--min-mapping-quality#
Option --min-mapping-quality specifies the minimum mapping quality that reads must have to be considered; reads with mapping quality below this will be filtered and not used for analysis.
--good-base-quality#
Option --good-base-quality defines the minimum quality of a 'good' base for the options --min-good-base-fraction and --min-good-bases.
--min-good-base-fraction#
Option --min-good-base-fraction specifies the fraction of 'good' (see --good-base-quality) base qualities a read must have in order to be considered; reads with a fraction of 'good' base qualities less than this will be filtered and not considered.
--min-good-bases#
Option --min-good-bases specifies the number of 'good' (see --good-base-quality) base qualities a read must have in order to be considered; reads with 'good' base qualities less than this will be filtered and not considered.
--allow-qc-fails#
Command --allow-qc-fails permits reads marked as QC fail in the input alignments to be considered, otherwise they will be filtered.
--min-read-length#
Option --min-read-length specifies the minimum length (number of sequence bases) of reads to be considered; reads with length less than this will be filtered and not considered.
--max-read-length#
Option --max-read-length specifies the maximum length (number of sequence bases) of reads to be considered; reads with length greater than this will be filtered and not considered.
--allow-marked-duplicates#
Command --allow-marked-duplicates disables the filter that removes reads marked as duplicate in the input alignments. The default behaviour is to remove reads marked duplicate.
--allow-octopus-duplicates#
Command --allow-octopus-duplicates disables the filter that removes reads that Octopus considers to be duplicates. The default behaviour is to remove duplicate reads.
--duplicate-read-detection-policy#
Option --duplicate-read-detection-policy specifies approach to use for detecting duplicate reads. Possible arguments are:
RELAXEDRequire 5' mapping co-ordinate matches and identical cigar strings.AGGRESSIVERequire 5' mapping co-ordinate matches only.
--allow-secondary-alignments#
Command --allow-secondary-alignments disables the filter that removes reads that are marked as secondary alignments in the input alignments. The default behaviour is to remove reads marked secondary.
--allow-supplementary-alignments#
Command --allow-supplementary-alignments disables the filter that removes reads that are marked as supplementary alignments in the input alignments. The default behaviour is to remove reads marked supplementary.
--max-decoy-supplementary-alignment-mapping-quality#
Option --max-decoy-supplementary-alignment-mapping-quality removes any reads with supplementary alignments (i.e. SA SAM tag) to decoy contigs with mapping quality greater than the specified value.
--max-unplaced-supplementary-alignment-mapping-quality#
Command --max-unplaced-supplementary-alignment-mapping-quality removes reads with supplementary alignments (i.e. SA` SAM tag) to unplaced contigs with mapping quality greater than the specified value.
--max-unlocalized-supplementary-alignment-mapping-quality#
Command --max-unlocalized-supplementary-alignment-mapping-quality removes reads with supplementary alignments (i.e. SA SAM tag) to unlocalized contigs with mapping quality greater than the specified value.
--no-reads-with-unmapped-segments#
Command --no-reads-with-unmapped-segments removes reads that are marked as having unmapped segments in the input alignments. The default behaviour is to allow reads with unmapped segments.
--no-reads-with-distant-segments#
Command --no-reads-with-distant-segments removes reads that have segments mapped to different contigs. The default behaviour is to allow reads with distant segments.
--no-adapter-contaminated-reads#
Command --no-adapter-contaminated-reads removes reads that are considered to have adapter contamination (i.e. sequence bases from the adapter). The default behaviour is to allow reads with adapter contamination.
--disable-downsampling#
Command --disable-downsampling turns off downsampling.
--downsample-above#
Option --downsample-above specifies the read depth required to mark a position as a candidate for the downsampler.
--downsample-target#
Option --downsample-target specifies the target read depth for the downsampler for all candidate sites. Reads will be removed from the input alignments until all positions have read depth not greater than this.
--use-same-read-profile-for-all-samples#
Command --use-same-read-profile-for-all-samples specifies that the same input read profile should be used for all samples, rather than generating one for each sample. This essentially means that the same read distribution is assumed for all samples. 
Variant discovery options#
--variant-discovery-mode#
Option --variant-discovery-mode specifies the thresholds used for candidate variant discovery, which affects the overall sensitivity of the generators. Possible values are:
ILLUMINAThe default mode, for Illumina quality reads.PACBIOFor PacBio quality reads - requires more observations to propose a candidate, particular indel candidates.
--disable-denovo-variant-discovery#
Command --disable-denovo-variant-discovery disables the pileup candidate variant generator.
--disable-repeat-candidate-generator#
Command --disable-repeat-candidate-generator disables the tandem repeat candidate variant generator.
--disable-assembly-candidate-generator#
Command --disable-assembly-candidate-generator disables the local de novo assembly candidate variant generator.
--source-candidates#
Option --source-candidates (short -c) accepts a list of VCF files, the contents of which will be added to the candidate variant set.
Notes
- The option accepts files in the 
.vcf,.vcf.gz, and.bcfformats. Files in the.vcf.gzand.bcfformat must be indexed. For larger 
--source-candidates-file#
Option --source-candidates-file can be used to provide one or more files containing lists (one per line) of files in the .vcf, .vcf.gz, and .bcf formats (see notes on --source-candidates).
where vcfs.txt may contain, for example
--min-source-candidate-quality#
Option --min-source-candidate-quality specifies the minimum QUAL score for a user provided source variant (see --source-candidates and --source-candidates-file) to be added to the final candidate variant list; any records with QUAL less than this will not be considered.
--use-filtered-source-candidates#
Command --use-filtered-source-candidates specifies allows variants in the user-provided candidate variants (see --source-candidates and --source-candidates-file) that are marked as filtered (according to the FILTER column) should be added to the final candidate variant list. The default behaviour is to remove filtered variants.
--min-pileup-base-quality#
Option --min-pileup-base-quality specifies the minimum base quality that a SNV in the input alignments must have in ordered to be considered by the pileup candidate variant generator.
--min-supporting-reads#
Option --min-supporting-reads specifies the minimum number of supporting reads a variant must have in the input alignments to be included in the candidate variant list from the pileup candidate variant generator.
--allow-pileup-candidates-from-likely-misaligned-reads#
Command --allow-pileup-candidates-from-likely-misaligned-reads stops the pileup candidate generator filtering out candidates that are considered likely to originate from read misalignment, which can otherwise result in many false candidates.
--max-variant-size#
Option --max-variant-size specifies the maximum size (w.r.t reference region) a candidate variant can have to be considered by the calling algorithm; candidate variants with size greater than this will be removed and not considered.
--kmer-sizes#
Option --kmer-sizes specifies the default kmer sizes to try for local de novo assembly. Assembly graphs will be constructed for each kmer size in all active assembler regions, and the union of candidate variants from each assembler graph used for the final candidate set.
--num-fallback-kmers#
Option --num-fallback-kmers specifies the number of additional kmer sizes to try in the case that none of the default (i.e. kmer sizes specified in --kmer-sizes) is able to construct a valid assembly graph (often the case for smaller kmer sizes).
--fallback-kmer-gap#
Option --fallback-kmer-gap speifies the increment between fallback kmer sizes (see --num-fallback-kmers)
--max-region-to-assemble#
Option --max-region-to-assemble specifies the maximum reference region size that can be used to construct a local de novo assembly graph. Larger values enable detection of larger variants (e.g. large deletions), but increase the likelihood that small kmer sizes will result in an invalid assembly graph, and therefore decrease sensitivity for smaller variation.
--max-assemble-region-overlap#
Option --max-assemble-region-overlap specifies the maximum overlap between reference regions used to build local de novo assembly graphs. Larger overlaps result in more assembly graphs being constructed and may increase sensitivity for variation, at the expense of compute time. 
--assemble-all#
Command --assemble-all forces all reference regions to be used for local de novo assembly, rather than only regions considered likely to contain variation.
Warning This command may result in substantially longer runtimes.
--assembler-mask-base-quality#
Option --assembler-mask-base-quality specifies the minimum base quality an aligned base should have to avoid being 'masked' (i.e. converted to reference) before being inserted into the local de novo assembly graph. Higher values reduce sensitivity to noise and increase the likelihood of finding bubbles in graphs with larger kmer sizes, as the cost of decreased sensitivity at lower kmer sizes. 
--allow-cycles#
Command --allow-cycles forces the assembler to consider assembly graphs containing non-reference cycles, which usually result in false candidates.
--min-kmer-prune#
Option --min-kmer-prune specifies the minimum number of kmer observations that must be present in the local de novo assembly graph for the kmer to stay in the graph before variant extraction. Lower values increase sensitivity but lower specificity.
--max-bubbles#
Option --max-bubbles specifies the maximum number of bubbles in the final local de novo assembly graph that may be explored for candidate variant generation. Higher values increase sensitivity but reduce specificity.
--min-bubble-score#
Option --min-bubble-score specifies the minimum score a bubble explored in a local de novo assembly graph must have for the bubble to be extracted as a candidate variant. The 'bubble score' is the average of all kmer observations along the bubble edge, scaled by the probability the kmer observations along the bubble edge originate from a single read strand. The bubble score can be viewed as a proxy for the number of 'good' read observations for the bubble (i.e. variant), and therefore higher values increase specificity but reduce sensitivity.
--min-candidate-credible-vaf-probability#
Option --min-candidate-credible-vaf-probability sets the minimum probability mass above --min-credible-somatic-frequency required to 'discover' a variant when using the cancer calling model. Smaller values increase the number of candidate variants generated, potentially improving sensitivity but also increasing computational complexity.
Haplotype generation options#
--max-haplotypes#
Option --max-haplotypes specifies the maximum number of haplotypes that are considered by the calling model. It also specifies the target number of haplotypes that the haplotype generator should produce on each iteration of the algorithm. If the haplotype generator is unable to satisfy the request (i.e. produces a greater number of haplotypes), then the set of haplotypes is reduced to this size by removing haplotypes considered unlikely using several likelihood based statistics.
Increasing --max-haplotypes reduces the chance that a true haplotype is incorrectly filtered before evaluation by the calling model. It also increases the number of candidate variants that may be considered on each iteration of the algorithm, potentially improving calling accuracy. However, increasing this value will usually increase runtimes - sometimes substantially. 
--haplotype-holdout-threshold#
Option --haplotype-holdout-threshold specifies the number of haplotypes that the haplotype generator can produce before some active candidate variants are added to the holdout stack. The value must not be less than --max-haplotypes.
--haplotype-overflow#
Option --haplotype-overflow specifies the number of haplotypes that the haplotype generator can produce before the current active region must be skipped. The value must not be less than --haplotype-holdout-threshold.
--max-holdout-depth#
Option --max-holdout-depth specifies the maximum size of the holdout stack.
--extension-level#
Option --extension-level specifies the condition for extending the active haplotype tree with novel alleles. The possible values are:
MINIMALOnly include novel alleles overlapping with overlapping reads.NORMALInclude novel alleles with reads overlapping the rightmost included allele.AGGRESSIVENo conditions on extension other than the number of alleles.
More aggressive extension levels result in larger haplotype blocks, which may improve phase lengths and accuracy. However, aggressive extension increases the possibility of including novel alleles that cannot be phased with active alleles which will increase compute time.
--lagging-level#
Option --lagging-level specifies the extent to which active alleles remain active in the next algorithm iteration, increasing the length of haplotype blocks.
NONEDisable lagging; each iteration of the algorithm will evaluate a novel set of alleles.NORMALConsider previous active alleles if there are reads overlapping them with the next active allele set.AGGRESSIVEConsider previous active alleles if there are overlapping reads that span them, and the next active alleles.
--backtrack-level#
Option --backtrack-level specifies the extent to which 
NONEDisable all backtracking.NORMALEnables backtracking.AGGRESSIVECurrently the same asNORMAL.
--min-protected-haplotype-posterior#
Option --min-protected-haplotype-posterior specifies the minimum posterior probability that a haplotype is present in the samples (according to the calling model) for the haplotype to avoid being removed from consideration; haplotypes with posterior probability less than this may be filtered. Increasing the value of this option results in a greater number of haplotypes being filtered, allowing the haplotype tree to grow to include more candidate alleles, and reducing computational complexity. However, larger values also increase the chance that a true haplotype is incorrectly discarded. 
--dont-protect-reference-haplotype#
Command --dont-protect-reference-haplotype disables protection of the reference haplotype during haplotype filtering, and therefore ensures that the reference haplotype is always considered by the calling model.
--bad-region-tolerance#
Option --bad-region-tolerance specifies the 'tolerance' for spending time calling variants in regions that are unlikely to be callable (e.g. due to low complexity sequence). Such regions tend to be computationally difficult and skipping them can save a lot of wasted computation time. However, identification of such regions is not perfect and could result in skipping regions with real variation. Possible argument values are:
LOWSkip region that show any signs of being uncallable.NORMALSkip region that show reasonable signs of being uncallable.HIGHSkip region that show strong signs of being uncallable.
Common variant calling options#
--caller#
Option --caller (short '-C') specifies the calling model to be used. The option must only be set if the calling model is not automatically determined from other options.
--organism-ploidy#
Option --organism-ploidy (short '-P') specifies the default ploidy of all input samples. All contigs will be assumed to have this ploidy unless specified otherwise in --contig-ploidies.
--contig-ploidies#
Option --contig-ploidies (short -p) can be used to specify the ploidy of contigs (format contig=ploidy) that are not the same as the --organism-ploidy, or the ploidy of individual sample contigs (format sample:contig=ploidy).
--contig-ploidies-file#
Option --contig-ploidies-file can be used to provide a file specifying contig or sample contig ploidies (see --contig-ploidies), one per line.  
where ploidies.txt could contain
--min-variant-posterior#
Option --min-variant-posterior specifies the minimum posterior probability (Phred scale) required to call a variant. Candidate variants with posterior probability less than this will not be reported in the final call set.
Notes
- For calling models with more than one class of variation, this option refers to the calling models default variant class. For example, in the 
cancercalling model there are both germline and somatic variants, and this option refers to germline variants (see the option--min-somatic-posteriorfor somatic variants). 
--refcall#
Option --refcall can be used to enable reference calling, meaning Octopus will generate a 'gVCF' format output. There are two possible arguments for this option:
BLOCKEDMerge adjacent called reference positions with similar quality (see--refcall-block-merge-quality) a single gVCF record.POSITIONALEmit a gVCF record for all positions.
If the option is specified but no argument is provide, BLOCKED is assumed.
--refcall-block-merge-quality#
Option --refcall-block-merge-quality specifies the quality (Phred scale) threshold for merging adjacent called reference positions when BLOCKED refcalls are requested; adjacent reference positions with an absolute quality difference less or equal than this will be merged into a block.
--min-refcall-posterior#
Option --min-refcall-posterior specifies the minimum posterior probability required to call a position as homozygous reference; positions with posterior probability less than this will not be reported in the output gVCF.
--max-refcall-posterior#
Option --max-refcall-posterior caps the QUAL of all reference calls, which may result in larger reference blocks and smaller gVCF file sizes.
--snp-heterozygosity#
Option --snp-heterozygosity specifies the SNV heterozygosity parameter for the Coalesent mutation model, used to assign prior probabilities.
--snp-heterozygosity-stdev#
Option --snp-heterozygosity-stdev specifies the standard deviation of the SNV heterozygosity parameter (see --snp-heterozygosity).
--indel-heterozygosity#
Option --indel-heterozygosity specifies the INDEL heterozygosity parameter for the Coalesent mutation model, used to assign prior probabilities.
--max-genotypes#
Option --max-genotypes specifies the maximum number of candidate genotypes that must be evaluated by the calling model. If there are more possible candidate genotypes than this value then the algorithm may decide to remove some candidate genotypes using heuristics. The number of candidate genotypes to evaluate can have a substantial impact on runtime for some calling models (e.g. cancer).
--max-genotype-combinations#
Option --max-genotype-combinations specifies the maximum number of candidate joint genotype vectors that must be evaluated by the calling model. If there are more possible candidate joint genotype vectors than this value the algorithm may decide to remove some candidate genotype vectors using heuristics. The number of candidate joint genotype vectors to evaluate can have a substantial impact on runtime. 
Notes
- This option is only used by calling models that consider joint genotypes (i.e. 
populationandtrio) 
--use-uniform-genotype-priors#
Command --use-uniform-genotype-priors indicates that the uniform genotype prior model should be used for calling genotypes and variants.
--use-independent-genotype-priors#
Command --use-independent-genotype-priors indicates that an independent genotype prior model should be used for evaluating joint genotype vectors.
--model-posterior#
Option --model-posterior enables or disables model posterior evaluation at each called variant site.
--disable-inactive-flank-scoring#
Command --disable-inactive-flank-scoring disables an additional step during haplotype likelihood calculation that attempts to correct low likelihoods caused by inactive variation in the flanking regions of the haplotype under evaluation.
--dont-model-mapping-quality#
Command --dont-model-mapping-quality disables consideration of mapping quality in the haplotype likelihood calculation. This can improve calling accuracy if read mapping qualities are well calibrated.
--sequence-error-model#
Option --sequence-error-model specifies the sequence error model to use.
Notes
- The same error model is used for all input reads.
 
--max-indel-errors#
Option --max-indel-errors specifies the maximum number of indel errors in an individual read fragment that can be accuretely modelled by the haplotype likelihood model. Larger values usually require greater computational resources (determined by your systems available SIMD instructions).
--use-wide-hmm-scores#
Command --use-wide-hmm-scores sets the score variable computed by the pair HMM for haplotype likelihoods to 32 bits (the default is 16 bits). This can avoid score overflow in long noisy reads, but will slow does the computation.
--max-vb-seeds#
Option --max-vb-seeds specifies the maximum number of seeds that Variational Bayes models can use for posterior evaluation. Increasing the number of seeds increases the likelihood that a posterior mode will be identified, but results in more computation time.
--read-linkage#
Option --read-linkage specifies how reads are linked in the input alignments. Read linkage information is used by the haplotype likelihood calculation and can improve calling accuracy and increase phase lengths.
NONEReads are not linked in any way.PAIREDReads may be paired, with pairs having identical read names.LINKEDReads may be linked or paired, with linked reads having identicalBXtags.
--min-phase-score#
Option --min-phase-score specifies the minimum phase score (PQ in VCF) required to emit adjacent variant calls in the same phase set. Increasing this value results in less sites being phased, but reduces the phase false positive rate.
--disable-early-phase-detection#
Command --disable-early-phase-detection prevents the phasing algorithm being applied to partially resolved haplotype blocks, which can lead to removal of complete phased segments from the head of the current haplotype block. This heuristic can prevent discontiguous phase blocks being resolved, which are more likely in some data (e.g. linked reads).
Cancer variant calling options#
--normal-samples#
Option --normal-samples specifies which of the input samples are normal samples for tumour-normal paired analysis.
Notes
- Specifying this option will automatically activate the 
cancercalling model. 
--max-somatic-haplotypes#
Option --max-somatic-haplotypes specifies the maximum number of unique haplotypes containing somatic variation that can be modelled by the somatic genotype model. If there are more true somatic haplotypes present in the input data than this value then the model will not accuretely fit the data and true somatic variants may not be called, however, larger values substantially increase the computational complexity of the model and potentially increase the false positive rate.
--somatic-snv-prior#
Option --somatic-snv-prior specifies the somatic SNV mutation prior probability for the samples under consideration.
--somatic-indel-prior#
Option --somatic-indel-prior specifies the somatic INDEL mutation prior probability for the samples under consideration.
--min-expected-somatic-frequency#
Option --min-expected-somatic-frequency specifies the minimum expected Variant Allele Frequency (VAF) for somatic mutations in the samples. This value is used by the cancer calling model as the lower bound for the VAF posterior marginalisation used to compute the posterior probability that a candidate variant is a somatic mutation. Decreasing this value increases sensitivity for somatic mutations with smaller VAFs, but also increases sensitivity to noise.
--min-credible-somatic-frequency#
Option --min-credible-somatic-frequency specifies the minimum credible Variant Allele Frequency (VAF) for somatic mutations in the samples. This value is used for candidate variant discovery, and also by the cancer calling model as the lower-bound on candidate somatic mutation VAF credible regions (i.e. if the credible region computed for a mutation contains VAFs less than this value then the mutation will not be called as somatic). Decreasing this option increases sensitivity for somatic mutations with lower VAFs, but also increases sensitivity to noise and increases computational complexity as more candidate variants will be generated.
Notes
- If the value provided to this option is greater than the value specified by 
--min-expected-somatic-frequency, then this value is used for both options. 
--tumour-germline-concentration#
Option --tumour-germline-concentration sets the Dirichlet concentration parameter for the germline haplotypes of tumour samples. Larger values concentrate more prior probability mass on equal frequencies of germline haplotypes and also increase prior mass on low VAFs for somatic haplotypes. This can help the model correctly classify somatic variation when the normal sample is not informative (or not present), but also reduces sensitivity to somatic variation with larger VAFs.
--somatic-credible-mass#
Option --somatic-credible-mass specifies the probability mass to used to compute the credible interval for determining whether to call a variant as somatic (see also --min-credible-somatic-frequency). Larger values result in wider credible regions, increasing sensitivity for lower VAF somatic mutations, but also increase sensitivity to noise. 
--min-somatic-posterior#
Option --min-somatic-posterior specifies the minimum posterior probability (Phred scale) required to call a candidate variant as a somatic mutation.
--normal-contamination-risk#
Option --normal-contamination-risk indicates the risk that the normal sample contains contamination from the tumour samples. There are two possible values:
LOWThe algorithm will not consider normal contamination when generating candidate genotypes.HIGHThe algorithm will consider normal contamination when generating candidate genotypes.
--somatics-only#
Command --somatics-only indicates that only variant sites called as somatic mutations (i.e. tagged SOMATIC) should appear in the final output.
Warning Using this command will produce a VCF file that cannot be re-filtered using Octopus.
Trio variant calling options#
--maternal-sample#
Option --maternal-sample (short '-M') indicates which of the input samples is the mother of the proband. If this option is specified then --paternal-sample must also be specified.
Notes
- Specifying this option will automatically activate the 
triocalling model. 
--paternal-sample#
Option --paternal-sample (short '-F') indicates which of the input samples is the father of the proband. If this option is specified then --maternal-sample must also be specified.
Notes
- Specifying this option will automatically activate the 
triocalling model. 
--denovo-snv-prior#
Option --denovo-snv-prior specifies the de novo SNV mutation prior probability for the samples under consideration.
--denovo-indel-prior#
Option --denovo-indel-prior specifies the de novo INDEL mutation prior probability for the samples under consideration.
--min-denovo-posterior#
Option --min-denovo-posterior specifies the minimum posterior probability (Phred scale) required to call a candidate variant as a de novo mutation.
--denovos-only#
Command --denovos-only indicates that only variant sites called as de novo mutations (i.e. tagged DENOVO) should appear in the final output.
Warning Using this command will produce a VCF file that cannot be re-filtered using Octopus.
Polyclone variant calling options#
--max-clones#
Option --max-clones specifies the maximum number of clones that can be modelled by the polyclone calling model. If there are more clones than this present in the input data then the model will not fit the data well and some true variation may not be called. However, larger values increase the computational complexity of the model and also increase sensitivity to noise.
--min-clone-frequency#
Option --min-clone-frequency specifies the lower-bound Variant Allele Frequency (VAF) to use when computing the posterior probability for a variant. Smaller values increase sensitivity and the false positive rate.
--clone-prior#
Option --clone-prior sets the prior probability for each new clone proposed by the model
--clone-concentration#
Option --clone-concentration  sets the concentration parameter for the symmetric Dirichlet distribution used to model clone frequencies
Cell variant calling options#
--max-copy-loss#
Option --max-copy-loss specifies the maximum number of germline haplotypes losses that can be considered by the model.
--max-copy-gains#
Option --max-copy-gains  specifies the maximum number of haplotypes gains that can be considered by the model.
--somatic-cnv-prior#
Option --somatic-cnv-prior  sets the prior probability of a loci having a copy-number change.
--dropout-concentration#
Option --dropout-concentration sets the default Dirichlet concentration prior on haplotype frequencies. The higher the concentration parameter, the more probability mass around the centre of the distribution (0.5 for diploid). Lowering the concentration parameter means there’s more mass on extreme frequencies, which in turn means the model is less sensitive to dropout, but also less sensitive to real somatic variation. Setting to unity would put a uniform prior on frequencies.
--sample-dropout-concentration#
Option --sample-dropout-concentration sets the Dirichlet concentration prior on haplotype frequencies for a specific sample, otherwise --dropout-concentration is used.
--phylogeny-concentration#
Option --phylogeny-concentration sets the symmetric Dirichlet concentration prior on group mixture proportions in each phylogeny. A larger concentration parameter implies a more even distribution of samples across the tree.
Variant filtering options#
--disable-call-filtering#
Command --disable-call-filtering disables variant call filtering.
--filter-expression#
Option --filter-expression sets the threshold filter expression to use for filtering variants not tagged with SOMATIC or DENOVO.
--somatic-filter-expression#
Option --somatic-filter-expression sets the threshold filter expression to use for filtering SOMATIC variants.
--denovo-filter-expression#
Option --denovo-filter-expression sets the threshold filter expression to use for filtering DENOVO variants.
--refcall-filter-expression#
Option --refcall-filter-expression sets the threshold filter expression to use for filtering sites called homozygous reference.
--use-preprocessed-reads-for-filtering#
Command --use-preprocessed-reads-for-filtering forces use of the same read pre-processing steps used for calling variants for filtering variants; otherwise all well-formed reads are used for filtering.
--keep-unfiltered-calls#
Command --keep-unfiltered-calls requests that Octopus keep a copy of the VCF file produced before filtering is applied. The copy has unfiltered appended to the final output name.
--annotations#
Option --annotations requests that the values of a sub-set of the measures used for filtering are reported in the final VCF output.
--filter-vcf#
Option --filter-vcf specifies an Octopus VCF file to filter. No calling is performed.
--forest-model#
Option --forest-model enables random forest variant filtering. The argument to the option is a ranger forest file.
--somatic-forest-model#
Option --somatic-forest-model enables random forest variant filtering for somatic variants. The argument to the option is a ranger forest file.
--min-forest-quality#
Option --min-forest-quality specifies the random forest minimum quality score (phred scale) required to PASS a variant call (RFGQ_ALL), and each samples genotype calls (RFGQ).
--use-germline-forest-for-somatic-normals#
Command --use-germline-forest-for-somatic-normals specifies that the forest model given to --forest-model should be used to score normal sample genotypes for in somatic records, rather than the forest model given to --somatic-forest-model.