Genbank API

This section documents the Genbank API for all builtin specifications of DNA chisel. See also the Genbank usage notes section for tips.

Caution

Use either “=” or “:” in sequence annotation parameters

For historical reasons, DNA Chisel allows to write keyword parameters with either : or = in genbank annotations, for instance both @no(ATG, strand:both) and @no(ATG, strand=both) are valid.

While we find = more readable, we have also seen that some tools are confused by the presence of an = sign in a Genbank label, so using : may be safer.

Pattern removal

Caution

The annotation’s direction matters!

Since v3.0, the orientation of the annotation of pattern-based specifications such as no() or insert() indicates where to look for the pattern. Annotations in the +1 sense indicate that the pattern should only be looked for in the +1 (5’3’) strand. Patterns in the -1 strand are only looked at in the -1 (3’5’) sense.

Caution

Use strand=both in Genbank files to target both strands.

Standard Genbank does not allow you to create strands without direction (unlike the Biopython Record format, or the Snapgene format. So keep in mind that by default, only the feature’s strand is considered. In @no() and @insert(), you can use strand=both to indicate that the pattern should be avoided in both strands, for instance @no(BsaI_site, strand=both).

To remove a pattern in a given region, use the @no() label (alias for @AvoidPattern()). For instance to remove any GC pattern:

This method also supports IUPAC nucleotide mutation for degenerate sequences (NKY etc.) For instance to remove any ATC and ATT (which could be written together as ATH):

It is also possible to provide an enzyme restriction site by suffixing the enzyme name with _site:

Other pattern shorthands can be used for instance to find sequence repeats (here we look for dimers repeated four times in a row):

When no pattern shorthand cuts it, use a regular expression! The annotation below forbids any TAA sequence following an ATG sequence.

Finally, note that you can also use no() as an objective by prefixing it with a tilde ~no(), at which case the pattern may not be completely eliminated, but its number of occurences will be minimized by the algorithm:

Sequence protection

To prevent sections of the sequence to be modified, use @keep (alias for @AvoidChanges):

You can also use keep as an optimization objective, at which case sequence modifications will not be strictly forbidden, but they will be minimized:

Forcing changes

In DNA Chisel, the opposite of specification AvoidChanges is EnforceChanges (shorthand notation change). For instance, if you want the solver to change a region as much as possible (for instance, to obfuscate a sequence), use a ~change annotation:

You can also ask for a certain amount of change, with for instance ~change(40%) which will aim at changing 40% (no more, no less) of the nucleotides under the annotation. You can also use the specification as a constraint, for instance @change (the sequence will be totally changed) or @change(minimum=50%) which will ensures that the final sequence will be 50% different from the original.

Pattern insertion

You can control how many times a pattern should appear in a sequence region with the @insert() specification (short form of @EnforcePatternOccurence):

By default @insert() ensures that exactly one occurence of the pattern is present in the given region, but it can also be used to create more occurences:

This specification can be used both to create new patterns in a region that contains too few, or to decrease the pattern occurences i a region that contains too many. Note that with the current algorithm, new occurences of the pattern will be be preferentially placed towards the center of the selected region.

You can also enforce a sequence (or degenerate sequence) at an exact location with @sequence (short for @EnforceSequence):

To enforce several same-length but quite different sequences, use @choice (short for @EnforceChoice):

Indicating coding sequences

To indicate that a region is a CDS and the protein sequence should be conserved (i.e. only synonymous codon mutations are allowed), use @cds (short for @EnforceTranslation) on a region whose span is a multiple of 3:

Caution

Non-ATG start codons

If the coding region has a start codon in an organisms using non-ATG start codons (for instance, E. coli sometimes use GTG), make sure to define a policy for the start codon, for instance start_codon=keep to keep the sequence of the original start codon, or start_codon=ATG to use ATG as a start codon. See the documentation of EnforceTranslation for more details and options.

Codon Optimization

Caution

Always use with @cds

If the CodonOptimize specification is used without a @cds constraint covering the same region, then the protein sequence is not guaranteed!

To codon-optimize a gene you can use the ~CodonOptimize() specification but it is faster and clearer to refer directly to one of the different methods available.

To replace each codon by its most common synonym (which is equivalent to maximizing the CAI index):

To optimize the gene sequence so that the final codon representation matches the codons frequencies of the target organism, use ~match_codon_usage:

To harmonize codons so that a rare codon in the original host will be replaced by a rare codon in the target host (and common codons are replaced by common codons), use ~harmonize_rca:

In the example above (~harmonize_rca(e_coli -> h_sapiens)), the original host is e_coli and the target host is h_sapiens, separated by an arrow (->). See HarmonizeRCA for more details. As you noticed we used species names in these examples. See the Codon Usage Tables package webpage for species that can be referred to by name. This includes b_subtilis, c_elegans, d_melanogaster, e_coli, g_gallus, h_sapiens, m_musculus, s_cerevisiae. You can also use a TaxID to refer to a species, e.g. species=1423 at which case the codon frequencies will be downloaded from the Kazusa codon usage database (assuming it isn’t down!)

GC content

Use @gc to ensure that a given region’s GC content is between a certain range

For large regions, you can use a windowed evaluation, e.g. with the parameter window=100 to ensure that the GC content will remain in the desired range over every 100bp subsegments of the sequence.

The specification can also be used as an optimization objective, at which case it is preferable to provide a target rather than a range:

Removing homologies

To ensure that a sequence region has no homologies anywhere else in the sequence you can use all_unique_kmers:

The annotation above ensures that each 20-mer in the annotated segment does not appear anywhere else in the sequence or in its reverse-complement.

To ensure that a sequence region has no homologies within itself, use the “here” parameter:

You can also avoid matches with a given organism you can also use avoid_matches .. raw:: html

<img class=’annotation-example’ src=’../_static/images/genbank_annotations/avoid_matches.png’></img>

Caution

Only works if supported by the server

This specification, originally meant to be used via scripts, will only work if the server associates the organism name to a Bowtie index path.

Melting temperature

The annotation tm can be used to ensure that a given subsequence has the desired melting temperature. It can be used either as a constraint or as an optimization objective:

As a suggestion, you can use this specification with all_unique_kmers to ensure that the resulting region is primer-friendly:

Specifications not yet supported as Genbank annotations

  • AvoidHeterodimerization

  • EnforceRegionsCompatibility