Core Classes ¶

DnaOptimizationProblem ¶

class dnachisel.DnaOptimizationProblem(sequence, constraints=None, objectives=None, logger='bar', mutation_space=None)[source]¶

Problem specifications: sequence, constraints, optimization objectives.

The original constraints, objectives, and original sequence of the problem are stored in the DNA problem. This class also has methods to display reports on the constraints and objectives, as well as solving the constraints and objectives.

Examples

>>> from dnachisel import *
>>> problem = DnaOptimizationProblem(
>>>     sequence = "ATGCGTGTGTGC...",
>>>     constraints = [constraint1, constraint2, ...],
>>>     objectives = [objective1, objective2, ...]
>>> )
>>> problem.resolve_constraints()
>>> problem.optimize()
>>> print(problem.constraints_text_summary())
>>> print(problem.objectives_text_summary())

Parameters:

sequence – A string of ATGC characters (they must be upper case!), e.g. “ATTGTGTA”
constraints – A list of objects of type Specification.
objectives – A list of objects of type Specification specifying what must be optimized in the problem. Note that each objective has a float boost parameter. The larger the boost, the more the objective is taken into account during the optimization.
logger – Either None for no logger, ‘bar’ for a tqdm progress bar logger, or any ProgLog progress bar logger.
mutations_space – A MutationSpace indicating the possible mutations. In most case the mutation space will be left to None and computed at problem initialization (which can be slightly compute-intensive), however some core DNA Chisel methods will create optimization problems with a provided mutation_space to save computing time.

randomization_threshold¶: The algorithm will use an exhaustive search when the size of the mutation space (=the number of possible variants) is above this threshold, and a (guided) random search when it is above.

max_random_iters¶: When using a random search, stop after this many iterations

mutations_per_iteration¶: When using a random search, produce this many sequence mutations each iteration.

optimization_stagnation_tolerance¶: When using a random search, stop if the score hasn’t improved in the last “this many” iterations

local_extensions¶: Try local resolution several times if it fails, increasing the mutable zone by [N1, N2…] nucleotides on each side, until resolution works. (by default, an extension of 0bp is tried, then 5bp.

Notes

The dictionary self.possible_mutations is of the form {location1 : list1, location2: list2...} where location is either a single index (e.g. 10) indicating the position of a nucleotide to be muted, or a couple (start, end) indicating a whole segment whose sub-sequence should be replaced. The list s are lists of possible sequences to replace each location, e.g. for the mutation of a whole codon (3,6): ["ATT", "ACT", "AGT"].

initialize()[source]¶: Precompute specification sets, evaluations, and mutation space.

number_of_edits()[source]¶: Return the number of nucleotide differences between the original and current sequence.

optimize_with_report(target, project_name='My project', file_path=None, file_content=None)[source]¶

Resolve constraints, optimize objectives, write a multi-file report.

The report’s content may vary depending on the optimization’s success.

Parameters:

target – Either a path to a folder that will contain the report, or a path to a zip archive, or “@memory” to return raw data of a zip archive containing the report.
project_name – Project name to write on PDF reports

Returns:

Triplet where success is True/False, message is a one-line string summary indication whether some clash was found, or some solution, or maybe no solution was found because the random searches were too short

Return type:

(success, message, zip_data)

sequence_edits_as_array()[source]¶: Return an array [False, False, True…] where True indicates an edit (i.e. a change at this position between the original problem sequence and the current one).

CircularDnaOptimizationProblem ¶

class dnachisel.CircularDnaOptimizationProblem(sequence, constraints=None, objectives=None, logger='bar', mutation_space=None)[source]¶

Class for solving circular DNA optimization problems.

The high-level interface is the same as for DnaOptimizationProblem: initialization, resolve_constraints(), and optimize() work the same way.

all_constraints_pass(autopass=True)[source]¶: Return whether the current problem sequence passes all constraints.

constraints_evaluations(autopass=True)[source]¶

Return the evaluation of constraints.

The “autopass_constraints” enables to just assume that constraints enforced by the mutation space are verified.

objectives_evaluations()[source]¶: Return a list of the evaluation of each objective of the problem

optimize()[source]¶: Maximize the total score by optimizing each objective in turn.

optimize_with_report(target, project_name='My project', file_path=None, file_content=None)[source]¶

Resolve constraints, optimize objectives, write a multi-file report.

WARNING: in case of optimization failure, the report generated will show a “pseudo-circular” sequence formed by concatenating the sequence with itself three times.

TODO: fix the warning above, at some point?

The report’s content may vary depending on the optimization’s success.

Parameters:

target – Either a path to a folder that will contain the report, or a path to a zip archive, or “@memory” to return raw data of a zip archive containing the report.
project_name – Project name to write on PDF reports

Returns:

Return type:

(success, message, zip_data)

resolve_constraints(final_check=True)[source]¶

Solve a particular constraint using local, targeted searches.

Parameters:

constraint – The Specification object for which the sequence should be solved
final_check – If True, a final check of that all constraints pass will be run at the end of the process, when constraints have been resolved one by one, to check that the solving of one constraint didn’t undo the solving of another.
cst_filter – An optional filter to only resolve a subset function (constraint => True/False)

Specification ¶

class dnachisel.Specification(evaluate=None, boost=1.0)[source]¶

General class to define specifications to optimize.

Note that all specifications have a boost attribute that is a multiplicator that will be used when computing the global specification score of a problem with problem.all_objectives_score().

New types of specifications are defined by subclassing Specification and providing a custom evaluate and localized methods.

Parameters:

evaluate – function (sequence) => SpecEvaluation
boost – Relative importance of the Specification’s score in a multi-specification problem.

best_possible_score¶: Best score that the specification can achieve. Used by the optimization algorithm to understand when no more optimization is required.

optimize_passively(boolean)¶: Indicates that there should not be a pass of the optimization algorithm to optimize this objective. Instead, this objective is simply taken into account when optimizing other objectives.

enforced_by_nucleotide_restrictions(boolean)¶: When the spec is used as a constraints, this indicates that the constraint will initially restrict the mutation space in a way that ensures that the constraint will always be valid. The constraint does not need to be evaluated again, which speeds up the resolution algorithm.

priority¶: Value used to sort the specifications and solve/optimize them in order, with highest priority first.

shorthand_name¶: Shorter name for the specification class that will be recognized when parsing annotations from genbanks.

as_passive_objective()[source]¶

Return a copy with optimize_passively set to true.

“Optimize passively” means that when the specification is used as an objective, the solver will not do a specific pass to optimize this specification, however this specification’s score will be taken into account in the global score when optimizing other objectives, and may therefore influence the final sequence.

breach_label()[source]¶

Shorter, less precise label to be used in tables, reports, etc.

This is meant for specifications such as EnforceGCContent(0.4, 0.6) to be represented as ‘40-60% GC’ in reports tables etc..

copy_with_changes(**kwargs)[source]¶

Return a copy of the Specification with modified properties.

For instance new_spec = spec.copy_with_changes(boost=10).

initialized_on_problem(problem, role='constraint')[source]¶

Complete specification initialization when the sequence gets known.

Some specifications like to know what their role is and on which sequence they are employed before they complete some values.

label(role=None, with_location=True, assignment_symbol=':', use_short_form=False, use_breach_form=False)[source]¶

Return a string label for this specification.

Parameters:

role – Either ‘constraint’ or ‘objective’ (for prefixing the label with @ or ~), or None.
with_location – If true, the location will appear in the label.
assignment_symbol – Indicates whether to use “:” or “=” or anything else when indicating parameters values.
use_short_form – If True, the label will use self.short_label(), so for instance AvoidPattern(BsmBI_site) will become “no BsmBI”. How this is handled is dependent on the specification.

label_parameters()[source]¶

In subclasses, returns a list of the creation parameters.

For instance [(‘pattern’, ‘ATT’), (‘occurences’, 2)]

localized(location, problem=None)[source]¶

Return a modified version of the specification for the case where sequence modifications are only performed inside the provided location.

For instance if an specification concerns local GC content, and we are only making local mutations to destroy a restriction site, then we only need to check the local GC content around the restriction site after each mutation (and not compute it for the whole sequence), so EnforceGCContent.localized(location) will return an specification that only looks for GC content around the provided location.

If an specification concerns a DNA segment that is completely disjoint from the provided location, this must return None.

Must return an object of class Constraint.

restrict_nucleotides(sequence, location=None)[source]¶

Restrict the mutation space to speed up optimization.

This method only kicks in when this specification is used as a constraint. By default it does nothing, but subclasses such as EnforceTranslation, AvoidChanges, EnforceSequence, etc. have custom methods.

In the code, this method is run during the initialize() step of DNAOptimizationProblem, when the MutationSpace is created for each constraint

shifted(shift)[source]¶

Shift the location of the specification.

Some specification classes may have a special method to do side effects when shifting the location.

Location shifting is used in particular when solving circular DNA optimization problems.

short_label()[source]¶

Shorter, less precise label to be used in tables, reports, etc.

This is meant for specifications such as EnforceGCContent(0.4, 0.6) to be represented as ‘40-60% GC’ in reports tables etc..

SpecEvaluation ¶

Store relevant infos about the evaluation of an objective on a problem.

Examples

>>> evaluation_result = ConstraintEvaluation(
>>>     specification=specification,
>>>     problem = problem,
>>>     score= evaluation_score, # float
>>>     locations=[Locations list...],
>>>     message = "Score: 42 (found 42 sites)"
>>> )

param objective:: The Objective that was evaluated.
param problem:: The problem that the objective was evaluated on.
param score:: The score associated to the evaluation.
param locations:: A list of couples (start, end) indicating the locations on which the the optimization shoul be localized to improve the objective.
param message:: A message that will be returned by str(evaluation). It will notably be displayed by problem.print_objectives_summaries.

Location ¶

Implements the Location class.

The class has useful methods for finding overlaps between locations, extract a subsequence from a sequence, etc.

class dnachisel.Location.Location(start, end, strand=0)[source]¶

Represent a segment of a sequence, with a start, end, and strand.

Warning: we use Python’s splicing notation, so Location(5, 10) represents sequence[5, 6, 7, 8, 9] which corresponds to nucleotides number 6, 7, 8, 9, 10.

The data structure is similar to a Biopython’s FeatureLocation, but with different methods for different purposes.

Parameters:

start – Lowest position index of the segment.
end – Highest position index of the segment.
strand – Either 1 or -1 for sense or anti-sense orientation.

extended(extension_length, lower_limit=0, upper_limit=None, left=True, right=True)[source]¶: Extend the location of a few basepairs on each side.

extract_sequence(sequence)[source]¶: Return the subsequence read at the given location.

static from_biopython_location(location)[source]¶: Return a DnaChisel Location from a Biopython location.

static from_data(location_data)[source]¶

Return a location, from varied data formats.

This method is used in particular in every built-in specification to quickly standardize the input location.

location_data can be a tuple (start, end) or (start, end, strand), or a Biopython FeatureLocation, or a Location instance. In any case, a new Location object will be returned.

static merge_overlapping_locations(locations)[source]¶: Return a list of locations obtained by merging all overlapping.

overlap_region(other_location)[source]¶: Return the overlap span between two locations (None if None).

to_biopython_feature(feature_type='misc_feature', **qualifiers)[source]¶: Return a Biopython SeqFeature with same location and custom qualifiers.

to_biopython_location()[source]¶: Return a Biopython FeatureLocation equivalent to the location.

to_tuple()[source]¶: Return (start, end, strand).

MutationSpace ¶

class dnachisel.MutationSpace.MutationChoice(segment, variants, is_any_nucleotide=False)[source]¶

Represent a segment of a sequence with several possible variants.

Parameters:

segment – A pair (start, end) indicating the range of nucleotides concerned. We are applying Python range, so
variants – A set of sequence variants, at the given position

Examples

>>> choice = MutationChoice((70, 73), {})

extract_varying_region()[source]¶

Return MutationChoices for the central varying region and 2 flanks.

For instance:

>>> choice = MutationChoice((5, 12), [
>>>     'ATGCGTG',
>>>     'AAAAATG',
>>>     'AAATGTG',
>>>     'ATGAATG',
>>> ])
>>> choice.extract_varying_region()

Result :

>>> [
>>>     MutChoice(5-6 A),
>>>     MutChoice(6-10 TGCG-AATG-TGAA-AAAA),
>>>     MutChoice(10-12 TG)
>>> ]

merge_with(others)[source]¶

Merge this mutation choice with others to form a single choice.

This method returns the only choices on the full interval which are compatible with at least one choice in each of the MutationChoices.

Examples:¶

Consider the following:

>>> ((2, 5), {'ATT', 'ATA'})

To be merged with:

>>> [
>>>     ((0, 3), {'GTA', 'GCT', 'GTT'}),
>>>     ((3, 4), {'A', 'T', 'G', 'C'}),
>>>     ((4, 7), {'ATG', 'ACC', 'CTG'})
>>> ]

The result will be:

>>> (0, 7), {'GTATACC', 'GTATATG'}

random_variant(sequence)[source]¶: Return one of the variants, randomly.

class dnachisel.MutationSpace.MutationSpace(choices_index, left_padding=0)[source]¶

Class for mutation space (set of sequence segments and their variants).

Parameters:: choices_index – A list L such that L[i] gives the MutationChoice governing the mutations allowed at position i (and possibly around i).

Examples

>>> # BEWARE: below, similar mutation choices are actually the SAME OBJECT
>>> space = MutationSpace([
        MutationChoice((0, 2), {'AT', 'TG'}),
        MutationChoice((0, 2), {'AT', 'TG'}),
        MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), # same
        MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), #
        MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}),
    ])

all_variants(sequence)[source]¶: Iterate through all sequence variants in this mutation space.

apply_random_mutations(n_mutations, sequence)[source]¶: Return a sequence with n random mutations applied.

property choices_span¶: Return (start, end), segment where multiple choices are possible.

constrain_sequence(sequence)[source]¶

Return a version of the sequence compatible with the mutation space.

All nucleotides of the sequence that are incompatible with the mutation space are replaced by nucleotides compatible with the space.

static from_optimization_problem(problem, new_constraints=None)[source]¶

Create a mutation space from a DNA optimization problem.

This can be used to initialize mutation spaces for new problems.

localized(location)[source]¶: Return a new version with only mutations overlapping the location.

pick_random_mutations(n_mutations, sequence)[source]¶: Draw N random mutations.

property space_size¶: Return the number of possible mutations.

string_representation(separator='|')[source]¶

Generates a string of the mutation space.

Examples

>>> print (mutation_space.string_representation())
>>> A|G|ATG|CA|A|CA|T|GTC|AC|T|A|C|T|T|T
>>> T|C|   |  |G|  |C|   |  |C|T|A|C|A|C
>>> G|A|   |  | |  | |   |  |G|G|G|G|G|G

SequencePattern ¶

Specifications AvoidPattern, EnforcePatternOccurence accept a Pattern as argument. The pattern can be provided either as a string or as a class:

Name	Pattern Class	As string
Restriction site	`EnzymeSitePattern('BsmBI')`	`'BsmBI_site'`
Homopolymer	`HomopolymerPattern("A", 6)`	`'6xA'`
Direct repeats	`RepeatedKmerPattern(3, 2)`	`'3x2mer'`
DNA notation	`DnaNotationPattern('ANNKA')`	`'ANNKA'`
Regular Expression	`SequencePattern('A[CG]*A')`	`'A[CG]*A'`

SequencePattern class ¶

class dnachisel.SequencePattern(expression, size=None, name=None, lookahead='loop', is_palyndromic=False)[source]¶

Pattern that will be looked for in a DNA sequence.

Use this class for matching regular expression patterns, and DnaNotationPattern for matching explicit sequences or sequences using Ns etc.

Examples

>>> expression = "A[ATGC]{3,}"
>>> pattern = SequencePattern(expression)
>>> constraint = AvoidPattern(pattern)

Parameters:

expression – Any string or regular expression (regex) for matching ATGC nucleotides. Note that multi-nucleotide symbols such as “N” (for A-T-G-C), or “K” are not supported by this class, see DnaNotationPattern instead.
size – Size of the pattern, in number of characters. The size is used to determine the size of windows when performing local optimization and constraint solving. It can be important to provide the size when the pattern string provided represents a complex regular expression whose maximal matching size cannot be easily evaluated. For example, if a regex is used to actively remove sites, then a size should be provided to inform DNA Chisel during optimization.
name – Name of the pattern (will be displayed e.g. when the pattern is printed).

find_matches(sequence, location=None, forced_strand=None)[source]¶

Return the locations where the sequence matches the expression.

Parameters:

sequence – A string of “ATGC…”
location – Location indicating a segment to which to restrict the search. Only patterns entirely included in the segment will be returned

Returns:

A list of the locations of matches, of the form [(start1, end1), (start2, end2),...].

Return type:

matches

DnaNotationPattern ¶

class dnachisel.DnaNotationPattern(sequence, name=None)[source]¶

Class for patterns in plain DNA notation: ATTGCCA, GCNNKTA, etc.

If the sequence is not palindromic, the pattern will be looked for in both strands of sequences.

all_variants()[source]¶: Return all ATGC sequence variants of a sequence

static dna_sequence_to_regexpr(sequence)[source]¶: Return a regular expression to find the pattern in a sequence.

MotifPssmPattern ¶

class dnachisel.MotifPssmPattern(pssm, threshold=None, relative_threshold=None)[source]¶

Motif pattern represented by a Position-Specific Scoring Matrix (PSSM).

This class is better instantiated by creating a pattern from a set of sequence strings with MotifPssmPattern.from_sequences(), or reading pattern(s) for a file with MotifPssmPattern.from_file().

Parameters:

pssm – A Bio.motifs.Motif object.
threshold – locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold – Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.

classmethod apply_pseudocounts(motif, pseudocounts)[source]¶

Add pseudocounts to the motif’s pssm matrix.

Add nothing if pseudocounts is None, auto-compute pseudocounts if pseudocounts=”jaspar”, else just attribute the pseudocounts as-is to the motif.

find_matches_in_string(sequence)[source]¶: Find all positions where the PSSM score is above threshold.

classmethod from_sequences(sequences, name='unnamed', pseudocounts='jaspar', threshold=None, relative_threshold=None)[source]¶

Return a PSSM pattern computed from same-length sequences.

Parameters:

sequences – A list of same-length sequences.
name – Name to give to the pattern (will appear in reports etc.).
pseudocounts – Either a dict {“A”: 0.01, “T”: …} or “jaspar” for automatic pseudocounts from the Biopython.motifs.jaspar module (recommended), or None for no pseudocounts at all (not recommended!).
threshold – locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold – Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.

classmethod list_from_file(motifs_file, file_format, threshold=None, pseudocounts='jaspar', relative_threshold=None)[source]¶

Return a list of PSSM patterns from a file in JASPAR, MEME, etc.

Parameters:

motifs_file – Path to a motifs file, or file handle.
file_format – File format. one of “jaspar”, “meme”, “TRANSFAC”.
pseudocounts – Either a dict {“A”: 0.01, “T”: …} or “jaspar” for automatic pseudocounts from the Biopython.motifs.jaspar module (recommended), or None for no pseudocounts at all (not recommended!).
threshold – locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold – Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.

EnzymeSitePattern ¶

class dnachisel.EnzymeSitePattern(enzyme_name)[source]¶

Class to represent Enzyme site patterns

Examples

>>> enzyme_pattern = EnzymeSitePattern("BsaI")
>>> constraint = AvoidPattern(enzyme_pattern)

static from_string(string)[source]¶: Convert BsmBI_site to EnzymeSitePattern(BsmBI)

RepeatedKmerPattern ¶

class dnachisel.RepeatedKmerPattern(n_repeats, kmer_size)[source]¶

Direct repeats like ATT-ATT, ATGC-ATGC-ATGC, etc.

Shorthand string version: “3x4mer”, “5x2mer”, etc.

Examples

>>> RepeatedKmerPattern(3, 2) # dimers repeated 3 times

HomopolymerPattern ¶

class dnachisel.HomopolymerPattern(nucleotide, number)[source]¶

Homopolymer of the form AAAAAAA, TTTTT, etc.

Shorthand string version: “7xA”, “9xC”, etc.

Examples

>>> pattern = HomopolymerPattern("A", 6)
>>> constraint = AvoidPattern(pattern)