Core Classes¶
DnaOptimizationProblem¶
- class dnachisel.DnaOptimizationProblem(sequence, constraints=None, objectives=None, logger='bar', mutation_space=None)[source]¶
Problem specifications: sequence, constraints, optimization objectives.
The original constraints, objectives, and original sequence of the problem are stored in the DNA problem. This class also has methods to display reports on the constraints and objectives, as well as solving the constraints and objectives.
Examples
>>> from dnachisel import * >>> problem = DnaOptimizationProblem( >>> sequence = "ATGCGTGTGTGC...", >>> constraints = [constraint1, constraint2, ...], >>> objectives = [objective1, objective2, ...] >>> ) >>> problem.resolve_constraints() >>> problem.optimize() >>> print(problem.constraints_text_summary()) >>> print(problem.objectives_text_summary())
- Parameters:
sequence – A string of ATGC characters (they must be upper case!), e.g. “ATTGTGTA”
constraints – A list of objects of type
Specification
.objectives – A list of objects of type
Specification
specifying what must be optimized in the problem. Note that each objective has a floatboost
parameter. The larger the boost, the more the objective is taken into account during the optimization.logger – Either None for no logger, ‘bar’ for a tqdm progress bar logger, or any ProgLog progress bar logger.
mutations_space – A MutationSpace indicating the possible mutations. In most case the mutation space will be left to None and computed at problem initialization (which can be slightly compute-intensive), however some core DNA Chisel methods will create optimization problems with a provided mutation_space to save computing time.
- randomization_threshold¶
The algorithm will use an exhaustive search when the size of the mutation space (=the number of possible variants) is above this threshold, and a (guided) random search when it is above.
- max_random_iters¶
When using a random search, stop after this many iterations
- mutations_per_iteration¶
When using a random search, produce this many sequence mutations each iteration.
- optimization_stagnation_tolerance¶
When using a random search, stop if the score hasn’t improved in the last “this many” iterations
- local_extensions¶
Try local resolution several times if it fails, increasing the mutable zone by [N1, N2…] nucleotides on each side, until resolution works. (by default, an extension of 0bp is tried, then 5bp.
Notes
The dictionary
self.possible_mutations
is of the form{location1 : list1, location2: list2...}
wherelocation
is either a single index (e.g. 10) indicating the position of a nucleotide to be muted, or a couple(start, end)
indicating a whole segment whose sub-sequence should be replaced. Thelist
s are lists of possible sequences to replace each location, e.g. for the mutation of a whole codon(3,6): ["ATT", "ACT", "AGT"]
.- number_of_edits()[source]¶
Return the number of nucleotide differences between the original and current sequence.
- optimize_with_report(target, project_name='My project', file_path=None, file_content=None)[source]¶
Resolve constraints, optimize objectives, write a multi-file report.
The report’s content may vary depending on the optimization’s success.
- Parameters:
target – Either a path to a folder that will contain the report, or a path to a zip archive, or “@memory” to return raw data of a zip archive containing the report.
project_name – Project name to write on PDF reports
- Returns:
Triplet where success is True/False, message is a one-line string summary indication whether some clash was found, or some solution, or maybe no solution was found because the random searches were too short
- Return type:
(success, message, zip_data)
CircularDnaOptimizationProblem¶
- class dnachisel.CircularDnaOptimizationProblem(sequence, constraints=None, objectives=None, logger='bar', mutation_space=None)[source]¶
Class for solving circular DNA optimization problems.
The high-level interface is the same as for DnaOptimizationProblem: initialization, resolve_constraints(), and optimize() work the same way.
- all_constraints_pass(autopass=True)[source]¶
Return whether the current problem sequence passes all constraints.
- constraints_evaluations(autopass=True)[source]¶
Return the evaluation of constraints.
The “autopass_constraints” enables to just assume that constraints enforced by the mutation space are verified.
- optimize_with_report(target, project_name='My project', file_path=None, file_content=None)[source]¶
Resolve constraints, optimize objectives, write a multi-file report.
WARNING: in case of optimization failure, the report generated will show a “pseudo-circular” sequence formed by concatenating the sequence with itself three times.
TODO: fix the warning above, at some point?
The report’s content may vary depending on the optimization’s success.
- Parameters:
target – Either a path to a folder that will contain the report, or a path to a zip archive, or “@memory” to return raw data of a zip archive containing the report.
project_name – Project name to write on PDF reports
- Returns:
Triplet where success is True/False, message is a one-line string summary indication whether some clash was found, or some solution, or maybe no solution was found because the random searches were too short
- Return type:
(success, message, zip_data)
- resolve_constraints(final_check=True)[source]¶
Solve a particular constraint using local, targeted searches.
- Parameters:
constraint – The
Specification
object for which the sequence should be solvedfinal_check – If True, a final check of that all constraints pass will be run at the end of the process, when constraints have been resolved one by one, to check that the solving of one constraint didn’t undo the solving of another.
cst_filter – An optional filter to only resolve a subset function (constraint => True/False)
Specification¶
- class dnachisel.Specification(evaluate=None, boost=1.0)[source]¶
General class to define specifications to optimize.
Note that all specifications have a
boost
attribute that is a multiplicator that will be used when computing the global specification score of a problem withproblem.all_objectives_score()
.New types of specifications are defined by subclassing
Specification
and providing a customevaluate
andlocalized
methods.- Parameters:
evaluate – function (sequence) => SpecEvaluation
boost – Relative importance of the Specification’s score in a multi-specification problem.
- best_possible_score¶
Best score that the specification can achieve. Used by the optimization algorithm to understand when no more optimization is required.
- optimize_passively(boolean)¶
Indicates that there should not be a pass of the optimization algorithm to optimize this objective. Instead, this objective is simply taken into account when optimizing other objectives.
- enforced_by_nucleotide_restrictions(boolean)¶
When the spec is used as a constraints, this indicates that the constraint will initially restrict the mutation space in a way that ensures that the constraint will always be valid. The constraint does not need to be evaluated again, which speeds up the resolution algorithm.
- priority¶
Value used to sort the specifications and solve/optimize them in order, with highest priority first.
- shorthand_name¶
Shorter name for the specification class that will be recognized when parsing annotations from genbanks.
- as_passive_objective()[source]¶
Return a copy with optimize_passively set to true.
“Optimize passively” means that when the specification is used as an objective, the solver will not do a specific pass to optimize this specification, however this specification’s score will be taken into account in the global score when optimizing other objectives, and may therefore influence the final sequence.
- breach_label()[source]¶
Shorter, less precise label to be used in tables, reports, etc.
This is meant for specifications such as EnforceGCContent(0.4, 0.6) to be represented as ‘40-60% GC’ in reports tables etc..
- copy_with_changes(**kwargs)[source]¶
Return a copy of the Specification with modified properties.
For instance
new_spec = spec.copy_with_changes(boost=10)
.
- initialized_on_problem(problem, role='constraint')[source]¶
Complete specification initialization when the sequence gets known.
Some specifications like to know what their role is and on which sequence they are employed before they complete some values.
- label(role=None, with_location=True, assignment_symbol=':', use_short_form=False, use_breach_form=False)[source]¶
Return a string label for this specification.
- Parameters:
role – Either ‘constraint’ or ‘objective’ (for prefixing the label with @ or ~), or None.
with_location – If true, the location will appear in the label.
assignment_symbol – Indicates whether to use “:” or “=” or anything else when indicating parameters values.
use_short_form – If True, the label will use self.short_label(), so for instance AvoidPattern(BsmBI_site) will become “no BsmBI”. How this is handled is dependent on the specification.
- label_parameters()[source]¶
In subclasses, returns a list of the creation parameters.
For instance [(‘pattern’, ‘ATT’), (‘occurences’, 2)]
- localized(location, problem=None)[source]¶
Return a modified version of the specification for the case where sequence modifications are only performed inside the provided location.
For instance if an specification concerns local GC content, and we are only making local mutations to destroy a restriction site, then we only need to check the local GC content around the restriction site after each mutation (and not compute it for the whole sequence), so
EnforceGCContent.localized(location)
will return an specification that only looks for GC content around the provided location.If an specification concerns a DNA segment that is completely disjoint from the provided location, this must return None.
Must return an object of class
Constraint
.
- restrict_nucleotides(sequence, location=None)[source]¶
Restrict the mutation space to speed up optimization.
This method only kicks in when this specification is used as a constraint. By default it does nothing, but subclasses such as EnforceTranslation, AvoidChanges, EnforceSequence, etc. have custom methods.
In the code, this method is run during the initialize() step of DNAOptimizationProblem, when the MutationSpace is created for each constraint
SpecEvaluation¶
Store relevant infos about the evaluation of an objective on a problem.
Examples
>>> evaluation_result = ConstraintEvaluation(
>>> specification=specification,
>>> problem = problem,
>>> score= evaluation_score, # float
>>> locations=[Locations list...],
>>> message = "Score: 42 (found 42 sites)"
>>> )
- param objective:
The Objective that was evaluated.
- param problem:
The problem that the objective was evaluated on.
- param score:
The score associated to the evaluation.
- param locations:
A list of couples (start, end) indicating the locations on which the the optimization shoul be localized to improve the objective.
- param message:
A message that will be returned by
str(evaluation)
. It will notably be displayed byproblem.print_objectives_summaries
.
Location¶
Implements the Location class.
The class has useful methods for finding overlaps between locations, extract a subsequence from a sequence, etc.
- class dnachisel.Location.Location(start, end, strand=0)[source]¶
Represent a segment of a sequence, with a start, end, and strand.
Warning: we use Python’s splicing notation, so Location(5, 10) represents sequence[5, 6, 7, 8, 9] which corresponds to nucleotides number 6, 7, 8, 9, 10.
The data structure is similar to a Biopython’s FeatureLocation, but with different methods for different purposes.
- Parameters:
start – Lowest position index of the segment.
end – Highest position index of the segment.
strand – Either 1 or -1 for sense or anti-sense orientation.
- extended(extension_length, lower_limit=0, upper_limit=None, left=True, right=True)[source]¶
Extend the location of a few basepairs on each side.
- static from_biopython_location(location)[source]¶
Return a DnaChisel Location from a Biopython location.
- static from_data(location_data)[source]¶
Return a location, from varied data formats.
This method is used in particular in every built-in specification to quickly standardize the input location.
location_data
can be a tuple (start, end) or (start, end, strand), or a Biopython FeatureLocation, or a Location instance. In any case, a new Location object will be returned.
- static merge_overlapping_locations(locations)[source]¶
Return a list of locations obtained by merging all overlapping.
- overlap_region(other_location)[source]¶
Return the overlap span between two locations (None if None).
MutationSpace¶
- class dnachisel.MutationSpace.MutationChoice(segment, variants, is_any_nucleotide=False)[source]¶
Represent a segment of a sequence with several possible variants.
- Parameters:
segment – A pair (start, end) indicating the range of nucleotides concerned. We are applying Python range, so
variants – A set of sequence variants, at the given position
Examples
>>> choice = MutationChoice((70, 73), {})
- extract_varying_region()[source]¶
Return MutationChoices for the central varying region and 2 flanks.
For instance:
>>> choice = MutationChoice((5, 12), [ >>> 'ATGCGTG', >>> 'AAAAATG', >>> 'AAATGTG', >>> 'ATGAATG', >>> ]) >>> choice.extract_varying_region()
Result :
>>> [ >>> MutChoice(5-6 A), >>> MutChoice(6-10 TGCG-AATG-TGAA-AAAA), >>> MutChoice(10-12 TG) >>> ]
- merge_with(others)[source]¶
Merge this mutation choice with others to form a single choice.
This method returns the only choices on the full interval which are compatible with at least one choice in each of the MutationChoices.
Examples:¶
Consider the following:
>>> ((2, 5), {'ATT', 'ATA'})
To be merged with:
>>> [ >>> ((0, 3), {'GTA', 'GCT', 'GTT'}), >>> ((3, 4), {'A', 'T', 'G', 'C'}), >>> ((4, 7), {'ATG', 'ACC', 'CTG'}) >>> ]
The result will be:
>>> (0, 7), {'GTATACC', 'GTATATG'}
- class dnachisel.MutationSpace.MutationSpace(choices_index, left_padding=0)[source]¶
Class for mutation space (set of sequence segments and their variants).
- Parameters:
choices_index – A list L such that L[i] gives the MutationChoice governing the mutations allowed at position i (and possibly around i).
Examples
>>> # BEWARE: below, similar mutation choices are actually the SAME OBJECT >>> space = MutationSpace([ MutationChoice((0, 2), {'AT', 'TG'}), MutationChoice((0, 2), {'AT', 'TG'}), MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), # same MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), # MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), ])
- apply_random_mutations(n_mutations, sequence)[source]¶
Return a sequence with n random mutations applied.
- property choices_span¶
Return (start, end), segment where multiple choices are possible.
- constrain_sequence(sequence)[source]¶
Return a version of the sequence compatible with the mutation space.
All nucleotides of the sequence that are incompatible with the mutation space are replaced by nucleotides compatible with the space.
- static from_optimization_problem(problem, new_constraints=None)[source]¶
Create a mutation space from a DNA optimization problem.
This can be used to initialize mutation spaces for new problems.
- property space_size¶
Return the number of possible mutations.
SequencePattern¶
Specifications AvoidPattern
, EnforcePatternOccurence
accept a Pattern
as argument. The pattern can be provided either as a string or as a class:
Name |
Pattern Class |
As string |
---|---|---|
Restriction site |
|
|
Homopolymer |
|
|
Direct repeats |
|
|
DNA notation |
|
|
Regular Expression |
|
|
SequencePattern class¶
- class dnachisel.SequencePattern(expression, size=None, name=None, lookahead='loop', is_palyndromic=False)[source]¶
Pattern that will be looked for in a DNA sequence.
Use this class for matching regular expression patterns, and DnaNotationPattern for matching explicit sequences or sequences using Ns etc.
Examples
>>> expression = "A[ATGC]{3,}" >>> pattern = SequencePattern(expression) >>> constraint = AvoidPattern(pattern)
- Parameters:
expression – Any string or regular expression (regex) for matching ATGC nucleotides. Note that multi-nucleotide symbols such as “N” (for A-T-G-C), or “K” are not supported by this class, see DnaNotationPattern instead.
size – Size of the pattern, in number of characters. The
size
is used to determine the size of windows when performing local optimization and constraint solving. It can be important to provide the size when thepattern
string provided represents a complex regular expression whose maximal matching size cannot be easily evaluated. For example, if a regex is used to actively remove sites, then a size should be provided to inform DNA Chisel during optimization.name – Name of the pattern (will be displayed e.g. when the pattern is printed).
- find_matches(sequence, location=None, forced_strand=None)[source]¶
Return the locations where the sequence matches the expression.
- Parameters:
sequence – A string of “ATGC…”
location – Location indicating a segment to which to restrict the search. Only patterns entirely included in the segment will be returned
- Returns:
A list of the locations of matches, of the form
[(start1, end1), (start2, end2),...]
.- Return type:
matches
DnaNotationPattern¶
MotifPssmPattern¶
- class dnachisel.MotifPssmPattern(pssm, threshold=None, relative_threshold=None)[source]¶
Motif pattern represented by a Position-Specific Scoring Matrix (PSSM).
This class is better instantiated by creating a pattern from a set of sequence strings with
MotifPssmPattern.from_sequences()
, or reading pattern(s) for a file withMotifPssmPattern.from_file()
.- Parameters:
pssm – A Bio.motifs.Motif object.
threshold – locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold – Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.
- classmethod apply_pseudocounts(motif, pseudocounts)[source]¶
Add pseudocounts to the motif’s pssm matrix.
Add nothing if pseudocounts is None, auto-compute pseudocounts if pseudocounts=”jaspar”, else just attribute the pseudocounts as-is to the motif.
- find_matches_in_string(sequence)[source]¶
Find all positions where the PSSM score is above threshold.
- classmethod from_sequences(sequences, name='unnamed', pseudocounts='jaspar', threshold=None, relative_threshold=None)[source]¶
Return a PSSM pattern computed from same-length sequences.
- Parameters:
sequences – A list of same-length sequences.
name – Name to give to the pattern (will appear in reports etc.).
pseudocounts – Either a dict {“A”: 0.01, “T”: …} or “jaspar” for automatic pseudocounts from the Biopython.motifs.jaspar module (recommended), or None for no pseudocounts at all (not recommended!).
threshold – locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold – Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.
- classmethod list_from_file(motifs_file, file_format, threshold=None, pseudocounts='jaspar', relative_threshold=None)[source]¶
Return a list of PSSM patterns from a file in JASPAR, MEME, etc.
- Parameters:
motifs_file – Path to a motifs file, or file handle.
file_format – File format. one of “jaspar”, “meme”, “TRANSFAC”.
pseudocounts – Either a dict {“A”: 0.01, “T”: …} or “jaspar” for automatic pseudocounts from the Biopython.motifs.jaspar module (recommended), or None for no pseudocounts at all (not recommended!).
threshold – locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold – Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.