Golden Hinges Reference Manual¶
Golden Hinges has one main class called OverhangsSelector
. One instance of
this class determines which Overhangs are accepted and what “intercompatibility”
of overhangs means. The different methods the class then enable to solve
different scenarii.
OverhangsSelector¶
-
class
goldenhinges.
OverhangsSelector
(gc_min=0, gc_max=1, differences=1, overhangs_size=4, forbidden_overhangs=(), forbidden_pairs=(), possible_overhangs=None, time_limit=None, external_overhangs=(), progress_logger='bar')[source]¶ A selector of comppatible overhangs for Golden-Gate assembly and others.
- Parameters
- gc_min
Minimal amount of GC allowed in the overhangs, e.g. 0.25 for 25% of GC in the overhang
- gc_max
Maximal amount of GC allowed in the overhangs, e.g. 0.75 for 75% of GC or less in the overhang.
- differences
Number of nucleotides by which all the selected overhangs and their reverse complement should differ.
- overhangs_size
Number of nucleotides for the overhangs, e.g. 4 for golden gate assembly.
- forbidden_overhang
List of all forbidden overhangs.
- possible_overhangs
List of a few overhangs the collection should be chosen from.
- time_limit
Time in seconds after which the solvers should stop if no solution was found yet.
- external_overhangs
List of overhangs that all selected overhangs should be compatible with.
Methods
-
cut_sequence
(sequence, intervals=None, solutions=1, allow_edits=False, include_extremities=True, optimize_score=True, edit_penalty=10, equal_segments=None, max_radius=10, target_indices=None)[source]¶ Select compatible-overhangs cut locations, one in each interval.
- Parameters
- sequence
An ATGC string or a Biopython record
- intervals
List of the form [(start1, end1), …] indicating intervals in which to cut the sequence. Note that
equal_segments
ortarget_indices
can be provided instead.- solutions
If equal to 1, one solution is returned (i.e. a list of cuts). If larger than 1, a list of solutions is returned. If equal to “iter”, an iterator over solutions is returned
- allow_edits
Keep to false to forbid any sequence change.
- include_extremities
Whether the sequence’s extremities should be considered as overhangs and be compatible with overhangs generated by the cuts.
- optimize_score
If False, the algorithm will return any solution that fills all constraints. If True, the algorithm will go through all possible solution and find the best one, i.e. the one whose overhangs total score is maximal, which gnerally means overhangs as near as possible from the center of the cut interval, and ‘native’ in the sequence, i.e. did not need a sequence edit.
- equal_segments
Number indicating that the sequence should be cut in N segments with lengths as similar as possible.
- target_indices
If provided, the sequence will be cut in regions around these target indices.
- max_radius
Maximal radius around the target indices for the search of a solution
- Returns
- solution
A list of dictionnaries, each representing one overhang with properties
o['location']
(coordinate of the overhang in the sequence) ando['sequence']
(sequence of the overhang)
-
generate_overhangs_set
(n_overhangs=None, mandatory_overhangs=(), start_at=2, step=2, n_cliques=None)[source]¶ Generate a set of compatible overhangs, eg
{"ATTC", "ATCG", ...}
- Parameters
- n_overhangs
Size of the desired overhang set. If left to None, the algorithm will return the largest set it can find.
- mandatory_overhangs
Overhangs which must be in the final set.
- step
Increment to use for the set size when looking for the larget possible set (case
n_overhangs=None
). Note that this should not change the final result, but a well-chosen step can improve the computations speed several fold- start_at
Number of overhangs to start from (before increasing) when auto-selecting the number of overhangs.
- n_cliques
If provided, the algorithm will look for for maximal sets of compatible overhangs using a graph-clique-based method
-
select_from_sets
(sets_list, solutions=1, optimize_score=True)[source]¶ Find compatible overhangs, picking one from each provided set.
This is the central solver for methods cut_sequence, cut_sequence_into_similar_lengths,
- Parameters
- sets_list
A list of either sets or lists of overhangs.
- solutions
Either 1 for a unique solution, a number k for a list of solutions, or “iter” which returns an iterator over all solutions.
- optimize_score
If True, the total score of all overhangs choices will be maximized
Clique methods¶
Methods based on graph cliques enable to quickly find large albeit non necessary optimal sets of compatible overhangs
-
goldenhinges.clique_methods.
find_compatible_overhangs
(overhangs_size=4, mandatory_overhangs=(), forbidden_overhangs=(), min_gc_content=0, max_gc_content=1, min_overhangs_differences=2, min_reverse_overhangs_differences=2, elements_filters=(), compatibility_conditions=(), solution_validity_conditions=(), n_solutions_considered=5000, score='subset_size', progress_bar=False, randomize=False)[source]¶ Return a list of compatible overhangs for (Golden Gate) assembly, satisfying all the specified conditions.
- Parameters
- overhangs_size
The size of the overhangs. Four is the default and the most common case.
- mandatory_overhangs
A list [“ATGC”, “TTGC”…] of the overhangs that must be part of the final solution. If these do not respect the other conditions or if they are not compatible between themselves, an error will be raised.
- forbidden_overhangs
A list [“ATGC”, “TTGC”…] of overhangs that should NOT be part of the final solution.
- min_gc_content
Float between 0.0 and 1.0 indicating the minimum proportion of G and C that valid overhangs should contain.
- max_gc_content
Float between 0.0 and 1.0 indicating the maximum proportion of G and C that valid overhangs should contain.
- min_overhangs_differences
Minimal number of different basepairs between two overhangs for them to be compatible (1 is an acceptable value but 2 is advised to really ensure the specificity of the assembly).
- min_reverse_overhangs_differences=2
Minimal number of different basepairs between an overhang and the reverse-complement of a second overhang for these two overhangs to be to be compatible (1 is an acceptable value but 2 is advised to really ensure the specificity of the assembly).
- elements_filters
Additional filters to narrow down the possible overhangs. Must be a list or tuple of functions fun(element)->True/False. Only overhangs such that fun(element) is True for all filters are kept.
- compatibility_conditions
Additional conditions to determine whether two overhands are compatible, i.e. whether a solution can feature these two elements at the same time. Must be a list or tuple of functions fun(e1, e2)->True/False. Two overhangs are compatible when fun(e1, e2) is True for all functions in compatibility_conditions.
- solution_validity_conditions
Additional validity conditions used to filter out some solutions. Must be a list or tuple of functions fun(solution)->True/False, where the solution is a list of elements of all_elements. A solution is considered valid when fun(solution) is True for all functions in solution_validity_conditions.
- score
Of all the solutions explored, the which scores highest is returned. Can be either a function fun(solution)->float where solution is a list of overhangs, or it can be the default “subset_size” which means the score will be the length of the solution found (the solution returned will have as many different overhangs as could be found)
- n_solutions_considered
Number of solutions considered (set to None if you want to consider all possible solutions which may take a very long time)
- progress_bar
If True, progress bars are displayed as the edges are computed and the graph cliques are explored. (see find_best_compatible_subset)
-
goldenhinges.clique_methods.
find_large_compatible_subset
(all_elements, mandatory_elements=(), compatibility_conditions=(), elements_filters=(), solution_validity_conditions=(), score='subset_size', n_solutions_considered=5000, progress_bar=False, randomize=False)[source]¶ Return a maximal subset of all_elements where all elements are valid and inter-compatibles.
The algorithm takes in a set of elements all_elements and filters out some elements using the filters in elements_filters. Then it creates a graph whose nodes are the remaining elements. Edges are added between all pairs of elements which are “compatible” as defined by the compatibility_conditions. Finally we look for the “cliques” i.e. subsets of the graph made of elements that are all inter-compatibles. We consider a number n_solutions_considered of these, and return the one which scored highest as defined by the score function.
- Parameters
- all_elements
A tuple or list of all possible elements.
- mandatory_elements
A tuple or list of elements contained in all_elements that must be included in the final solution.
- elements_filters
Functions used to pre-filter the list of all_elements. Must be a list or tuple of functions fun(element)->True/False. Only elements of all_elements such that fun(element) is True for all filters are considered.
- compatibility_conditions
Functions used to determine whether two elements are compatible, i.e. whether a solution can feature these two elements at the same time. Must be a list or tuple of functions fun(e1, e2)->True/False. Two elements of all_elements are compatible when fun(e1, e2) is True for all functions in compatibility_conditions.
- solution_validity_conditions
Additional validity conditions used to filter out some solutions. Must be a list or tuple of functions fun(solution)->True/False, where the solution is a list of elements of all_elements. A solution is considered valid when fun(solution) is True for all functions in solution_validity_conditions.
- score
Of all the solutions explored, the which scores highest is returned. Can be either a function fun(solution)->float where solution is a list of elements, or it can be the default “subset_size” which means the score will be the length of the solution found (the solution returned will have as many elements as could be found)
- n_solutions_considered
Number of cliques of the graph that are itered through (set to None if you want to consider all cliques in the graph, which may take a very long time)
- progress_bar
If True, progress bars are displayed as the edges are computed and the graph cliques are explored.
Biotools¶
-
goldenhinges.biotools.
annotate_record
(seqrecord, location='full', feature_type='misc_feature', margin=0, **qualifiers)[source]¶ Add a feature to a Biopython SeqRecord.
- Parameters
- seqrecord
The biopython seqrecord to be annotated.
- location
Either (start, end) or (start, end, strand). (strand defaults to +1)
- feature_type
The type associated with the feature
- margin
Number of extra bases added on each side of the given location.
- qualifiers
Dictionnary that will be the Biopython feature’s qualifiers attribute.
-
goldenhinges.biotools.
crop_record
(record, crop_start, crop_end, features_suffix=' (part)')[source]¶ Return the cropped record with possibly cropped features.
Note that this differs from
record[start:end]
in that in the latter expression, cropped features are discarded.- Parameters
- record
A Biopython record
- crop_start, crop_end
Start and end of the segment to be cropped.
- features_suffix
All cropped features will have their label appended with this suffix.
-
goldenhinges.biotools.
gc_content
(sequence)[source]¶ Return the proportion of G and C in the sequence (between 0 and 1).
The sequence must be an ATGC string.
-
goldenhinges.biotools.
list_overhangs
(overhang_size=4, filters=())[source]¶ Return the list of all possible ATGC overhangs of the given size, such that
fl(overhang)
is true for every functionfl
infilters
.
-
goldenhinges.biotools.
load_record
(filename, linear=True, name='unnamed', fmt='auto')[source]¶ Load a FASTA/Genbank/… record
-
goldenhinges.biotools.
reverse_complement
(sequence)[source]¶ Return the reverse-complement of the DNA sequence. For instance
complement("ATGC")
returns"GCAT"
.The sequence must be an ATGC string.
-
goldenhinges.biotools.
sequences_differences
(seq1, seq2)[source]¶ Return the number of nucleotides that differ in the two sequences.
seq1, seq2 should be strings of DNA sequences e.g. “ATGCTGTGC”