Biotools

Biologically-related useful methods.

dnachisel.biotools.annotate_differences(record, reference, feature_type='misc_feature', prefix='#')[source]

Annotate differences between two records in a new record.

Returns a version of SeqRecord record where differences with the references are annotated as new features.

Parameters
record

The SeqRecord to be compared to the reference.

reference

The reference SeqRecord. Must be the same size as reference.

feature_type

The type of the features added to mark differences.

prefix

Each new feature will be labeled “po” where p is the prefix and o the original sequence at the feature’s location. For instance “#A” or “#TT”.

dnachisel.biotools.annotate_pattern_occurrences(record, pattern, feature_type='misc_feature', prefix='!')[source]

Return a new record annotated w. all occurences of pattern in sequence.

Parameters
record

A Biopython record.

pattern

A DnaChisel SequencePattern object (such as DnaPAttern).

feature_type

Type of the annotations in the returned record.

dnachisel.biotools.annotate_record(seqrecord, location='full', feature_type='misc_feature', margin=0, **qualifiers)[source]

Add a feature to a Biopython SeqRecord.

Parameters
seqrecord

The biopython seqrecord to be annotated.

location

Either (start, end) or (start, end, strand). (strand defaults to +1).

feature_type

The type associated with the feature.

margin

Number of extra bases added on each side of the given location.

qualifiers

Dictionary that will be the Biopython feature’s qualifiers attribute.

dnachisel.biotools.blast_sequence(sequence, blast_db=None, subject_sequences=None, subject=None, word_size=4, perc_identity=80, num_alignments=1000, ungapped=False, num_threads=3, culling_limit=None, e_value=None, task=None, dust='no')[source]

Return a Biopython BLAST record of the given sequence BLASTed against the provided database.

Parameters
sequence

An ATGC sequence

Examples

>>> blast_record = blast_sequence("ATTGTGCGTGTGTGCGT", "blastdb/ecoli")
>>> for alignment in blast_record.alignments:
>>>     for hit in alignment.hsps:
>>>         print (hit.identities)
dnachisel.biotools.change_biopython_record_sequence(record, new_seq)[source]

Return a version of the record with the sequence set to new_seq.

dnachisel.biotools.complement(dna_sequence)[source]

Return the complement of the DNA sequence.

For instance complement("ATGCCG") returns "TACGGC".

Uses Biopython for speed.

dnachisel.biotools.dict_to_pretty_string(d, rounding_digits=2, indent=2)[source]

Return a nicely JSON-like formatted string to print a dict.

dnachisel.biotools.dna_pattern_to_regexpr(dna_pattern)[source]

Return a regular expression pattern for the provided DNA pattern

For instance dna_pattern_to_regexpr('ATTNN') returns "ATT[A|T|G|C][A|T|G|C]".

dnachisel.biotools.find_all_bowtie_matches(sequence, bowtie_index_path, match_length, max_mismatches=0)[source]

Return (short) matches between a sequence and a Bowtie index.

The result is of the form [(start, end), n_mismatches)] where (start, end) indicates the position of the match in the sequence, and n_mismatches is the number of mismatches with the closest homology in the index.

dnachisel.biotools.find_specification_label_in_feature(feature)[source]

Analyse a Biopython feature to find a DnaChisel Specification in it.

The specification should start with either “@” or “~”, in the feature’s field “label” or “note”.

dnachisel.biotools.gc_content(sequence, window_size=None)[source]

Compute global or local GC content.

Parameters
sequence

An ATGC DNA sequence (upper case!)

window_size

If provided, the local GC content for the different sliding windows of this size is returned, else the global GC content is returned.

Returns
A number between 0 and 1 indication the proportion

of GC content. If window_size is provided, returns a list of len(sequence)-window_size values indicating the local GC contents (sliding-window method). The i-th value indicates the GC content in the window [i, i+window_size]

dnachisel.biotools.group_nearby_indices(indices, max_gap=None, max_group_spread=None)[source]

Return a list of groups of the different indices.

Indices are considered from smaller to larger and placed into groups

Parameters
max_gap

Maximal allowed difference between two consecutive numbers of a group

max_group_spread

Maximal allowed difference between the smallest and largest elements of a group.

dnachisel.biotools.group_nearby_segments(segments, max_start_gap=None, max_start_spread=None)[source]

Return a list of groups of the different indices.

Indices are considered from smaller to larger and placed into groups

Parameters
max_gap

Maximal allowed difference between two consecutive numbers of a group

max_group_spread

Maximal allowed difference between the smallest and largest elements of a group.

dnachisel.biotools.list_common_enzymes(site_length=6, opt_temp=37, min_suppliers=1, site_unlike=())[source]

Return a list of enzyme names with the given constraints.

Parameters
site_length

List of accepted site lengths (6, 4, …)

opt_temp

List of accepted optimal temperatures for the enzyme

min_suppliers

Minimal number registered suppliers in the Biopython data. A minimum of 3 known suppliers returns the most common enzymes.

site_unlike

List of (ambiguous or unambiguous) DNA sequences that should NOT be recognized by the selected enzymes.

dnachisel.biotools.load_record(filepath, linear=True, name='unnamed', file_format='auto')[source]

Load a FASTA/Genbank/Snapgene record.

Note that reading Snapgene records requires the library snapgene_reader installed.

dnachisel.biotools.random_dna_sequence(length, gc_share=None, probas=None, seed=None)[source]

Return a random DNA sequence (“ATGGCGT…”) with the specified length.

Parameters
length

Length of the DNA sequence.

proba

Frequencies for the different nucleotides, for instance probas={"A":0.2, "T":0.3, "G":0.3, "C":0.2}. If not specified, all nucleotides are equiprobable (p=0.25).

seed

The seed to feed to the random number generator. When a seed is provided the random results depend deterministically on the seed, thus enabling reproducibility

dnachisel.biotools.random_protein_sequence(length, seed=None)[source]

Return a random protein sequence “MNQTW…YL*” of the specified length.

Parameters
length

Length of the protein sequence (in number of amino-acids). Note that the sequence will always start with "M" and end with a stop codon "*" with (length-2) random amino-acids in the middle

seed

The seed to feed to the random number generator. When a seed is provided the random results depend deterministically on the seed, thus enabling reproducibility

dnachisel.biotools.reverse_complement(sequence)[source]

Return the reverse-complement of the DNA sequence.

For instance reverse_complement("ATGCCG") returns "CGGCAT".

Uses Biopython for speed.

dnachisel.biotools.reverse_translate(protein_sequence, randomize_codons=False, table='Standard')[source]

Return a DNA sequence which translates to the provided protein sequence.

Parameters
protein_sequence

A sequence string of aminoacids, e.g. “MVKK…”

table

Genetic code table to use (e.g. ‘Standard’, ‘Bacterial’, etc.). See dnachisel.biotools.CODON_TABLE_NAMES for a list of available genetic code tables.

randomize_codons

If False, the first valid codon found is used for each, which can create biases (GC content, etc.), if True, each amino acid gets replaced by a randomly selected codon for this amino acid.

dnachisel.biotools.round_all_numbers_in_dict(d, rounding_digits=2, outplace=True)[source]

Return a new version of dict d with all floats rounded to N digits.

dnachisel.biotools.score_to_formatted_string(score, characters=9)[source]

Transform a number (score) into a best-format string.

The format will be either int (2234), float (10.234) or engineering (1.20E5), whichever is shorter. The score is then padded with left whitespaces to obtained the desired number of characters.

dnachisel.biotools.sequence_to_biopython_record(sequence, id='<unknown id>', name='<unknown name>', features=())[source]

Return a SeqRecord of the sequence, ready to be Genbanked.

dnachisel.biotools.sequences_differences(seq1, seq2)[source]

Return the number of nucleotides that differ in the two sequences.

seq1, seq2 should be strings of DNA sequences e.g. “ATGCTGTGC”

dnachisel.biotools.sequences_differences_array(seq1, seq2)[source]

Return an array [0, 0, 1, 0, …] with 1s for sequence differences.

seq1, seq2 should both be ATGC strings.

dnachisel.biotools.sequences_differences_segments(seq1, seq2)[source]

Return the list of segments on which sequence seq1 differs from seq2.

The list is of the form [(start1, end1), (start2, end2), etc.]

Parameters
seq1, seq2

ATGC sequences to be compared

dnachisel.biotools.subdivide_window(window, max_span)[source]

Subdivide a window (start, end) into windows of size < max_span (start, i_1), (i_1, i_2), … (i_n, end)

dnachisel.biotools.translate(dna_sequence, table='Standard', assume_start_codon=False)[source]

Translate the DNA sequence into an amino-acids sequence “MLKYQT…”. If translation_table is the name or number of a NCBI genetic table, Biopython will be used. See here for options:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc25

translation_table can also be a dictionary of the form {"ATT": "M", "CTC": "X", etc.} for more exotic translation tables

If assume_start_codon is True and the first codon is a start codon in the given genetic table, then the first amino acid will be M (methionine).

dnachisel.biotools.windows_overlap(window1, window2)[source]

Return the overlap span between two windows.

Parameters
window1, window2

Each window is a couple of the form (start, end) indicating the range of a segment of integers.

Returns
None

In case the two windows do not overlap.

[start, end]

The coordinates of the overlap segment if there is one.

dnachisel.biotools.write_record(record, target, file_format='genbank', remove_locationless_features=True, max_name_length=20)[source]

Write a record as genbank, fasta, etc. via Biopython, with fixes.

Parameters
record

A biopython record.

target

Path to a file or filelike object.

file_format

Format, either Genbank or fasta.

remove_locationless_features

If True, will remove all features whose location is None, to avoid a Biopython bug

max_name_length

The record’s name will be truncated if longer than this (also here to avoid a Biopython bug).