easy_dna¶

Easy_dna is a Python library implementing useful routines for manipulating DNA sequences, either as “ATGC” strings or Biopython records. It aims at providing a simpler interface than Biopython for common operations related to DNA sequence design and genbank generation.

Easy_dna was originally created to gather useful methods repeatedly used in the different software projects of the Edinburgh Genome Foundry for DNA design and manufacturing.

See the API reference here.

Installation¶

You can install easy_dna through PIP:

pip install easy_dna

Alternatively, you can unzip the sources in a folder and type:

python setup.py install

License = MIT¶

Easy_dna is an open-source software originally written at the Edinburgh Genome Foundry by Zulko and released on Github under the MIT licence (Copyright 2019 Edinburgh Genome Foundry). Everyone is welcome to contribute!

More biology software¶

https://raw.githubusercontent.com/Edinburgh-Genome-Foundry/Edinburgh-Genome-Foundry.github.io/master/static/imgs/logos/egf-codon-horizontal.png

Easy_dna is part of the EGF Codons synthetic biology software suite for DNA design, manufacturing and validation.

Reference¶

easy_dna.all_iupac_variants(iupac_sequence)[source]¶

Return all unambiguous possible versions of the given sequence.

Examples

>>> all_iupac_variants('ATN')
['ATA', 'ATC', 'ATG', 'ATT']

easy_dna.annotate_record(seqrecord, location='full', feature_type='misc_feature', margin=0, **qualifiers)[source]¶

Add a feature to a Biopython SeqRecord.

Parameters:

seqrecord (SeqRecord) – The Biopython SeqRecord to be annotated.
location (FeatureLocation) – Either (start, end) or (start, end, strand). (strand defaults to +1).
feature_type (str) – The type associated with the feature.
margin (int) – Number of extra bases added on each side of the given location.
qualifiers (dict) – Dictionary that will be the Biopython feature’s qualifiers attribute.

easy_dna.anonymized_record(record, record_id='anonymized', label_generator='feature_%d')[source]¶

Return a record with removed annotations/keywords/features/etc.

Warning: this does not change the record sequence!

Parameters:

record (SeqRecord) – The record to be anonymized.
record_id (str) – ID of the new record.
label_generator (str or function) – Recipe to change feature labels. Either “feature_%d” or “None” (no label) or a function (i, feature) => label.

easy_dna.censor_genbank(filename, target, **censor_params)[source]¶

Load Genbank file and write censored version.

Parameters:

filename (str) – Path to the file containing the record.
target (str) – Path to output Genbank file.
censor_params (dict, optional) – Optional parameters. See censor_record() for details.

easy_dna.censor_record(record, record_id='censored', label_generator='feature_%d', keep_topology=False, anonymise_features=True, preserve_sites=None)[source]¶

Return a record with random sequence and censored annotations/features.

Useful for creating example files or anonymizing sequences for bug reports.

Parameters:

record (SeqRecord) – The record to be anonymized.
record_id (str) – ID of the new record.
label_generator (str or function) – Recipe to change feature labels. Either “feature_%d” or “None” (no label) or a function (i, feature) => label.
keep_topology (bool) – Whether to keep the record topology or not.
anonymise_features (bool) – Whether to replace feature labels and ID/name, or not.
preserve_sites (list of str) – List of enzyme sites to keep. Example: [“BsmBI”, “BsaI”]. Preserves the sequence around cut sites of the specified enzymes.

easy_dna.complement(dna_sequence)[source]¶

Return the complement of the DNA sequence.

For instance complement("ATGCCG") returns "TACGGC".

Uses Biopython for speed.

easy_dna.copy_and_paste_segment(seq, start, end, new_start)[source]¶: Return the sequence with segment [start, end] also copied elsewhere, starting in ``new_start`.

easy_dna.cut_and_paste_segment(seq, start, end, new_start)[source]¶: Move a subsequence by “diff” nucleotides the left or the right.

easy_dna.delete_nucleotides(seq, start, n)[source]¶: Return the sequence with n deletions from position start.

easy_dna.delete_segment(seq, start, end)[source]¶: Return the sequence with deleted segment from start to end.

easy_dna.dna_pattern_to_regexpr(dna_pattern)[source]¶

Return a regular expression pattern for the provided DNA pattern.

For instance dna_pattern_to_regexpr('ATTNN') returns "ATT[A|T|G|C][A|T|G|C]".

easy_dna.extract_from_input(filename=None, directory=None, construct_list=None, direct_sense=True, output_path=None, min_sequence_length=20)[source]¶

Extract features from input and return in a dictionary.

Optionally save the features in separate files.

Parameters:

filename (str) – Input sequence file path (e.g. Genbank).
directory (str) – Directory name containing input sequence files.
construct_list (list of SeqRecord) – A list of SeqRecords.
direct_sense (bool) – If True, make antisense features into direct-sense in the exported files.
output_path (str) – Path for the exported feature and report files.
min_sequence_length (int) – Discard sequences with length less than this integer.

easy_dna.insert_segment(seq, pos, inserted)[source]¶: Return the sequence with inserted inserted, starting at index pos.

easy_dna.is_genbank_standard(filepath)[source]¶: Check the LOCUS line of a Genbank file.

easy_dna.list_common_enzymes(site_length=(6,), opt_temp=(37,), min_suppliers=1, site_unlike=())[source]¶

Return a list of enzyme names with the given constraints.

Parameters:

site_length (list of int) – List of accepted site lengths (6, 4, …).
opt_temp (list of int) – List of accepted optimal temperatures for the enzyme.
min_suppliers (int) – Minimal number of registered suppliers in the Biopython data. A minimum of 3 known suppliers returns the most common enzymes.
site_unlike (list of str) – List of (ambiguous or unambiguous) DNA sequences that should NOT be recognized by the selected enzymes.

easy_dna.load_record(filename, record_id='filename', adapt_id=True, upperize=False, id_cutoff=20)[source]¶

Load a FASTA/Genbank/Snapgene file as a Biopython record.

Parameters:

filename (str) – Path to the sequence file.
record_id (str) – ID of the record (“filename”: use the file name; “original”: keep the record’s original ID (defaults to file name if the record has no ID).
adapt_id (bool) – If True, convert ID to alphanumeric and underscore.
upperize (bool) – If True, the record’s sequence will converted to uppercase.
id_cutoff (int, optional) – If the ID is longer than this value, it will get truncated at this cutoff to conform to guidelines and Genbank name limit. Use None for no cutoff.

easy_dna.random_dna_sequence(length, gc_share=None, probas=None, seed=None)[source]¶

Return a random DNA sequence (“ATGGCGT…”) with the specified length.

Parameters:

length (int) – Length of the DNA sequence.
gc_share (float, optional) – The GC content of the random sequence, as a fraction (for example, 0.3 for 30%). Overwrites probas.
probas (dict, optional) – Frequencies for the different nucleotides, for instance probas={“A”:0.2, “T”:0.3, “G”:0.3, “C”:0.2}. If not specified, all nucleotides are equiprobable (p=0.25).
seed (int, optional) – The seed to feed to the random number generator. When a seed is provided, the random results depend deterministically on the seed, thus enabling reproducibility.

easy_dna.random_protein_sequence(length, seed=None)[source]¶

Return a random protein sequence “MNQTW…YL*” of the specified length.

Parameters:

length (int) – Length of the protein sequence (in number of amino acids). Note that the sequence will always start with “M” and end with a stop codon “*” with (length-2) random amino acids in the middle.
seed (int, optional) – The seed to feed to the random number generator. When a seed is provided, the random results depend deterministically on the seed, thus enabling reproducibility.

easy_dna.record_with_different_sequence(record, new_seq)[source]¶: Return a version of the record with the sequence set to new_seq.

easy_dna.records_from_data_files(filepaths=None, folder=None)[source]¶: Automatically convert files or a folder’s content to Biopython records.

easy_dna.replace_segment(seq, start, end, replacement)[source]¶: Return the sequence with seq[start:end] replaced by replacement.

easy_dna.reverse_complement(dna_sequence)[source]¶

Return the reverse-complement of the DNA sequence.

For instance reverse_complement("ATGCCG") returns "CGGCAT".

Uses Biopython for speed.

easy_dna.reverse_segment(seq, start, end)[source]¶: Return the sequence with segment seq[start:end] reverse-complemented.

easy_dna.reverse_translate(protein_sequence, randomize_codons=False)[source]¶

Return a DNA sequence which translates to the provided protein sequence.

Note: at the moment, the first valid codon found is used for each amino-acid (so it is deterministic but no codon-optimization is done).

easy_dna.sequence_to_biopython_record(sequence, id='<unknown id>', name='Exported', features=())[source]¶: Return a SeqRecord of a DNA sequence, ready to be Genbanked.

easy_dna.swap_segments(seq, pos1, pos2)[source]¶

Return a new sequence with segments at position pos1 and pos2 swapped.

pos1, pos2 are both of the form (start1, end1), (start2, end2).

easy_dna.translate(dna_sequence, translation_table='Bacterial')[source]¶

Translate the DNA sequence into an amino-acid sequence “MLKYQT…”.

If translation_table is the name or number of a NCBI genetic table, Biopython will be used. See here for options:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec26

translation_table can also be a dictionary of the form {"ATT": "M", "CTC": "X", etc.} for more exotic translation tables.

easy_dna.write_record(record, target, id_cutoff=20, adapt_id=True)[source]¶

Write a DNA record as Genbank or FASTA via Biopython, with fixes.

Parameters:

record (SeqRecord) – Biopython SeqRecord.
target (str or StringIO) – Filepath, string, or StringIO instance. Desired sequence format is inferred from the ending. If it’s a directory, it uses the ID as filename and exports Genbank.
id_cutoff (int, optional) – If the ID is longer than this value, it will get truncated at this cutoff to conform to guidelines and Genbank name limit. Use None for no cutoff.
adapt_id (bool) – If True, convert ID to alphanumeric and underscore.