easy_dna

Travis CI build status https://coveralls.io/repos/github/Edinburgh-Genome-Foundry/easy_dna/badge.svg?branch=master

Easy_dna is a Python library implementing useful routines for manipulating DNA sequences, either as “ATGC” strings or Biopython records. It aims at providing a simpler interface than Biopython for common operations related to DNA sequence design and genbank generation.

Easy_dna was originally created to gather useful methods repeatedly used in the different software projects of the Edinburgh Genome Foundry for DNA design and manufacturing.

See the API reference here.

Installation

You can install easy_dna through PIP:

pip install easy_dna

Alternatively, you can unzip the sources in a folder and type:

python setup.py install

License = MIT

Easy_dna is an open-source software originally written at the Edinburgh Genome Foundry by Zulko and released on Github under the MIT licence (Copyright 2019 Edinburgh Genome Foundry). Everyone is welcome to contribute!

More biology software

https://raw.githubusercontent.com/Edinburgh-Genome-Foundry/Edinburgh-Genome-Foundry.github.io/master/static/imgs/logos/egf-codon-horizontal.png

Easy_dna is part of the EGF Codons synthetic biology software suite for DNA design, manufacturing and validation.

Reference

easy_dna.all_iupac_variants(iupac_sequence)[source]

Return all unambiguous possible versions of the given sequence.

Examples

>>> all_iupac_variants('ATN')
>>> ['ATA', 'ATC', 'ATG', 'ATT']
easy_dna.annotate_record(seqrecord, location='full', feature_type='misc_feature', margin=0, **qualifiers)[source]

Add a feature to a Biopython SeqRecord.

Parameters
seqrecord

The Biopython SeqRecord to be annotated.

location

Either (start, end) or (start, end, strand). (strand defaults to +1).

feature_type

The type associated with the feature.

margin

Number of extra bases added on each side of the given location.

qualifiers

Dictionary that will be the Biopython feature’s qualifiers attribute.

easy_dna.anonymized_record(record, record_id='anonymized', label_generator='feature_%d')[source]

Return a record with removed annotations/keywords/features/etc.

Warning: this does not change the record sequence!

Parameters
record

The record to be anonymized.

record_id

ID of the new record.

label_generator

Recipe to change feature labels. Either "feature_%d" or None (no label) of a function (i, feature)=>label.

easy_dna.censor_genbank(filename, target, **censor_params)[source]

Load Genbank file and write censored version.

Parameters
filename

Path to the file containing the record.

target

Path to output genbank file.

censor_params

Optional parameters. See censor_record() for details.

easy_dna.censor_record(record, record_id='censored', label_generator='feature_%d', keep_topology=False, anonymise_features=True, preserve_sites=None)[source]

Return a record with random sequence and censored annotations/features.

Useful for creating example files or anonymising sequences for bug reports.

Parameters
record

The record to be anonymized.

record_id

ID of the new record.

label_generator

Recipe to change feature labels. Either "feature_%d" or None (no label) of a function (i, feature)=>label.

keep_topology

Whether to keep the record topology or not.

anonymise_features

Whether to replace feature labels and ID/name, or not.

preserve_sites

List of enzyme sites to keep. Example: ["BsmBI", "BsaI"]. Preserves the sequence around cut sites of the specified enzymes.

easy_dna.complement(dna_sequence)[source]

Return the complement of the DNA sequence.

For instance complement("ATGCCG") returns "TACGGC".

Uses Biopython for speed.

easy_dna.copy_and_paste_segment(seq, start, end, new_start)[source]

Return the sequence with segment [start, end] also copied elsewhere, starting in ``new_start`.

easy_dna.cut_and_paste_segment(seq, start, end, new_start)[source]

Move a subsequence by “diff” nucleotides the left or the right.

easy_dna.delete_nucleotides(seq, start, n)[source]

Return the sequence with n deletions from position start.

easy_dna.delete_segment(seq, start, end)[source]

Return the sequence with deleted segment from start to end.

easy_dna.dna_pattern_to_regexpr(dna_pattern)[source]

Return a regular expression pattern for the provided DNA pattern.

For instance dna_pattern_to_regexpr('ATTNN') returns "ATT[A|T|G|C][A|T|G|C]".

easy_dna.extract_from_input(filename=None, directory=None, construct_list=None, direct_sense=True, output_path=None, min_sequence_length=20)[source]

Extract features from input and return in a dictionary.

Optionally save the features in separate files.

Parameters
file

Input sequence file (Genbank).

directory

Directory name containing input sequence files.

construct_list

A list of SeqRecords.

direct_sense

If True: make antisense features into direct-sense in the exported files.

output_path

Path for the exported feature and report files.

min_sequence_length

Discard sequences with length less than this integer.

easy_dna.insert_segment(seq, pos, inserted)[source]

Return the sequence with inserted inserted, starting at index pos.

easy_dna.list_common_enzymes(site_length=6, opt_temp=37, min_suppliers=1, site_unlike=())[source]

Return a list of enzyme names with the given constraints.

Parameters
site_length

List of accepted site lengths (6, 4, …).

opt_temp

List of accepted optimal temperatures for the enzyme.

min_suppliers

Minimal number registered suppliers in the Biopython data. A minimum of 3 known suppliers returns the most common enzymes.

site_unlike

List of (ambiguous or unambiguous) DNA sequences that should NOT be recognized by the selected enzymes.

easy_dna.load_record(filename, record_id='auto', upperize=False, id_cutoff=20)[source]

Load a Fasta/Genbank/Snapgene file as a Biopython record.

Parameters
filename

Path to the file containing the record.

record_id

Id of the record (leave to “auto” to keep the record’s original Id, which will default to the file name if the record has no Id).

upperize

If true, the record’s sequence will be upperized.

id_cutoff

If the Id is read from a filename, it will get truncated at this cutoff to avoid errors at report write time.

easy_dna.random_dna_sequence(length, gc_share=None, probas=None, seed=None)[source]

Return a random DNA sequence (“ATGGCGT…”) with the specified length.

Parameters
length

Length of the DNA sequence.

gc_share

The GC content of the random sequence, as a fraction (for example, 0.3 for 30%). Overwrites probas.

probas

Frequencies for the different nucleotides, for instance probas={"A":0.2, "T":0.3, "G":0.3, "C":0.2}. If not specified, all nucleotides are equiprobable (p=0.25).

seed

The seed to feed to the random number generator. When a seed is provided the random results depend deterministically on the seed, thus enabling reproducibility.

easy_dna.random_protein_sequence(length, seed=None)[source]

Return a random protein sequence “MNQTW…YL*” of the specified length.

Parameters
length

Length of the protein sequence (in number of amino-acids). Note that the sequence will always start with "M" and end with a stop codon "*" with (length-2) random amino-acids in the middle.

seed

The seed to feed to the random number generator. When a seed is provided the random results depend deterministically on the seed, thus enabling reproducibility.

easy_dna.record_with_different_sequence(record, new_seq)[source]

Return a version of the record with the sequence set to new_seq.

easy_dna.records_from_data_files(filepaths=None, folder=None)[source]

Automatically convert files or a folder’s content to Biopython records.

easy_dna.replace_segment(seq, start, end, replacement)[source]

Return the sequence with seq[start:end] replaced by replacement.

easy_dna.reverse_complement(dna_sequence)[source]

Return the reverse-complement of the DNA sequence.

For instance reverse_complement("ATGCCG") returns "CGGCAT".

Uses Biopython for speed.

easy_dna.reverse_segment(seq, start, end)[source]

Return the sequence with segment seq[start:end] reverse-complemented.

easy_dna.reverse_translate(protein_sequence, randomize_codons=False)[source]

Return a DNA sequence which translates to the provided protein sequence.

Note: at the moment, the first valid codon found is used for each amino-acid (so it is deterministic but no codon-optimization is done).

easy_dna.sequence_to_biopython_record(sequence, id='<unknown id>', name='<unknown name>', features=())[source]

Return a SeqRecord of the sequence, ready to be Genbanked.

easy_dna.swap_segments(seq, pos1, pos2)[source]

Return a new sequence with segments at position pos1 and pos2 swapped.

pos1, pos2 are both of the form (start1, end1), (start2, end2).

easy_dna.translate(dna_sequence, translation_table='Bacterial')[source]

Translate the DNA sequence into an amino-acid sequence “MLKYQT…”.

If translation_table is the name or number of a NCBI genetic table, Biopython will be used. See here for options:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec26

translation_table can also be a dictionary of the form {"ATT": "M", "CTC": "X", etc.} for more exotic translation tables.

easy_dna.write_record(record, target, fmt='genbank')[source]

Write a record as genbank, fasta, etc. via Biopython, with fixes.