Genome Collector API Reference

class genome_collector.GenomeCollection(data_dir='default', logger='bar')[source]

Collection of local data files including genomes and BLAST databases.

Parameters

data_dir

Path to the directory where all genomes informations, sequences and blast database will be kept. The default value “local_default” means that the system-specific user data directory will be used, so either ~/.local/share/genome_collector on Linux, or ~/Library/Application Support/genome_collector on MacOS, etc.

logger

Logger to which the messages will be sent when downloading files or building blast databases. Use “bar” for a default bar logger, None for no logging, or any Proglog logger.

Attributes

autodownload

When an absent local data file is requested and autodownload is True, the file will be automatically downloaded over the Internet. Otherwise a FileNotFoundError is raised. Attribute autodownload can be set to False on a single collection instance, or at class level with GenomeCollection.autodownload = False to globally prevent Genome Collector from downloadind data files.

default_dir

Default directory to use when parameter data_dir=’default’. This attribute allows to set a global default_dir at the beginning of a script with GlobalCollection.default_dir = '/my/new/dir/'

messages_prefix

Prefix appearing as “[prefix] ” in all logging messages.

datafiles_extensions

Dictionnary linking data file types to standardized file extensions.

blast_against_taxid(taxid, db_type, blast_args)

Run a BLAST, using a genome_collector database.

Parameters

taxid

TaxID (int or str) of the reference genome against which the query will be blaster. If no database for this taxID is available locally it will be downloaded.

db_type

Either “nucl” (nucleotides database for blastn, blastx) or “prot” (protein database, untested).

blast_args

List of NCBI-BLAST arguments, for instance [‘blastn’, ‘-query’, ‘my_sequences.fa’, ‘-out’, ‘myresults.xml’].

Examples

>>> blast_against_taxid(taxid, db_type, blast_args)
datafile_path(taxid, data_type)

Return a standardized datafile path for the given TaxID.

Unlike get methods such as self.get_taxid_genome_data_path(), this method only returns the path, and does not check whether the files exist locally or not.

Parameter data_type should be one of genomic_fasta, protein_fasta, blast_nucl, blast_prot, genomic_gz, protein_gz, infos.

download_taxid_genome_data_from_ncbi(taxid, data_type)

Download and uncompress a gz file from archives.

data_type is either genomic_fasta, genomic_genbank, genomic_gff, or protein_fasta.

download_taxid_genome_infos_from_ncbi(taxid, assembly_id=None)

Download infos on the TaxID and store them in ‘[taxid].json’.

For taxIDs with several genomes listed on NCBI, you can provide an assembly_id, which can also be of the form “#1” to select the first available NCBI Assembly ID (first in numerical order).

generate_blast_db_for_taxid(taxid, db_type='nucl')

Generates a Blast DB for the TaxID. Autodownload FASTA if needed.

db_type is either “nucl” (nucleotides database for blastn, blastx) or “prot” (protein database, untested).

generate_bowtie_index_for_taxid(taxid, version='1')

Generate a Bowtie (1 or 2) index for the given TaxID.

get_taxid_biopython_records(taxid, source_type='genomic_genbank', as_iterator=False)[source]

Return a list of biopython records for the genome’s chromosome.

Even if there is a single record (for genomes with a single chromosome, a list is returned.

For huge genomes, use the as_iterator option to return a Python iterator, which avoids to load all chromosomes at once in memory.

get_taxid_blastdb_path(taxid, db_type)

Get the path to a local blast DB, download and create one if needed.

db_type is either “nucl” (nucleotides database for blastn, blastx) or “prot” (protein database, untested).

get_taxid_bowtie_index_path(taxid, version='1')

Get a path to the Bowtie (1 or 2) index for the given TaxID.

This will download data and generate the index if necessary. This requires Bowtie (1 or 2) installed.

get_taxid_genome_data_path(taxid, data_type='genomic_fasta')[source]

Return a path to the taxid’s genome sequence. Download if needed.

data_type is either genomic_fasta, genomic_genbank, genomic_gff, or protein_fasta

get_taxid_infos(taxid)[source]

Return a dict with data about the taxid.

Examples

>>> collection.get_taxid_infos(511145)
>>> {
>>>      'Organism Name': 'Escherischia Coli', 
>>>      'DefLine': 'A well-studied enteric bacterium',
>>>      'Organism_Kingdom': 'Bacteria',
>>>      'AssemblyID': '1755381',
>>>      ...
>>> }
list_locally_available_taxids(data_type='infos')

Return all taxIDs for which there is a local data file of this type.

Parameter data_type should be one of genomic_fasta, protein_fasta, blast_nucl, blast_prot, genomic_gz, protein_gz, infos.

list_locally_available_taxids_names(print_mode=False)

Return a dictionnary {taxid: scientific_name} of all local taxIDs.

For convenience, when print_mode is set to True, the table is printed in alphabetical order instead of being returned as a dict.

remove_all_local_data_files()

Remove all the locally stored data files

remove_all_taxid_files(taxid)

Remove all local data files for this TaxID. Return a names list.

Examples

To remove all local data files:

>>> import genome_collector as gd
>>> for taxid in gd.list_locally_available_taxids():
>>>     gd.remove_all_taxid_files(taxid)