Genome Collector API Reference¶
-
class
genome_collector.
GenomeCollection
(data_dir='default', logger='bar')[source]¶ Collection of local data files including genomes and BLAST databases.
- Parameters
data_dir
Path to the directory where all genomes informations, sequences and blast database will be kept. The default value “local_default” means that the system-specific user data directory will be used, so either
~/.local/share/genome_collector
on Linux, or~/Library/Application Support/genome_collector
on MacOS, etc.logger
Logger to which the messages will be sent when downloading files or building blast databases. Use “bar” for a default bar logger, None for no logging, or any Proglog logger.
Attributes
autodownload
When an absent local data file is requested and
autodownload
is True, the file will be automatically downloaded over the Internet. Otherwise a FileNotFoundError is raised. Attributeautodownload
can be set to False on a singlecollection
instance, or at class level withGenomeCollection.autodownload = False
to globally prevent Genome Collector from downloadind data files.default_dir
Default directory to use when parameter data_dir=’default’. This attribute allows to set a global default_dir at the beginning of a script with
GlobalCollection.default_dir = '/my/new/dir/'
messages_prefix
Prefix appearing as “[prefix] ” in all logging messages.
datafiles_extensions
Dictionnary linking data file types to standardized file extensions.
-
blast_against_taxid
(taxid, db_type, blast_args)¶ Run a BLAST, using a genome_collector database.
- Parameters
taxid
TaxID (int or str) of the reference genome against which the query will be blaster. If no database for this taxID is available locally it will be downloaded.
db_type
Either “nucl” (nucleotides database for blastn, blastx) or “prot” (protein database, untested).
blast_args
List of NCBI-BLAST arguments, for instance [‘blastn’, ‘-query’, ‘my_sequences.fa’, ‘-out’, ‘myresults.xml’].
Examples
>>> blast_against_taxid(taxid, db_type, blast_args)
-
datafile_path
(taxid, data_type)¶ Return a standardized datafile path for the given TaxID.
Unlike get methods such as
self.get_taxid_genome_data_path()
, this method only returns the path, and does not check whether the files exist locally or not.Parameter
data_type
should be one of genomic_fasta, protein_fasta, blast_nucl, blast_prot, genomic_gz, protein_gz, infos.
-
download_taxid_genome_data_from_ncbi
(taxid, data_type)¶ Download and uncompress a gz file from archives.
data_type is either genomic_fasta, genomic_genbank, genomic_gff, or protein_fasta.
-
download_taxid_genome_infos_from_ncbi
(taxid, assembly_id=None)¶ Download infos on the TaxID and store them in ‘[taxid].json’.
For taxIDs with several genomes listed on NCBI, you can provide an assembly_id, which can also be of the form “#1” to select the first available NCBI Assembly ID (first in numerical order).
-
generate_blast_db_for_taxid
(taxid, db_type='nucl')¶ Generates a Blast DB for the TaxID. Autodownload FASTA if needed.
db_type
is either “nucl” (nucleotides database for blastn, blastx) or “prot” (protein database, untested).
-
generate_bowtie_index_for_taxid
(taxid, version='1')¶ Generate a Bowtie (1 or 2) index for the given TaxID.
-
get_taxid_biopython_records
(taxid, source_type='genomic_genbank', as_iterator=False)[source]¶ Return a list of biopython records for the genome’s chromosome.
Even if there is a single record (for genomes with a single chromosome, a list is returned.
For huge genomes, use the
as_iterator
option to return a Python iterator, which avoids to load all chromosomes at once in memory.
-
get_taxid_blastdb_path
(taxid, db_type)¶ Get the path to a local blast DB, download and create one if needed.
db_type
is either “nucl” (nucleotides database for blastn, blastx) or “prot” (protein database, untested).
-
get_taxid_bowtie_index_path
(taxid, version='1')¶ Get a path to the Bowtie (1 or 2) index for the given TaxID.
This will download data and generate the index if necessary. This requires Bowtie (1 or 2) installed.
-
get_taxid_genome_data_path
(taxid, data_type='genomic_fasta')[source]¶ Return a path to the taxid’s genome sequence. Download if needed.
data_type
is either genomic_fasta, genomic_genbank, genomic_gff, or protein_fasta
-
get_taxid_infos
(taxid)[source]¶ Return a dict with data about the taxid.
Examples
>>> collection.get_taxid_infos(511145) >>> { >>> 'Organism Name': 'Escherischia Coli', >>> 'DefLine': 'A well-studied enteric bacterium', >>> 'Organism_Kingdom': 'Bacteria', >>> 'AssemblyID': '1755381', >>> ... >>> }
-
list_locally_available_taxids
(data_type='infos')¶ Return all taxIDs for which there is a local data file of this type.
Parameter
data_type
should be one of genomic_fasta, protein_fasta, blast_nucl, blast_prot, genomic_gz, protein_gz, infos.
-
list_locally_available_taxids_names
(print_mode=False)¶ Return a dictionnary {taxid: scientific_name} of all local taxIDs.
For convenience, when print_mode is set to True, the table is printed in alphabetical order instead of being returned as a dict.
-
remove_all_local_data_files
()¶ Remove all the locally stored data files
-
remove_all_taxid_files
(taxid)¶ Remove all local data files for this TaxID. Return a names list.
Examples
To remove all local data files:
>>> import genome_collector as gd >>> for taxid in gd.list_locally_available_taxids(): >>> gd.remove_all_taxid_files(taxid)