A simple Python interface to query the biological databases kept at the NCBI.
It uses the Entrez Programming Utilities (E-utilities), nine server-side programs that access the Entrez query and database system at the National Center for Biotechnology Information (NCBI). They provide a structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.
The main function is:
query(tool[, ...])
- yields the response of a query with the given tool
There are a few more functions for convenience:
select(tool, db[, ...])
- returns a dict that references the elements selected with tool over database dbapply(tool, db, selections[, retmax, ...])
- yields the response of applying a tool on db for the selected elementson_search(term, db, tool[, dbfrom, ...])
- yields the response of applying a tool over the results of a search query (of the given term in database db)
If we want to select many elements and do further queries on them, we
could get a long list of ids that we would have to upload in the next
query. Instead of that, we can use the function select(...)
to make
a selection of elements on the server, that can be referenced later
for future queries. It returns a dictionary with the necessary
information to refer to the selection in the server.
The function apply(...)
can get that dictionary in its selections
argument. It then runs a tool using those selected elements.
Finally, on_search(...)
is a convenience function that combines the
results of a select
on an apply
, which is a very common case.
The data often comes as xml. For convenience, there is also the
function read_xml(...)
that converts it to a Python object closely
resembling the original structure of the data.
You can download this repository and run from there without
installing anything. Or simply put entrez.py
in a place where your
Python interpreter can find it (for example, you can add its
directory to your
PYTHONPATH).
It is that easy, really. There is no need to pip install
anything.
Fetch information for SNP with id 3000, as in the example at https://www.ncbi.nlm.nih.gov/projects/SNP/SNPeutils.htm:
import entrez as ez
for line in ez.query(tool='fetch', db='snp', id='3000'):
print(line) # or: print(ez.read_xml(line)) for nicer output
Get a summary of nucleotides related to accession numbers
NC_010611.1
and EU477409.1
:
import entrez as ez
for line in ez.on_search(term='NC_010611.1[accn] OR EU477409.1[accn]',
db='nucleotide', tool='summary'):
print(line)
Download to file chimp.fna
all chimpanzee mRNA sequences in FASTA
format (our version of the sample application
3):
import entrez as ez
with open('chimp.fna', 'w') as fout:
for line in ez.on_search(term='chimpanzee[orgn] AND biomol mrna[prop]',
db='nucleotide', tool='fetch', rettype='fasta'):
fout.write(line + '\n')
In the examples directory, there is a program sample_applications.py that shows how the sample applications of the E-utilities would look like with this interface.
There are also some little programs: acc2gi.py uses the library to convert accession numbers into GIs, and sra2runacc.py uses entrez to get all the run accession numbers for a given SRA study.
The NCBI now asks for all requests to include email
as a parameter,
with the email address of the user making the request (see their
"General Usage
Guidelines" for
example).
You can pass it to any of the functions in this module as an argument
(for example, query(..., email='[email protected]')
), or more comfortably
it can be initialized at the module level with:
import entrez as ez
ez.EMAIL = '[email protected]'
and from that point on, all the queries will have the email automatically incorporated.
Similarly, an API
key
can be passed to any of the functions as an argument
(query(..., api_key='ABCD123')
), or initialized and incorporated
automatically from that moment with:
import entrez as ez
ez.API_KEY = 'ABCD123'
There is a script to run the queries directly from the command line,
called etool.py
.
For the examples above, the equivalent calls would be:
./etool.py fetch --db snp --id 3000
and
./etool.py summary --on-search --db nucleotide \
--term 'NC_010611.1[accn] OR EU477409.1[accn]'
For xml outputs, the --parse-xml
argument is particularly useful. 😉
There is also a small tool to perform web searches with
BLAST (Basic Local
Alignment Search Tool) at the NCBI, called web_blast.py
.
For example, if you want to perform a blast search on the
"non-redundant" database for the protein sequences that you have in a
file named sequences.fasta
, you can write:
./web_blast.py --program blastp --database nr --format Tabular sequences.fasta
You can run the tests in the tests
directory with:
pytest
which will run all the functions that start with test_
in the
test_*.py
files.
There is some more information in the wiki.
This program is licensed under the GPL v3. See the project license for further details.
When I initially wrote this module (circa 2016) there were no Python alternatives (that I could find). That also explains why I chose to name it simply "entrez". Thanks to a more recent module, easy-entrez, here is a collection of alternatives: