Skip to content

EUtils practical demo

Meg Staton edited this page Aug 31, 2016 · 3 revisions

##EUtils

The NCBI has a toolkit which they call Entrez Programming Utilities or eutils for short. You can read all about it in the documentation. There are a lot of things you can do to interface with all of the different NCBI databases, including:

  • ask for the total number of records in a database
  • search with a text query
  • search with a text query and respond with the number of matches
  • provide a list of IDs and ask for information back in a certain format
  • and more

Eutils will work to query these databases:

  • BioProject
  • BioSample
  • Biosystems
  • Books
  • Conserved Domains
  • dbGaP
  • dbVar
  • Epigenomics
  • EST
  • Gene
  • Genome
  • GEO Datasets
  • GEO Profiles
  • GSS
  • HomoloGene
  • MeSH
  • NCBI C++ Toolkit
  • NCBI Web Site
  • NLM Catalog
  • Nucleotide
  • OMIA
  • PopSet
  • Probe
  • Protein
  • Protein Clusters
  • PubChem BioAssay
  • PubChem Compound
  • PubChem Substance
  • PubMed
  • PubMed Central
  • SNP
  • SRA
  • Structure
  • Taxonomy
  • UniGene
  • UniSTS

API = application programming interface

In this case, we'll talk about an API in terms of the internet. A web API is a programmatic interface to a request-response message system. A user sends a request via the internet (via http), and the server responds with structured data.

Another definition: "“an interface through which you access someone else’s code or through which someone else’s code accesses yours – in effect the public methods and properties.”

Lets focus on the genome db for now.

EUtil requests are basically about building a web URL. The base URL is always:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/

Assuming you know the unique identifier of your genome of interest, you can start a download with this URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CP000962&rettype=fasta&retmode=text

Let’s breakdown the command here:

Part Explanation
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? This is command telling your computer program (or your browser) to talk to the NCBI API tool efetch.
db=nuccore This command tells the NCBI API that you’d like it to look in this particular database for some data. Other databases that the NCBI has available can be found here.
id=CP000962 This command tells the NCBI API efetch the ID of the genome you want to find.
rettype=fasta&retmode=text These two commands tells the NCBI how the data is returned. We asked for the FASTA sequence as a text file.

Lets try asking for the Genbank file instead:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CP000962&rettype=gb&retmode=text

What changed?

Here’s some elusive documentation on where to find these “return” objects, ie which formats are available for each database (such as fasta format and genbank format for genome database entries).

How could we get this data into our newton account? With curl! (This is one of those instances where wget will work but it will save your data in a weird file name, so curl is better).

	curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CP000962&rettype=gb&retmode=text" > CP000962.gb

##Many nucleotide records What if you want a whole bunch of records? Lets try a large set - go get a list of GI accessions from NCBI and scp them to your Amazon instance.

We know how to:

  • Use the web portal and look up each FASTA
  • Use the FTP site, find each genome, and download manually or with wget
  • Build an eutils URL and use wget

These aren't great options if you have 300 records. However, this turns out to be very useful when you know how to write a bash script, or write some python code or R code or perl code or something that does loops. Here's a bash script:

cat $1 | while read line
do
  echo $line
  begin="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id="
  end="&rettype=fasta&retmode=text"
  url=$begin$line$end
  echo "${url}"
  curl $url >> $1.fasta
done 

This script searches the protein database of NCBI. It accepts a file on the command line. This file should contain the GI numbers of the records to retrieve, one per line.

How would you alter this script to get genbank format instead of fasta format?

Clone this wiki locally