EUtils practical demo

##EUtils

The NCBI has a toolkit which they call Entrez Programming Utilities or eutils for short. You can read all about it in the documentation. There are a lot of things you can do to interface with all of the different NCBI databases, including:

ask for the total number of records in a database
search with a text query
search with a text query and respond with the number of matches
provide a list of IDs and ask for information back in a certain format
and more

Eutils will work to query these databases:

BioProject
BioSample
Biosystems
Books
Conserved Domains
dbGaP
dbVar
Epigenomics
EST
Gene
Genome
GEO Datasets
GEO Profiles
GSS
HomoloGene
MeSH
NCBI C++ Toolkit
NCBI Web Site
NLM Catalog
Nucleotide
OMIA
PopSet
Probe
Protein
Protein Clusters
PubChem BioAssay
PubChem Compound
PubChem Substance
PubMed
PubMed Central
SNP
SRA
Structure
Taxonomy
UniGene
UniSTS

API = application programming interface

In this case, we'll talk about an API in terms of the internet. A web API is a programmatic interface to a request-response message system. A user sends a request via the internet (via http), and the server responds with structured data.

Another definition: "“an interface through which you access someone else’s code or through which someone else’s code accesses yours – in effect the public methods and properties.”

Lets focus on the genome db for now.

EUtil requests are basically about building a web URL. The base URL is always:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/

Assuming you know the unique identifier of your genome of interest, you can start a download with this URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CP000962&rettype=fasta&retmode=text

Let’s breakdown the command here:

Part	Explanation
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?	This is command telling your computer program (or your browser) to talk to the NCBI API tool efetch.
db=nuccore	This command tells the NCBI API that you’d like it to look in this particular database for some data. Other databases that the NCBI has available can be found here.
id=CP000962	This command tells the NCBI API efetch the ID of the genome you want to find.
rettype=fasta&retmode=text	These two commands tells the NCBI how the data is returned. We asked for the FASTA sequence as a text file.

Lets try asking for the Genbank file instead:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CP000962&rettype=gb&retmode=text

What changed?

Here’s some elusive documentation on where to find these “return” objects, ie which formats are available for each database (such as fasta format and genbank format for genome database entries).

How could we get this data into our newton account? With curl! (This is one of those instances where wget will work but it will save your data in a weird file name, so curl is better).

	curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CP000962&rettype=gb&retmode=text" > CP000962.gb

##Many nucleotide records What if you want a whole bunch of records? Lets try a large set - go get a list of GI accessions from NCBI and scp them to your Amazon instance.

We know how to:

Use the web portal and look up each FASTA
Use the FTP site, find each genome, and download manually or with wget
Build an eutils URL and use wget

These aren't great options if you have 300 records. However, this turns out to be very useful when you know how to write a bash script, or write some python code or R code or perl code or something that does loops. Here's a bash script:

cat $1 | while read line
do
  echo $line
  begin="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id="
  end="&rettype=fasta&retmode=text"
  url=$begin$line$end
  echo "${url}"
  curl $url >> $1.fasta
done

This script searches the protein database of NCBI. It accepts a file on the command line. This file should contain the GI numbers of the records to retrieve, one per line.

How would you alter this script to get genbank format instead of fasta format?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EUtils practical demo

API = application programming interface

Clone this wiki locally