Exploration of Entrez Databases

  • Entrez databases

    • NCBI's Guidelines

      • Before using Biopython to access the NCBI's online resources(via Bio.Entrez or some of the other modules), please read the NCBI's Entrez User Requirements. If NCBI finds you are abusing their systems, they can and will ban your access!

      • To paraphrase: For any series of more than 100 requests, do this at weekends or outside USA peak times. This is up to you to obey. Use the http://eutils.ncbi.nlm.nih.gov address, not the standard NCBI Web address. Biopython uses this web address. You can make no more than 10 queries per second if using a API key, otherwise at most 3 queries per second (relaxed form at most one request every three seconds in early 2009). This is automatically enforced by Biopython. Use the optional email parameter so the NCBI can contact you if there is a problem. You can either explicitly set this as a paraemter with each call to Entrez(e.g. include Entrez.email = "A.N.Other@example.com" in the argument list), or you can set a global email address as follow:

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"
  • What database do I have access to?

In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: Entrez.email="duan@mit.edu"

In [4]: handle=Entrez.einfo()

In [5]: record=Entrez.read(handle)

In [6]: record["DbList"]
Out[6]: ['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 
'genome', 'gpipe', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 
'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'epigenomics', 
'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 
'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 
'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 
'taxonomy', 'unigene', 'gencoll', 'gtr']
  • What if I want info about a database?

In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.einfo(db="pubmed")

In [4]: record=Entrez.read(handle)

In [5]: record["DbInfo"]["Description"]
Out[5]: 'PubMed bibliographic record'

In [6]: record["DbInfo"]["Count"]
Out[6]: '36234233'
  • How do I search for a given term?


Example 1:
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.esearch(db="pubmed",term="biopython")

In [4]: record=Entrez.read(handle)

In [5]: record["IdList"]
Out[5]: ['29641230', '28011774', '24929426', '24497503', '24267035', '24194598', '23842806', '23157543', 
'22909249', '22399473', '21666252', '21210977', '20015970', '19811691', '19773334', '19304878', 
'18606172', '21585724', '16403221', '16377612']


Example 2:
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle = Entrez.esearch(db="nucleotide", retmax=10, term="human[ORGN] tp53", idtype="acc")

In [4]: record=Entrez.read(handle)

In [5]: record["Count"]
Out[5]: '4253'
  • How do I retrieve a specific term?

Example 1: retrieve a previously identified biopython article (id=24929426) from pubmed
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.efetch(db='pubmed',id='29641230')

In [4]: print(handle.read())


Example 2: retrieve gene information from genbank
In [1]: import Bio

In [2]: from Bio import Entrez,SeqIO

In [3]: handle=Entrez.efetch(db='nucleotide',id='AF307851',rettype='gb',retmode='text')

In [4]: record=SeqIO.read(handle,'genbank')

In [5]: handle.close()

In [6]: print(record)
ID: AF307851.1
Name: AF307851
Description: Homo sapiens p53 protein mRNA, complete cds
Number of features: 2
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/keywords=['']
/data_file_division=PRI
/organism=Homo sapiens
/sequence_version=1
/molecule_type=mRNA
/source=Homo sapiens (human)
/topology=linear
/date=29-JAN-2001
/references=[Reference(title='Hyaluronidase induction of a WW domain-containing oxidoreductase that enhances tumor necrosis factor cytotoxicity', ...), Reference(title='Direct Submission', ...)]
/accessions=['AF307851']
Seq('GGCACGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTG...AAA', IUPACAmbiguousDNA())
  • How do I write my searching result to a file?

outpath=os.getcwd()+"\\Tp53GeneBank.gb"
SeqIO.write(record,open(outpath,"w"),"gb")

Last updated

Massachusetts Institute of Technology