1 of 5

Quick Start

Basic Sequence Analyses

The Seq object deals with sequences. There are different methods between Seq objects and standard Python strings.

In [1]: import Bio

In [2]: Bio.__version__
Out[2]: '1.84'

In [3]: from Bio.Seq import Seq

In [4]: my_seq=Seq("TCATGTGTCCATACTTGATCATAG")

In [5]: my_seq.reverse_complement()
Out[5]: Seq('CTATGATCAAGTATGGACACATGA')

In [6]: my_seq.transcribe()
Out[6]: Seq('UCAUGUGUCCAUACUUGAUCAUAG')

In [7]: my_seq.translate()
Out[7]: Seq('SCVHT*S*')

In [8]: my_seq.translate(to_stop=True)
Out[8]: Seq('SCVHT')

GC functions in Bio.SeqUtils module to calculate a GC%,

In [1]: import Bio

In [2]: from Bio.Seq import Seq

In [3]: from Bio.SeqUtils import gc_fraction

In [4]: my_seq=Seq("ACTG")

In [5]: gc_fraction(my_seq)
Out[5]: 50.0

SeqRecord

The SeqRecord object has a few additional attributes beyond the Seq object:
- seq - The sequence itself
- id - The primary ID used to identify the sequence - a string. In most cases this is something like an accession number
- name - A 'common' name/id for the sequence - a string. In most cases this will be the same as the accession number, but it could also be a clone name. Analogous to the LOCUS id in a GenBank record
- description - A human readable description or expressive name for the sequence - a string

In [1]: import Bio

In [2]: from Bio.Seq import Seq

In [3]: from Bio.SeqRecord import SeqRecord

In [4]: ##create a simple SeqRecord object

In [5]: simple_seq=Seq("GATCAGGATTAGGCC")

In [6]: simple_seq_r=SeqRecord(simple_seq)

In [7]: simple_seq_r.id="AC12345"

In [8]: simple_seq_r.name = "alienClone"

In [9]: simple_seq_r.description="I am not a real sequence"

In [10]: ##print summary

In [11]: print(simple_seq_r.id)
AC12345

In [12]: print(simple_seq_r.description)
I am not a real sequence

In [13]: print(simple_seq_r.seq)
GATCAGGATTAGGCC

In [14]: print(simple_seq_r.seq.translate())
DQD*A

Sequence IO

The Sequence IO object is like a container for multiple SeqRecord objects
test.fa can be downloaded from https://ki-data.mit.edu/bcc/teaching/Biopython/
- Login and password will be distributed during class

Using Sequence IO object, we can analyze many sequences within this file
Using your text editor, write the following into a file seqio.py

import Bio
from Bio import SeqIO
import os
allSequences=[]
allSeqIDs=[]

os.getcwd()
'C:\\Users\\XXX\\Desktop\\BioPython'
pathToFile=os.path.join("C:\\Users\\XXX\\Desktop\\BioPython","test.fa")

for seq in SeqIO.parse(pathToFile,"fasta"):
        allSequences.append(seq)
        allSeqIDs.append(seq.id)
        print(seq.id)
        print(str(seq.seq))
        print(len(seq))  

If test.fa is saved under the current directory,
for seq in SeqIO.parse("test.fa","fasta"):
        allSequences.append(seq)
        allSeqIDs.append(seq.id)
        print(seq.id)
        print(str(seq.seq))
        print(len(seq))  

Then run seqio.py:
python seqio.py

NS500496_22_H2GV2BGXX:1:11101:7399:1076#TAAGGC/1
AGCCAGGCAATGGTGGTGCATGCCTTTAATCCCAGGTCTTGAGAGGAAGAGGCAGGCAGATATCTGTAAGTTTGATGCCAGCCTAGTCAATAGAGTTCCACCAAAACCAGAACTACACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
150
NS500496_22_H2GV2BGXX:1:11101:20246:1131#TAAGGC/1
ATGGGACCACAGTGCCAAATGGCAGGAGAAACCTGATAATCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
80
NS500496_22_H2GV2BGXX:1:11101:6476:1147#TAAGGC/1
ACTTTATGATATTCATCTTACATCAATAGAAAACCTTTCTGGTATTTTTTTATTTAATTAATTTTTTTAATAAAATTTTATTTAATAAATTTGGGGGGTGTGACATTAAAAATATCATGATTTAAATTAAACAAAAAAACTAACCAACAA
150
NS500496_22_H2GV2BGXX:1:11101:22871:1170#TAAGGC/1
ATGTAGTAAAACCCTGTTTGATTTATTTATTTTGAAACGAACTTGGCTGGTCTGGAACTCCTCCTGGCTCTGTCTCCTGAGAGGTGGGAATAAAGGTGTGTAGTGCCTGGCTAGAACAAAATTACAAAATTACAAAAAAAAACAAAAAAA
150
.
.
.
.

Exploration of Entrez Databases

Entrez databases
- NCBI's Guidelines
  - Before using Biopython to access the NCBI's online resources(via Bio.Entrez or some of the other modules), please read the NCBI's Entrez User Requirements. If NCBI finds you are abusing their systems, they can and will ban your access!
  - To paraphrase: For any series of more than 100 requests, do this at weekends or outside USA peak times. This is up to you to obey. Use the http://eutils.ncbi.nlm.nih.gov address, not the standard NCBI Web address. Biopython uses this web address. You can make no more than 10 queries per second if using a API key, otherwise at most 3 queries per second (relaxed form at most one request every three seconds in early 2009). This is automatically enforced by Biopython. Use the optional email parameter so the NCBI can contact you if there is a problem. You can either explicitly set this as a paraemter with each call to Entrez(e.g. include Entrez.email = "[email protected]" in the argument list), or you can set a global email address as follow:

from Bio import Entrez
Entrez.email = "[email protected]"

What database do I have access to?

In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: Entrez.email="[email protected]"

In [4]: handle=Entrez.einfo()

In [5]: record=Entrez.read(handle)

In [6]: record["DbList"]
Out[6]: ['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 
'genome', 'gpipe', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 
'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'epigenomics', 
'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 
'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 
'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 
'taxonomy', 'unigene', 'gencoll', 'gtr']

What if I want info about a database?

In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.einfo(db="pubmed")

In [4]: record=Entrez.read(handle)

In [5]: record["DbInfo"]["Description"]
Out[5]: 'PubMed bibliographic record'

In [6]: record["DbInfo"]["Count"]
Out[6]: '36234233'

How do I search for a given term?


Example 1:
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.esearch(db="pubmed",term="biopython")

In [4]: record=Entrez.read(handle)

In [5]: record["IdList"]
Out[5]: ['29641230', '28011774', '24929426', '24497503', '24267035', '24194598', '23842806', '23157543', 
'22909249', '22399473', '21666252', '21210977', '20015970', '19811691', '19773334', '19304878', 
'18606172', '21585724', '16403221', '16377612']


Example 2:
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle = Entrez.esearch(db="nucleotide", retmax=10, term="human[ORGN] tp53", idtype="acc")

In [4]: record=Entrez.read(handle)

In [5]: record["Count"]
Out[5]: '4253'

How do I retrieve a specific term?

Example 1: retrieve a previously identified biopython article (id=24929426) from pubmed
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.efetch(db='pubmed',id='29641230')

In [4]: print(handle.read())


Example 2: retrieve gene information from genbank
In [1]: import Bio

In [2]: from Bio import Entrez,SeqIO

In [3]: handle=Entrez.efetch(db='nucleotide',id='AF307851',rettype='gb',retmode='text')

In [4]: record=SeqIO.read(handle,'genbank')

In [5]: handle.close()

In [6]: print(record)
ID: AF307851.1
Name: AF307851
Description: Homo sapiens p53 protein mRNA, complete cds
Number of features: 2
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/keywords=['']
/data_file_division=PRI
/organism=Homo sapiens
/sequence_version=1
/molecule_type=mRNA
/source=Homo sapiens (human)
/topology=linear
/date=29-JAN-2001
/references=[Reference(title='Hyaluronidase induction of a WW domain-containing oxidoreductase that enhances tumor necrosis factor cytotoxicity', ...), Reference(title='Direct Submission', ...)]
/accessions=['AF307851']
Seq('GGCACGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTG...AAA', IUPACAmbiguousDNA())

How do I write my searching result to a file?

outpath=os.getcwd()+"\\Tp53GeneBank.gb"
SeqIO.write(record,open(outpath,"w"),"gb")

Exploration of Entrez Databases

Entrez databases
- NCBI's Guidelines
  - Before using Biopython to access the NCBI's online resources(via Bio.Entrez or some of the other modules), please read the NCBI's Entrez User Requirements. If NCBI finds you are abusing their systems, they can and will ban your access!
  - To paraphrase: For any series of more than 100 requests, do this at weekends or outside USA peak times. This is up to you to obey. Use the http://eutils.ncbi.nlm.nih.gov address, not the standard NCBI Web address. Biopython uses this web address. You can make no more than 10 queries per second if using a API key, otherwise at most 3 queries per second (relaxed form at most one request every three seconds in early 2009). This is automatically enforced by Biopython. Use the optional email parameter so the NCBI can contact you if there is a problem. You can either explicitly set this as a paraemter with each call to Entrez(e.g. include Entrez.email = "[email protected]" in the argument list), or you can set a global email address as follow:

from Bio import Entrez
Entrez.email = "[email protected]"

What database do I have access to?

In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: Entrez.email="[email protected]"

In [4]: handle=Entrez.einfo()

In [5]: record=Entrez.read(handle)

In [6]: record["DbList"]
Out[6]: ['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 
'genome', 'gpipe', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 
'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'epigenomics', 
'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 
'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 
'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 
'taxonomy', 'unigene', 'gencoll', 'gtr']

What if I want info about a database?

In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.einfo(db="pubmed")

In [4]: record=Entrez.read(handle)

In [5]: record["DbInfo"]["Description"]
Out[5]: 'PubMed bibliographic record'

In [6]: record["DbInfo"]["Count"]
Out[6]: '36234233'

How do I search for a given term?


Example 1:
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.esearch(db="pubmed",term="biopython")

In [4]: record=Entrez.read(handle)

In [5]: record["IdList"]
Out[5]: ['29641230', '28011774', '24929426', '24497503', '24267035', '24194598', '23842806', '23157543', 
'22909249', '22399473', '21666252', '21210977', '20015970', '19811691', '19773334', '19304878', 
'18606172', '21585724', '16403221', '16377612']


Example 2:
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle = Entrez.esearch(db="nucleotide", retmax=10, term="human[ORGN] tp53", idtype="acc")

In [4]: record=Entrez.read(handle)

In [5]: record["Count"]
Out[5]: '4253'

How do I retrieve a specific term?

Example 1: retrieve a previously identified biopython article (id=24929426) from pubmed
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.efetch(db='pubmed',id='29641230')

In [4]: print(handle.read())


Example 2: retrieve gene information from genbank
In [1]: import Bio

In [2]: from Bio import Entrez,SeqIO

In [3]: handle=Entrez.efetch(db='nucleotide',id='AF307851',rettype='gb',retmode='text')

In [4]: record=SeqIO.read(handle,'genbank')

In [5]: handle.close()

In [6]: print(record)
ID: AF307851.1
Name: AF307851
Description: Homo sapiens p53 protein mRNA, complete cds
Number of features: 2
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/keywords=['']
/data_file_division=PRI
/organism=Homo sapiens
/sequence_version=1
/molecule_type=mRNA
/source=Homo sapiens (human)
/topology=linear
/date=29-JAN-2001
/references=[Reference(title='Hyaluronidase induction of a WW domain-containing oxidoreductase that enhances tumor necrosis factor cytotoxicity', ...), Reference(title='Direct Submission', ...)]
/accessions=['AF307851']
Seq('GGCACGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTG...AAA', IUPACAmbiguousDNA())

How do I write my searching result to a file?

outpath=os.getcwd()+"\\Tp53GeneBank.gb"
SeqIO.write(record,open(outpath,"w"),"gb")