1 of 11

Biopython

About Biopython

Biopython is a set of freely available tools for biological computation written in python by an international team of developers. The tool is to "make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable module and scripts." The source code is made available under the Biopython License, which is extremely liberal and compatible with almost every license in the world. Details refer Biopython Tutorial.

Some of the pricipal functions of biopython
- A standard sequence class that deals with sequences, ids on sequences, and sequence features
- Tools for performing common operations on sequences, such as translation, transcription and weight calculations
- Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines
- Code for dealing with alignments, including a standard way to create and deal with substitution matrices
- Code making it easy to split up parallelizable tasks into separate processes
- GUI-based programs to do basic sequence manipulations, translations, BLASTing,etc.

Quick Start

Basic Sequence Analyses

The Seq object deals with sequences. There are different methods between Seq objects and standard Python strings.

GC functions in Bio.SeqUtils module to calculate a GC%,

SeqRecord

The SeqRecord object has a few additional attributes beyond the Seq object:
- seq - The sequence itself
- id - The primary ID used to identify the sequence - a string. In most cases this is something like an accession number
- name - A 'common' name/id for the sequence - a string. In most cases this will be the same as the accession number, but it could also be a clone name. Analogous to the LOCUS id in a GenBank record
- description - A human readable description or expressive name for the sequence - a string

In [1]: import Bio

In [2]: from Bio.Seq import Seq

In [3]: from Bio.SeqRecord import SeqRecord

In [4]: ##create a simple SeqRecord object

In [5]: simple_seq=Seq("GATCAGGATTAGGCC")

In [6]: simple_seq_r=SeqRecord(simple_seq)

In [7]: simple_seq_r.id="AC12345"

In [8]: simple_seq_r.name = "alienClone"

In [9]: simple_seq_r.description="I am not a real sequence"

In [10]: ##print summary

In [11]: print(simple_seq_r.id)
AC12345

In [12]: print(simple_seq_r.description)
I am not a real sequence

In [13]: print(simple_seq_r.seq)
GATCAGGATTAGGCC

In [14]: print(simple_seq_r.seq.translate())
DQD*A

Sequence IO

The Sequence IO object is like a container for multiple SeqRecord objects
test.fa can be downloaded from https://ki-data.mit.edu/bcc/teaching/Biopython/
- Login and password will be distributed during class

Using Sequence IO object, we can analyze many sequences within this file
Using your text editor, write the following into a file seqio.py

import Bio
from Bio import SeqIO
import os
allSequences=[]
allSeqIDs=[]

os.getcwd()
'C:\\Users\\XXX\\Desktop\\BioPython'
pathToFile=os.path.join("C:\\Users\\XXX\\Desktop\\BioPython","test.fa")

for seq in SeqIO.parse(pathToFile,"fasta"):
        allSequences.append(seq)
        allSeqIDs.append(seq.id)
        print(seq.id)
        print(str(seq.seq))
        print(len(seq))  

If test.fa is saved under the current directory,
for seq in SeqIO.parse("test.fa","fasta"):
        allSequences.append(seq)
        allSeqIDs.append(seq.id)
        print(seq.id)
        print(str(seq.seq))
        print(len(seq))  

Then run seqio.py:
python seqio.py

NS500496_22_H2GV2BGXX:1:11101:7399:1076#TAAGGC/1
AGCCAGGCAATGGTGGTGCATGCCTTTAATCCCAGGTCTTGAGAGGAAGAGGCAGGCAGATATCTGTAAGTTTGATGCCAGCCTAGTCAATAGAGTTCCACCAAAACCAGAACTACACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
150
NS500496_22_H2GV2BGXX:1:11101:20246:1131#TAAGGC/1
ATGGGACCACAGTGCCAAATGGCAGGAGAAACCTGATAATCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
80
NS500496_22_H2GV2BGXX:1:11101:6476:1147#TAAGGC/1
ACTTTATGATATTCATCTTACATCAATAGAAAACCTTTCTGGTATTTTTTTATTTAATTAATTTTTTTAATAAAATTTTATTTAATAAATTTGGGGGGTGTGACATTAAAAATATCATGATTTAAATTAAACAAAAAAACTAACCAACAA
150
NS500496_22_H2GV2BGXX:1:11101:22871:1170#TAAGGC/1
ATGTAGTAAAACCCTGTTTGATTTATTTATTTTGAAACGAACTTGGCTGGTCTGGAACTCCTCCTGGCTCTGTCTCCTGAGAGGTGGGAATAAAGGTGTGTAGTGCCTGGCTAGAACAAAATTACAAAATTACAAAAAAAAACAAAAAAA
150
.
.
.
.

Exploration of Entrez Databases

Entrez databases
- NCBI's Guidelines
  - Before using Biopython to access the NCBI's online resources(via Bio.Entrez or some of the other modules), please read the NCBI's Entrez User Requirements. If NCBI finds you are abusing their systems, they can and will ban your access!
  - To paraphrase: For any series of more than 100 requests, do this at weekends or outside USA peak times. This is up to you to obey. Use the http://eutils.ncbi.nlm.nih.gov address, not the standard NCBI Web address. Biopython uses this web address. You can make no more than 10 queries per second if using a API key, otherwise at most 3 queries per second (relaxed form at most one request every three seconds in early 2009). This is automatically enforced by Biopython. Use the optional email parameter so the NCBI can contact you if there is a problem. You can either explicitly set this as a paraemter with each call to Entrez(e.g. include Entrez.email = "[email protected]" in the argument list), or you can set a global email address as follow:

from Bio import Entrez
Entrez.email = "[email protected]"

What database do I have access to?

In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: Entrez.email="[email protected]"

In [4]: handle=Entrez.einfo()

In [5]: record=Entrez.read(handle)

In [6]: record["DbList"]
Out[6]: ['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 
'genome', 'gpipe', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 
'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'epigenomics', 
'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 
'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 
'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 
'taxonomy', 'unigene', 'gencoll', 'gtr']

What if I want info about a database?

In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.einfo(db="pubmed")

In [4]: record=Entrez.read(handle)

In [5]: record["DbInfo"]["Description"]
Out[5]: 'PubMed bibliographic record'

In [6]: record["DbInfo"]["Count"]
Out[6]: '36234233'

How do I search for a given term?


Example 1:
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.esearch(db="pubmed",term="biopython")

In [4]: record=Entrez.read(handle)

In [5]: record["IdList"]
Out[5]: ['29641230', '28011774', '24929426', '24497503', '24267035', '24194598', '23842806', '23157543', 
'22909249', '22399473', '21666252', '21210977', '20015970', '19811691', '19773334', '19304878', 
'18606172', '21585724', '16403221', '16377612']


Example 2:
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle = Entrez.esearch(db="nucleotide", retmax=10, term="human[ORGN] tp53", idtype="acc")

In [4]: record=Entrez.read(handle)

In [5]: record["Count"]
Out[5]: '4253'

How do I retrieve a specific term?

Example 1: retrieve a previously identified biopython article (id=24929426) from pubmed
In [1]: import Bio

In [2]: from Bio import Entrez

In [3]: handle=Entrez.efetch(db='pubmed',id='29641230')

In [4]: print(handle.read())


Example 2: retrieve gene information from genbank
In [1]: import Bio

In [2]: from Bio import Entrez,SeqIO

In [3]: handle=Entrez.efetch(db='nucleotide',id='AF307851',rettype='gb',retmode='text')

In [4]: record=SeqIO.read(handle,'genbank')

In [5]: handle.close()

In [6]: print(record)
ID: AF307851.1
Name: AF307851
Description: Homo sapiens p53 protein mRNA, complete cds
Number of features: 2
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/keywords=['']
/data_file_division=PRI
/organism=Homo sapiens
/sequence_version=1
/molecule_type=mRNA
/source=Homo sapiens (human)
/topology=linear
/date=29-JAN-2001
/references=[Reference(title='Hyaluronidase induction of a WW domain-containing oxidoreductase that enhances tumor necrosis factor cytotoxicity', ...), Reference(title='Direct Submission', ...)]
/accessions=['AF307851']
Seq('GGCACGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTG...AAA', IUPACAmbiguousDNA())

How do I write my searching result to a file?

outpath=os.getcwd()+"\\Tp53GeneBank.gb"
SeqIO.write(record,open(outpath,"w"),"gb")

Example Projects

Coronavirus Exploration

We are still impacted by Covid19. Let's see how Biopython can help us explore and understand these coronaviruses.

1. we search PubMed for papers in order to access the timely coronavirus findings. In this way, we will have a good idea what people already know and still do not know about coronaviruses.

search PubMed for coronavirus information:
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.esearch(db="pubmed",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]
Out[6]: ['32526774', '32526763', '32526759', '32526746', '32526627', 
'32526560', '32526559', '32526545', '32526541', '32526530']

2. we look into specific terms. We will use '32526774' as an example

In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.efetch(db='pubmed',id='32526774')
In [5]: print(handle.read())

3. we mine nucleotide database to start bioinformatics study

In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.esearch(db="nucleotide",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]

4. we retrieve individual coronavirus genome sequences. Here, we use '1850952228' as an example

In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.efetch(db='nucleotide',id='1850952228',rettype='gb',retmode='text')
In [5]: record=SeqIO.read(handle,'genbank')
In [6]: handle.close()
##writing genBank file
In [7]: SeqIO.write(record,open("MT578017.gb","w"),"gb")
##converting GenBank to FASTA
In [8]: SeqIO.convert("MT578017.gb", "genbank", "MT578017.fna", "fasta")

5. Identifying open reading frames from genomic sequence. Here, we use '1850952228' as an example

The code is from "Cookbook: Translating a FASTA file of CDS entries"

from Bio import SeqIO
record = SeqIO.read("MT578017.fna", "fasta")
table = 1
min_pro_len = 100
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
    for frame in range(3):
        length = 3 * ((len(record)-frame) // 3) #Multiple of three
        for pro in nuc[frame:frame+length].translate(table).split("*"):
            if len(pro) >= min_pro_len:
                print("%s...%s - length %i, strand %i, frame %i" \
                      % (pro[:30], pro[-3:], len(pro), strand, frame))

6. build up a list of the candidate proteins and keep track of where the proteins are. Here, we use '1850952228' as an example

The code is from "Cookbook 20.1.13 Identifying open reading frames"

from Bio import SeqIO

record = SeqIO.read("MT578017.gb", "genbank")
table = 1
min_pro_len = 100

def find_orfs_with_trans(seq, trans_table, min_protein_length):
    answer = []
    seq_len = len(seq)
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            trans = str(nuc[frame:].translate(trans_table))
            trans_len = len(trans)
            aa_start = 0
            aa_end = 0
            while aa_start < trans_len:
                aa_end = trans.find("*", aa_start)
                if aa_end == -1:
                    aa_end = trans_len
                if aa_end - aa_start >= min_protein_length:
                    if strand == 1:
                        start = frame + aa_start * 3
                        end = min(seq_len, frame + aa_end * 3 + 3)
                    else:
                        start = seq_len - frame - aa_end * 3 - 3
                        end = seq_len - frame - aa_start * 3
                    answer.append((start, end, strand, trans[aa_start:aa_end]))
                aa_start = aa_end + 1
    answer.sort()
    return answer


orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)
for start, end, strand, pro in orf_list:
    print(
        "%s...%s - length %i, strand %i, %i:%i"
        % (pro[:30], pro[-3:], len(pro), strand, start, end)
    )

Now we have coronavirus protein sequences. There is so much more that we can do. I will leave it to you to do great jobs using Biopython.

Translating a eukaryotic FASTA file of CDS entries

From DNA sequence to predicted protein is an example in Biopython Cookbook "From gene sequence to predicted protein with the GFF module". Basing on DNA sequence and GlimmerHMM output in GFF3 format, the project script outputs protein coding sequences in fasta format.

The project script and example input files are available from: https://ki-data.mit.edu/bcc/teaching/Biopython

** install GFF module: pip install bcbio-gff

** Download reference DNA sequence file: ref.fa

** Download GlimmerHMM GFF3 output file: glimmer.gff

** Download project script: glimmergff_to_proteins.py 

** run glimmergff_to_proteins.py with GFF3 file as the 1st argument and reference DNA file as the second argument

** The output protein sequence is a fasta file with the same suffix as the gff file and with the ending as proteins.fa

Further Resources