The Seq object deals with sequences. There are different methods between Seq objects and standard Python strings.
In [1]: import Bio
In [2]: Bio.__version__
Out[2]: '1.84'
In [3]: from Bio.Seq import Seq
In [4]: my_seq=Seq("TCATGTGTCCATACTTGATCATAG")
In [5]: my_seq.reverse_complement()
Out[5]: Seq('CTATGATCAAGTATGGACACATGA')
In [6]: my_seq.transcribe()
Out[6]: Seq('UCAUGUGUCCAUACUUGAUCAUAG')
In [7]: my_seq.translate()
Out[7]: Seq('SCVHT*S*')
In [8]: my_seq.translate(to_stop=True)
Out[8]: Seq('SCVHT')GC functions in Bio.SeqUtils module to calculate a GC%,
In [1]: import Bio
In [2]: from Bio.Seq import Seq
In [3]: from Bio.SeqUtils import gc_fraction
In [4]: my_seq=Seq("ACTG")
In [5]: gc_fraction(my_seq)
Out[5]: 50.0The Sequence IO object is like a container for multiple SeqRecord objects
test.fa can be downloaded from https://ki-data.mit.edu/bcc/teaching/Biopython/
Login and password will be distributed during class
Using Sequence IO object, we can analyze many sequences within this file
Using your text editor, write the following into a file seqio.py
import Bio
from Bio import SeqIO
import os
allSequences=[]
allSeqIDs=[]
os.getcwd()
'C:\\Users\\XXX\\Desktop\\BioPython'
pathToFile=os.path.join("C:\\Users\\XXX\\Desktop\\BioPython","test.fa")
for seq in SeqIO.parse(pathToFile,"fasta"):
allSequences.append(seq)
allSeqIDs.append(seq.id)
print(seq.id)
print(str(seq.seq))
print(len(seq))
If test.fa is saved under the current directory,
for seq in SeqIO.parse("test.fa","fasta"):
allSequences.append(seq)
allSeqIDs.append(seq.id)
print(seq.id)
print(str(seq.seq))
print(len(seq))
Then run seqio.py:
python seqio.py
NS500496_22_H2GV2BGXX:1:11101:7399:1076#TAAGGC/1
AGCCAGGCAATGGTGGTGCATGCCTTTAATCCCAGGTCTTGAGAGGAAGAGGCAGGCAGATATCTGTAAGTTTGATGCCAGCCTAGTCAATAGAGTTCCACCAAAACCAGAACTACACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
150
NS500496_22_H2GV2BGXX:1:11101:20246:1131#TAAGGC/1
ATGGGACCACAGTGCCAAATGGCAGGAGAAACCTGATAATCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
80
NS500496_22_H2GV2BGXX:1:11101:6476:1147#TAAGGC/1
ACTTTATGATATTCATCTTACATCAATAGAAAACCTTTCTGGTATTTTTTTATTTAATTAATTTTTTTAATAAAATTTTATTTAATAAATTTGGGGGGTGTGACATTAAAAATATCATGATTTAAATTAAACAAAAAAACTAACCAACAA
150
NS500496_22_H2GV2BGXX:1:11101:22871:1170#TAAGGC/1
ATGTAGTAAAACCCTGTTTGATTTATTTATTTTGAAACGAACTTGGCTGGTCTGGAACTCCTCCTGGCTCTGTCTCCTGAGAGGTGGGAATAAAGGTGTGTAGTGCCTGGCTAGAACAAAATTACAAAATTACAAAAAAAAACAAAAAAA
150
.
.
.
.The SeqRecord object has a few additional attributes beyond the Seq object:
seq - The sequence itself
id - The primary ID used to identify the sequence - a string. In most cases this is something like an accession number
name - A 'common' name/id for the sequence - a string. In most cases this will be the same as the accession number, but it could also be a clone name. Analogous to the LOCUS id in a GenBank record
description - A human readable description or expressive name for the sequence - a string
In [1]: import Bio
In [2]: from Bio.Seq import Seq
In [3]: from Bio.SeqRecord import SeqRecord
In [4]: ##create a simple SeqRecord object
In [5]: simple_seq=Seq("GATCAGGATTAGGCC")
In [6]: simple_seq_r=SeqRecord(simple_seq)
In [7]: simple_seq_r.id="AC12345"
In [8]: simple_seq_r.name = "alienClone"
In [9]: simple_seq_r.description="I am not a real sequence"
In [10]: ##print summary
In [11]: print(simple_seq_r.id)
AC12345
In [12]: print(simple_seq_r.description)
I am not a real sequence
In [13]: print(simple_seq_r.seq)
GATCAGGATTAGGCC
In [14]: print(simple_seq_r.seq.translate())
DQD*A
Entrez databases
NCBI's Guidelines
Before using Biopython to access the NCBI's online resources(via Bio.Entrez or some of the other modules), please read the NCBI's Entrez User Requirements. If NCBI finds you are abusing their systems, they can and will ban your access!
To paraphrase: For any series of more than 100 requests, do this at weekends or outside USA peak times. This is up to you to obey. Use the http://eutils.ncbi.nlm.nih.gov address, not the standard NCBI Web address. Biopython uses this web address. You can make no more than 10 queries per second if using a API key, otherwise at most 3 queries per second (relaxed form at most one request every three seconds in early 2009). This is automatically enforced by Biopython. Use the optional email parameter so the NCBI can contact you if there is a problem. You can either explicitly set this as a paraemter with each call to Entrez(e.g. include Entrez.email = "[email protected]" in the argument list), or you can set a global email address as follow:
from Bio import Entrez
Entrez.email = "[email protected]"What database do I have access to?
In [1]: import Bio
In [2]: from Bio import Entrez
In [3]: Entrez.email="[email protected]"
In [4]: handle=Entrez.einfo()
In [5]: record=Entrez.read(handle)
In [6]: record["DbList"]
Out[6]: ['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure',
'genome', 'gpipe', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo',
'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'epigenomics',
'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog',
'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay',
'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra',
'taxonomy', 'unigene', 'gencoll', 'gtr']What if I want info about a database?
In [1]: import Bio
In [2]: from Bio import Entrez
In [3]: handle=Entrez.einfo(db="pubmed")
In [4]: record=Entrez.read(handle)
In [5]: record["DbInfo"]["Description"]
Out[5]: 'PubMed bibliographic record'
In [6]: record["DbInfo"]["Count"]
Out[6]: '36234233'How do I search for a given term?
Example 1:
In [1]: import Bio
In [2]: from Bio import Entrez
In [3]: handle=Entrez.esearch(db="pubmed",term="biopython")
In [4]: record=Entrez.read(handle)
In [5]: record["IdList"]
Out[5]: ['29641230', '28011774', '24929426', '24497503', '24267035', '24194598', '23842806', '23157543',
'22909249', '22399473', '21666252', '21210977', '20015970', '19811691', '19773334', '19304878',
'18606172', '21585724', '16403221', '16377612']
Example 2:
In [1]: import Bio
In [2]: from Bio import Entrez
In [3]: handle = Entrez.esearch(db="nucleotide", retmax=10, term="human[ORGN] tp53", idtype="acc")
In [4]: record=Entrez.read(handle)
In [5]: record["Count"]
Out[5]: '4253'How do I retrieve a specific term?
Example 1: retrieve a previously identified biopython article (id=24929426) from pubmed
In [1]: import Bio
In [2]: from Bio import Entrez
In [3]: handle=Entrez.efetch(db='pubmed',id='29641230')
In [4]: print(handle.read())
Example 2: retrieve gene information from genbank
In [1]: import Bio
In [2]: from Bio import Entrez,SeqIO
In [3]: handle=Entrez.efetch(db='nucleotide',id='AF307851',rettype='gb',retmode='text')
In [4]: record=SeqIO.read(handle,'genbank')
In [5]: handle.close()
In [6]: print(record)
ID: AF307851.1
Name: AF307851
Description: Homo sapiens p53 protein mRNA, complete cds
Number of features: 2
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/keywords=['']
/data_file_division=PRI
/organism=Homo sapiens
/sequence_version=1
/molecule_type=mRNA
/source=Homo sapiens (human)
/topology=linear
/date=29-JAN-2001
/references=[Reference(title='Hyaluronidase induction of a WW domain-containing oxidoreductase that enhances tumor necrosis factor cytotoxicity', ...), Reference(title='Direct Submission', ...)]
/accessions=['AF307851']
Seq('GGCACGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTG...AAA', IUPACAmbiguousDNA())How do I write my searching result to a file?
outpath=os.getcwd()+"\\Tp53GeneBank.gb"
SeqIO.write(record,open(outpath,"w"),"gb")