We are still impacted by Covid19. Let's see how Biopython can help us explore and understand these coronaviruses.
1. we search PubMed for papers in order to access the timely coronavirus findings. In this way, we will have a good idea what people already know and still do not know about coronaviruses.
search PubMed for coronavirus information:
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.esearch(db="pubmed",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]
Out[6]: ['32526774', '32526763', '32526759', '32526746', '32526627',
'32526560', '32526559', '32526545', '32526541', '32526530']
2. we look into specific terms. We will use '32526774' as an example
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.efetch(db='pubmed',id='32526774')
In [5]: print(handle.read())
3. we mine nucleotide database to start bioinformatics study
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.esearch(db="nucleotide",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]
4. we retrieve individual coronavirus genome sequences. Here, we use '1850952228' as an example
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.efetch(db='nucleotide',id='1850952228',rettype='gb',retmode='text')
In [5]: record=SeqIO.read(handle,'genbank')
In [6]: handle.close()
##writing genBank file
In [7]: SeqIO.write(record,open("MT578017.gb","w"),"gb")
##converting GenBank to FASTA
In [8]: SeqIO.convert("MT578017.gb", "genbank", "MT578017.fna", "fasta")
5. Identifying open reading frames from genomic sequence. Here, we use '1850952228' as an example
from Bio import SeqIO
record = SeqIO.read("MT578017.fna", "fasta")
table = 1
min_pro_len = 100
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
length = 3 * ((len(record)-frame) // 3) #Multiple of three
for pro in nuc[frame:frame+length].translate(table).split("*"):
if len(pro) >= min_pro_len:
print("%s...%s - length %i, strand %i, frame %i" \
% (pro[:30], pro[-3:], len(pro), strand, frame))
6. build up a list of the candidate proteins and keep track of where the proteins are. Here, we use '1850952228' as an example