We are still impacted by Covid19. Let's see how Biopython can help us explore and understand these coronaviruses.
1. we search PubMed for papers in order to access the timely coronavirus findings. In this way, we will have a good idea what people already know and still do not know about coronaviruses.
search PubMed for coronavirus information:
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.esearch(db="pubmed",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]
Out[6]: ['32526774', '32526763', '32526759', '32526746', '32526627',
'32526560', '32526559', '32526545', '32526541', '32526530']
2. we look into specific terms. We will use '32526774' as an example
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.efetch(db='pubmed',id='32526774')
In [5]: print(handle.read())
3. we mine nucleotide database to start bioinformatics study
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.esearch(db="nucleotide",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]
4. we retrieve individual coronavirus genome sequences. Here, we use '1850952228' as an example
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.efetch(db='nucleotide',id='1850952228',rettype='gb',retmode='text')
In [5]: record=SeqIO.read(handle,'genbank')
In [6]: handle.close()
##writing genBank file
In [7]: SeqIO.write(record,open("MT578017.gb","w"),"gb")
##converting GenBank to FASTA
In [8]: SeqIO.convert("MT578017.gb", "genbank", "MT578017.fna", "fasta")
5. Identifying open reading frames from genomic sequence. Here, we use '1850952228' as an example
from Bio import SeqIO
record = SeqIO.read("MT578017.fna", "fasta")
table = 1
min_pro_len = 100
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
length = 3 * ((len(record)-frame) // 3) #Multiple of three
for pro in nuc[frame:frame+length].translate(table).split("*"):
if len(pro) >= min_pro_len:
print("%s...%s - length %i, strand %i, frame %i" \
% (pro[:30], pro[-3:], len(pro), strand, frame))
6. build up a list of the candidate proteins and keep track of where the proteins are. Here, we use '1850952228' as an example
from Bio import SeqIO
record = SeqIO.read("MT578017.gb", "genbank")
table = 1
min_pro_len = 100
def find_orfs_with_trans(seq, trans_table, min_protein_length):
answer = []
seq_len = len(seq)
for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
for frame in range(3):
trans = str(nuc[frame:].translate(trans_table))
trans_len = len(trans)
aa_start = 0
aa_end = 0
while aa_start < trans_len:
aa_end = trans.find("*", aa_start)
if aa_end == -1:
aa_end = trans_len
if aa_end - aa_start >= min_protein_length:
if strand == 1:
start = frame + aa_start * 3
end = min(seq_len, frame + aa_end * 3 + 3)
else:
start = seq_len - frame - aa_end * 3 - 3
end = seq_len - frame - aa_start * 3
answer.append((start, end, strand, trans[aa_start:aa_end]))
aa_start = aa_end + 1
answer.sort()
return answer
orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)
for start, end, strand, pro in orf_list:
print(
"%s...%s - length %i, strand %i, %i:%i"
% (pro[:30], pro[-3:], len(pro), strand, start, end)
)
Now we have coronavirus protein sequences. There is so much more that we can do. I will leave it to you to do great jobs using Biopython.