Coronavirus Exploration

We are still impacted by Covid19. Let's see how Biopython can help us explore and understand these coronaviruses.

1. we search PubMed for papers in order to access the timely coronavirus findings. In this way, we will have a good idea what people already know and still do not know about coronaviruses.

search PubMed for coronavirus information:
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.esearch(db="pubmed",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]
Out[6]: ['32526774', '32526763', '32526759', '32526746', '32526627', 
'32526560', '32526559', '32526545', '32526541', '32526530']

2. we look into specific terms. We will use '32526774' as an example

In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.efetch(db='pubmed',id='32526774')
In [5]: print(handle.read())

3. we mine nucleotide database to start bioinformatics study

In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.esearch(db="nucleotide",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]

4. we retrieve individual coronavirus genome sequences. Here, we use '1850952228' as an example

In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.efetch(db='nucleotide',id='1850952228',rettype='gb',retmode='text')
In [5]: record=SeqIO.read(handle,'genbank')
In [6]: handle.close()
##writing genBank file
In [7]: SeqIO.write(record,open("MT578017.gb","w"),"gb")
##converting GenBank to FASTA
In [8]: SeqIO.convert("MT578017.gb", "genbank", "MT578017.fna", "fasta")

5. Identifying open reading frames from genomic sequence. Here, we use '1850952228' as an example

The code is from "Cookbook 22.1.13 Identifying open reading frames"

from Bio import SeqIO
record = SeqIO.read("MT578017.fna", "fasta")
table = 1
min_pro_len = 100
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
    for frame in range(3):
        length = 3 * ((len(record)-frame) // 3) #Multiple of three
        for pro in nuc[frame:frame+length].translate(table).split("*"):
            if len(pro) >= min_pro_len:
                print("%s...%s - length %i, strand %i, frame %i" \
                      % (pro[:30], pro[-3:], len(pro), strand, frame))

6. build up a list of the candidate proteins and keep track of where the proteins are. Here, we use '1850952228' as an example

The code is from "Cookbook 20.1.13 Identifying open reading frames"

from Bio import SeqIO

record = SeqIO.read("MT578017.gb", "genbank")
table = 1
min_pro_len = 100

def find_orfs_with_trans(seq, trans_table, min_protein_length):
    answer = []
    seq_len = len(seq)
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            trans = str(nuc[frame:].translate(trans_table))
            trans_len = len(trans)
            aa_start = 0
            aa_end = 0
            while aa_start < trans_len:
                aa_end = trans.find("*", aa_start)
                if aa_end == -1:
                    aa_end = trans_len
                if aa_end - aa_start >= min_protein_length:
                    if strand == 1:
                        start = frame + aa_start * 3
                        end = min(seq_len, frame + aa_end * 3 + 3)
                    else:
                        start = seq_len - frame - aa_end * 3 - 3
                        end = seq_len - frame - aa_start * 3
                    answer.append((start, end, strand, trans[aa_start:aa_end]))
                aa_start = aa_end + 1
    answer.sort()
    return answer


orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)
for start, end, strand, pro in orf_list:
    print(
        "%s...%s - length %i, strand %i, %i:%i"
        % (pro[:30], pro[-3:], len(pro), strand, start, end)
    )

Now we have coronavirus protein sequences. There is so much more that we can do. I will leave it to you to do great jobs using Biopython.

PreviousExample Projects NextTranslating a eukaryotic FASTA file of CDS entries

Last updated 1 year ago

Was this helpful?