# Coronavirus Exploration

We are still impacted by Covid19. Let's see how Biopython can help us explore and understand these coronaviruses.

* 1\. we search PubMed for papers in order to access the timely coronavirus findings. In this way, we will have a good idea what people already know and still do not know about coronaviruses.

```
search PubMed for coronavirus information:
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.esearch(db="pubmed",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]
Out[6]: ['32526774', '32526763', '32526759', '32526746', '32526627', 
'32526560', '32526559', '32526545', '32526541', '32526530']
```

* 2\. we look into specific terms. We will use '32526774' as an example

```
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez
In [4]: handle=Entrez.efetch(db='pubmed',id='32526774')
In [5]: print(handle.read())
```

* 3\. we mine nucleotide database to start bioinformatics study

```
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.esearch(db="nucleotide",retmax=10,term="coronavirus")
In [5]: record=Entrez.read(handle)
In [6]: record["IdList"]
```

* 4\. we retrieve individual coronavirus genome sequences. Here, we use '1850952228' as an example

```
In [1]: import Bio
In [2]: Bio.__version__
In [3]: from Bio import Entrez,SeqIO
In [4]: handle=Entrez.efetch(db='nucleotide',id='1850952228',rettype='gb',retmode='text')
In [5]: record=SeqIO.read(handle,'genbank')
In [6]: handle.close()
##writing genBank file
In [7]: SeqIO.write(record,open("MT578017.gb","w"),"gb")
##converting GenBank to FASTA
In [8]: SeqIO.convert("MT578017.gb", "genbank", "MT578017.fna", "fasta")
```

* 5\. Identifying open reading frames from genomic sequence. Here, we use '1850952228' as an example

The code is from ["Cookbook: Translating a FASTA file of CDS entries"](https://biopython.org/docs/latest/Tutorial/chapter_cookbook.html)

```
from Bio import SeqIO
record = SeqIO.read("MT578017.fna", "fasta")
table = 1
min_pro_len = 100
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
    for frame in range(3):
        length = 3 * ((len(record)-frame) // 3) #Multiple of three
        for pro in nuc[frame:frame+length].translate(table).split("*"):
            if len(pro) >= min_pro_len:
                print("%s...%s - length %i, strand %i, frame %i" \
                      % (pro[:30], pro[-3:], len(pro), strand, frame))
```

* 6\. build up a list of the candidate proteins and keep track of where the proteins are. Here, we use '1850952228' as an example

The code is from ["Cookbook 20.1.13 Identifying open reading frames"](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec379)

```
from Bio import SeqIO

record = SeqIO.read("MT578017.gb", "genbank")
table = 1
min_pro_len = 100

def find_orfs_with_trans(seq, trans_table, min_protein_length):
    answer = []
    seq_len = len(seq)
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            trans = str(nuc[frame:].translate(trans_table))
            trans_len = len(trans)
            aa_start = 0
            aa_end = 0
            while aa_start < trans_len:
                aa_end = trans.find("*", aa_start)
                if aa_end == -1:
                    aa_end = trans_len
                if aa_end - aa_start >= min_protein_length:
                    if strand == 1:
                        start = frame + aa_start * 3
                        end = min(seq_len, frame + aa_end * 3 + 3)
                    else:
                        start = seq_len - frame - aa_end * 3 - 3
                        end = seq_len - frame - aa_start * 3
                    answer.append((start, end, strand, trans[aa_start:aa_end]))
                aa_start = aa_end + 1
    answer.sort()
    return answer


orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)
for start, end, strand, pro in orf_list:
    print(
        "%s...%s - length %i, strand %i, %i:%i"
        % (pro[:30], pro[-3:], len(pro), strand, start, end)
    )                    
```

Now we have coronavirus protein sequences. There is so much more that we can do. I will leave it to you to do great jobs using Biopython.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://igb.mit.edu/mini-courses/python/biopython/example-projects/coronavirus-exploration.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
