# The Logistic Regression Model

### Quick Review

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2FwldmF64gKFF6NTKVZcer%2Fimage.png?alt=media&#x26;token=944c56ec-c66f-44f8-ba7a-d501d9c97e65" alt=""><figcaption></figcaption></figure>

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2FnnczcecvVOdM6CQ8h15s%2Fimage.png?alt=media&#x26;token=1844245d-3a13-4e63-8124-952662751e1c" alt=""><figcaption></figcaption></figure>

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2FcCa6n59lQGQjGNX120CG%2Fimage.png?alt=media&#x26;token=b31643f0-ed11-4316-8fc5-1b03b58c385c" alt=""><figcaption></figcaption></figure>

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2FhOMvNeo4ywq4vWis7dAr%2Fimage.png?alt=media&#x26;token=b49713cf-aaf4-4102-982e-3a2ee064e3b3" alt=""><figcaption></figcaption></figure>

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2FUFcjKstzhrVvUcGUQZIh%2Fimage.png?alt=media&#x26;token=f8fabb2e-de64-4562-a151-18a443c73b10" alt=""><figcaption></figcaption></figure>

### Basic Flow

Import LogisticRegression module from Biopython

```
 
from Bio import LogisticRegression
```

Define a list containing the distance and the score of similarity in expression profile between the 2 genes

```
xs =      [[-53, -200.78],
          [117, -267.14],
          [57, -163.47],
          [16, -190.30],
          [11, -220.94],
          [85, -193.94],
          [16, -182.71],
          [15, -180.41],
          [-26, -181.73],
          [58, -259.87],
          [126, -414.53],
          [191, -249.57],
          [113, -265.28],
          [145, -312.99],
          [154, -213.83],
          [147, -380.85],
          [93, -291.13]]
```

Define a list specifies if the gene pair belongs to the same operon (1) or different operons (0)

```
ys =     [1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          0,
          0,
          0,
          0,
          0,
          0,
          0]
```

Run logistic regression on the training data to esitmate coefficients

```
model = LogisticRegression.train(xs, ys)
```

Output the weights estimated from logistic regression training model

```
model.beta
```

```
Out[1]: [8.983029015714463, -0.03596896044485087, 0.02181395662983518]
```

Using the logistic regression model for classification

Let's predict gene pair yxcE — yxcD

```
LogisticRegression.classify(model, [6, -173.143442352])
```

```
Out[2]: 1
```

Let's predict gene pair yxiB — yxiA

```
LogisticRegression.classify(model, [309, -271.005880394])
```

```
Out[3]: 0
```

The above prediction agrees with the biological literature

Hooray! Let's celebrate!!!

### Fancier Analyses

To run the code in this section, the following keys steps must be run ahead of time

1. Import LogisticRegression module from Biopython
2. Define a list containing the distance and the score of similarity in expression profile between the 2 genes
3. Define a list specifies if the gene pair belongs to the same operon (1) or different operons (0)

#### Using arguments in train function

We can use update\_fn argument in the train function to track the progress of the model calculation

First define a fundtion show\_progress to print the desired outputs

```
def show_progress(iteration, loglikelihood):
        print("Log-likelihood function:", loglikelihood)
```

Then call train function with update\_fn argument to output log-likelihood at each iteration

```
model = LogisticRegression.train(xs, ys, update_fn=show_progress)
```

The iteration stops once the increase in the log-likelihood function is less than 0.01. If no convergence is reached after 500 iterations, the train function returns with an AssertionError.

<br>

We can also apply typecode argument in train function to save memory for gigantic data it may be necessary to use single-precision floats rather than double

```
model = LogisticRegression.train(xs, ys, typecode="float")
```

Take a look at the estimated weights

```
model.beta
```

```
Out[1]: [8.983029015714463, -0.03596896044485087, 0.02181395662983518]
```

#### Print the classification output in more intuitive ways

Print the classification results in more intuitive ways

```
print("yxcE, yxcD:", LogisticRegression.classify(model, [6, -173.143442352]))
```

Output

```
yxcE, yxcD: 1
```

```
print("yxiB, yxiA:", LogisticRegression.classify(model, [309, -271.005880394]))
```

Output

```
yxiB, yxiA: 0
```

#### Obtain the probabilities

Obtain the probabilities to find out how confident we can be in these predictions

```
q, p = LogisticRegression.calculate(model, [309, -271.005880394])
print("class OP: probability =", p, "class NOP: probability =", q)
```

Output

```
class OP: probability = 0.00032121125181733316 class NOP: probability = 0.9996787887481826
```

#### Prediction Accuracy

To get some idea of the prediction accuracy of the logistic regression model, we can apply it to the training data:

```
for i in range(len(ys)):
        print("True:", ys[i], "Predicted:", LogisticRegression.classify(model, xs[i]))
```

Output

```
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
```

A more reliable estimate of the prediction accuracy can be found from a leave-one-out analysis the model is recalculated from the training data after removing the gene to be predicted:

```
for i in range(len(ys)):
        model = LogisticRegression.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:])
        print("True:", ys[i], "Predicted:", LogisticRegression.classify(model, xs[i]))
```

Output

```
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
```

#### Using optional arguments in classify function to specify distance and weight&#x20;

By default, the Euclidean distance is used.

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2FR09rNCOqZxIH8w0hucml%2Fimage.png?alt=media&#x26;token=23de98bf-3ec6-4abb-a102-dd08fff16056" alt=""><figcaption></figcaption></figure>

Instead, we could use the city-block (Manhattan) distance:

```
def cityblock(x1, x2):
    assert len(x1)==2
    assert len(x2)==2
    distance = abs(x1[0]-x2[0]) + abs(x1[1]-x2[1])
    return distance

pair1=[6, -173.143442352]
print("yxcE, yxcD:", kNN.classify(model, pair1, distance_fn = cityblock))
```

Output:

```
yxcE, yxcD: 1
```

Using weight function to assign higher weight for closer pairs

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2Fb8jHIOVKB1QCgInSFRRL%2Fimage.png?alt=media&#x26;token=a879d4a1-4266-4dc7-86fc-ea100125b3e3" alt=""><figcaption></figcaption></figure>

```
def weight(x1, x2):
    assert len(x1)==2
    assert len(x2)==2
    return math.exp(-abs(x1[0]-x2[0]) - abs(x1[1]-x2[1]))

x = [6, -173.143442352]
print("yxcE, yxcD:", kNN.classify(model, x, weight_fn = weight))
```

Output:

```
yxcE, yxcD: 1
```

## Call the calculate function to calculate the confidence in predictions <a href="#firstheading" id="firstheading"></a>

This will calculate the total weight assigned to the classes OP and NOP

```
pair1 = [6, -173.143442352]
weight = kNN.calculate(model, pair1)
print("class NOP: weight =", weight[0], "class OP: weight =", weight[1])
```

Output:

```
class NOP: weight = 0.0 class OP: weight = 3.0
```

```
pair2 = [117, -267.14]
weight = kNN.calculate(model, pair2)
print("class NOP: weight =", weight[0], "class OP: weight =", weight[1])
```

Output:

```
class NOP: weight = 2.0 class OP: weight = 1.0
```

#### Prediction Accuracy

```
for i in range(len(ys)):
        print("True:", ys[i], "Predicted:", kNN.classify(model, xs[i]))
```

Output:

```
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
```

The prediction is correct for all but two of the gene pairs

\
A more reliable estimate of the prediction accuracy can be found from a leave-one-out analysis The model is recalculated from the training data after removing the gene to be predicted:

```
k = 3
for i in range(len(ys)):
    model = kNN.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:], k)
    print("True:", ys[i], "Predicted:", kNN.classify(model, xs[i]))
```

Output:

```
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1
```

The prediction is correct for 13 out of 17 gene pairs, 76%
