The Logistic Regression Model
Quick Review





Basic Flow
Import LogisticRegression module from Biopython
from Bio import LogisticRegression
Define a list containing the distance and the score of similarity in expression profile between the 2 genes
xs = [[-53, -200.78],
[117, -267.14],
[57, -163.47],
[16, -190.30],
[11, -220.94],
[85, -193.94],
[16, -182.71],
[15, -180.41],
[-26, -181.73],
[58, -259.87],
[126, -414.53],
[191, -249.57],
[113, -265.28],
[145, -312.99],
[154, -213.83],
[147, -380.85],
[93, -291.13]]
Define a list specifies if the gene pair belongs to the same operon (1) or different operons (0)
ys = [1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
0,
0,
0,
0,
0,
0,
0]
Run logistic regression on the training data to esitmate coefficients
model = LogisticRegression.train(xs, ys)
Output the weights estimated from logistic regression training model
model.beta
Out[1]: [8.983029015714463, -0.03596896044485087, 0.02181395662983518]
Using the logistic regression model for classification
Let's predict gene pair yxcE — yxcD
LogisticRegression.classify(model, [6, -173.143442352])
Out[2]: 1
Let's predict gene pair yxiB — yxiA
LogisticRegression.classify(model, [309, -271.005880394])
Out[3]: 0
The above prediction agrees with the biological literature
Hooray! Let's celebrate!!!
Fancier Analyses
To run the code in this section, the following keys steps must be run ahead of time
Import LogisticRegression module from Biopython
Define a list containing the distance and the score of similarity in expression profile between the 2 genes
Define a list specifies if the gene pair belongs to the same operon (1) or different operons (0)
Using arguments in train function
We can use update_fn argument in the train function to track the progress of the model calculation
First define a fundtion show_progress to print the desired outputs
def show_progress(iteration, loglikelihood):
print("Log-likelihood function:", loglikelihood)
Then call train function with update_fn argument to output log-likelihood at each iteration
model = LogisticRegression.train(xs, ys, update_fn=show_progress)
The iteration stops once the increase in the log-likelihood function is less than 0.01. If no convergence is reached after 500 iterations, the train function returns with an AssertionError.
We can also apply typecode argument in train function to save memory for gigantic data it may be necessary to use single-precision floats rather than double
model = LogisticRegression.train(xs, ys, typecode="float")
Take a look at the estimated weights
model.beta
Out[1]: [8.983029015714463, -0.03596896044485087, 0.02181395662983518]
Print the classification output in more intuitive ways
Print the classification results in more intuitive ways
print("yxcE, yxcD:", LogisticRegression.classify(model, [6, -173.143442352]))
Output
yxcE, yxcD: 1
print("yxiB, yxiA:", LogisticRegression.classify(model, [309, -271.005880394]))
Output
yxiB, yxiA: 0
Obtain the probabilities
Obtain the probabilities to find out how confident we can be in these predictions
q, p = LogisticRegression.calculate(model, [309, -271.005880394])
print("class OP: probability =", p, "class NOP: probability =", q)
Output
class OP: probability = 0.00032121125181733316 class NOP: probability = 0.9996787887481826
Prediction Accuracy
To get some idea of the prediction accuracy of the logistic regression model, we can apply it to the training data:
for i in range(len(ys)):
print("True:", ys[i], "Predicted:", LogisticRegression.classify(model, xs[i]))
Output
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
A more reliable estimate of the prediction accuracy can be found from a leave-one-out analysis the model is recalculated from the training data after removing the gene to be predicted:
for i in range(len(ys)):
model = LogisticRegression.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:])
print("True:", ys[i], "Predicted:", LogisticRegression.classify(model, xs[i]))
Output
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
Using optional arguments in classify function to specify distance and weight
By default, the Euclidean distance is used.

Instead, we could use the city-block (Manhattan) distance:
def cityblock(x1, x2):
assert len(x1)==2
assert len(x2)==2
distance = abs(x1[0]-x2[0]) + abs(x1[1]-x2[1])
return distance
pair1=[6, -173.143442352]
print("yxcE, yxcD:", kNN.classify(model, pair1, distance_fn = cityblock))
Output:
yxcE, yxcD: 1
Using weight function to assign higher weight for closer pairs

def weight(x1, x2):
assert len(x1)==2
assert len(x2)==2
return math.exp(-abs(x1[0]-x2[0]) - abs(x1[1]-x2[1]))
x = [6, -173.143442352]
print("yxcE, yxcD:", kNN.classify(model, x, weight_fn = weight))
Output:
yxcE, yxcD: 1
Call the calculate function to calculate the confidence in predictions
This will calculate the total weight assigned to the classes OP and NOP
pair1 = [6, -173.143442352]
weight = kNN.calculate(model, pair1)
print("class NOP: weight =", weight[0], "class OP: weight =", weight[1])
Output:
class NOP: weight = 0.0 class OP: weight = 3.0
pair2 = [117, -267.14]
weight = kNN.calculate(model, pair2)
print("class NOP: weight =", weight[0], "class OP: weight =", weight[1])
Output:
class NOP: weight = 2.0 class OP: weight = 1.0
Prediction Accuracy
for i in range(len(ys)):
print("True:", ys[i], "Predicted:", kNN.classify(model, xs[i]))
Output:
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
The prediction is correct for all but two of the gene pairs
A more reliable estimate of the prediction accuracy can be found from a leave-one-out analysis The model is recalculated from the training data after removing the gene to be predicted:
k = 3
for i in range(len(ys)):
model = kNN.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:], k)
print("True:", ys[i], "Predicted:", kNN.classify(model, xs[i]))
Output:
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1
The prediction is correct for 13 out of 17 gene pairs, 76%
Last updated
Was this helpful?