LogoLogo
LogoLogo
  • The Barbara K. Ostrom (1978) Bioinformatics and Computing Facility
  • Computing Resources
    • Active Data Storage
    • Archive Data Storage
    • Luria Cluster
      • FAQs
    • Other Resources
  • Bioinformatics Topics
    • Tools - A Basic Bioinformatics Toolkit
      • Getting more out of Microsoft Excel
      • Bioinformatics Applications of Unix
        • Unix commands applied to bioinformatics
        • Manipulate NGS files using UNIX commands
        • Manipulate alignment files using UNIX commands
      • Alignments and Mappers
      • Relational databases
        • Running Joins on Galaxy
      • Spotfire
    • Tasks - Bioinformatics Methods
      • UCSC Genome Bioinformatics
        • Interacting with the UCSC Genome Browser
        • Obtaining DNA sequence from the UCSC Database
        • Obtaining genomic data from the UCSC database using table browser queries
        • Filtering table browser queries
        • Performing a BLAT search
        • Creating Custom Tracks
        • UCSC Intersection Queries
        • Viewing cross-species alignments
        • Galaxy
          • Intro to Galaxy
          • Galaxy NGS Illumina QC
          • Galaxy NGS Illumina SE Mapping
          • Galaxy SNP Interval Data
        • Editing and annotation gene structures with Argo
      • GeneGO MetaCore
        • GeneGo Introduction
        • Loading Data Into GeneGO
        • Data Management in GeneGO
        • Setting Thresholds and Background Sets
        • Search And Browse Content Tab
        • Workflows and Reports Tab
        • One-click Analysis Tab
        • Building Network for Your Experimental Data
      • Functional Annotation of Gene Lists
      • Multiple Sequence Alignment
        • Clustalw2
      • Phylogenetic analysis
        • Neighbor Joining method in Phylip
      • Microarray data processing with R/Bioconductor
    • Running Jupyter notebooks on luria cluster nodes
  • Data Management
    • Globus
  • Mini Courses
    • Schedule
      • Previous Teaching
    • Introduction to Unix and KI Computational Resources
      • Basic Unix
        • Why Unix?
        • The Unix Tree
        • The Unix Terminal and Shell
        • Anatomy of a Unix Command
        • Basic Unix Commands
        • Output Redirection and Piping
        • Manual Pages
        • Access Rights
        • Unix Text Editors
          • nano
          • vi / vim
          • emacs
        • Shell Scripts
      • Software Installation
        • Module
        • Conda Environment
      • Slurm
    • Introduction to Unix
      • Why Unix?
      • The Unix Filesystem
        • The Unix Tree
        • Network Filesystems
      • The Unix Shell
        • About the Unix Shell
        • Unix Shell Manual Pages
        • Using the Unix Shell
          • Viewing the Unix Tree
          • Traversing the Unix Tree
          • Editing the Unix Tree
          • Searching the Unix Tree
      • Files
        • Viewing File Contents
        • Creating and Editing Files
        • Manipulating Files
        • Symbolic Links
        • File Ownership
          • How Unix File Ownership Works
          • Change File Ownership and Permissions
        • File Transfer (in-progress)
        • File Storage and Compression
      • Getting System Information
      • Writing Scripts
      • Schedule Scripts Using Crontab
    • Advanced Utilization of IGB Computational Resources
      • High Performance Computing Clusters
      • Slurm
        • Checking the Status of Computing Nodes
        • Submitting Jobs / Slurm Scripts
        • Interactive Sessions
      • Package Management
        • The System Package Manager
        • Environment Modules
        • Conda Environments
      • SSH Port Forwarding
        • SSH Port Forwarding Jupyter Notebooks
      • Containerization
        • Docker
          • Docker Installation
          • Running Docker Images
          • Building Docker Images
        • Singularity
          • Differences from Docker
          • Running Images in Singularity
      • Running Nextflow / nf-core Pipelines
    • Python
      • Introduction to Python for Biologists
        • Interactive Python
        • Types
          • Strings
          • Lists
          • Tuples
          • Dictionaries
        • Control Flow
        • Loops
          • For Loops
          • While Loops
        • Control Flows and Loops
        • Storing Programs for Re-use
        • Reading and Writing Files
        • Functions
      • Biopython
        • About Biopython
        • Quick Start
          • Basic Sequence Analyses
          • SeqRecord
          • Sequence IO
          • Exploration of Entrez Databases
        • Example Projects
          • Coronavirus Exploration
          • Translating a eukaryotic FASTA file of CDS entries
        • Further Resources
      • Machine Learning with Python
        • About Machine Learning
        • Hands-On
          • Project Introduction
          • Supervised Approaches
            • The Logistic Regression Model
            • K-Nearest Neighbors
          • Unsupervised Approaches
            • K-Means Clustering
          • Further Resources
      • Data Processing with Python
        • Pandas
          • About Pandas
          • Making DataFrames
          • Inspecting DataFrames
          • Slicing DataFrames
          • Selecting from DataFrames
          • Editing DataFrames
        • Matplotlib
          • About Matplotlib
          • Basic Plotting
          • Advanced Plotting
        • Seaborn
          • About Seaborn
          • Basic Plotting
          • Visualizing Statistics
          • Visualizing Proteomics Data
          • Visualizing RNAseq Data
    • R
      • Intro to R
        • Before We Start
        • Getting to Know R
        • Variables in R
        • Functions in R
        • Data Manipulation
        • Simple Statistics in R
        • Basic Plotting in R
        • Advanced Plotting in R
        • Writing Figures to a File
        • Further Resources
    • Version Control with Git
      • About Version Control
      • Setting up Git
      • Creating a Repository
      • Tracking Changes
        • Exercises
      • Exploring History
        • Exercises
      • Ignoring Things
      • Remotes in Github
      • Collaborating
      • Conflicts
      • Open Science
      • Licensing
      • Citation
      • Hosting
      • Supplemental
Powered by GitBook

MIT Resources

  • https://accessibility.mit.edu

Massachusetts Institute of Technology

On this page
  • Quick Review
  • Basic Flow
  • Fancier Analyses
  • Call the calculate function to calculate the confidence in predictions

Was this helpful?

Export as PDF
  1. Mini Courses
  2. Python
  3. Machine Learning with Python
  4. Hands-On
  5. Supervised Approaches

The Logistic Regression Model

PreviousSupervised ApproachesNextK-Nearest Neighbors

Last updated 1 year ago

Was this helpful?

Quick Review

Basic Flow

Import LogisticRegression module from Biopython

 
from Bio import LogisticRegression

Define a list containing the distance and the score of similarity in expression profile between the 2 genes

xs =      [[-53, -200.78],
          [117, -267.14],
          [57, -163.47],
          [16, -190.30],
          [11, -220.94],
          [85, -193.94],
          [16, -182.71],
          [15, -180.41],
          [-26, -181.73],
          [58, -259.87],
          [126, -414.53],
          [191, -249.57],
          [113, -265.28],
          [145, -312.99],
          [154, -213.83],
          [147, -380.85],
          [93, -291.13]]

Define a list specifies if the gene pair belongs to the same operon (1) or different operons (0)

ys =     [1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          0,
          0,
          0,
          0,
          0,
          0,
          0]

Run logistic regression on the training data to esitmate coefficients

model = LogisticRegression.train(xs, ys)

Output the weights estimated from logistic regression training model

model.beta
Out[1]: [8.983029015714463, -0.03596896044485087, 0.02181395662983518]

Using the logistic regression model for classification

Let's predict gene pair yxcE — yxcD

LogisticRegression.classify(model, [6, -173.143442352])
Out[2]: 1

Let's predict gene pair yxiB — yxiA

LogisticRegression.classify(model, [309, -271.005880394])
Out[3]: 0

The above prediction agrees with the biological literature

Hooray! Let's celebrate!!!

Fancier Analyses

To run the code in this section, the following keys steps must be run ahead of time

  1. Import LogisticRegression module from Biopython

  2. Define a list containing the distance and the score of similarity in expression profile between the 2 genes

  3. Define a list specifies if the gene pair belongs to the same operon (1) or different operons (0)

Using arguments in train function

We can use update_fn argument in the train function to track the progress of the model calculation

First define a fundtion show_progress to print the desired outputs

def show_progress(iteration, loglikelihood):
        print("Log-likelihood function:", loglikelihood)

Then call train function with update_fn argument to output log-likelihood at each iteration

model = LogisticRegression.train(xs, ys, update_fn=show_progress)

The iteration stops once the increase in the log-likelihood function is less than 0.01. If no convergence is reached after 500 iterations, the train function returns with an AssertionError.

We can also apply typecode argument in train function to save memory for gigantic data it may be necessary to use single-precision floats rather than double

model = LogisticRegression.train(xs, ys, typecode="float")

Take a look at the estimated weights

model.beta
Out[1]: [8.983029015714463, -0.03596896044485087, 0.02181395662983518]

Print the classification output in more intuitive ways

Print the classification results in more intuitive ways

print("yxcE, yxcD:", LogisticRegression.classify(model, [6, -173.143442352]))

Output

yxcE, yxcD: 1
print("yxiB, yxiA:", LogisticRegression.classify(model, [309, -271.005880394]))

Output

yxiB, yxiA: 0

Obtain the probabilities

Obtain the probabilities to find out how confident we can be in these predictions

q, p = LogisticRegression.calculate(model, [309, -271.005880394])
print("class OP: probability =", p, "class NOP: probability =", q)

Output

class OP: probability = 0.00032121125181733316 class NOP: probability = 0.9996787887481826

Prediction Accuracy

To get some idea of the prediction accuracy of the logistic regression model, we can apply it to the training data:

for i in range(len(ys)):
        print("True:", ys[i], "Predicted:", LogisticRegression.classify(model, xs[i]))

Output

True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0

A more reliable estimate of the prediction accuracy can be found from a leave-one-out analysis the model is recalculated from the training data after removing the gene to be predicted:

for i in range(len(ys)):
        model = LogisticRegression.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:])
        print("True:", ys[i], "Predicted:", LogisticRegression.classify(model, xs[i]))

Output

True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0

Using optional arguments in classify function to specify distance and weight

By default, the Euclidean distance is used.

Instead, we could use the city-block (Manhattan) distance:

def cityblock(x1, x2):
    assert len(x1)==2
    assert len(x2)==2
    distance = abs(x1[0]-x2[0]) + abs(x1[1]-x2[1])
    return distance

pair1=[6, -173.143442352]
print("yxcE, yxcD:", kNN.classify(model, pair1, distance_fn = cityblock))

Output:

yxcE, yxcD: 1

Using weight function to assign higher weight for closer pairs

def weight(x1, x2):
    assert len(x1)==2
    assert len(x2)==2
    return math.exp(-abs(x1[0]-x2[0]) - abs(x1[1]-x2[1]))

x = [6, -173.143442352]
print("yxcE, yxcD:", kNN.classify(model, x, weight_fn = weight))

Output:

yxcE, yxcD: 1

Call the calculate function to calculate the confidence in predictions

This will calculate the total weight assigned to the classes OP and NOP

pair1 = [6, -173.143442352]
weight = kNN.calculate(model, pair1)
print("class NOP: weight =", weight[0], "class OP: weight =", weight[1])

Output:

class NOP: weight = 0.0 class OP: weight = 3.0
pair2 = [117, -267.14]
weight = kNN.calculate(model, pair2)
print("class NOP: weight =", weight[0], "class OP: weight =", weight[1])

Output:

class NOP: weight = 2.0 class OP: weight = 1.0

Prediction Accuracy

for i in range(len(ys)):
        print("True:", ys[i], "Predicted:", kNN.classify(model, xs[i]))

Output:

True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0

The prediction is correct for all but two of the gene pairs

A more reliable estimate of the prediction accuracy can be found from a leave-one-out analysis The model is recalculated from the training data after removing the gene to be predicted:

k = 3
for i in range(len(ys)):
    model = kNN.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:], k)
    print("True:", ys[i], "Predicted:", kNN.classify(model, xs[i]))

Output:

True: 1 Predicted: 1
True: 1 Predicted: 0
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 1
True: 1 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 0
True: 0 Predicted: 1

The prediction is correct for 13 out of 17 gene pairs, 76%