K-Nearest Neighbors

A Quick Review

  1. The k-nearest neighbors method is a supervised learning approach that does not need to fit a model to the data.

  2. Data points are classified based on the categories of the k nearest neighbors in the training data set.

  3. In Biopython, the k-nearest neighbors method is available in Bio.kNN.

  4. k is the number of neighbors k that will be considered for the classification.

  5. For classification into two classes, choosing an odd number for k lets you avoid tied votes

  6. No exact physical or biological rules to define the best value of k

  7. Apply different values of K to a part of training data Then evaluate the performance using the other part of training data

  8. Low values of k such as 1 or 2 can be noisy and subject to outliers

  9. High values of k smooth over things. But you do not want k to be too big because small clusters will be out weighed by other clusters

Basic Flow of KNN

Import KNN module from Biopython

 
from Bio import kNN

Define a list containing the distance and the score of similarity in expression profile between the 2 genes

xs =      [[-53, -200.78],
          [117, -267.14],
          [57, -163.47],
          [16, -190.30],
          [11, -220.94],
          [85, -193.94],
          [16, -182.71],
          [15, -180.41],
          [-26, -181.73],
          [58, -259.87],
          [126, -414.53],
          [191, -249.57],
          [113, -265.28],
          [145, -312.99],
          [154, -213.83],
          [147, -380.85],
          [93, -291.13]]

Define a list specifies if the gene pair belongs to the same operon (1) or different operons (0)

ys =     [1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          0,
          0,
          0,
          0,
          0,
          0,
          0]

Define the number of nereast neightours.Choosing an odd number for k lets you avoid tied votes

k = 3

Create and initialize a k-nearest neighbors model

The function name train is a bit deceiving since no model training is done

This function simply stores xs, ys, and k in model.

model = kNN.train(xs, ys, k)

Using a k-nearest neighbors model for classification Classify yxcE, yxcD

pair1 = [6, -173.143442352]
kNN.classify(model, pair1)
Out[1]: 1

Classify yxiB and yxiA

pair2 = [309, -271.005880394]
kNN.classify(model, pair2)
Out[2]: 0

This is consistent with the results from logistic regression

Hooray! Let's celebrate again!!!

Fancier Analyses

To run the code in this section, the following keys steps must be run ahead of time

  1. Import KNN module from Biopython

  2. Define a list containing the distance and the score of similarity in expression profile between the 2 genes

  3. Define a list specifies if the gene pair belongs to the same operon (1) or different operons (0)

classify yxcE, yxcD

pair1 = [6, -173.143442352]
print("yxcE, yxcD:", kNN.classify(model, pair1))

Output:

yxcE, yxcD: 1

Classify yxiB and yxiA

pair2 = [309, -271.005880394]
print("yxiB, yxiA:", kNN.classify(model, pair2))

Output:

yxiB, yxiA: 0

Last updated

Was this helpful?