Unsupervised learning applies when points have no external labeling of classification. Let's pretend we do not know whether the genes pairs are in the same operon. All we know are the distance between the gene pairs and the expression similarity scores. We will practice an unsupervised method using K-Means algorithm to classify the gene pairs
The k-means algorithm takes a bunch of unlabeled points and tries to group them into “k” number of clusters.
The “k” in k-means denotes the number of clusters you want to have in the end. If k = 5, you will have 5 clusters on the data set.
How it works?
Determine K value by Elbow method and specify the number of clusters K
Randomly assign each data point to a cluster
Determine the cluster centroid coordinates
Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance
Calculate cluster centroids again
Repeat steps 4 and 5 until we reach global optima where no improvements are possible and no switching of data points from one cluster to other.
Import necessary Python packages
Define a list containing the distance and the score of similarity in expression profile between the 2 genes
Implementation of K-Means Clustering
Accuracy estimates Define a list with know answer if the gene pair belongs to the same operon (1) or different operons (0)
Compare cluster labels with know answer
Import necessary Python packages
Define a list containing the distance and the score of similarity in expression profile between the 2 genes
Find out the optimal number of clusters using the elbow method
Implementation of K-Means Clustering
Accuracy estimates Define a list with know answer if the gene pair belongs to the same operon (1) or different operons (0)
Compare cluster labels with know answer
import numpy
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
xs = [[-53, -200.78],
[117, -267.14],
[57, -163.47],
[16, -190.30],
[11, -220.94],
[85, -193.94],
[16, -182.71],
[15, -180.41],
[-26, -181.73],
[58, -259.87],
[126, -414.53],
[191, -249.57],
[113, -265.28],
[145, -312.99],
[154, -213.83],
[147, -380.85],
[93, -291.13]]
model = KMeans(n_clusters = 2)
model.fit(xs)
model.labels_
colormap = numpy.array(['Red', 'Blue'])
z = plt.scatter([i[0] for i in xs], [i[1] for i in xs], c = colormap[model.labels_]
ys = [1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
0,
0,
0,
0,
0,
0,
0]
accuracy_score(ys,model.labels_)
Out[1]: 0.9411764705882353
import numpy
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
xs = [[-53, -200.78],
[117, -267.14],
[57, -163.47],
[16, -190.30],
[11, -220.94],
[85, -193.94],
[16, -182.71],
[15, -180.41],
[-26, -181.73],
[58, -259.87],
[126, -414.53],
[191, -249.57],
[113, -265.28],
[145, -312.99],
[154, -213.83],
[147, -380.85],
[93, -291.13]]
Nc = range(1, 10)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(xs).score(xs) for i in range(len(kmeans))]
score
plt.plot(Nc,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
model = KMeans(n_clusters = 2)
model.fit(xs)
model.labels_
colormap = numpy.array(['Red', 'Blue'])
z = plt.scatter([i[0] for i in xs], [i[1] for i in xs], c = colormap[model.labels_])
ys = [1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
0,
0,
0,
0,
0,
0,
0]
accuracy_score(ys,model.labels_)
Out[1]: 0.9411764705882353