# K-Means Clustering

The k-means algorithm takes a bunch of unlabeled points and tries to group them into “k” number of clusters.

The “k” in k-means denotes the number of clusters you want to have in the end. If k = 5, you will have 5 clusters on the data set.

How it works?

1. Determine K value by Elbow method and specify the number of clusters K
2. Randomly assign each data point to a cluster
3. Determine the cluster centroid coordinates
4. Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance
5. Calculate cluster centroids again
6. Repeat steps 4 and 5 until we reach global optima where no improvements are possible and no switching of data points from one cluster to other.

#### Quick Run

Import necessary Python packages

```
import numpy
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
```

Define a list containing the distance and the score of similarity in expression profile between the 2 genes

```
xs =      [[-53, -200.78],
          [117, -267.14],
          [57, -163.47],
          [16, -190.30],
          [11, -220.94],
          [85, -193.94],
          [16, -182.71],
          [15, -180.41],
          [-26, -181.73],
          [58, -259.87],
          [126, -414.53],
          [191, -249.57],
          [113, -265.28],
          [145, -312.99],
          [154, -213.83],
          [147, -380.85],
          [93, -291.13]]
```

Implementation of K-Means Clustering

```
model = KMeans(n_clusters = 2)
model.fit(xs)
model.labels_
colormap = numpy.array(['Red', 'Blue'])
z = plt.scatter([i[0] for i in xs], [i[1] for i in xs], c = colormap[model.labels_]
```

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2FbRw8jECWm3IWo6UD5huD%2Fimage.png?alt=media&#x26;token=b18fc7ab-0ec7-45e6-8f8e-2272fdabeaa3" alt=""><figcaption></figcaption></figure>

Accuracy estimates Define a list with know answer if the gene pair belongs to the same operon (1) or different operons (0)

```
ys =     [1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          0,
          0,
          0,
          0,
          0,
          0,
          0]
```

Compare cluster labels with know answer

```
accuracy_score(ys,model.labels_)
```

```
Out[1]: 0.9411764705882353
```

#### Traditional Approach

Import necessary Python packages

```
import numpy
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
```

Define a list containing the distance and the score of similarity in expression profile between the 2 genes

```
xs =      [[-53, -200.78],
          [117, -267.14],
          [57, -163.47],
          [16, -190.30],
          [11, -220.94],
          [85, -193.94],
          [16, -182.71],
          [15, -180.41],
          [-26, -181.73],
          [58, -259.87],
          [126, -414.53],
          [191, -249.57],
          [113, -265.28],
          [145, -312.99],
          [154, -213.83],
          [147, -380.85],
          [93, -291.13]]
```

Find out the optimal number of clusters using the elbow method

```
Nc = range(1, 10)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(xs).score(xs) for i in range(len(kmeans))]
score
plt.plot(Nc,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
```

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2FgwyNXoO1gFQT3790GrXX%2Fimage.png?alt=media&#x26;token=3990093b-43c9-45ca-b26f-4cc2c4df742a" alt=""><figcaption></figcaption></figure>

Implementation of K-Means Clustering

```
model = KMeans(n_clusters = 2)
model.fit(xs)
model.labels_
colormap = numpy.array(['Red', 'Blue'])
z = plt.scatter([i[0] for i in xs], [i[1] for i in xs], c = colormap[model.labels_])
```

<figure><img src="https://498238201-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWuHhstIreJ3jFvE4gQ3y%2Fuploads%2Fc3Ic5x8laX2YtELj1PWt%2Fimage.png?alt=media&#x26;token=e99a9b5a-ce41-4fdc-93d8-7230984325c0" alt=""><figcaption></figcaption></figure>

Accuracy estimates Define a list with know answer if the gene pair belongs to the same operon (1) or different operons (0)

```
ys =     [1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          0,
          0,
          0,
          0,
          0,
          0,
          0]
```

Compare cluster labels with know answer

```
accuracy_score(ys,model.labels_)
```

```
Out[1]: 0.9411764705882353
```
