# K-Means Clustering

The k-means algorithm takes a bunch of unlabeled points and tries to group them into “k” number of clusters.

The “k” in k-means denotes the number of clusters you want to have in the end. If k = 5, you will have 5 clusters on the data set.

How it works?

1. Determine K value by Elbow method and specify the number of clusters K
2. Randomly assign each data point to a cluster
3. Determine the cluster centroid coordinates
4. Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance
5. Calculate cluster centroids again
6. Repeat steps 4 and 5 until we reach global optima where no improvements are possible and no switching of data points from one cluster to other.

#### Quick Run

Import necessary Python packages

```
import numpy
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
```

Define a list containing the distance and the score of similarity in expression profile between the 2 genes

```
xs =      [[-53, -200.78],
          [117, -267.14],
          [57, -163.47],
          [16, -190.30],
          [11, -220.94],
          [85, -193.94],
          [16, -182.71],
          [15, -180.41],
          [-26, -181.73],
          [58, -259.87],
          [126, -414.53],
          [191, -249.57],
          [113, -265.28],
          [145, -312.99],
          [154, -213.83],
          [147, -380.85],
          [93, -291.13]]
```

Implementation of K-Means Clustering

```
model = KMeans(n_clusters = 2)
model.fit(xs)
model.labels_
colormap = numpy.array(['Red', 'Blue'])
z = plt.scatter([i[0] for i in xs], [i[1] for i in xs], c = colormap[model.labels_]
```

<figure><img src="/files/OGANy3MSkdhVX6WvPgNg" alt=""><figcaption></figcaption></figure>

Accuracy estimates Define a list with know answer if the gene pair belongs to the same operon (1) or different operons (0)

```
ys =     [1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          0,
          0,
          0,
          0,
          0,
          0,
          0]
```

Compare cluster labels with know answer

```
accuracy_score(ys,model.labels_)
```

```
Out[1]: 0.9411764705882353
```

#### Traditional Approach

Import necessary Python packages

```
import numpy
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
```

Define a list containing the distance and the score of similarity in expression profile between the 2 genes

```
xs =      [[-53, -200.78],
          [117, -267.14],
          [57, -163.47],
          [16, -190.30],
          [11, -220.94],
          [85, -193.94],
          [16, -182.71],
          [15, -180.41],
          [-26, -181.73],
          [58, -259.87],
          [126, -414.53],
          [191, -249.57],
          [113, -265.28],
          [145, -312.99],
          [154, -213.83],
          [147, -380.85],
          [93, -291.13]]
```

Find out the optimal number of clusters using the elbow method

```
Nc = range(1, 10)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(xs).score(xs) for i in range(len(kmeans))]
score
plt.plot(Nc,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
```

<figure><img src="/files/31Impwb8hjiASNzkkb8g" alt=""><figcaption></figcaption></figure>

Implementation of K-Means Clustering

```
model = KMeans(n_clusters = 2)
model.fit(xs)
model.labels_
colormap = numpy.array(['Red', 'Blue'])
z = plt.scatter([i[0] for i in xs], [i[1] for i in xs], c = colormap[model.labels_])
```

<figure><img src="/files/WJyqSLj4GlRbKfFudFow" alt=""><figcaption></figcaption></figure>

Accuracy estimates Define a list with know answer if the gene pair belongs to the same operon (1) or different operons (0)

```
ys =     [1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          0,
          0,
          0,
          0,
          0,
          0,
          0]
```

Compare cluster labels with know answer

```
accuracy_score(ys,model.labels_)
```

```
Out[1]: 0.9411764705882353
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://igb.mit.edu/mini-courses/python/machine-learning-with-python/hands-on/unsupervised-approaches/k-means-clustering.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
