## K Means Clustering | Step-by-Step Tutorials for Clustering in Data Analysis

Introduction.

K means is one of the most popular Unsupervised Machine Learning Algorithms Used for Solving Classification Problems in data science and is very important if you are aiming for a data scientist role. K Means segregates the unlabeled data into various groups, called clusters, based on having similar features and common patterns. This tutorial will teach you the definition and applications of clustering, focusing on the K means clustering algorithm and its implementation in Python. It will also tell you how to choose the optimum number of clusters for a dataset.

Learning Objectives

- Understand what the K-means clustering algorithm is.
- Develop a good understanding of the steps involved in implementing the K-Means algorithm and finding the optimal number of clusters.
- Implement K means Clustering in Python with scikit-learn library.

This article was published as a part of the Data Science Blogathon.

## Table of Contents

## Implementation of the K-Means Algorithm

The implementation and working of the K-Means algorithm are explained in the steps below:

Step 1: Select the value of K to decide the number of clusters (n_clusters) to be formed.

Step 2: Select random K points that will act as cluster centroids (cluster_centers).

Step 4: Place a new centroid of each cluster.

Step 6: If any reassignment occurs, then go to step 4; else, go to step 7.

Step 6: As reassignment has occurred, we will repeat the above step of finding new k centroids.

The final cluster formed is like this:

## Elbow Method

Here are the steps to follow in order to find the optimal number of clusters using the elbow method:

Step 2: For each value of K, calculate the WCSS value.

Step 3: Plot a graph/curve between WCSS values and the respective number of clusters K.

## WCSS and Elbow Method

This method shows that 3 is a good number of clusters.

- K-means is a widely used unsupervised machine learning algorithm for clustering data into groups (also known as clusters) of similar objects.
- The objective is to minimize the sum of squared distances between the objects and their respective cluster centroids.
- The k-means clustering algorithm is limited as it can not handle complex and non-linear data.

## Frequently Asked Questions

Q1. what is meant by n_init in k-means clustering.

## Q2. What are the advantages and disadvantages of K-means Clustering?

## Q3. What is meant by random_state in k-means clustering?

## About the Author

## Pranshu Sharma

## Download Analytics Vidhya App for the Latest blog/Article

Leave a reply your email address will not be published. required fields are marked *.

Notify me of follow-up comments by email.

Notify me of new posts by email.

## Top Resources

30 Best Data Science Books to Read in 2023

How to Read and Write With CSV Files in Python:..

Understand Random Forest Algorithms With Examples (Updated 2023)

Feature Selection Techniques in Machine Learning (Updated 2023)

## Welcome to India's Largest Data Science Community

A verification link has been sent to your email id

If you have not recieved the link please goto Sign Up page again

## back Please enter the OTP that is sent to your registered email id

Back please enter the otp that is sent to your email id, back please enter your registered email id.

This email id is not registered with us. Please enter your registered email id.

## back Please enter the OTP that is sent your registered email id

Please create the new password here, privacy overview.

## Create a K-Means Clustering Algorithm from Scratch in Python

Cement your knowledge of k-means clustering by implementing it yourself, introduction.

Suppose you have a dataset of 2-dimensional scalar attributes:

Figure 1: A dataset of points with groups of distinct attributes.

Figure 2: The data points are segmented into groups denoted with differing colors.

- The distance from each point to each centroid is calculated.
- Points are assigned to their nearest centroid.
- Centroids are shifted to be the average value of the points belonging to it. If the centroids did not move, the algorithm is finished, else repeat.

Figure 3: The dataset we will use to evaluate our k means clustering model.

## Model Creation

## Implementation

Now, the bulk of the algorithm is performed when fitting the model to a training dataset.

Next, we perform the iterative process of optimizing the centroid locations.

Before beginning the while loop, we’ll initialize the variables used in the exit conditions.

## First Model Evaluation

Figure 4: A failed example where one centroid has no points, and one contains two clusters.

## Re-evaluating Centroid Initialization

- If a centroid is initialized far from any groups, it is unlikely to move. (Example: the bottom right centroid in Figure 4 .)
- If centroids are initialized too close, they’re unlikely to diverge from one another. (Example: the two centroids in the green group in Figure 6 .)

- Initialize the first centroid as a random selection of one of the data points.
- Calculate the sum of the distances between each data point and all the centroids.
- Select the next centroid randomly, with a probability proportional to the total distance to the centroids.
- Return to step 2. Repeat until all centroids have been initialized.

Figure 7: An ideal convergence, after implementing the k-means++ initialization method.

Thanks for reading! Connect with me on LinkedIn See this project in GitHub

## More from Towards Data Science

Your home for data science. A Medium publication sharing concepts, ideas and codes.

## Get the Medium app

## Turner Luke

## K-Means Clustering in Python: A Practical Guide

## Overview of Clustering Techniques

In this tutorial, you’ll learn:

- What k -means clustering is
- When to use k -means clustering to analyze your data
- How to implement k -means clustering in Python with scikit-learn
- How to select a meaningful number of clusters

## What Is Clustering?

Partitional clustering methods have several strengths :

- They work well when clusters have a spherical shape .
- They’re scalable with respect to algorithm complexity.

They also have several weaknesses :

- They’re not well suited for clusters with complex shapes and different sizes.
- They break down when used with clusters of different densities .

The strengths of hierarchical clustering methods include the following:

- They often reveal the finer details about the relationships between data objects.
- They provide an interpretable dendrogram .

The weaknesses of hierarchical clustering methods include the following:

- They’re computationally expensive with respect to algorithm complexity.
- They’re sensitive to noise and outliers .

The strengths of density-based clustering methods include the following:

The weaknesses of density-based clustering methods include the following:

- They aren’t well suited for clustering in high-dimensional spaces .
- They have trouble identifying clusters of varying densities .

## How to Perform K-Means Clustering in Python

Otherwise, you can begin by installing the required packages:

This step will import the modules needed for all the code in this section:

- n_samples is the total number of samples to generate.
- centers is the number of centers to generate.
- cluster_std is the standard deviation.

make_blobs() returns a tuple of two values:

- A two-dimensional NumPy array with the x- and y-values for each of the samples
- A one-dimensional NumPy array containing the cluster labels for each sample

Generate the synthetic data and labels:

Here’s a look at the first five elements for each of the variables returned by make_blobs() :

Take a look at how the values have been scaled in scaled_features :

Here are the parameters used in this example:

n_clusters sets k for the clustering step. This is the most important parameter for k -means.

max_iter sets the number of maximum iterations for each initialization of the k -means algorithm.

Instantiate the KMeans class with the following arguments:

The above code produces the following plot:

- How close the data point is to other points in the cluster
- How far away the data point is from points in other clusters

Note: In practice, it’s rare to encounter datasets that have ground truth labels.

This time, use make_moons() to generate synthetic data in the shape of crescents:

If you’re interested, you can find the code for the above plot by expanding the box below.

Compare the clustering results of DBSCAN and k -means using ARI as the performance metric:

## How to Build a K-Means Clustering Pipeline in Python

Download and extract the TCGA dataset from UCI:

After the download and extraction is completed, you should have a directory that looks like this:

The labels are strings containing abbreviations of cancer types:

- BRCA : Breast invasive carcinoma
- COAD : Colon adenocarcinoma
- KIRC : Kidney renal clear cell carcinoma
- LUAD : Lung adenocarcinoma
- PRAD : Prostate adenocarcinoma

n_init: You’ll increase the number of initializations to ensure you find a stable solution.

Build the k -means clustering pipeline with user-defined arguments in the KMeans constructor:

Calling .fit() with data as the argument performs all the pipeline steps on the data :

Evaluate the performance by calculating the silhouette coefficient:

Calculate ARI, too, since the ground truth cluster labels are available:

Here’s what the plot looks like:

Iterate over a range of n_components and record evaluation metrics for each iteration:

The above code generates the a plot showing performance metrics as a function of n_components :

There are two takeaways from this figure:

In this tutorial, you learned:

- What the popular clustering techniques are and when to use them
- What the k -means algorithm is
- How to implement k -means clustering in Python
- How to evaluate the performance of clustering algorithms
- How to build and tune a robust k -means clustering pipeline in Python
- How to analyze and present clustering results from the k -means algorithm

The original dataset is maintained by The Cancer Genome Atlas Pan-Cancer analysis project.

Kevin is a data scientist for a clinical genomics company, a Pythonista, and an NBA fan.

Master Real-World Python Skills With Unlimited Access to Real Python

Related Tutorial Categories: advanced data-science machine-learning

## Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

K-Means Clustering (Source Code)

🔒 No spam. We take your privacy seriously.

- Office Hours
- Python Tutorial
- Markov Decisions
- Practice Midterms
- Midterm Solutions
- Big Picture
- Programming:
- Driverless Car
- Visual Cortex
- Problem Sets:
- Search Pset
- Variable Pset
- Learning Pset
- Variable Models Pset
- Machine Learning Pset
- Final Project
- Self Driving Car Machine Translation Deep Blue Watson

Written by Chris Piech. Based on a handout by Andrew Ng.

## The Basic Idea

## The Algorithm

## Implementation

To calculate the distance between x and y we can use: np.sqrt(sum((x - y) ** 2))

## Expectation Maximization

Figure 2: The K-Means algorithm is the EM algorithm applied to this Bayes Net.

© Stanford 2013 | Designed by Chris . Inspired by Niels and Percy .

- Initialize K centroids (See "Progress steps and testing" below). In our case, a centroid is just the average of all genes in a cluster, and it can be represented just like a normal gene.
- Assign each gene to the cluster that has the closest centroid.
- After all genes have been assigned to clusters, recalculate the centroids for each cluster (as averages of all genes in the cluster).
- Repeat the gene assignments and centroid calculations until no change in gene assignment occurs between iterations .

- The Gene object will contain a gene name and an array of microarray expression levels. Gene objects will also have the ability to calculate the distance to another Gene . Gene also contains two member functions hashCode() and equals(Object o) for defining how to compare two objects of type Gene .
- The Cluster object will contain a HashSet (more on this later) of Gene objects. Cluster objects will have the ability to print out the names of all the genes in the cluster, to calculate the centroid of the cluster, and to create an image that represents the cluster (this code is provided for you).
- The KMeans class will contain an array of all the Gene objects to be clustered and an array of the Cluster objects created; it will also locally create arrays containing the centroids of the clusters. KMeans will also hold the main() function and all of the code to perform the clustering.

for (Gene gene : geneSet) { System.out.println(gene.getName()); }

java KMeans <input_data_filename> <K> <metric> <output_filename>

- <input_data_filename> is the file containing the data to cluster
- <K> is an integer indicating the number of clusters to create
- <metric> is a string indicating which distance metric to use ("euclid" for Euclidean distance and "spearman" for Spearman correlation)
- <output_filename> is the base name desired for the pictures generated (without a "." or extension)

We have marked the places in the code where you have to fill in with // TODO .

- Randomly choose K genes as the initial centroids
- Use the first K genes in the dataset as the the initial centroids

## Required files to submit:

- Cluster.java
- KMeans.java
- human_euclidean_5.txt
- human_spearman_5.txt
- yeast_euclidean_7.txt
- yeast_spearman_7.txt
- human_euclidean_5_1.jpg
- human_euclidean_5_2.jpg
- human_euclidean_5_3.jpg
- human_euclidean_5_4.jpg
- human_euclidean_5_5.jpg
- human_spearman_5_1.jpg
- human_spearman_5_2.jpg
- human_spearman_5_3.jpg
- human_spearman_5_4.jpg
- human_spearman_5_5.jpg
- yeast_euclidean_7_1.jpg
- yeast_euclidean_7_2.jpg
- yeast_euclidean_7_3.jpg
- yeast_euclidean_7_4.jpg
- yeast_euclidean_7_5.jpg
- yeast_euclidean_7_6.jpg
- yeast_euclidean_7_7.jpg
- yeast_spearman_7_1.jpg
- yeast_spearman_7_2.jpg
- yeast_spearman_7_3.jpg
- yeast_spearman_7_4.jpg
- yeast_spearman_7_5.jpg
- yeast_spearman_7_6.jpg
- yeast_spearman_7_7.jpg

## Tutorial Playlist

Everything You Need to Know About Feature Selection

## Linear Regression in Python

The Best Guide to Confusion Matrix

## How to Leverage KNN Algorithm in Machine Learning?

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

## PCA in Machine Learning - Your Complete Guide to Principal Component Analysis

Lesson 17 of 33 By Mayank Banoula

## Table of Contents

## Types of Clustering

The various types of clustering are:

Hierarchical clustering is further subdivided into:

Partitioning clustering is further subdivided into:

## Become an Expert in All Things AI and ML!

## Hierarchical Clustering

Hierarchical clustering uses a tree-like structure, like so:

## Partitioning Clustering

Partitioning clustering is split into two subtypes - K-Means clustering and Fuzzy C-Means.

## Boost Your AI and Machine Learning Career

## What is Meant by the K-Means Clustering Algorithm?

Let's take a look at the steps to create these clusters.

If we plot the data, this is how it would look:

Stay ahead of the tech-game with our AI Certification in partnership with Caltech and in collaboration with IBM. Explore more!

## Perform Clustering

We need to create the clusters, as shown below:

Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2).

## Fast-track Your Career in AI & Machine Learning!

## Applications of K-Means Clustering

K-Means clustering is used in a variety of examples or business cases in real life, like:

## Academic Performance

Based on the scores, students are categorized into grades like A, B, or C.

## Distance Measure

K-Means clustering supports various kinds of distance measures, such as:

- Euclidean distance measure
- Manhattan distance measure
- A squared euclidean distance measure
- Cosine distance measure

Do you wish to accelerate your AL and ML career? Join our Machine Learning Course and gain access to 25+ industry relevant projects, career mentorship and more.

## Euclidean Distance Measure

The formula for distance between two points is shown below:

## Squared Euclidean Distance Measure

## Manhattan Distance Measure

Note that we are taking the absolute value so that the negative values don't come into play.

## Cosine Distance Measure

## Become a Data Scientist With Real-World Experience

## How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

Let's use a visualization example to understand this better.

Now, we draw a curve between WSS and the number of clusters.

Here, WSS is on the y-axis and number of clusters on the x-axis.

You can see that there is a very gradual change in the value of WSS as the K value increases from 2.

## Machine Learning FREE Course

Let's assume that these are our delivery points:

We can randomly initialize two points called the cluster centroids.

Here, C1 and C2 are the centroids assigned randomly.

This is how the initial grouping is done:

Compute the actual centroid of data points for the first group.

Reposition the random centroid to the actual centroid.

Compute the actual centroid of data points for the second group.

Once the cluster becomes static, the k-means algorithm is said to be converged.

The final cluster with centroids c1 and c2 is as shown below:

## K-Means Clustering Algorithm

Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters.

The steps to form clusters are:

Step 1: Choose K random points as cluster centers called centroids.

Step 3: Identify new centroids by taking the average of the assigned points.

Step 4: Keep repeating step 2 and step 3 until convergence is achieved

Let's take a detailed look at it at each of these steps.

We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say that

Where C is the set of all centroids.

## Master Tools You Need For Becoming an AI Engineer

Where dist() is the Euclidean distance.

Similarly, we find the minimum distance for x2, x3, etc.

We identify the actual centroid by taking the average of all the points assigned to that cluster.

Where Si is the set of all points assigned to the ith cluster.

Keep repeating step 2 and step 3 until convergence is achieved.

## How to Choose the Value of "K number of clusters" in K-Means Clustering?

## Python Implementation of the K-Means Clustering Algorithm

- Data pre-processing
- Finding the optimal number of clusters using the elbow method
- Training the K-Means algorithm on the training data set
- Visualizing the clusters

1. Data Pre-Processing. Import the libraries, datasets, and extract the independent variables.

import matplotlib.pyplot as mtp

dataset = pd.read_csv('Mall_Customers_data.csv')

x = dataset.iloc[:, [3, 4]].values

2. Find the optimal number of clusters using the elbow method. Here’s the code you use:

#finding optimal number of clusters using the elbow method

from sklearn.cluster import KMeans

wcss_list= [] #Initializing the list for the values of WCSS

#Using for loop for iterations from 1 to 10.

kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)

wcss_list.append(kmeans.inertia_)

mtp.plot(range(1, 11), wcss_list)

mtp.title('The Elobw Method Graph')

mtp.xlabel('Number of clusters(k)')

#training the K-means model on a dataset

kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)

y_predict= kmeans.fit_predict(x)

4. Visualize the Clusters. Since this model has five clusters, we need to visualize each one.

mtp.title('Clusters of customers')

mtp.xlabel('Annual Income (k$)')

mtp.ylabel('Spending Score (1-100)')

Coding provided by Javatpoint .

## Demo: K-Means Clustering

Explore the concepts of Machine Learning and understand how it’s transforming the digital world with the Machine Learning Bootcamp . Enroll now!

## Conclusion

## Find our Caltech Post Graduate Program In AI And Machine Learning Online Bootcamp in top cities:

## Recommended Programs

Caltech Post Graduate Program in AI and Machine Learning

*Lifetime access to high-quality, self-paced e-learning content.

## Recommended Resources

What is Hierarchical Clustering and How Does It Work

Free eBook: 2016 High Paying Certifications

Free eBook: Guide To Scrum Methodology

- Platform Platform Domino Enterprise MLOps Platform Platform Components System of Record Integrated Model Factory Self Service Infrastructure Portal Code Assist Explore Pricing Nexus Platform Updates Domino Cloud Platform
- Solutions By Role Chief Data & Analytics Executives Data Science Leaders Data Scientists IT Leaders By Industry Financial Services Health & Life Sciences Insurance See More Use Cases Self-Service Data Science Open Data Science Model Risk Management Cloud Data Science Upskill Talent Solutions
- Resources Guides, Videos & More Blog Events Podcast Learning Community Documentation
- Partners Partners Tools & Data Infrastructure Solutions Implementation & Consulting Become a Partner Featured NVIDIA AWS Azure Accenture Snowflake Partners
- Company About Careers We're Hiring News & Press Contact
- Domino Enterprise MLOps Platform
- System of Record
- Integrated Model Factory
- Self Service Infrastructure Portal
- Low Code Assistant
- Platform Updates
- Self-Service Data Science
- Open Data Science
- Model Risk Management
- Cloud Data Science
- Data Science Leaders
- Data Scientists
- Financial Services
- Media & Technology
- Health & Life Sciences
- Manufacturing
- Retail, eCommerce & Consumer Products
- AI Upskiling
- Guides, Videos & More
- Data Science Blog
- Enterprise Field Guides
- Documentation
- Tools & Data
- Infrastructure
- Implementation & Consulting
- Become a Partner
- Careers We're Hiring
- News & Press

## Getting started with k-means clustering in Python

## Table of Contents

## Getting Started

## K-Means - what does it mean?

The distance \(\overline{p_1 p_2 }\) is given by:

$$\overline{p_1 p_2 }=\sqrt{(a_2-a_1 )^2+(b_2-b_1 )^2 }$$

## A recipe for k-means

The recipe for \(k\)-means is quite straightforward.

- Decide how many clusters you want, i.e. choose k
- Randomly assign a centroid to each of the k clusters
- Calculate the distance of all observation to each of the k centroids
- Assign observations to the closest centroid
- Find the new location of the centroid by taking the mean of all the observations in each cluster
- Repeat steps 3-5 until the centroids do not change position

- ID : A unique identifier for the observation
- x : Attribute corresponding to an x coordinate
- y : Attribute corresponding to a y coordinate
- Cluster : An identifier for the cluster the observation belongs to

## Steps 1 and 2 - Define \(k\) and initiate the centroids

## Step 3 - Calculate distance

## Step 4 - Assign centroids

## Step 5 - Update centroid location

We can verify that the position has been updated. Let us look again at our scatter plot:

## Step 6 - Repeat steps 3-5

Let us see the location of the final centroids:

And in a graphical way, let us see our clusters:

That is it! We can look at the labels that the algorithm has provided as follows:

In order to make a comparison, let us reorder the labels:

Let us look at the location of the final clusters:

And we can look at some 3D graphics. In this case we will plot the following features:

• Petal width • Sepal length • Petal length

As you can see, there are a few observations that differ in colour between the two plots.

To get access to the code/data from this article, please use the Access Project button below.

## Dr J Rogel-Salazar

## Subscribe to the Domino Newsletter

Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.

## Other posts you might be interested in

Shap and lime python libraries: part 2 - using shap and lime.

This blog post provides insights on how to use the SHAP and LIME Python libraries in practice and...

## Manual Feature Engineering

## Fitting Support Vector Machines via Quadratic Programming

In this blog post we take a deep dive into the internals of Support Vector Machines. We derive a...

## IMAGES

## VIDEO

## COMMENTS

Vitamin K is often mentioned in health and wellbeing media, and it’s common to see it addressed in respect to various health claims — from anti-aging effects and healthy skin to bone health and a lower risk of cardiovascular disease.

When talking about money, the letter K after a number denotes thousands. 1K means $1,000 while 100K stands for $100,000. Both uppercase and lowercase K’s are generally accepted and recognized.

Schedule K-1 is a form used to report the taxpayer’s portion of the income from a partnership, S-corporation, estate or trust. These legal entities use a pass-through taxation, according to TurboTax.

Programming Assignment - Python version · Implement the following clustering algorithms: K-means and Kernel K-means. · Implement the following supervised

In this exercise, you will implement the K-means clustering algorithm and

The implementation and working of the K-Means algorithm are explained in the steps below: Step 1: Select the value of K to decide the number of

For a given dataset, k is specified to be the number of distinct groups the points belong to. These k centroids are first randomly initialized, then iterations

The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a

This assignment requires you to implement the k-means clustering algorithm using the Python programming language. Note that no credit will be

The goal of the k-means algorithm is to minimize

In this project, we'll build a k-means clustering algorithm from scratch. Clustering is an unsupervised machine learning technique that can

Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2). The first step in k-means clustering is the

We will first establish the notion of a cluster and determine an important part in the implementation of k-means: centroids. We will see how k-