K Means Clustering | Step-by-Step Tutorials for Clustering in Data Analysis

Introduction.

K means is one of the most popular Unsupervised Machine Learning Algorithms Used for Solving Classification Problems in data science and is very important if you are aiming for a data scientist role. K Means segregates the unlabeled data into various groups, called clusters, based on having similar features and common patterns. This tutorial will teach you the definition and applications of clustering, focusing on the K means clustering algorithm and its implementation in Python. It will also tell you how to choose the optimum number of clusters for a dataset.

K Means Clustering image

Learning Objectives

  • Understand what the K-means clustering algorithm is.
  • Develop a good understanding of the steps involved in implementing the K-Means algorithm and finding the optimal number of clusters.
  • Implement K means Clustering in Python with scikit-learn library.

This article was published as a part of the Data Science Blogathon.

Table of Contents

What is clustering, what is k-means clustering algorithm, diagrammatic implementation of k-means clustering, choosing the optimal number of clusters.

Suppose we have N number of unlabeled multivariate datasets of various animals like dogs, cats, birds, etc. The technique of segregating these datasets into various groups on the basis of having similar features and characteristics is called clustering .

The groups being formed are known as clusters. Clustering techniques are used in various fields, such as image recognition, spam filtering, etc. They are also used in unsupervised learning algorithms in machine learning, as they can segregate multivariate data into various groups, without any supervisor, on the basis of common patterns hidden inside the datasets.

The k-means clustering algorithm is an Iterative algorithm that divides a group of n datasets into k different clusters based on the similarity and their mean distance from the centroid of that particular subgroup/ formed.

K, here is the pre-defined number of clusters to be formed by the algorithm. If K=3, It means the number of clusters to be formed from the dataset is 3.

Implementation of the K-Means Algorithm

The implementation and working of the K-Means algorithm are explained in the steps below:

Step 1: Select the value of K to decide the number of clusters (n_clusters) to be formed.

Step 2: Select random K points that will act as cluster centroids (cluster_centers).

Step 3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid, which will form the predefined clusters.

Step 4: Place a new centroid of each cluster.

Step 5: Repeat step no.3, which reassigns each datapoint to the new closest centroid of each cluster.

Step 6: If any reassignment occurs, then go to step 4; else, go to step 7.

Step 7: Finish

Step 1: Let’s choose the number k of clusters, i.e., K=2, to segregate the dataset and put them into different respective clusters. We will choose some random 2 points which will act as centroids to form the cluster.

Step 2: Now, we will assign each data point to a scatter plot based on its distance from the closest K-point or centroid. It will be done by drawing a median between both the centroids.

Step 3:  points on the left side of the line are near the blue centroid, and points to the right of the line are close to the yellow centroid. The left forms a cluster with the blue centroid, and the right one with the yellow centroid.

Step 4: Repeat the process by choosing a new centroid . To choose the new centroids, we will find the new center of gravity of these centroids, as depicted below.

Step 5:  Next, we will reassign each data point to the new centroid. We will repeat the same process as above (using a median line). The yellow data point on the blue side of the median line will be included in the blue cluster.

K Means Clustering 2

Step 6:  As reassignment has occurred, we will repeat the above step of finding new k centroids.

K Means Clustering step 6

Step 7:  We will repeat the above process of finding the center of gravity of k centroids, as depicted below.

K Means Clustering step 7

Step 8:  After finding the new k centroids, we will again draw the median line and reassign the data points, like the above steps.

K Means Clustering step 8

Step 9:  We will finally segregate points based on the median line, such that two groups are being formed and no dissimilar point is to be included in a single group.

K Means Clustering step 9

The final cluster formed is like this:

Cluster

The number of clusters that we choose for the algorithm shouldn’t be random. Each and every cluster is formed by calculating and comparing the mean distances of each data point within a cluster from its centroid.

We can choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method. WCSS stands for the sum of the squares of distances of the data points in each and every cluster from its centroid.

The main idea is to minimize the distance (e.g., euclidean distance) between the data points and the centroid of the clusters. The process is iterated until we reach a minimum value for the sum of distances.

Elbow Method

Here are the steps to follow in order to find the optimal number of clusters using the elbow method:

Step 1: Execute the K-means clustering on a given dataset for different K values (ranging from 1-10).

Step 2: For each value of K, calculate the WCSS value.

Step 3: Plot a graph/curve between WCSS values and the respective number of clusters K.

Step 4: The sharp point of bend or a point (looking like an elbow joint) of the plot, like an arm, will be considered as the best/optimal value of K.

Python Implementation:

Importing relevant libraries

Loading the data

data

Plotting the data

Python Code:

plotting data

Selecting the feature

select feature

Clustering results

plot cluster

WCSS and Elbow Method

cluster

This method shows that 3 is a good number of clusters.

To summarize everything that has been stated so far, k-means clustering is a widely used unsupervised machine learning technique that enables the grouping of data into clusters based on similarity. It is a simple algorithm that can be applied to various domains and data types, including image and text data. k-means can be used for a variety of purposes. We can use it to perform dimensionality reduction also, where each transformed feature is the distance of the point from a cluster center.

Key Takeaways

Frequently Asked Questions

Q1. what is meant by n_init in k-means clustering.

A. n_init is an integer and represents the number of times or the number of iterations the k-means algorithm will be run independently.

Q2. What are the advantages and disadvantages of K-means Clustering?

A. Advantages of K-means Clustering include its simplicity, scalability, and versatility, as it can be applied to a wide range of data types. Disadvantages include its sensitivity to the initial placement of centroids and its limitations in handling complex, non-linear data. k-means is also sensitive to outliers.

Q3. What is meant by random_state in k-means clustering?

A. In K-Means, random_state represents random number generation for centroid initialization. We can use an Integer value to make the randomness fixed or constant. Also, it helps when we want to produce the same clusters every time.

programming assignment implementing the k means clustering algorithm

About the Author

Pranshu Sharma

Pranshu Sharma

Our top authors.

Rahul Shah

Download Analytics Vidhya App for the Latest blog/Article

Leave a reply your email address will not be published. required fields are marked *.

Notify me of follow-up comments by email.

Notify me of new posts by email.

Top Resources

programming assignment implementing the k means clustering algorithm

30 Best Data Science Books to Read in 2023

programming assignment implementing the k means clustering algorithm

How to Read and Write With CSV Files in Python:..

programming assignment implementing the k means clustering algorithm

Understand Random Forest Algorithms With Examples (Updated 2023)

programming assignment implementing the k means clustering algorithm

Feature Selection Techniques in Machine Learning (Updated 2023)

Welcome to India's Largest Data Science Community

Back welcome back :), don't have an account yet register here, back start your journey here, already have an account login here.

A verification link has been sent to your email id

If you have not recieved the link please goto Sign Up page again

back Please enter the OTP that is sent to your registered email id

Back please enter the otp that is sent to your email id, back please enter your registered email id.

This email id is not registered with us. Please enter your registered email id.

back Please enter the OTP that is sent your registered email id

Please create the new password here, privacy overview.

programming assignment implementing the k means clustering algorithm

Towards Data Science

Turner Luke

Apr 11, 2022

Create a K-Means Clustering Algorithm from Scratch in Python

Cement your knowledge of k-means clustering by implementing it yourself, introduction.

k-means clustering is an unsupervised machine learning algorithm that seeks to segment a dataset into groups based on the similarity of datapoints. An unsupervised model has independent variables and no dependent variables .

Suppose you have a dataset of 2-dimensional scalar attributes:

If the points in this dataset belong to distinct groups with attributes significantly varying between groups but not within, the points should form clusters when plotted.

Figure 1: A dataset of points with groups of distinct attributes.

This dataset clearly displays 3 distinct classes of data. If we seek to assign a new data point to one of these three groups, it can be done by finding the midpoint of each group (centroid) and selecting the nearest centroid as the group of the unassigned data point.

Figure 2: The data points are segmented into groups denoted with differing colors.

For a given dataset, k is specified to be the number of distinct groups the points belong to. These k centroids are first randomly initialized, then iterations are performed to optimize the locations of these k centroids as follows:

To evaluate our algorithm, we’ll first generate a dataset of groups in 2-dimensional space. The sklearn.datasets function make_blobs creates groupings of 2-dimensional normal distributions, and assigns a label corresponding to the group said point belongs to.

Figure 3: The dataset we will use to evaluate our k means clustering model.

This dataset provides a unique demonstration of the k-means algorithm. Observe the orange point uncharacteristically far from its center, and directly in the cluster of purple data points. This point cannot be accurately classified as belonging to the right group, thus even if our algorithm works well it should incorrectly characterize it as a member of the purple group.

Model Creation

Helper functions.

We’ll need to calculate the distances between a point and a dataset of points multiple times in this algorithm. To do so, lets define a function that calculates Euclidean distances.

Implementation

First, the k-means clustering algorithm is initialized with a value for k and a maximum number of iterations for finding the optimal centroid locations. If a maximum number of iterations is not considered when optimizing centroid locations, there is a risk of running an infinite loop.

Now, the bulk of the algorithm is performed when fitting the model to a training dataset.

First we’ll initialize the centroids randomly in the domain of the test dataset, with a uniform distribution.

Next, we perform the iterative process of optimizing the centroid locations.

The optimization process is to readjust the centroid locations to be the means of the points belonging to it. This process is to repeat until the centroids stop moving, or the maximum number of iterations is passed. We’ll use a while loop to account for the fact that this process does not have a fixed number of iterations. Additionally, you could also use a for loop that repeats max_iter times and breaks when the centroids stop changing.

Before beginning the while loop, we’ll initialize the variables used in the exit conditions.

Now, we begin the loop. We’ll iterate through the data points in the training set, assigning them to an initialized empty list of lists. The sorted_points list contains one empty list for each centroid, where data points are appended once they’ve been assigned.

Now that we’ve assigned the whole training dataset to their closest centroids, we can update the location of the centroids and finish the iteration.

After the completion of the iteration, the while conditions are checked again, and the algorithm will continue until the centroids are optimized or the max iterations are passed. The full fit method is included below.

Lastly, lets make a method to evaluate a set of points to the centroids we’ve optimized to our training set. This method returns the centroid and the index of said centroid for each point.

First Model Evaluation

Now we can finally deploy our model. Lets train and test it on our original dataset and see the results. We’ll keep our original method of plotting our data, by separating the true labels by color, but now we’ll additionally separate the predicted labels by marker style, to see how the model performs.

Figure 4: A failed example where one centroid has no points, and one contains two clusters.

Figure 5: A failed example where one centroid has no points, two contains two clusters, and two split one cluster.

Figure 6: A failed example where two centroids contain one and a half clusters, and two centroids split a cluster.

Re-evaluating Centroid Initialization

Looks like our model isn’t performing very well. We can infer two primary problems from these three failed examples.

We’ll begin to remedy these problems with a new process of initializing the centroid locations. This new method is referred to as the k-means++ algorithm.

This code is included below.

If we run this new model a few times we’ll see it performs much better, but still not always perfect.

Figure 7: An ideal convergence, after implementing the k-means++ initialization method.

And with that, we’re finished. We learned a simple, yet elegant implementation of an unsupervised machine learning model. The complete project code is included below.

Thanks for reading! Connect with me on LinkedIn See this project in GitHub

More from Towards Data Science

Your home for data science. A Medium publication sharing concepts, ideas and codes.

About Help Terms Privacy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store

Turner Luke

Aspiring Data Scientist

Text to speech

K-Means Clustering in Python: A Practical Guide

K-Means Clustering in Python: A Practical Guide

Table of Contents

Overview of Clustering Techniques

Partitional clustering, hierarchical clustering, density-based clustering, understanding the k-means algorithm, writing your first k-means clustering code in python, choosing the appropriate number of clusters, evaluating clustering performance using advanced techniques, building a k-means clustering pipeline, tuning a k-means clustering pipeline.

The k -means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k -means is one of the oldest and most approachable. These traits make implementing k -means clustering in Python reasonably straightforward, even for novice programmers and data scientists.

If you’re interested in learning how and when to implement k -means clustering in Python, then this is the right place. You’ll walk through an end-to-end example of k -means clustering using Python, from preprocessing the data to evaluating results.

In this tutorial, you’ll learn:

Click the link below to download the code you’ll use to follow along with the examples in this tutorial and implement your own k -means clustering pipeline:

Download the sample code: Click here to get the code you’ll use to learn how to write a k-means clustering pipeline in this tutorial.

What Is Clustering?

Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters. In practice, clustering helps identify two qualities of data:

Meaningful clusters expand domain knowledge. For example, in the medical field, researchers applied clustering to gene expression experiments. The clustering results identified groups of patients who respond differently to medical treatments.

Useful clusters, on the other hand, serve as an intermediate step in a data pipeline . For example, businesses use clustering for customer segmentation. The clustering results segment customers into groups with similar purchase histories, which businesses can then use to create targeted advertising campaigns.

Note: You’ll learn about unsupervised machine learning techniques in this tutorial. If you’re interested in learning more about supervised machine learning techniques, then check out Logistic Regression in Python .

There are many other applications of clustering , such as document clustering and social network analysis. These applications are relevant in nearly every industry, making clustering a valuable skill for professionals working with data in any field.

You can perform clustering using many different approaches—so many, in fact, that there are entire categories of clustering algorithms. Each of these categories has its own unique strengths and weaknesses. This means that certain clustering algorithms will result in more natural cluster assignments depending on the input data.

Note: If you’re interested in learning about clustering algorithms not mentioned in this section, then check out A Comprehensive Survey of Clustering Algorithms for an excellent review of popular techniques.

Selecting an appropriate clustering algorithm for your dataset is often difficult due to the number of choices available. Some important factors that affect this decision include the characteristics of the clusters, the features of the dataset, the number of outliers, and the number of data objects.

You’ll explore how these factors help determine which approach is most appropriate by looking at three popular categories of clustering algorithms:

It’s worth reviewing these categories at a high level before jumping right into k -means. You’ll learn the strengths and weaknesses of each category to provide context for how k -means fits into the landscape of clustering algorithms.

Partitional clustering divides data objects into nonoverlapping groups. In other words, no object can be a member of more than one cluster, and every cluster must have at least one object.

These techniques require the user to specify the number of clusters, indicated by the variable k . Many partitional clustering algorithms work through an iterative process to assign subsets of data points into k clusters. Two examples of partitional clustering algorithms are k -means and k -medoids.

These algorithms are both nondeterministic , meaning they could produce different results from two separate runs even if the runs were based on the same input.

Partitional clustering methods have several strengths :

They also have several weaknesses :

Hierarchical clustering determines cluster assignments by building a hierarchy. This is implemented by either a bottom-up or a top-down approach:

Agglomerative clustering is the bottom-up approach. It merges the two points that are the most similar until all points have been merged into a single cluster.

Divisive clustering is the top-down approach. It starts with all points as one cluster and splits the least similar clusters at each step until only single data points remain.

These methods produce a tree-based hierarchy of points called a dendrogram . Similar to partitional clustering, in hierarchical clustering the number of clusters ( k ) is often predetermined by the user. Clusters are assigned by cutting the dendrogram at a specified depth that results in k groups of smaller dendrograms.

Unlike many partitional clustering techniques, hierarchical clustering is a deterministic process, meaning cluster assignments won’t change when you run an algorithm twice on the same input data.

The strengths of hierarchical clustering methods include the following:

The weaknesses of hierarchical clustering methods include the following:

Density-based clustering determines cluster assignments based on the density of data points in a region. Clusters are assigned where there are high densities of data points separated by low-density regions.

Unlike the other clustering categories, this approach doesn’t require the user to specify the number of clusters. Instead, there is a distance-based parameter that acts as a tunable threshold. This threshold determines how close points must be to be considered a cluster member.

Examples of density-based clustering algorithms include Density-Based Spatial Clustering of Applications with Noise, or DBSCAN , and Ordering Points To Identify the Clustering Structure, or OPTICS .

The strengths of density-based clustering methods include the following:

The weaknesses of density-based clustering methods include the following:

How to Perform K-Means Clustering in Python

In this section, you’ll take a step-by-step tour of the conventional version of the k -means algorithm. Understanding the details of the algorithm is a fundamental step in the process of writing your k -means clustering pipeline in Python. What you learn in this section will help you decide if k -means is the right choice to solve your clustering problem.

Conventional k -means requires only a few steps. The first step is to randomly select k centroids, where k is equal to the number of clusters you choose. Centroids are data points representing the center of a cluster.

The main element of the algorithm works by a two-step process called expectation-maximization . The expectation step assigns each data point to its nearest centroid. Then, the maximization step computes the mean of all the points for each cluster and sets the new centroid. Here’s what the conventional version of the k -means algorithm looks like:

k means algorithm

The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after the centroids converge , or match the previous iteration’s assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. Since this is a measure of error, the objective of k -means is to try to minimize this value.

The figure below shows the centroids and SSE updating through the first five iterations from two different runs of the k -means algorithm on the same dataset:

The purpose of this figure is to show that the initialization of the centroids is an important step. It also highlights the use of SSE as a measure of clustering performance. After choosing a number of clusters and the initial centroids, the expectation-maximization step is repeated until the centroid positions reach convergence and are unchanged.

The random initialization step causes the k -means algorithm to be nondeterministic , meaning that cluster assignments will vary if you run the same algorithm twice on the same dataset. Researchers commonly run several initializations of the entire k -means algorithm and choose the cluster assignments from the initialization with the lowest SSE.

Thankfully, there’s a robust implementation of k -means clustering in Python from the popular machine learning package scikit-learn . You’ll learn how to write a practical implementation of the k -means algorithm using the scikit-learn version of the algorithm .

Note: If you’re interested in gaining a deeper understanding of how to write your own k -means algorithm in Python, then check out the Python Data Science Handbook .

The code in this tutorial requires some popular external Python packages and assumes that you’ve installed Python with Anaconda. For more information on setting up your Python environment for machine learning in Windows, read through Setting Up Python for Machine Learning on Windows .

Otherwise, you can begin by installing the required packages:

The code is presented so that you can follow along in an ipython console or Jupyter Notebook. Click the prompt ( >>> ) at the top right of each code block to see the code formatted for copy-paste. You can also download the source code used in this article by clicking on the link below:

This step will import the modules needed for all the code in this section:

You can generate the data from the above GIF using make_blobs() , a convenience function in scikit-learn used to generate synthetic clusters. make_blobs() uses these parameters:

make_blobs() returns a tuple of two values:

Note: Many scikit-learn algorithms rely heavily on NumPy in their implementations. If you want to learn more about NumPy arrays, check out Look Ma, No For-Loops: Array Programming With NumPy .

Generate the synthetic data and labels:

Nondeterministic machine learning algorithms like k -means are difficult to reproduce. The random_state parameter is set to an integer value so you can follow the data presented in the tutorial. In practice, it’s best to leave random_state as the default value, None .

Here’s a look at the first five elements for each of the variables returned by make_blobs() :

Data sets usually contain numerical features that have been measured in different units, such as height (in inches) and weight (in pounds). A machine learning algorithm would consider weight more important than height only because the values for weight are larger and have higher variability from person to person.

Machine learning algorithms need to consider all features on an even playing field. That means the values for all features must be transformed to the same scale.

The process of transforming numerical features to use the same scale is known as feature scaling . It’s an important data preprocessing step for most distance-based machine learning algorithms because it can have a significant impact on the performance of your algorithm.

There are several approaches to implementing feature scaling. A great way to determine which technique is appropriate for your dataset is to read scikit-learn’s preprocessing documentation .

In this example, you’ll use the StandardScaler class. This class implements a type of feature scaling called standardization . Standardization scales, or shifts, the values for each numerical feature in your dataset so that the features have a mean of 0 and standard deviation of 1:

Take a look at how the values have been scaled in scaled_features :

Now the data are ready to be clustered. The KMeans estimator class in scikit-learn is where you set the algorithm parameters before fitting the estimator to the data. The scikit-learn implementation is flexible, providing several parameters that can be tuned.

Here are the parameters used in this example:

init controls the initialization technique. The standard version of the k -means algorithm is implemented by setting init to "random" . Setting this to "k-means++" employs an advanced trick to speed up convergence, which you’ll use later.

n_clusters sets k for the clustering step. This is the most important parameter for k -means.

n_init sets the number of initializations to perform. This is important because two runs can converge on different cluster assignments. The default behavior for the scikit-learn algorithm is to perform ten k -means runs and return the results of the one with the lowest SSE.

max_iter sets the number of maximum iterations for each initialization of the k -means algorithm.

Instantiate the KMeans class with the following arguments:

The parameter names match the language that was used to describe the k -means algorithm earlier in the tutorial. Now that the k -means class is ready, the next step is to fit it to the data in scaled_features . This will perform ten runs of the k -means algorithm on your data with a maximum of 300 iterations per run:

Statistics from the initialization run with the lowest SSE are available as attributes of kmeans after calling .fit() :

Finally, the cluster assignments are stored as a one-dimensional NumPy array in kmeans.labels_ . Here’s a look at the first five predicted labels:

Note that the order of the cluster labels for the first two data objects was flipped. The order was [1, 0] in true_labels but [0, 1] in kmeans.labels_ even though those data objects are still members of their original clusters in kmeans.lables_ .

This behavior is normal, as the ordering of cluster labels is dependent on the initialization. Cluster 0 from the first run could be labeled cluster 1 in the second run and vice versa. This doesn’t affect clustering evaluation metrics.

In this section, you’ll look at two methods that are commonly used to evaluate the appropriate number of clusters:

These are often used as complementary evaluation techniques rather than one being preferred over the other. To perform the elbow method , run several k -means, increment k with each iteration, and record the SSE:

The previous code block made use of Python’s dictionary unpacking operator ( ** ). To learn more about this powerful Python operator, check out How to Iterate Through a Dictionary in Python .

When you plot SSE as a function of the number of clusters, notice that SSE continues to decrease as you increase k . As more centroids are added, the distance from each point to its closest centroid will decrease.

There’s a sweet spot where the SSE curve starts to bend known as the elbow point . The x-value of this point is thought to be a reasonable trade-off between error and number of clusters. In this example, the elbow is located at x=3 :

The above code produces the following plot:

k means elbow method

Determining the elbow point in the SSE curve isn’t always straightforward. If you’re having trouble choosing the elbow point of the curve, then you could use a Python package, kneed , to identify the elbow point programmatically:

The silhouette coefficient is a measure of cluster cohesion and separation. It quantifies how well a data point fits into its assigned cluster based on two factors:

Silhouette coefficient values range between -1 and 1 . Larger numbers indicate that samples are closer to their clusters than they are to other clusters.

In the scikit-learn implementation of the silhouette coefficient , the average silhouette coefficient of all the samples is summarized into one score. The silhouette score() function needs a minimum of two clusters, or it will raise an exception.

Loop through values of k again. This time, instead of computing SSE, compute the silhouette coefficient:

Plotting the average silhouette scores for each k shows that the best choice for k is 3 since it has the maximum score:

k means silhouette analysis

Ultimately, your decision on the number of clusters to use should be guided by a combination of domain knowledge and clustering evaluation metrics.

The elbow method and silhouette coefficient evaluate clustering performance without the use of ground truth labels . Ground truth labels categorize data points into groups based on assignment by a human or an existing algorithm. These types of metrics do their best to suggest the correct number of clusters but can be deceiving when used without context.

Note: In practice, it’s rare to encounter datasets that have ground truth labels.

When comparing k -means against a density-based approach on nonspherical clusters, the results from the elbow method and silhouette coefficient rarely match human intuition. This scenario highlights why advanced clustering evaluation techniques are necessary. To visualize an example, import these additional modules:

This time, use make_moons() to generate synthetic data in the shape of crescents:

Fit both a k -means and a DBSCAN algorithm to the new data and visually assess the performance by plotting the cluster assignments with Matplotlib :

Print the silhouette coefficient for each of the two algorithms and compare them. A higher silhouette coefficient suggests better clusters, which is misleading in this scenario:

The silhouette coefficient is higher for the k -means algorithm. The DBSCAN algorithm appears to find more natural clusters according to the shape of the data:

k means clustering crescents

This suggests that you need a better method to compare the performance of these two clustering algorithms.

If you’re interested, you can find the code for the above plot by expanding the box below.

Plot Crescents Show/Hide

To learn more about plotting with Matplotlib and Python, check out Python Plotting with Matplotlib (Guide) . Here’s how you can plot the comparison of the two algorithms in the crescent moons example:

Since the ground truth labels are known, it’s possible to use a clustering metric that considers labels in its evaluation. You can use the scikit-learn implementation of a common metric called the adjusted rand index (ARI) . Unlike the silhouette coefficient, the ARI uses true cluster assignments to measure the similarity between true and predicted labels.

Compare the clustering results of DBSCAN and k -means using ARI as the performance metric:

The ARI output values range between -1 and 1 . A score close to 0.0 indicates random assignments, and a score close to 1 indicates perfectly labeled clusters.

Based on the above output, you can see that the silhouette coefficient was misleading. ARI shows that DBSCAN is the best choice for the synthetic crescents example as compared to k -means.

There are several metrics that evaluate the quality of clustering algorithms. Reading through the implementations in scikit-learn will help you select an appropriate clustering evaluation metric.

How to Build a K-Means Clustering Pipeline in Python

Now that you have a basic understanding of k -means clustering in Python, it’s time to perform k -means clustering on a real-world dataset. These data contain gene expression values from a manuscript authored by The Cancer Genome Atlas (TCGA) Pan-Cancer analysis project investigators.

There are 881 samples (rows) representing five distinct cancer subtypes. Each sample has gene expression values for 20,531 genes (columns). The dataset is available from the UC Irvine Machine Learning Repository , but you can use the Python code below to obtain the data programmatically.

To follow along with the examples below, you can download the source code by clicking on the following link:

In this section, you’ll build a robust k -means clustering pipeline. Since you’ll perform multiple transformations of the original input data, your pipeline will also serve as a practical clustering framework.

Assuming you want to start with a fresh namespace , import all the modules needed to build and evaluate the pipeline, including pandas and seaborn for more advanced visualizations:

Download and extract the TCGA dataset from UCI:

After the download and extraction is completed, you should have a directory that looks like this:

The KMeans class in scikit-learn requires a NumPy array as an argument. The NumPy package has a helper function to load the data from the text file into memory as NumPy arrays:

Check out the first three columns of data for the first five samples as well as the labels for the first five samples:

The data variable contains all the gene expression values from 20,531 genes. The true_label_names are the cancer types for each of the 881 samples. The first record in data corresponds with the first label in true_labels .

The labels are strings containing abbreviations of cancer types:

To use these labels in the evaluation methods, you first need to convert the abbreviations to integers with LabelEncoder :

Since the label_encoder has been fitted to the data, you can see the unique classes represented using .classes_ . Store the length of the array to the variable n_clusters for later use:

In practical machine learning pipelines, it’s common for the data to undergo multiple sequences of transformations before it feeds into a clustering algorithm. You learned about the importance of one of these transformation steps, feature scaling, earlier in this tutorial. An equally important data transformation technique is dimensionality reduction , which reduces the number of features in the dataset by either removing or combining them.

Dimensionality reduction techniques help to address a problem with machine learning algorithms known as the curse of dimensionality . In short, as the number of features increases, the feature space becomes sparse. This sparsity makes it difficult for algorithms to find data objects near one another in higher-dimensional space. Since the gene expression dataset has over 20,000 features, it qualifies as a great candidate for dimensionality reduction.

Principal Component Analysis (PCA) is one of many dimensionality reduction techniques. PCA transforms the input data by projecting it into a lower number of dimensions called components . The components capture the variability of the input data through a linear combination of the input data’s features.

Note: A full description of PCA is out of scope for this tutorial, but you can learn more about it in the scikit-learn user guide .

The next code block introduces you to the concept of scikit-learn pipelines . The scikit-learn Pipeline class is a concrete implementation of the abstract idea of a machine learning pipeline.

Your gene expression data aren’t in the optimal format for the KMeans class, so you’ll need to build a preprocessing pipeline . The pipeline will implement an alternative to the StandardScaler class called MinMaxScaler for feature scaling. You use MinMaxScaler when you do not assume that the shape of all your features follows a normal distribution.

The next step in your preprocessing pipeline will implement the PCA class to perform dimensionality reduction:

Now that you’ve built a pipeline to process the data, you’ll build a separate pipeline to perform k -means clustering. You’ll override the following default arguments of the KMeans class:

init: You’ll use "k-means++" instead of "random" to ensure centroids are initialized with some distance between them. In most cases, this will be an improvement over "random" .

n_init: You’ll increase the number of initializations to ensure you find a stable solution.

max_iter: You’ll increase the number of iterations per initialization to ensure that k -means will converge.

Build the k -means clustering pipeline with user-defined arguments in the KMeans constructor:

The Pipeline class can be chained to form a larger pipeline. Build an end-to-end k -means clustering pipeline by passing the "preprocessor" and "clusterer" pipelines to Pipeline :

Calling .fit() with data as the argument performs all the pipeline steps on the data :

The pipeline performs all the necessary steps to execute k -means clustering on the gene expression data! Depending on your Python REPL , .fit() may print a summary of the pipeline. Objects defined inside pipelines are accessible using their step name.

Evaluate the performance by calculating the silhouette coefficient:

Calculate ARI, too, since the ground truth cluster labels are available:

As mentioned earlier, the scale for each of these clustering performance metrics ranges from -1 to 1. A silhouette coefficient of 0 indicates that clusters are significantly overlapping one another, and a silhouette coefficient of 1 indicates clusters are well-separated. An ARI score of 0 indicates that cluster labels are randomly assigned, and an ARI score of 1 means that the true labels and predicted labels form identical clusters.

Since you specified n_components=2 in the PCA step of the k -means clustering pipeline, you can also visualize the data in the context of the true labels and predicted labels. Plot the results using a pandas DataFrame and the seaborn plotting library:

Here’s what the plot looks like:

k means clustering pca

The visual representation of the clusters confirms the results of the two clustering evaluation metrics. The performance of your pipeline was pretty good. The clusters only slightly overlapped, and cluster assignments were much better than random.

Your first k -means clustering pipeline performed well, but there’s still room to improve. That’s why you went through the trouble of building the pipeline: you can tune the parameters to get the most desirable clustering results.

The process of parameter tuning consists of sequentially altering one of the input values of the algorithm’s parameters and recording the results. At the end of the parameter tuning process, you’ll have a set of performance scores, one for each new value of a given parameter. Parameter tuning is a powerful method to maximize performance from your clustering pipeline.

By setting the PCA parameter n_components=2 , you squished all the features into two components, or dimensions. This value was convenient for visualization on a two-dimensional plot. But using only two components means that the PCA step won’t capture all of the explained variance of the input data.

Explained variance measures the discrepancy between the PCA-transformed data and the actual input data. The relationship between n_components and explained variance can be visualized in a plot to show you how many components you need in your PCA to capture a certain percentage of the variance in the input data. You can also use clustering performance metrics to evaluate how many components are necessary to achieve satisfactory clustering results.

In this example, you’ll use clustering performance metrics to identify the appropriate number of components in the PCA step. The Pipeline class is powerful in this situation. It allows you to perform basic parameter tuning using a for loop .

Iterate over a range of n_components and record evaluation metrics for each iteration:

Plot the evaluation metrics as a function of n_components to visualize the relationship between adding components and the performance of the k -means clustering results:

The above code generates the a plot showing performance metrics as a function of n_components :

k means performance evaluation

There are two takeaways from this figure:

The silhouette coefficient decreases linearly. The silhouette coefficient depends on the distance between points, so as the number of dimensions increases, the sparsity increases.

The ARI improves significantly as you add components. It appears to start tapering off after n_components=7 , so that would be the value to use for presenting the best clustering results from this pipeline.

Like most machine learning decisions, you must balance optimizing clustering evaluation metrics with the goal of the clustering task. In situations when cluster labels are available, as is the case with the cancer dataset used in this tutorial, ARI is a reasonable choice. ARI quantifies how accurately your pipeline was able to reassign the cluster labels.

The silhouette coefficient, on the other hand, is a good choice for exploratory clustering because it helps to identify subclusters. These subclusters warrant additional investigation, which can lead to new and important insights.

You now know how to perform k -means clustering in Python. Your final k -means clustering pipeline was able to cluster patients with different cancer types using real-world gene expression data. You can use the techniques you learned here to cluster your own data, understand how to get the best clustering results, and share insights with others.

In this tutorial, you learned:

You also took a whirlwind tour of scikit-learn, an accessible and extensible tool for implementing k -means clustering in Python. If you’d like to reproduce the examples you saw above, then be sure to download the source code by clicking on the following link:

You’re now ready to perform k -means clustering on datasets you find interesting. Be sure to share your results in the comments below!

Note: The dataset used in this tutorial was obtained from the UCI Machine Learning Repository. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository . Irvine, CA: University of California, School of Information and Computer Science.

The original dataset is maintained by The Cancer Genome Atlas Pan-Cancer analysis project.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Kevin Arvai

Kevin Arvai

Kevin is a data scientist for a clinical genomics company, a Pythonista, and an NBA fan.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Tutorial Categories: advanced data-science machine-learning

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

K-Means Clustering (Source Code)

🔒 No spam. We take your privacy seriously.

programming assignment implementing the k means clustering algorithm

Written by Chris Piech. Based on a handout by Andrew Ng.

The Basic Idea

Say you are given a data set where each observed example has a set of features, but has no labels. Labels are an essential ingredient to a supervised algorithm like Support Vector Machines, which learns a hypothesis function to predict labels given features. So we can't run supervised learning. What can we do?

One of the most straightforward tasks we can perform on a data set without labels is to find groups of data in our dataset which are similar to one another -- what we call clusters.

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

programming assignment implementing the k means clustering algorithm

Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations of k-means. In each iteration, we assign each training example to the closest cluster centroid (shown by "painting" the training examples the same color as the cluster centroid to which is assigned); then we move each cluster centroid to the mean of the points assigned to it. Images courtesy of Michael Jordan.

The Algorithm

In the clustering problem, we are given a training set ${x^{(1)}, ... , x^{(m)}}$, and want to group the data into a few cohesive "clusters." Here, we are given feature vectors for each data point $x^{(i)} \in \mathbb{R}^n$ as usual; but no labels $y^{(i)}$ (making this an unsupervised learning problem). Our goal is to predict $k$ centroids and a label $c^{(i)}$ for each datapoint. The k-means clustering algorithm is as follows:

programming assignment implementing the k means clustering algorithm

Implementation

Here is pseudo-python code which runs k-means on a dataset. It is a short algorithm made longer by verbose commenting. # Function: K Means # ------------- # K-Means is an algorithm that takes in a dataset and a constant # k and returns k centroids (which define clusters of data in the # dataset which are similar to one another). def kmeans(dataSet, k): # Initialize centroids randomly numFeatures = dataSet.getNumFeatures() centroids = getRandomCentroids(numFeatures, k) # Initialize book keeping vars. iterations = 0 oldCentroids = None # Run the main k-means algorithm while not shouldStop(oldCentroids, centroids, iterations): # Save old centroids for convergence test. Book keeping. oldCentroids = centroids iterations += 1 # Assign labels to each datapoint based on centroids labels = getLabels(dataSet, centroids) # Assign centroids based on datapoint labels centroids = getCentroids(dataSet, labels, k) # We can get the labels too by calling getLabels(dataSet, centroids) return centroids # Function: Should Stop # ------------- # Returns True or False if k-means is done. K-means terminates either # because it has run a maximum number of iterations OR the centroids # stop changing. def shouldStop(oldCentroids, centroids, iterations): if iterations > MAX_ITERATIONS: return True return oldCentroids == centroids # Function: Get Random Centroids # ------------- # Returns k random centroids, each of dimension n. def getRandomCentroids(n, k): # return some reasonable randomization. --> # Function: Get Labels # ------------- # Returns a label for each piece of data in the dataset. def getLabels(dataSet, centroids): # For each element in the dataset, chose the closest centroid. # Make that centroid the element's label. # Function: Get Centroids # ------------- # Returns k random centroids, each of dimension n. def getCentroids(dataSet, labels, k): # Each centroid is the geometric mean of the points that # have that centroid's label. Important: If a centroid is empty (no points have # that centroid's label) you should randomly re-initialize it.

Important note: You might be tempted to calculate the distance between two points manually, by looping over values. This will work, but it will lead to a slow k-means! And a slow k-means will mean that you have to wait longer to test and debug your solution.

Let's define three vectors:

To calculate the distance between x and y we can use: np.sqrt(sum((x - y) ** 2))

To calculate the distance between all the length 5 vectors in z and x we can use: np.sqrt(((z-x)**2).sum(axis=0))

Expectation Maximization

K-Means is really just the EM (Expectation Maximization) algorithm applied to a particular naive bayes model.

To demonstrate this remarkable claim, consider the classic naive bayes model with a class variable which can take on discrete values (with domain size $k$) and a set of feature variables, each of which can take on a continuous value (see figure 2). The conditional probability distributions for $P(f_i = x | C= c)$ is going to be slightly different than usual. Instead of storing this conditional probability as a table, we are going to store it as a single normal (gaussian) distribution, with it's own mean and a standard deviation of 1. Specifically, this means that: $P(f_i = x | C= c) \sim \mathcal{N}(\mu_{c,i}, 1)$

Learning the values of $\mu_{c, i}$ given a dataset with assigned values to the features but not the class variables is the provably identical to running k-means on that dataset.

programming assignment implementing the k means clustering algorithm

Figure 2: The K-Means algorithm is the EM algorithm applied to this Bayes Net.

If we know that this is the strcuture of our bayes net, but we don't know any of the conditional probability distributions then we have to run Parameter Learning before we can run Inference.

In the dataset we are given, all the feature variables are observed (for each data point) but the class variable is hidden. Since we are running Parameter Learning on a bayes net where some variables are unobserved, we should use EM.

Lets review EM. In EM, you randomly initialize your model parameters, then you alternate between (E) assigning values to hidden variables, based on parameters and (M) computing parameters based on fully observed data.

E-Step : Coming up with values to hidden variables, based on parameters. If you work out the math of chosing the best values for the class variable based on the features of a given piece of data in your data set, it comes out to "for each data-point, chose the centroid that it is closest to, by euclidean distance, and assign that centroid's label." The proof of this is within your grasp! See lecture.

M-Step : Coming up with parameters, based on full assignments. If you work out the math of chosing the best parameter values based on the features of a given piece of data in your data set, it comes out to "take the mean of all the data-points that were labeled as c."

So what? Well this gives you an idea of the qualities of k-means. Like EM, it is provably going to find a local optimum. Like EM, it is not necessarily going to find a global optimum. It turns out those random initial values do matter.

Figure 1 shows k-means with a 2-dimensional feature vector (each point has two dimensions, an x and a y). In your applications, will probably be working with data that has a lot of features. In fact each data-point may be hundreds of dimensions. We can visualize clusters in up to 3 dimensions (see figure 3) but beyond that you have to rely on a more mathematical understanding.

programming assignment implementing the k means clustering algorithm

Figure 3: KMeans in other dimensions. (left) K-means in 2d. (right) K-means in 3d. You have to imagine k-means in 4d.

© Stanford 2013 | Designed by Chris . Inspired by Niels and Percy .

Pair programming. On this assignment, you are encouraged (not required) to work with a partner provided you practice pair programming. Pair programming "is a practice in which two programmers work side-by-side at one computer, continuously collaborating on the same design, algorithm, code, or test." One partner is driving (designing and typing the code) while the other is navigating (reviewing the work, identifying bugs, and asking questions). The two partners switch roles every 30-40 minutes, and on demand, brainstorm.

Before pair programming, you must read the article All I really need to know about pair programming I learned in kindergarten . You should choose a partner of similar ability.

Only one partner submits the code and readme.txt ; the other partner submits only an abbrievated readme.txt that contains both partners' names, logins and information about extra credit. The names and logins of both partners MUST appear at the top of every submitted file. Both partners receive the same grade.

Writing code with a partner without following the pair programming instructions listed above is a serious violation of the course collaboration policy.

programming assignment implementing the k means clustering algorithm

Distance Measures. You will use two distance measures for clustering: Euclidean distance and Spearman rank correlation. Euclidean distance measures differences in the absolute levels of gene expression, whereas correlation compares the shapes of expression patterns. Furthermore, Spearman rank correlation uses ranks in place of absolute values, which makes it less sensitive to outliers (extremely high or low values) in the data.

programming assignment implementing the k means clustering algorithm

For example, suppose we have two vectors x = [4.6, 1.9, 2.4] and y = [2.2, 6.6, 5.5]. Then the rank transforms of these two vectors would be r = [3, 1, 2] and s = [1, 3, 2], making d = [3-1, 1-3, 2-2] = [2, -2, 0].

Note that these functions should return distance measures; that is, the returned value should be high if the two vectors are dissimilar, low if they are similar, and zero if they are completely identical. This requirement is already met for Euclidean distance, but Spearman rank correlation varies between -1 and 1, and high values indicate similarity. Therefore, you must transform the Spearman rank correlation so that the returned value is always greater than or equal to zero, with high values indicating dissimilarity.

Overview of the code. The code is divided into several classes to represent our real-world data. These classes are Gene.java , Cluster.java , and KMeans.java . We have provided you with a start on each of these classes. As you read the following descriptions, refer to the relevant code files to get a sense of the structure of the program and to figure out the code we provided.

We're keeping the Gene objects in each Cluster in a HashSet<Gene> . The size of our clusters will change as we add genes to the cluster with nearest centroid. The generic class HashSet is part of the Java language specification (i.e. the standard libraries). A HashSet is an unordered set storing only unique elements, which means no duplicate elements can be present in the hash set. To enforce this uniqueness property, HashSet requires the containing member type to define hashCode() and equals(Object o) for checking equivalence of elements in the set. Once these are defined, inserting an element that is already present in the HashSet should have no effect.

All standard Java data types (such as Integer , String ) already have equals() and hashCode() defined, but not for user-defined classes such as Gene . For this assignment, you will need to define hashCode() and equals(Object o) for type Gene . Two Gene objects are considered equal if their gene names are equal. hashCode() returns an int that is the index of the "bin" a Gene object should be hashed to. You can assume hashCode() for Gene works in the same way as how hashCode() for gene name (of type String ) works.

Another method of HashSet that you want to be familiar with is add(Gene o) . The add(Gene o) method will take a Gene as an argument and add it to the hash set.

for (Gene gene : geneSet) { System.out.println(gene.getName()); }

We use the HashSet data structure to represent genes in a cluster, rather than array, for the reason of efficient equality checking between clusters, because the equality can be checked by finding the intersections of two HashSet s. If the size of intersection is equal to the size of cluster, then the two clusters are equal. At each iteration, you need to check for equality between the cluster before iteration and the cluster after iteration, for every cluster in the list, and terminate the iterative algorithm only if all clusters do not change. You can check for equality of two sets by: s1.containsAll(s2) && s2.containsAll(s1) .

java KMeans <input_data_filename> <K> <metric> <output_filename>

Progress Steps and Testing. Once you know that the Gene s are loading correctly, you should start to cluster. Initialize each of the cluster centroids to a random Gene as a starting value and proceed with the algorithm described above. After each iteration of creating new clusters, you need to compare the cluster content before and after the iteration to determine if you need to continue with another iteration. In order to test and debug this process, it may be useful to take one of the provided data files and considerably reduce the number of genes and/or experiments. Then you can use this smaller data file and print out information at each step of the iterative process to verify that the proper behavior is occurring.

Once you know that your program is clustering properly, have it write to the screen the names of each gene in each cluster, and create a .jpg for each cluster by calling createJPG() , passing in the <output_filename> and the cluster number as the arguments. In the end, when your code is invoked, it should read in a data file, perform the clustering, list the gene names on the screen, and create pictures of the expression levels of the genes in each cluster.

We have marked the places in the code where you have to fill in with // TODO .

Using your code. Now you are ready to use your program to cluster some real data. We have provided two microarray data files. One ( human_cancer.pcl ) is a human lung cancer data set, and the other ( yeast_stress.pcl ) is a yeast stress response data set. Run your code on both data sets with varying Ks using each distance metric. Inspect the results, and see what values of K and which distance metric are performing the best. We want you to turn in results (a text file with the gene names in each cluster and .jpgs of each cluster) with K = 5 for the cancer dataset and K = 7 for the yeast dataset using both distance metrics. Name the text files dataset_metric_K.txt , e.g., human_euclidean_5.txt . Name the cluster image files dataset_metric_K_clusternumber.jpg , e.g., yeast_spearman_7_1.jpg .

Extra credit. If you'd like to get some extra credit (and see more of how microarray data is being analyzed in research labs today), read on. One problem with the clusters of genes you identified is that it is unclear which of the clusters is "better" (i.e. more biologically correct). Is Spearman or Euclidian-based clustering better?

We can try to evaluate the clusters by looking at the biological functions of known genes in each cluster and checking for statistical enrichment of specific biological functions. If a cluster has high statistical enrichment for genes with particular function, does that make you more or less confident in the fact that this cluster is biologically relevant?

For extra credit, examine one of your clusters for functional enrichment. Pick a cluster from the yeast dataset that changed a lot between the two distance metrics. Put the genes from the two versions of that cluster into the GoTermFinder tool at http://go.princeton.edu/cgi-bin/GOTermFinder . Make sure to check the Process option. You can ignore the other fields. This tool identifies significantly enriched functions in a group of genes. It uses the Gene Ontology , a vocabulary for known biological functions assigned to genes. Based on the results from GoTermFinder, fill out the appropriate section of the readme.txt template.

Required files to submit:

Tutorial Playlist

Machine learning tutorial: a step-by-step guide for beginners, an introduction to machine learning, what is machine learning and how does it work, machine learning steps: a complete guide, top 10 machine learning applications in 2023, an introduction to the types of machine learning, supervised and unsupervised learning in machine learning.

Everything You Need to Know About Feature Selection

Linear Regression in Python

Everything you need to know about classification in machine learning, an introduction to logistic regression in python, understanding the difference between linear vs. logistic regression, the best guide on how to implement decision tree in python, random forest algorithm, understanding naive bayes classifier.

The Best Guide to Confusion Matrix

How to Leverage KNN Algorithm in Machine Learning?

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

PCA in Machine Learning - Your Complete Guide to Principal Component Analysis

What is cost function in machine learning, the ultimate guide to cross-validation in machine learning, an easy guide to stock price prediction using machine learning, what is reinforcement learning the best guide to reinforcement learning, what is q-learning the best guide to understand q-learning, the best guide to regularization in machine learning, everything you need to know about bias and variance, the complete guide on overfitting and underfitting in machine learning, mathematics for machine learning - important skills you must possess, a one-stop guide to statistics for machine learning, embarking on a machine learning career here’s all you need to know, how to become a machine learning engineer, top 45 machine learning interview questions and answers for 2023, explaining the concepts of quantum computing, supervised machine learning: all you need to know, k-means clustering algorithm: applications, types, & how does it work.

Lesson 17 of 33 By Mayank Banoula

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Table of Contents

Every Machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types - supervised and unsupervised . K-Means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.

Types of Clustering

Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.

The various types of clustering are:

Hierarchical clustering is further subdivided into:

Partitioning clustering is further subdivided into:

Become an Expert in All Things AI and ML!

Become an Expert in All Things AI and ML!

Hierarchical Clustering

Hierarchical clustering uses a tree-like structure, like so:

hierarchical clustering

In agglomerative clustering, there is a bottom-up approach. We begin with each element as a separate cluster and merge them into successively more massive clusters, as shown below:

clustering-slide19

Divisive clustering is a top-down approach. We begin with the whole set and proceed to divide it into successively smaller clusters, as you can see below:

clustering slide20

Partitioning Clustering 

Partitioning clustering is split into two subtypes - K-Means clustering and Fuzzy C-Means.

In k-means clustering, the objects are divided into several clusters mentioned by the number ‘K.’ So if we say K = 2, the objects are divided into two clusters, c1 and c2, as shown:

clustering-slide21

Boost Your AI and Machine Learning Career

Boost Your AI and Machine Learning Career

Here, the features or characteristics are compared, and all objects having similar characteristics are clustered together. 

Fuzzy c-means is very similar to k-means in the sense that it clusters objects that have similar characteristics together. In k-means clustering, a single object cannot belong to two different clusters. But in c-means, objects can belong to more than one cluster, as shown. 

clustering-slide22

What is Meant by the K-Means Clustering Algorithm?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. 

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data. 

For a better understanding of k-means, let's take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers. 

Let's take a look at the steps to create these clusters.

Assign data points

Here, we have our data set plotted on ‘x’ and ‘y’ coordinates. The information on the y-axis is about the runs scored, and on the x-axis about the wickets taken by the players. 

If we plot the data, this is how it would look:

plot data

Stay ahead of the tech-game with our AI Certification in partnership with Caltech and in collaboration with IBM. Explore more!

Perform Clustering

We need to create the clusters, as shown below:

perform-clustering

Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2).

The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two points are assigned as centroids. Note that the points can be anywhere, as they are random points. They are called centroids, but initially, they are not the central point of a given data set.

centroids

The next step is to determine the distance between each of the randomly assigned centroids' data points. For every point, the distance is measured from both the centroids, and whichever distance is less, that point is assigned to that centroid. You can see the data points attached to the centroids and represented here in blue and yellow.

centroids2

The next step is to determine the actual centroid for these two clusters. The original randomly allocated centroid is to be repositioned to the actual centroid of the clusters.

clusters3

This process of calculating the distance and repositioning the centroid continues until we obtain our final cluster. Then the centroid repositioning stops.

centroidsr

As seen above, the centroid doesn't need anymore repositioning, and it means the algorithm has converged, and we have the two clusters with a centroid. 

Fast-track Your Career in AI & Machine Learning!

Fast-track Your Career in AI & Machine Learning!

Applications of K-Means Clustering

K-Means clustering is used in a variety of examples or business cases in real life, like:

Wireless sensor networks

Academic Performance

Based on the scores, students are categorized into grades like A, B, or C. 

Diagnostic systems

The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.

Search engines

Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this. 

The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.

Distance Measure 

Distance measure determines the similarity between two elements and influences the shape of clusters.

K-Means clustering supports various kinds of distance measures, such as: 

Do you wish to accelerate your AL and ML career? Join our Machine Learning Course and gain access to 25+ industry relevant projects, career mentorship and more.

Euclidean Distance Measure 

The most common case is determining the distance between two points. If we have a point P and point Q, the euclidean distance is an ordinary straight line. It is the distance between the two points in Euclidean space. 

The formula for distance between two points is shown below:

euclidean

Squared Euclidean Distance Measure

This is identical to the Euclidean distance measurement but does not take the square root at the end. The formula is shown below:

squared eu

Manhattan Distance Measure

The Manhattan distance is the simple sum of the horizontal and vertical components or the distance between two points measured along axes at right angles.

Note that we are taking the absolute value so that the negative values don't come into play. 

The formula is shown below:

manhattan distance

Cosine Distance Measure

In this case, we take the angle between the two vectors formed by joining the origin point. The formula is shown below:

cosine-distance

Become a Data Scientist With Real-World Experience

Become a Data Scientist With Real-World Experience

How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

slide32

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters. 

Another method is to use the Elbow technique to determine the value of K. Once we get the K's value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.

For the newly formed clusters, it calculates the new centroid position. The position of the centroid moves compared to the randomly allocated one.

Once again, the distance of each point is measured from this new centroid point. If required, the data points are relocated to the new centroids, and the mean position or the new centroid is calculated once again. 

If the centroid moves, the iteration continues indicating no convergence. But once the centroid stops moving (which means that the clustering process has converged), it will reflect the result.

Let's use a visualization example to understand this better. 

We have a data set for a grocery shop, and we want to find out how many clusters this has to be spread across. To find the optimum number of clusters, we break it down into the following steps:

The Elbow method is the best way to find the number of clusters. The elbow method constitutes running  K-Means clustering on the dataset.

Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid.

slide34

The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is taken as the optimum value. 

Now, we draw a curve between WSS and the number of clusters. 

slide35

Here, WSS is on the y-axis and number of clusters on the x-axis.

You can see that there is a very gradual change in the value of WSS as the K value increases from 2. 

So, you can take the elbow point value as the optimal value of K. It should be either two, three, or at most four. But, beyond that, increasing the number of clusters does not dramatically change the value in WSS, it gets stabilized. 

Machine Learning FREE Course

Machine Learning FREE Course

Let's assume that these are our delivery points:

delivery points

We can randomly initialize two points called the cluster centroids.

Here, C1 and C2 are the centroids assigned randomly. 

Now the distance of each location from the centroid is measured, and each data point is assigned to the centroid, which is closest to it.

This is how the initial grouping is done:

initial grouping

Compute the actual centroid of data points for the first group.

Reposition the random centroid to the actual centroid.

random centranoid

Compute the actual centroid of data points for the second group.

actual centroid

Once the cluster becomes static, the k-means algorithm is said to be converged. 

The final cluster with centroids c1 and c2 is as shown below:

final centroid

K-Means Clustering Algorithm 

Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters. 

The steps to form clusters are:

Step 1: Choose K random points as cluster centers called centroids. 

Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid)

Step 3: Identify new centroids by taking the average of the assigned points.

Step 4: Keep repeating step 2 and step 3 until convergence is achieved

Let's take a detailed look at it at each of these steps.

We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say that 

centroid

Where C is the set of all centroids.

Master Tools You Need For Becoming an AI Engineer

Master Tools You Need For Becoming an AI Engineer

We assign each data point to its nearest center, which is accomplished by calculating the euclidean distance.

centroid slide44

Where dist() is the Euclidean distance.

Here, we calculate each x value's distance from each c value, i.e. the distance between x1-c1, x1-c2, x1-c3, and so on. Then we find which is the lowest value and assign x1 to that particular centroid.

Similarly, we find the minimum distance for x2, x3, etc. 

We identify the actual centroid by taking the average of all the points assigned to that cluster. 

centroid slide 45

Where Si is the set of all points assigned to the ith cluster.     

It means the original point, which we thought was the centroid, will shift to the new position, which is the actual centroid for each of these groups. 

Keep repeating step 2 and step 3 until convergence is achieved.

How to Choose the Value of "K number of clusters" in K-Means Clustering?

Although there are many choices available for choosing the optimal number of clusters, the Elbow Method is one of the most popular and appropriate methods. The Elbow Method uses the idea of WCSS value, which is short for for Within Cluster Sum of Squares. WCSS defines the total number of variations within a cluster. This is the formula used to calculate the value of WCSS (for three clusters) provided courtesy of Javatpoint:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

Python Implementation of the K-Means Clustering Algorithm

Here’s how to use Python to implement the K-Means Clustering Algorithm. These are the steps you need to take:

1. Data Pre-Processing. Import the libraries, datasets, and extract the independent variables.

# importing libraries    

import numpy as nm    

import matplotlib.pyplot as mtp    

import pandas as pd    

# Importing the dataset  

dataset = pd.read_csv('Mall_Customers_data.csv')  

x = dataset.iloc[:, [3, 4]].values 

2. Find the optimal number of clusters using the elbow method. Here’s the code you use:

#finding optimal number of clusters using the elbow method  

from sklearn.cluster import KMeans  

wcss_list= []  #Initializing the list for the values of WCSS  

#Using for loop for iterations from 1 to 10.  

for i in range(1, 11):  

    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)  

    kmeans.fit(x)  

    wcss_list.append(kmeans.inertia_)  

mtp.plot(range(1, 11), wcss_list)  

mtp.title('The Elobw Method Graph')  

mtp.xlabel('Number of clusters(k)')  

mtp.ylabel('wcss_list')  

mtp.show() 

3. Train the K-means algorithm on the training dataset. Use the same two lines of code used in the previous section. However, instead of using i, use 5, because there are 5 clusters that need to be formed. Here’s the code:

#training the K-means model on a dataset  

kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)  

y_predict= kmeans.fit_predict(x) 

4. Visualize the Clusters. Since this model has five clusters, we need to visualize each one.

#visulaizing the clusters  

mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster  

mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster  

mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster  

mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster  

mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster  

mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')   

mtp.title('Clusters of customers')  

mtp.xlabel('Annual Income (k$)')  

mtp.ylabel('Spending Score (1-100)')  

mtp.legend()  

mtp.show()  

Coding provided by Javatpoint .

Demo: K-Means Clustering

Problem Statement - Walmart wants to open a chain of stores across the state of Florida, and it wants to find the optimal store locations to maximize revenue.

The issue here is if they open too many stores close to each other, they will not make a profit. But, if the stores are too far apart, they do not have enough sales coverage. 

Solution - An organization like Walmart is an e-commerce giant. They already have the addresses of their customers in their database. So they can use this information and perform K-Means Clustering to find the optimal location.

Explore the concepts of Machine Learning and understand how it’s transforming the digital world with the Machine Learning Bootcamp . Enroll now!

Conclusion 

Considered as the Job of the future, Machine Learning engineers are in-demand as well as highly paid. A report by Marketwatch predicts a machine learning growth rate of over 45% for the period between 2017 and 2025. So, why restrict your learning to merely K-means clustering algorithms? Enroll in Simplilearn's Machine Learning Course and expand your knowledge in the broader concepts of Machine Learning. Get certified and become a part of the Artificial Intelligence talent that companies constantly look forward to.

Find our Caltech Post Graduate Program In AI And Machine Learning Online Bootcamp in top cities:

About the author.

Mayank Banoula

Mayank is a Research Analyst at Simplilearn. He is proficient in Machine learning and Artificial intelligence with python.

Recommended Programs

Caltech Post Graduate Program in AI and Machine Learning

*Lifetime access to high-quality, self-paced e-learning content.

Recommended Resources

What is Hierarchical Clustering and How Does It Work

What is Hierarchical Clustering and How Does It Work

Free eBook: 2016 High Paying Certifications

Free eBook: 2016 High Paying Certifications

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Free eBook: Guide To Scrum Methodology

Domino Data Logo - Graphic part

Getting started with k-means clustering in Python

programming assignment implementing the k means clustering algorithm

Table of Contents

Getting Started

Using scikit-learn.

Imagine you are an accomplished marketeer establishing a new campaign for a product and want to find appropriate segments to target, or you are lawyer interested in grouping together different documents depending on their content, or you are analysing credit card transactions to identify similar patterns. In all those cases, and many more, data science can be used to help clustering your data. Clustering analysis is an important area of unsupervised learning that helps us group data together. We have discussed in this blog the difference between supervised and unsupervised learning in the past.  As a reminder, we use unsupervised learning when labelled data is not available for our purposes but we want to explore common features in the data. In the examples above, as a marketeer we may find common demographic characteristics in our target audience, or as a lawyer we establish different common themes in the documents in question or, as a fraud analyst we establish common transactions that may highlight outliers in someone’s account.

In all those cases, clustering offers a hand at finding those common traces and there are a variety of clustering algorithms out there. In a previous post,  we talked about density based clustering where we discussed its use in anomaly detection, similar to the credit card transactions use-case above. In that post we argued that other algorithms may be easier to understand and implement, for example k-means, and the aim of this post if to do exactly that.

We will first establish the notion of a cluster and determine an important part in the implementation of k -means: centroids. We will see how k -means approaches the issue of similarity and how the groups are updated on every iteration until a stopping condition is met. We will illustrate this with a Python implementation and will finish by looking at how to use this algorithm via the Scikit-learn library .

You can use the code in this post in your own machine, provided you have Python 3.x installed. Alternatively, you can just sign up for free trial access to the Domino MLOps platform and access the code used in this article directly (use the Access Project button at the end of the article) . Let's get started.

K-Means - what does it mean?

We mentioned that we are interested in finding out commonalities among our data observations. One way to determine that commonality or similarity is through a measure of distance among the data points. The shorter the distance, the more similar the observations are. There are different ways in which we can measure that distance and one that is very familiar to a lot of people is the Euclidean distance. That’s right! The same one we are taught when learning thePythagorean theorem. Let us take a look and consider two data observations over two attributes \(a\) and \(b\). Point \(p_1\) has coordinates \((a_1,b_1 )\) and point \(p_2=(a_2,b_2 )\).

K Means Distance

The distance \(\overline{p_1 p_2 }\) is given by:

$$\overline{p_1 p_2 }=\sqrt{(a_2-a_1 )^2+(b_2-b_1 )^2 }$$

The expression above can be extended to more than 2 attributes and the distance can be measured between any two points. For a dataset with \(n\) observations, we assume there are \(k\) groups or clusters, and our aim is to determine which observation corresponds to any of those \(k\) groups. This is an important point to emphasise: the algorithm will not give us the number of clusters, instead we need to define the number\(k\) in advance. We may be able to run the algorithm with different values for \(k\) and determine the best possible solution.

In a nutshell, \(k\)-means clustering tries to minimise the distances between the observations that belong to a cluster and maximise the distance between the different clusters. In that way, we have cohesion between the observations that belong to a group, while observations that belong to a different group are kept further apart. Please note that as we explained in this post , \(k\)-means is exhaustive in the sense that every single observation in the dataset will be forced to be part of one of the \(k\) clusters assumed.

It should now be clear where the \(k\) in \(k\)-means comes from, but what about the “means” part? Well, it turns out that as part of the algorithm we are also looking to identify the centre for each cluster. We call this a centroid , and as we assign observations to one cluster or the other, we update the position of the cluster centroid . This is done by taking the mean (average if you will) of all the data points that have been included in that cluster. Easy!

A recipe for k-means

The recipe for \(k\)-means is quite straightforward.

Take a look at the representation below where the steps are depicted in a schematic way for a 2-dimensional space. The same steps can be applied to more dimensions (i.e. more features or attributes). For simplicity, in the schematic we only show the distance measured to the closest centroid, but in practice all distances need to be considered.

For the purposes of our implementation, we will take a look at some data with 2 attributes for simplicity. We will then look at an example with more dimension. To get us started we will use a dataset we have prepared and it is available here with the name kmeans_blobs.csv . The data set contains 4 columns with the following information:

We will discard column 4 for our analysis, but it may be useful to check the results of the application of \(k\)-means. We will do this in our second example later on. Let us start by reading the dataset:

First 5 rows of the K Means dataset

Let us look at the observations in the dataset. We will use the Cluster column to show the different groups that are present in the dataset. Our aim will be to see if the application of the algorithm reproduces closely the groupings.

Scatter plot visualization of k means clustering dataset

Let's now look at our recipe.

Steps 1 and 2 - Define \(k\) and initiate the centroids

First we need 1) to decide how many groups we have and 2) assign the initial centroids randomly. In this case let us consider \(k=3\), and as for the centroids, well, they have to be in the same range as the dataset itself. So one option is to randomly pick \(k\) observations and use their coordinates to initialise the centroids:

K means centroids

Step 3 - Calculate distance

We now need to calculate the distance between each of the centroids and the data points. We will assign the data point to the centroid that gives us the minimum error. Let us create a function to calculate the root of square errors:

Let us pick a data point and calculate the error so we can see how this works in practice. We will use point , which is in fact one of the centroids we picked above. As such, we expect that the error for that point and the third centroid is zero. We therefore would assign that data point to the second centroid. Let’s take a look:

Step 4 - Assign centroids

We can use the idea from Step 3 to create a function that helps us assign the data points to corresponding centroids. We will calculate all the errors associated to each centroid, and then pick the one with the lowest value for assignation:

Let us add some columns to our data containing the centroid assignations and the error incurred. Furthermore, we can use this to update our scatter plot showing the centroids (denoted with squares) and we colour the observations according to the centroid they have been assigned to:

K means centroids assigned in the dataframe

Let us see the total error by adding all the contributions. We will take a look at this error as a measure of convergence. In other words, if the error does not change, we can assume that the centroids have stabilised their location and we can terminate our iterations. In practice, we need to be mindful of having found a local minimum (outside the scope of this post).

Step 5 - Update centroid location

Now that we have a first attempt at defining our clusters, we need to update the position of the k centroids. We do this by calculating the mean of the position of the observations assigned to each centroid. Let take a look:

K means centroid locations

We can verify that the position has been updated. Let us look again at our scatter plot:

Step 6 - Repeat steps 3-5

Now we go back to calculate the distance to each centroid, assign observations and update the centroid location. This calls for a function to encapsulate the loop:

OK, we are now ready to apply our function. We will clean our dataset first and let the algorithm run:

K_Means_Fig7

Let us see the location of the final centroids:

Centroid locations in k means clustering dataframe

And in a graphical way, let us see our clusters:

K means cluster with updated centroids

We have seen how to make an initial implementation of the algorithm, but in many cases you may want to stand on the shoulders of giants and use other tried and tested modules to help you with your machine learning work. In this case, Scikit-learn is a good choice and it has a very nice implementation for \(k\)-means. If you want to know more about the algorithm and its evaluation you can take a look at Chapter 5 of Data Science and Analytics with Python where I use a wine dataset for the discussion. 

In this case we will show how k-means can be implemented in a couple of lines of code using the well-known Iris dataset. We can load it directly from Scikit-learn and we will shuffle the data to ensure the points are not listed in any particular order.

We can call the KMeans implementation to instantiate a model and fit it. The parameter n_clusters is the number of clusters \(k\). In the example below we request 3 clusters:

That is it! We can look at the labels that the algorithm has provided as follows:

In order to make a comparison, let us reorder the labels:

We can check how many observations were correctly assigned. We do this with the help of a confusion matrix:

K means cluster confusion matrix

Let us look at the location of the final clusters:

And we can look at some 3D graphics. In this case we will plot the following features:

• Petal width • Sepal length • Petal length

As you can see, there are a few observations that differ in colour between the two plots.

K Means clusters for iris data set vs the actual labels in the dataset

In this post we have explained the ideas behind the \(k\)-means algorithm and provided a simple implementation of these ideas in Python. I hope you agree that it is a very straightforward algorithm to understand and in case you want to use a more robust implementation, Scikit-learn has us covered. Given its simplicity, \(k\)-means is a very popular choice for starting up with clustering analysis. 

To get access to the code/data from this article, please use the Access Project button below.

Trial Project  Getting Started with K Means Clustering in Domino Try Now

Dr J Rogel-Salazar

Dr Jesus Rogel-Salazar is a Research Associate in the Photonics Group in the Department of Physics at Imperial College London. He obtained his PhD in quantum atom optics at Imperial College in the group of Professor Geoff New and in collaboration with the Bose-Einstein Condensation Group in Oxford with Professor Keith Burnett. After completion of his doctorate in 2003, he took a posdoc in the Centre for Cold Matter at Imperial and moved on to the Department of Mathematics in the Applied Analysis and Computation Group with Professor Jeff Cash.

Subscribe to the Domino Newsletter

Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.

Other posts you might be interested in

Shap and lime python libraries: part 2 - using shap and lime.

This blog post provides insights on how to use the SHAP and LIME Python libraries in practice and...

Manual Feature Engineering

Many thanks to AWP Pearson for the permission to excerpt "Manual Feature Engineering: Manipulating...

Fitting Support Vector Machines via Quadratic Programming

In this blog post we take a deep dive into the internals of Support Vector Machines. We derive a...

IMAGES

  1. Clustering 6: The k-means algorithm visually

    programming assignment implementing the k means clustering algorithm

  2. Illustration of how K-Means clustering algorithm. The algorithm depends...

    programming assignment implementing the k means clustering algorithm

  3. K- Means Clustering Algorithm

    programming assignment implementing the k means clustering algorithm

  4. Implementing K-means clustering in Python from Scratch

    programming assignment implementing the k means clustering algorithm

  5. Data Science Workshop 2 (Part 3): k-means Clustering: An Unsupervised Machine Learning Algorithm

    programming assignment implementing the k means clustering algorithm

  6. 6 Types of Clustering Algorithms in Machine Learning

    programming assignment implementing the k means clustering algorithm

VIDEO

  1. K Means Clustering Algorithm Part 1

  2. NPTEL Introduction to Machine Learning

  3. Clustering Algorithms

  4. Advanced Setups 19

  5. FoDA F22 Lecture L22

  6. K

COMMENTS

  1. Vitamin K Facts: Everything You Need to Know

    Vitamin K is often mentioned in health and wellbeing media, and it’s common to see it addressed in respect to various health claims — from anti-aging effects and healthy skin to bone health and a lower risk of cardiovascular disease.

  2. What Does K Mean in Money?

    When talking about money, the letter K after a number denotes thousands. 1K means $1,000 while 100K stands for $100,000. Both uppercase and lowercase K’s are generally accepted and recognized.

  3. What Is a Schedule K-1?

    Schedule K-1 is a form used to report the taxpayer’s portion of the income from a partnership, S-corporation, estate or trust. These legal entities use a pass-through taxation, according to TurboTax.

  4. Prashant47/Clustering-algorithm-K-means: Implementation ...

    Programming Assignment - Python version · Implement the following clustering algorithms: K-means and Kernel K-means. · Implement the following supervised

  5. K-means Clustering and Principal Component Analysis.ipynb at

    In this exercise, you will implement the K-means clustering algorithm and

  6. K Means Clustering

    The implementation and working of the K-Means algorithm are explained in the steps below: Step 1: Select the value of K to decide the number of

  7. Create a K-Means Clustering Algorithm from Scratch in Python

    For a given dataset, k is specified to be the number of distinct groups the points belong to. These k centroids are first randomly initialized, then iterations

  8. K-Means Clustering in Python: A Practical Guide

    The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different

  9. K-Means Clustering algorithm

    K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a

  10. COMP 527 2019 2 CA Assignment Data Clustering Implementing

    This assignment requires you to implement the k-means clustering algorithm using the Python programming language. Note that no credit will be

  11. K-Means

    The goal of the k-means algorithm is to minimize

  12. K-means Clustering From Scratch In Python [Machine ...

    In this project, we'll build a k-means clustering algorithm from scratch. Clustering is an unsupervised machine learning technique that can

  13. K-means Clustering Algorithm: Applications, Types, and Demos

    Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2). The first step in k-means clustering is the

  14. K-means Clustering in Python: A Step-by-Step Guide

    We will first establish the notion of a cluster and determine an important part in the implementation of k-means: centroids. We will see how k-