Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

K-means

Principle and extension of K-means clustering algorithm. Algorithm idea: For a given data object set, divide the data object set into K clusters according to the distance between data objects, so that the points in the clusters are connected as closely as possible, and the distance between clusters is as close as possible. big.

Edited at 2023-12-23 14:03:33

PlotWizard

Recent works View more works>>

K-means

PlotWizard

Recent works View more works>>

Recommended to you
Outline

K-means

Introduction

Algorithm idea: For a given data object set, divide the data object set into K clusters according to the distance between data objects, so that the points in the clusters are connected as closely as possible, and the distance between clusters is as close as possible. big

Illustration:

Algorithm steps

Step 1: Select the initial centers of K clusters

Step 2: Calculate the distance between each sample and the K initial centers and attribute them to the cluster with the closest distance.

Step 3: Recalculate the center of the cluster (the mean of the samples in the cluster)

Step 4: Repeat steps 2 and 3 until all samples no longer change.

Illustration:

Several issues to consider with K-means

How is the number of clusters determined?

Method 1: Elbow method (calculate the SSE of the model at each K value and select the K value with the smallest change in SSE)

Illustration:

Method 2: Silhouette coefficient (calculate the silhouette coefficient of the model under each K value, and select the K value with the largest silhouette coefficient)

Idea: Clustering evaluation by examining the separation and compactness of clusters

Illustration:

How is the initial center determined?

Method 1: Random selection

Method 2: Specify the location

Method 3: K-means

Idea: When selecting the initial center, try to keep the distance between each initial center as far as possible

Illustration:

Advantages and Disadvantages of K-means

advantage

It is also simple and efficient for large data sets, with low time complexity and space complexity.

The algorithm has strong interpretability

shortcoming

When the data set is large, the calculation speed is slow and the result is easy to be local optimal.

K-means is more sensitive to the number of K values and the location of the initial center

K-means is very sensitive to noise and outliers

The mean cannot be calculated for data sets containing categorical attributes, making the algorithm unavailable.

K-means can only cluster spherical clusters

Optimization of K-means

To solve the problem of slow calculation speed when the data set is too large

Method: Randomly sample the data set multiple times, and cluster each sampled subset using K-means until the cluster center becomes stable (MiniBatchKMeans)

MiniBatchKMeans algorithm steps

Step 1: Random sampling of the sample set

Step 2: K-means

Step 3: Repeat steps 1 and 2 until the cluster center becomes stable.

For the problem that the mean cannot be calculated when the attribute is of categorical type

Method: replace the mean by calculating the mode (K-mode)

For data sets where it is difficult to determine the number of clusters K

Method: Calculate the cluster center through the mean value of the samples in a given area, and continuously update the cluster center until the cluster center becomes stable (Mean-Shift)

Mean-Shift algorithm steps

Step 1: Randomly select a sample point and calculate the mean vector of the distances from other sample points to it:

Step 2: Move the position of the sample point according to the mean vector, and then calculate the mean vector of the distance from other sample points to it again until the absolute value of the mean vector is small enough or the sample point cannot be moved.

Step 3: Repeat steps 1 and 2 until all sample points are traversed

Mean-Shift optimization

For the calculation of the mean vector, the contribution of other sample points to the current sample point is not considered.

Use the Gaussian kernel function to measure the contribution of other sample points to the current sample point: