Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery hierarchical clustering

hierarchical clustering

Hierarchical clustering is a clustering algorithm. Its basic idea is to regard all observations (or samples) to be classified as an initial clustering group, and then classify this clustering group hierarchically according to a certain clustering criterion. The method is decomposed into several subgroups in turn until certain termination conditions are met.

Edited at 2023-12-23 14:06:33

PlotWizard

Recent works View more works>>

hierarchical clustering

PlotWizard

Recent works View more works>>

Recommended to you
Outline

hierarchical clustering

Introduction

Algorithmic idea: divide hierarchies according to a certain method until certain conditions are met.

Illustration:

Two hierarchical clustering methods

agglomeration method

Algorithm idea: bottom-up, first treat each object as a cluster, and then merge the clusters into larger and larger clusters until all objects are in one cluster or meet a certain termination condition

Algorithm steps

Step 1: Calculate the distance between each sample

Step 2: The two samples with the smallest distance are clustered into one category, namely cluster C1

Step 3: Calculate the distance from other samples to C1

Distance measurement method between clusters

Method 1: Shortest distance method (the minimum distance between samples in cluster Ci and cluster Cj is used as the inter-cluster distance)

Method 2: Longest distance method (the maximum distance between samples in cluster Ci and cluster Cj is used as the inter-cluster distance)

Method 3: Class average method (the mean of the distances between cluster Ci and all samples in cluster Cj is used as the inter-cluster distance)

Method 4: Center method (the distance between the center points of cluster Ci and cluster Cj (the mean value of the samples in the cluster) is used as the inter-cluster distance)

Step 4: Loop steps 2 and 3 until all objects are in a cluster or meet a certain termination condition

Illustration:

split method

Algorithm idea: top-down, first place all objects in the same cluster, and then gradually divide them into smaller and smaller clusters until each object forms a cluster of its own or meets a certain termination condition

Algorithm steps

Step 1: Group all samples into one cluster, calculate the distance between each sample, and select the two samples with the furthest distance.

Step 2: Divide the two furthest samples into two clusters and calculate the distances of other samples to the two clusters.

The distance measurement method is exactly the same as the agglomeration method

Step 3: Divide other samples into closer clusters

Step 4: Loop through steps 2 and 3 until each object forms a cluster or meets a certain termination condition.

Illustration:

Advantages and Disadvantages of Hierarchical Clustering

advantage

Distance and rule similarity are easy to define

No need to specify the number of clusters in advance

You can discover the hierarchical relationship of classes

shortcoming

The computational complexity is too high and the amount of data is too large to be applicable.

The model is more sensitive to outliers

Cluster shape tends to be chain-like

optimization

Aiming at the problem that hierarchical clustering data is too large to be used

Method: Use multi-stage clustering technology to perform clustering in an incremental manner to greatly reduce clustering time, that is, BIRCH algorithm

Incremental: The clustering decision of each data point is based on the currently processed data points, rather than based on the global data points.

BIRCH algorithm

Algorithm principle: Clustering features use 3-tuples to obtain relevant information about a cluster. Clustering is obtained by constructing a clustering feature tree that meets the constraints of branching factor and cluster diameter. Each leaf node is a cluster.

several concepts

Clustering Features (CF)

Definition: CF is a triplet, which can be represented by (N, LS, SS). Among them, N represents the number of samples in this CF; LS represents the sum vector of each feature dimension of the sample points in this CF, and SS represents the sum of squares of each feature dimension of the sample points in this CF.

Properties: Satisfy the linear relationship, that is, CF1 CF2=(N1 N2,LS1 LS2,SS1 SS2)

Example: Suppose a certain CF contains 5 two-dimensional feature samples (3,4), (2,6), (4,5), (4,7), (3,8)

CF's N=5

LS of CF=(3 2 4 4 3,4 6 5 7 8)=(16,30)

SS of CF=(3^2 2^2 4^2 4^2 3^2 4^2 6^2 5^2 7^2 8^2)=54 190=244

Cluster feature tree (CF-tree)

Definition: Leaf nodes are clusters, and non-leaf nodes store the CF sum of their descendants.

Parameters of CF Tree

Maximum number of non-leaf nodes: B (branching factor)

The maximum number of CFs contained in each leaf node: L

Maximum radius threshold for each CF of leaf nodes: T

CF-tree creation process

Step 1: Read in the first sample and incorporate it into the new triplet LN1

Illustration:

Step 2: Read the second sample. If it is within a sphere with radius T as the previous sample, set it to the same triplet. Otherwise, generate a new triplet LN2.

Illustration:

Step 3: If the new sample is closest to the LN1 node, but is no longer within the hypersphere radius T of SC1, SC2, and SC3, and L=3, it needs to be split.

Illustration:

Step 4: Among all CF tuples in LN1, find the two farthest CFs to be the seed CFs of these two new leaf nodes, and then add all CFs sc1, sc2, sc3 in the LN1 node, as well as the new elements of the new sample point. Group sc6 is divided into two new leaf nodes

Illustration:

Step 5: Repeat steps 2, 3, and 4 until the termination condition is met

Advantages and Disadvantages

advantage

Clustering speed is fast and noise points can be identified

Linear scalability, good clustering quality

shortcoming

Can only handle numerical data

Sensitive to data input order

Does not work well when clusters are non-spherical