Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

KNN algorithm

If most of the K most similar samples of a sample in the feature space belong to a certain category, then the sample also belongs to this category. Let's take a look at the selection of K values and kd tree knowledge.

Edited at 2023-03-30 20:41:06

PlotWizard

Recent works View more works>>

KNN algorithm

PlotWizard

Recent works View more works>>

Recommended to you
Outline

KNN algorithm

definition

If most of the K most similar samples of a sample in the feature space belong to a certain category, then the sample also belongs to this category.

API

KNeighborsClassifier(n_neighbors=5,algorithm="auto")

n_neighbors represents the selection of K value

algorithm indicates whether the nearest neighbor method, brutal search method, or auto is used

distance

Euclidean distance

manhattan distance

Chebyshev distance

Minkowski distance

Normalized Euclidean distance

cosine distance

Hamming distance

Jaccard distance

Mahalanobis distance

Selection of K value

The k value is too small: easily affected by outliers - overfitting

The k value is too large: it suffers from the problem of sample balancing-underfitting

Approximation error (focusing on the training set) - overfitting - performs well on the training set, but performs poorly on the test set

Estimation error (focusing on the test set) - smaller, indicating better prediction of unknown data

kd tree

Construction of kd tree

nearest neighbor search

Search within this domain

To search across other domains

Case: Prediction of flower species--

Obtaining data sklearn.datasets

Acquisition of small data: sklearn.datasets.load_* (local)

Acquisition of big data: sklearn.datasets.fetch_* (network)

Introduction to data set return values: The return value is batch (dictionary)

data: characteristic data array

target: tag (target) array

DESCR: data description

feature_names: feature names

target_names: tag names

Data visualization: seaborn(x=,y=,data=,hue="target value",fit_reg=False does not perform data fitting)

Data set division: sklearn.model_selection.train_test_split (x=feature value, y=target value, test_size=0.2 test set size, random_size=22 random number seed)

x_train,x_test,y_train,y_test are return values

feature engineering

Feature preprocessing

Normalized

Definition: Map original data to a certain range (0, 1)

Api

sklearn.preprocessing.MinMaxScaler(feature_range=(0,1))

fit_transformer(X) converts to an array of the same shape

Summary: The maximum and minimum values are very susceptible to outliers, so this method is less robust and is only suitable for traditional accurate small data scenarios.

standardization

Definition: Transform the original data into a range with a mean of 0 and a standard deviation of 1

API

StandardScaler()

fit_trasformer()

Summary: Outliers have less impact on the results and are suitable for modern noisy data.

Summarize

Simple and effective, low retraining cost, suitable for cross-class samples, and suitable for automatic classification of large samples

Lazy learning, category scoring is not standardized, is not good at unbalanced samples, and requires a large amount of calculation

machine learning

Get dataset

Basic data processing

Data splitting

feature engineering

Feature preprocessing

Normalized

standardization

machine learning

Model evaluation

Cross-validation

Definition: Divide the training set into a validation set and a test set, leaving the test set unchanged. Divide the training set into several equal parts, which is called several-fold cross-validation.

Features: Cannot improve test accuracy, just to provide more accurate prediction results

grid search

For some hyperparameters of the model, it is impossible to determine how big to choose. When the model training effect is the best, grid search is generally used to test the performance of different hyperparameters and return the optimal hyperparameter selection.

API

GridSearchCV(estimator=cross-validation model name, param_grid=number of hyperparameters, cv=several-fold cross-validation)

return value

best_estimator: best model

best_params_: Parameter combination for best results

best_score_: best prediction result

cv_results_: prediction results of the overall model