Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery Explanation of big data algorithm development plan

Explanation of big data algorithm development plan

An explanation of the big data algorithm development plan, including the reservoir sampling algorithm, the difference between the current limiting algorithm funnel and the token bucket, the consistent Hash algorithm, etc.

Edited at 2022-11-16 10:59:21

PlotWizard

Recent works View more works>>

Explanation of big data algorithm development plan

PlotWizard

Recent works View more works>>

Recommended to you
Outline

Explanation of big data algorithm development plan

LRU

Time-based algorithm. Most recently visited Least Frequently Used

Three implementations

1. Use an array to store data Mark each data item with an access timestamp. Each time a new data item is inserted, first increment the timestamp of the data item existing in the array, set the timestamp of the new data item to 0 and insert it into the array. . Each time a data item in the array is accessed, the timestamp of the accessed data item is set to 0. When the array space is full, the data item with the largest timestamp is eliminated.

2. Use a linked list to implement Every time new data is inserted, the new data is inserted into the head of the linked list; every time the cache hits (that is, the data is accessed), the data is moved to the head of the linked list; then when the linked list is full, the data at the end of the linked list is moved. throw away.

3. Use linked lists and hashmaps. Maintain linked list and map at the same time, map is used for index search When a new data item needs to be inserted, if the new data item exists in the linked list (generally called a hit), the node is moved to the head of the linked list. If it does not exist, a new node is created and placed at the head of the linked list. If the cache is full, just delete the last node of the linked list. When accessing data, if the data item exists in the linked list, move the node to the head of the linked list, otherwise -1 is returned. In this way, the node at the end of the linked list is the most recently accessed data item.

Java implementation

LinkedHashMap

shortcoming

When hot data exists, LRU is very efficient, but occasional and periodic batch operations will cause the LRU hit rate to drop sharply and cache pollution to be serious.

LRU-K algorithm

The main purpose of LRU-K is to solve the "cache pollution" problem of the LRU algorithm. Its core idea is to extend the judgment criterion of "recently used once" to "recently used K times".

accomplish

LRU-K needs to maintain an additional queue to record the history of all cached data being accessed.

Only when the number of accesses to the data reaches K times, the data is put into the cache. When data needs to be eliminated, LRU-K will eliminate the data whose K-th access time is the largest from the current time.

When the data is accessed for the first time, it is added to the historical access list. If the book does not reach K access times in the access history list, it will be eliminated according to certain rules (FIFO, LRU)

LRU-2

2Q changes the access history queue in the LRU-2 algorithm (note that this is not used to cache data) to a FIFO cache queue, that is: the 2Q algorithm has two cache queues, one is a FIFO queue and the other is an LRU queue.

When the data is accessed for the first time, the 2Q algorithm caches the data in the FIFO queue. When the data is accessed for the second time, the data is moved from the FIFO queue to the LRU queue. The two queues each eliminate the data according to their own methods.

Avoids cache damage caused by batch queries

Multi Queue

The MQ algorithm divides data into multiple queues based on access frequency. Different queues have different access priorities. The core idea is to cache data with more access times first. The detailed algorithm structure diagram is as follows, Q0, Q1...Qk represent different priority queues, Q-history represents the queue that eliminates data from the cache, but records the index and reference number of the data.

The newly inserted data is put into Q0, and each queue is managed according to LRU. When the number of accesses to the data reaches a certain number and the priority needs to be increased, the data is deleted from the current queue and added to the head of the higher-level queue; in order To prevent high-priority data from ever being eliminated, when the data is not accessed within the specified time, the priority needs to be lowered, the data is deleted from the current queue, and added to the head of the lower-level queue; when the data needs to be eliminated, Starting from the lowest level queue, the queue is eliminated according to LRU. When each queue eliminates data, the data is deleted from the cache and the data index is added to the Q-history header. If the data is revisited in Q-history, its priority is recalculated and moved to the head of the target queue. Q-history eliminates the index of data according to LRU.

FIFO

First-in-first-out page replacement algorithm

Maintain a linked list of all pages currently in memory, with the latest page entered at the end of the list and the oldest page entered at the head. When a page fault occurs, the page at the head of the table is eliminated and the newly loaded page is added to the end of the table.

This algorithm does not count the frequency of use, but only based on the time when it first entered the cache. Whoever entered the oldest cache will be discarded first.

This algorithm may discard elements that are used frequently. In fact, the LRU algorithm is an optimization of FIFO. It will be placed in the cache head after no new access.

NRU

The page replacement algorithm has not been used recently

This algorithm focuses on the status bits of the accessed elements. Each time it is cleaned, only the unvisited elements are cleaned up.

There is also a need for a timing mechanism that can regularly clean up the mark bits of elements and mark them as unvisited.

reservoir sampling algorithm

Given a data stream, the data stream length N is very large, and N is unknown until all the data is processed. How can we randomly select m non-duplicate data while only traversing the data once (O(N))? The data

Features

The data stream length N is very large and unknown, so it cannot be stored in memory at once.

The time complexity is O(N).

m numbers are randomly selected, and the probability of each number being selected is m/N.

Ideas

If the amount of received data is less than m, it is put into the reservoir in sequence.

When the i-th data is received, i >= m, a random number d is taken in the range [0, i]. If d falls in the range [0, m-1], the received i-th data is used. The data replaces the dth data in the reservoir.

Repeat step 2.

Derivation

The probability that the i-th received data can finally stay in the reservoir = the probability that the i-th data has entered the reservoir * the probability that the i-th data will not be replaced thereafter (i-th to N-th data processing will not be replaced).

Idea: Prove respectively that when i <=m, the probability of entering the reservoir and the probability of not being replaced by the following value When i>m, the probability of entering the reservoir and the probability of not being replaced later Important proof is below

When i<=m, the program starts to perform the replacement operation when receiving the m 1th data. The m 1st processing will replace the data in the pool is m/(m 1), and the probability of replacing the ith data is 1/m, then the probability of replacing the i-th data in the m-th processing is (m/(m 1))*(1/m)=1/(m 1), and the probability of not being replaced is 1- 1/(m 1)=m/(m 1). In turn, the probability that the m-th processing does not replace the i-th data is (m 1)/(m 2)... The probability that the N-th processing does not replace the i-th data is (N-1)/N. Therefore, the probability that the i-th data will not be replaced later =m/(m 1)*(m 1)/(m 2)*...*(N-1)/N=m/N.

distributed reservoir

Idea: 1. Split N into k machines, which can be divided according to hash. 2. A single machine obtains m values from each reservoir. 3. Randomly pick a value from [1, N]. If it is on N1 (on the first machine, it can be calculated directly based on the hash if you use hash), then randomly 1/ m takes a value

Assume that there are K machines, and the large data set is divided into K data streams. Each machine uses a stand-alone version of the reservoir to sample and process a data stream, sample m data, and finally record the amount of processed data as N1, N2, .. ., Nk, ..., NK (assuming m<Nk). N1 N2...NK=N.

Take a random number d from [1, N]. If d<N1, select a data in the first machine's reservoir with medium probability without replacement (1/m); if N1<=d<(N1 N2 ), then select a piece of data from the reservoir of the second machine with medium probability without replacement; by analogy, repeat m times, and finally select m pieces of data from the N large data set.

Derivation

The probability that the reservoir data in the k-th machine is selected is m/Nk.

The probability of selecting a piece of data from the reservoir of the k-th machine and putting it into the final reservoir is Nk/N.

The probability that a data in the kth machine reservoir is selected is 1/m. (equal probability when selected without replacement)

Repeat the selection m times, then the probability of each data being selected is m*(m/Nk*Nk/N*1/m)=m/N

The difference between current limiting algorithm funnel and token bucket

funnel algorithm

Set the total funnel capacity as the maximum concurrency level of the service

Set the consumption rate of the funnel, which is fixed

The funnel is empty by default, and requests enter the bucket.

The funnel algorithm is actually pessimistic, because it strictly limits the throughput of the system. From a certain perspective, its effect is very similar to concurrency limiting. The funnel algorithm can also be used in most scenarios, but because it has strict and fixed limits on service throughput

Guarantee the highest number of concurrency of the service, but the performance is not good when the instantaneous traffic increases.

Token Bucket Algorithm

The token bucket algorithm can limit the average transmission rate of data while also allowing a certain degree of burst transmission.

When the token bucket can be empty, the request will be discarded. At the same time, tokens are stored in the bucket at a rate of 1/r

Consistent Hash Algorithm

effect

load balancing

Handle sharding, no distributed lock architecture

Common Hash algorithms cannot solve the problem of adding or removing nodes and causing a large amount of cache drift and failure.

How to deal with drift situations

Use Hash Ring

Route the request to the ring, and drift the request to the previous node clockwise when drifting

How to avoid uneven post-drift processing

Use virtual nodes.

Use physical node to virtual node mapping. Map one physical node to multiple virtual nodes.

The key of the virtual node can be marked with a prefix number. In this way, the physical node can be located according to the virtual node bytes.

The mapping of physical nodes to virtual nodes is not strictly slicing, but completely random.

When drift occurs, the node to which the physical node drifts is completely random to avoid a The physical node completely drifts to another physical node, causing a load imbalance

actual implementation

Including hash ring add, remove and hash node positioning

Use two Maps

A treemap stores the mapping of hash virtual nodes to physical nodes. At the same time, the keys can be sorted. A given hash value can be used to locate the specific virtual node that should be processed, and then based on the virtual node prefix Determine the physical node (key:hash value: physical node#index)

Stores the mapping from physical nodes to virtual nodes. Key is the physical node and value is the corresponding virtual node hash value. (You can find it by querying treemap)

When a physical node is added, several (definable) virtual nodes are generated. Stored in two map clocks

When a physical node is deleted, the virtual node corresponding to the physical node is queried, the drift logic is executed, and the request is drifted to the previous node of the node. The treemap value of the virtual node needs to be modified to the value corresponding to the previous key.

Use murmur to hash keys with very high similarity evenly. The better the randomness. At this time, the drift can be made random enough, and the actual physical node load is not uniform.

The specific drift logic is implemented by the client itself

Drifting is also required when adding. Requests previously processed by the previous node will be routed to this node. When deleting, requests processed by this node will be forwarded.

murmur

Non-cryptographic hash algorithm