Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery Spark design and operating principles

Spark design and operating principles

Spark's design and operating principles, Spark's main features: fast running speed, easy to use, versatility, modularity, diverse operating modes, and support for various data sources.

Edited at 2024-11-28 12:52:29

ReesyA

Recent works View more works>>

Spark design and operating principles

ReesyA

Recent works View more works>>

Recommended to you
Outline

Spark design and operating principles

Overview

Spark main features

Runs fast

easy to use

Versatility

Modular

Various operating modes

Support various data sources

Spark ecosystem

core components

Spark Core

Basic functions: mainly for batch data processing

RDD operations

Spark SQL

Process structured data

Spark Streaming

Real-time streaming data processing

Structured streaming

It is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

MLlib

Machine learning library

Algorithms and tools

GraphX

graph computing framework

Graph parallel computing

Spark running architecture

Basic concepts

RDD

Resilient distributed data sets are an abstract concept of distributed memory and provide a highly constrained shared memory model.

DAG

Directed acyclic graph reflects the dependencies between RDDs

Executor

A process running on the worker node, responsible for running tasks and storing data for the application

Application

User-written spark applications

Task

A unit of work running on an executor

Job

A job contains multiple RDDs and various operations that act on the corresponding RDDs.

Stage

The basic scheduling unit of a job. A job is divided into multiple groups of tasks. Each group of tasks is called a stage, also called a task set.

Architecture design

Cluster Manager

Work node

Driver

actuator

Adopting a master-slave architecture with one driver and several Workers

Spark runs basic process

RDD design and operation principles

RDD design background

To optimize iterative algorithms and interactive data mining

RDD concept

Elastic distributed data set distributed object collection is essentially a read-only partitioned record collection

Action operations

Used to perform calculations and specify the form of the output

conversion operation

Used to specify interdependencies between RDDs

Is a coarse-grained data conversion operation

RDD is suitable for batch-style applications that perform the same operation on elements in the data set, but is not suitable for applications that require asynchronous, fine-grained state.

RDD execution process

Read external data sources for RDD creation

After RDD undergoes a series of conversion operations, a different RDD will be generated each time for use in the next conversion operation.

The last RDD is processed by the action operation and output to an external data source

lazy mechanism

The real calculation occurs in the action operation of RDD. Spark only records some basic data used in the transformation operation and the trajectory generated by DDD, but does not trigger the real calculation.

RDD characteristics

Efficient fault tolerance

It is not necessary to achieve fault tolerance through data redundancy, but only needs to recalculate the lost partitions through the RDD parent-child dependency relationship to achieve fault tolerance, without rolling back the entire system.

The conversion operations provided by RDD are coarse-grained operations.

Intermediate results are persisted to memory

Data is transferred between multiple RDDs in memory and does not need to be "landed" on the disk, avoiding unnecessary reading and writing disk overhead.

The stored data can be Java objects

Avoid unnecessary object serialization and deserialization overhead

The main reason why Spark uses RDD to achieve efficient calculations

Dependencies between RDDs

narrow dependency

one-to-one dependence

Does not include Shuffle process

Pipeline optimization can be achieved

wide dependency

Dependencies involving multiple partitions

Contains Shuffle process

Pipeline optimization cannot be achieved

The main difference is whether the Shuffle process is included

Data retransmission

Division of stages

Application→Job→Phase→Task

Specific division method

Perform reverse analysis in the DAG, disconnect when encountering wide dependencies, and add the current RDD to the current stage when encountering narrow dependencies.

RDD running process

Create RDD object

Sparkcontext is responsible for calculating the dependencies between RDD and building DAG

DAGScheduler is responsible for decomposing the DNA graph into multiple Stages. Each Stage contains multiple tasks. Each Task will be distributed by TaskSchedule to the Executor on each WorkNode for execution.

Spark deployment mode

Local, local mode

Standalone

Spark on Mesos

Spark on YARN

Spark on Kubernetes