Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery Data Science Theory and Practice Chapter 4

Data Science Theory and Practice Chapter 4

Data science and big data structure, data science and big data industry chain provide technical support for big data analysis, including data analysis platform, data science platform, social analysis, machine learning, etc.; data resources represent the institutions that generate data, including Incubators, schools and research institutions.

Edited at 2023-10-21 15:49:41

PlotWizard

Recent works View more works>>

Data Science Theory and Practice Chapter 4

PlotWizard

Recent works View more works>>

Recommended to you
Outline

Data Science
- 83
- 2
Ilmi Preacher
Chapter 23 - Statistics and Data Science
- 42
PlotWizard
Chapter 5, Data Product Development
- 22
PlotWizard

Technology & Tools

Data science technology system

infrastructure

Provide data calculation, data management and monitoring, etc.

analyzing tool

Data science and big data industry chain provide technical support for big data analysis, including data analysis platform, data science platform, social analysis, machine learning, etc.

Enterprise applications

Organizations provide enterprise-level application technologies or tools, including sales and marketing, customer service, human capital and other specific services

Industry application

Solve common industry problems and provide a technology platform for enterprise applications

Cross-platform infrastructure and analytics tools

Provide cross-platform infrastructure and cross-platform analysis tools, such as Microsoft, etc.

Open source tools

Technical design framework query data flow, data access coordination stream processing statistical tools, artificial intelligence machine learning deep learning search log analysis visualization collaboration and security

Data source and APP

Health Internet of Things Finance and Economics, etc.

Data resources

Data resources represent the institutions that generate the data, including incubators, schools and research institutions.

MapReduce

A distributed computing model

map function

The user-defined map function receives the key-value pairs in the input data, and after calculation by the map function, a set of intermediate key-value pairs is obtained.

reduce function

The user-defined reduce function receives an intermediate key value and a related set of value values.

Google's three major papers

Implementation process

Main features

Run as a master-slave structure

Data processing between map function and reduce function

Shuffle processing

combiner processing

partition function

Input and output of key value type

The complexity of fault tolerance mechanisms

Worker failure

Master failure

Diversity of data storage locations

Source file:GFS

Map processing results: local storage

Reduse processing results: GFS

Log:GFS

The importance of task granularity

The necessity of task backup mechanism

Key technologies

partition function

combiner function

Skip corrupted records

local execution

status information

counter

Implementation and improvement of MapReduce

MRv1

programming model

data processing engine

runtime environment

Poor expansion

Poor reliability

Low resource utilization

Unable to support multiple computing frameworks

Hadoop

Apache provides a complete set of open source system libraries for reliable scalable and distributed computing

Hadoop MapReduce

Operation

Submission of assignments

Job initialization

Process and status updates

completion of homework

Task

assignment of tasks

execution of tasks

jobTracker and TaskTracker

input slice

Data localization optimization

Client submits MapReduce task

JobTracker coordinates the running of jobs

TaskTracker runs the divided tasks

HDFS is used to share job files between other entities

HDFS

Support very large files

Based on commercial hardware

Streaming data access

High throughput

Hive

It can map structured data files into a database table, provide simple HiveQL query functions, and convert HiveQL statements into MapReduce tasks for running.

Pig

Pig Latin language, a description language for data analysis

Easy to program

Easy to optimize

flexibility

Pig execution environment

Mahout

Provide scalable machine learning algorithms and their implementation

HBase

Scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data

HBase logical model

HBase physical model

ZooKeeper

simplicity

self-replication

sequential access

high speed reading

Flueme

High reliability

Scalability

Support convenient management

Support user customization

Sqoop

Spark

A brief history with Hadoop

main feature

high speed

Versatility

Ease of use

Technical structure

resource management

Spark core layer

service layer

Basic process

Cluster management

Key technologies

RDD

a set of partitions

A function that calculates each partition

rely

PreferredLocation

Partitioner

Transformation

Action

Scheduler

DAGScheduler is responsible for creating execution plans

TaskScheduler is responsible for allocating tasks and scheduling the running of Workers

Shuffle

SparkR

Data type mapping

Redefinition of session process

Provide multiple APIs

Support custom distributed running functions

Supports a variety of R code editing and running environments

Lambda architecture

NoSQL and NewSQL

Advantages and Disadvantages of Relational Databases

High data consistency

Low data redundancy

Strong complex query capabilities and high product maturity

NoSQL technology

Easy to decentralize storage and processing of data

The cost of frequent data operations is low and the simple processing of data is highly efficient.

Suitable for application scenarios where data models are constantly changing

relationship cloud

data model

Data distribution

Fragmentation

BigTable

master-slave replication

Peer to peer replication

data consistency

weak consistency

eventual consistency

update consistency

Read and write consistency

session consistency

CAP theory and BASE principles

application

A distributed system cannot meet the requirements of consistency, availability, and partition tolerance at the same time. It can only meet at most two of these characteristics at the same time.

BASE principle

In practical applications of NoSQL, consistency and availability need to be weighed

Views and materialized views

materialized view

event triggered

time triggered

Materialized view in Map stage

Materialized view of the Reduce phase

Transaction and version stamp

condition update

version stamp

Typical products

R and Python

R language supports vectorized calculations

Call professional-level services for data science tasks through R language R package

The developers of mainstream R packages are all experts in statistics, machine learning and other data fields.

Integration of data lake and lake warehouse

Data lake is an approach that emphasizes storing data in a natural format and supports configuring data in various schemas and structures.

database

data lake

Data Lake Warehouse

development trend

Development trends of data computing layer

Selling software, hardware products or information resources to users

Responsible for managing and maintaining their software and hardware equipment or information resources on behalf of users

Development trends in data management

From Data Management Perfectionist to Realist

From Schema First to Schema First, Schema Later and Schemaless coexist.

From a focus on complex processing to an emphasis on simple processing

From the pursuit of strong consistency to the diversified understanding of data consistency

From emphasizing the negative effects of data redundancy to emphasizing the positive effects of data redundancy

From the pursuit of recall rate and precision rate to the emphasis on query response speed

The transition from database management systems as a product to database management systems as a service

From standardization of data management technology to diversification of data management technology

From relying solely on a single technology to integrating multiple technologies

Data Science Platform

What is cloud computing

Economy

Strong computation

on demand services

Virtualization