MindMap Gallery Data Science Theory and Practice Chapter 4
Data science and big data structure, data science and big data industry chain provide technical support for big data analysis, including data analysis platform, data science platform, social analysis, machine learning, etc.; data resources represent the institutions that generate data, including Incubators, schools and research institutions.
Edited at 2023-10-21 15:49:41This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
Technology & Tools
Data science technology system
infrastructure
Provide data calculation, data management and monitoring, etc.
analyzing tool
Data science and big data industry chain provide technical support for big data analysis, including data analysis platform, data science platform, social analysis, machine learning, etc.
Enterprise applications
Organizations provide enterprise-level application technologies or tools, including sales and marketing, customer service, human capital and other specific services
Industry application
Solve common industry problems and provide a technology platform for enterprise applications
Cross-platform infrastructure and analytics tools
Provide cross-platform infrastructure and cross-platform analysis tools, such as Microsoft, etc.
Open source tools
Technical design framework query data flow, data access coordination stream processing statistical tools, artificial intelligence machine learning deep learning search log analysis visualization collaboration and security
Data source and APP
Health Internet of Things Finance and Economics, etc.
Data resources
Data resources represent the institutions that generate the data, including incubators, schools and research institutions.
MapReduce
A distributed computing model
map function
The user-defined map function receives the key-value pairs in the input data, and after calculation by the map function, a set of intermediate key-value pairs is obtained.
reduce function
The user-defined reduce function receives an intermediate key value and a related set of value values.
Google's three major papers
Implementation process
Main features
Run as a master-slave structure
Data processing between map function and reduce function
Shuffle processing
combiner processing
partition function
Input and output of key value type
The complexity of fault tolerance mechanisms
Worker failure
Master failure
Diversity of data storage locations
Source file:GFS
Map processing results: local storage
Reduse processing results: GFS
Log:GFS
The importance of task granularity
The necessity of task backup mechanism
Key technologies
partition function
combiner function
Skip corrupted records
local execution
status information
counter
Implementation and improvement of MapReduce
MRv1
programming model
data processing engine
runtime environment
Poor expansion
Poor reliability
Low resource utilization
Unable to support multiple computing frameworks
Hadoop
Apache provides a complete set of open source system libraries for reliable scalable and distributed computing
Hadoop MapReduce
Operation
Submission of assignments
Job initialization
Process and status updates
completion of homework
Task
assignment of tasks
execution of tasks
jobTracker and TaskTracker
input slice
Data localization optimization
Client submits MapReduce task
JobTracker coordinates the running of jobs
TaskTracker runs the divided tasks
HDFS is used to share job files between other entities
HDFS
Support very large files
Based on commercial hardware
Streaming data access
High throughput
Hive
It can map structured data files into a database table, provide simple HiveQL query functions, and convert HiveQL statements into MapReduce tasks for running.
Pig
Pig Latin language, a description language for data analysis
Easy to program
Easy to optimize
flexibility
Pig execution environment
Mahout
Provide scalable machine learning algorithms and their implementation
HBase
Scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data
HBase logical model
HBase physical model
ZooKeeper
simplicity
self-replication
sequential access
high speed reading
Flueme
High reliability
Scalability
Support convenient management
Support user customization
Sqoop
Spark
A brief history with Hadoop
main feature
high speed
Versatility
Ease of use
Technical structure
resource management
Spark core layer
service layer
Basic process
Cluster management
Key technologies
RDD
a set of partitions
A function that calculates each partition
rely
PreferredLocation
Partitioner
Transformation
Action
Scheduler
DAGScheduler is responsible for creating execution plans
TaskScheduler is responsible for allocating tasks and scheduling the running of Workers
Shuffle
SparkR
Data type mapping
Redefinition of session process
Provide multiple APIs
Support custom distributed running functions
Supports a variety of R code editing and running environments
Lambda architecture
NoSQL and NewSQL
Advantages and Disadvantages of Relational Databases
High data consistency
Low data redundancy
Strong complex query capabilities and high product maturity
NoSQL technology
Easy to decentralize storage and processing of data
The cost of frequent data operations is low and the simple processing of data is highly efficient.
Suitable for application scenarios where data models are constantly changing
relationship cloud
data model
Data distribution
Fragmentation
BigTable
master-slave replication
Peer to peer replication
data consistency
weak consistency
eventual consistency
update consistency
Read and write consistency
session consistency
CAP theory and BASE principles
application
A distributed system cannot meet the requirements of consistency, availability, and partition tolerance at the same time. It can only meet at most two of these characteristics at the same time.
BASE principle
In practical applications of NoSQL, consistency and availability need to be weighed
Views and materialized views
materialized view
event triggered
time triggered
Materialized view in Map stage
Materialized view of the Reduce phase
Transaction and version stamp
condition update
version stamp
Typical products
R and Python
R language supports vectorized calculations
Call professional-level services for data science tasks through R language R package
The developers of mainstream R packages are all experts in statistics, machine learning and other data fields.
Integration of data lake and lake warehouse
Data lake is an approach that emphasizes storing data in a natural format and supports configuring data in various schemas and structures.
database
data lake
Data Lake Warehouse
development trend
Development trends of data computing layer
Selling software, hardware products or information resources to users
Responsible for managing and maintaining their software and hardware equipment or information resources on behalf of users
Development trends in data management
From Data Management Perfectionist to Realist
From Schema First to Schema First, Schema Later and Schemaless coexist.
From a focus on complex processing to an emphasis on simple processing
From the pursuit of strong consistency to the diversified understanding of data consistency
From emphasizing the negative effects of data redundancy to emphasizing the positive effects of data redundancy
From the pursuit of recall rate and precision rate to the emphasis on query response speed
The transition from database management systems as a product to database management systems as a service
From standardization of data management technology to diversification of data management technology
From relying solely on a single technology to integrating multiple technologies
Data Science Platform
What is cloud computing
Economy
Strong computation
on demand services
Virtualization