MindMap Gallery DAMA-CDGA Data Governance Engineer-14. Big Data and Data Science
Big Data and Data Science: The desire to seize business opportunities from data sets generated by multiple processes is the biggest business driver for improving an organization's big data and data science capabilities.
Edited at 2024-03-05 20:32:22Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
14. Big data and data science
introduction
Big data not only refers to the large amount of data, but also refers to the variety of data and the rapid speed of data generation.
Traditional business intelligence BI provides "rearview mirror" reports, showing past trends by analyzing structured data.
In some cases, BI models are used to predict future behavior, but it does not have high confidence
If you want to take advantage of big data, you must change the way you manage data
Most data warehouses are based on relational models, while big data generally does not use relational models to organize data.
Most data warehouses rely on the concept of ETL (Extract, Transform, Load)
Big data solutions, such as data lakes, rely on the concept of ELT - load first and then transform.
business drivers
The desire to seize business opportunities generated from data sets generated by multiple processes is the biggest business driver for improving an organization's big data and data science capabilities.
in principle
Principles related to big data management have yet to be formulated, but one thing is very clear: organizations should carefully manage the metadata associated with big data sources to allow for accurate inventory management of data files, their origin and value
basic concept
data science
Data scientists formulate a hypothesis about behavior, that is, a specific behavior can be observed in the data before the specific action
Data scientists then analyze large amounts of historical data to determine how often the hypothesis actually occurred in the past and statistically verify the model's likely accuracy.
If a hypothesis is valid at a high enough frequency, and the behavior it predicts is useful, then the model may become the basis for an operational intelligence process to predict future behavior, perhaps even in real time.
depends on
Rich data sources
Information organization and analysis
information delivery
Showcase findings and data insights
data science process
Define big data strategy and business needs
Select data source
Collect and extract data
Set data assumptions and methods
Integrate and align data for analysis
Explore data using models
Deploy and monitor
Big Data
Big amount of data
Big data often has thousands of entities or elements in billions of records
Data updates quickly
Refers to the speed at which data is captured, generated or shared
Various data types
Refers to the form of grabbing or passing data
Data viscosity is high
Refers to the difficulty of using or integrating data
Data fluctuates greatly
Refers to the frequency of data changes and the resulting short data validity time
Low data accuracy
Refers to the low reliability of the data
Big data architecture components
The biggest difference between DW/BI and big data processing is
In a traditional data warehouse, data is integrated (extracted, transformed, loaded) as it enters the warehouse
In a big data environment, data is received and loaded (extracted, loaded, transformed) before being integrated
Big data sources
Structured data Unstructured data
data lake
A data lake is an environment that can extract, store, evaluate and analyze massive data of different types and structures, and can provide a variety of scenario applications.
For example, you can provide
An environment where data scientists can mine and analyze data
Centralized storage area for raw data with minimal transformation (if needed)
Few conversions are due to ELT
Alternate storage area for data warehouse detail historical data
Online archiving of information records
The environment in which data is extracted can be identified through automated models
A data lake can be implemented as a composite configuration of data processing tools such as Hadoop or other data storage systems, cluster services, data transformation or data integration.
risk
The risk with a data lake is that it can quickly turn into a data swamp - messy, unclean, and inconsistent
In order to build an inventory of content in a data lake, it is critical to manage metadata as the data is ingested
service-based architecture
Service-based architecture is becoming a way to provide data immediately and use the same data source to update complete and accurate historical data sets
SBA architecture is somewhat similar to data warehouse
It sends the data to the operational data store ODS for immediate access
At the same time, the data will also be sent to the data warehouse for historical accumulation.
level
batch layer
Data lakes serve batch processing, including recent and historical data
acceleration layer
Includes only real-time data
service layer
Provides an interface to connect batch processing and acceleration layer data
Data is loaded into batch and acceleration layers
All analytical calculations are performed on the data in the batch layer and acceleration layer. This design may need to be implemented in two independent systems.
The batch layer is often referred to as the structural component that changes over time (here each transaction is an insert), while in the acceleration layer (often referred to as operational data store in ODS) all transactions are updates
This architecture prevents synchronization issues by creating current state and history layers simultaneously.
machine learning
supervised learning
is based on complex mathematical theories, especially statistics, combinatorics and operations research
Passing is based on rules (such as separating SPAM emails from non-SPAM emails)
unsupervised learning
data mining
Based on finding those hidden patterns
Enhance learning
Goal optimization achieved without teacher buy-in
Semantic Analysis
Media monitoring and text analytics are automated methods of retrieving and deriving insights from large amounts of unstructured or semi-structured data to sense how people feel and think about a brand, product, service, or other type of topic
Use natural language processing (NLP) to analyze short sentences or sentences to detect emotions and reveal changes in emotions to predict possible scenarios.
Data and text mining
Data mining is a special analysis method that uses various algorithms to reveal patterns in data
It was originally a branch of machine learning and a subfield of artificial intelligence
Standardized query and reporting tools can identify specific problems, while data mining tools help discover unknown relationships by revealing patterns.
Text mining uses text analysis and data mining technology to analyze documents, automatically classify the content, and become a workflow-oriented and domain expert-oriented knowledge ontology.
Electronic text media can therefore be analyzed without reconstruction or formatting
technology
Analyze
Attempt to describe the classic behavior of an individual, group, or crowd, used to establish behavioral norms for anomaly detection applications
The profiling results are input to many unsupervised learning components
data reduction
is to replace a large data set with a smaller data set
Smaller data sets contain most of the information in large data sets
Smaller data sets are easier to analyze or manipulate
association
Correlation is an unsupervised learning process that studies the elements involved in a transaction and finds the correlation between them.
For example, Internet recommendations
clustering
Group data elements into different clusters based on their shared characteristics
For example, customer segmentation
self-organizing map
Predictive analytics
Predictive analytics is developed based on probabilistic models of possible events and variables that trigger organizational responses when it receives additional information.
The simplest form of a predictive model is an estimate
normative analysis
Going one step further than predictive analytics, it defines actions that will affect outcomes rather than just predicting outcomes based on actions that have already occurred
Prescriptive analysis predicts what will happen, when it will happen, and suggests why it will happen
Because prescriptive analytics can show the implications of various decisions, it can suggest how to exploit opportunities or avoid risks.
Unstructured data analysis
Unstructured data analysis becomes increasingly important as more unstructured data is generated
Certain analyzes cannot be performed without incorporating unstructured data into the analytic model
But analysis of unstructured data can also be very difficult without some way of isolating elements of interest from irrelevant elements.
Scanning and tagging is a method of adding "hooks" to unstructured data, allowing connected filtering of related schema data
operational analysis
Also known as operational BI or streaming analytics, the concept arises from the integration of operational processes with real-time analytics
Operational analytics includes tracking and integrating real-time information flows, drawing conclusions based on behavioral prediction models, and triggering automated responses and alerts
Operational analytics solutions include preparation of historical data required to populate behavioral models
data visualization
Visualization is the process of explaining concepts, ideas, and facts through the use of pictures or graphic representations
Compress and encapsulate feature data to make it easier to view
Visualizations can be in a static format (such as a published report), or require creative means to adapt the visualization
Data mashup
Bring data and services together to visually display insights or analysis results
Activity
Define big data strategy and business needs
Strategic Assessment Criteria
What problem is the organization trying to solve and what needs to be analyzed?
What is the data source to use or obtain
Provide timeliness and scope of data
Effects on and dependencies on other data structures
Impact on existing modeling data
Select data source
Get and receive data sources
Develop data assumptions and methods
Integrate and align data for analysis
Explore data using models
Populate the predictive model
Training model
Evaluation model
Create data visualizations
Deploy and monitor
Reveal insights and discoveries
Iterate using additional data sources
tool
MPP shared nothing technology and architecture
Distributed file-based database
In-database algorithm
Big data cloud solution
Statistical computing and graphics languages
Data visualization toolset
method
Analytical modeling
Big data modeling
Implementation Guide
strategic consistency
Readiness Assessment/Risk Assessment
Organizational and cultural changes
Big Data and Data Science Governance
Visual channel management
Data Science and Visualization Standards
Data Security
Metadata
Data quality
Metrics
Technical usage metrics
Loading and scanning indicators
Learning and Story Scenarios