MindMap Gallery DAMA-CDGA Data Governance Engineer-13. Data Quality
Data quality management means that all data management principles should help improve data quality, and supporting the organization's use of high-quality data should be the goal of all data management principles.
Edited at 2024-03-05 20:31:04Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
13. Data quality
introduction
in principle
1. Start with important data first
2. PDCA
3. Assess the dimensions of data governance
4. Root cause analysis
5. Data quality report
Overview
The prerequisite for realizing the value of data is that the data itself is reliable and trustworthy. In other words, the data should be of high quality.
All data management principles should help improve data quality, and supporting the organization's use of high-quality data should be the goal of all data management principles.
Like data governance and overall data management, data quality management is not a project but an ongoing effort
business drivers
include
Opportunities to increase organizational data value and data utilization
Reduce risks and costs caused by low-quality data
Improve organizational efficiency and productivity
Protect and enhance the organization's reputation
Organizations looking to derive value from their data recognize that high-quality data is more valuable than low-quality data
Using poor quality data is fraught with risks
High-quality data is not an end in itself; it is a means to organizational success.
Target
Based on the needs of data consumers, develop a managed approach to adapt data to requirements
Define standards and specifications for data quality control as part of the entire data life cycle
Define and implement processes for measuring, monitoring and reporting data quality levels
in principle
importance
Data quality management focuses on the data that is most important to the business and its customers, and improvements should be prioritized based on the importance of the data and the level of risk if the data is incorrect.
Full life cycle management
Data governance management should cover the entire data life cycle from creation or procurement to disposal.
Every link in the data chain should ensure that the data has high-quality output
prevention
The focus of a data quality program should be on preventing errors and situations that reduce data availability, not on simply correcting records
root cause correction
Improving data quality is more than just correcting errors, as data quality issues are often related to process or system design. Improving data quality often requires changes to the processes and systems that support them, not just understanding and solving them.
governance
Data governance activities must support the development of high-quality data, and data quality planning activities must support and sustain a governed data environment.
Standard driver
Quantifiable data quality requirements should be defined in the form of measurable standards and expectations
Objective measurement and transparency
Data quality levels need to be measured objectively and consistently
Embed business processes
Business process owners are responsible for the quality of data generated through their processes and they must implement data quality standards in their processes
System enforcement
System owners must enforce data quality standards on the system
associated with service levels
Data governance reporting and issue management should be integrated into service level agreements
basic concept
Data quality
refers to the relevant characteristics of high-quality data
Also refers to the process used to measure and improve data quality
high quality
Data meets data consumer application needs
low quality
Data does not meet data consumer application needs
Data quality depends on the data scenario and the needs of data consumers
key data
Most organizations have large amounts of data, but not all data is equally important
A principle of data quality management is to focus improvements on the data that is most important to the organization and customers
Doing so clarifies the scope of the project and enables it to have a direct, measurable impact on business needs
Evaluate key data
regulatory reporting
financial report
business policy
Continue to operate
business strategy
Data quality dimensions
figure
Strong-Wang
Thomas Redman
Larry English
A data quality dimension is a measurable characteristic of the data
Data quality dimensions provide a set of vocabulary that defines data quality requirements
These dimension definitions enable the evaluation of initial data quality and the effectiveness of continuous improvements.
Dimensions are the basis for measurement rules
DAMA core dimensions
completeness
Stored data volume as a percentage of potential data volume
uniqueness
Entity instances should not be recorded multiple times on the basis of satisfying object recognition.
timeliness
The extent to which the data represents reality from the requested point in time
effectiveness
The data is valid if it conforms to its defined syntax (format, type, range)
accuracy
The extent to which the data accurately describes the "real world" object or event being described
consistency
Compare the differences between multiple expressions and definitions of things
Data governance and metadata
Metadata is critical to managing data quality
Data quality depends on how well it meets the needs of data consumers
Data quality is about meeting expectations, and metadata is the primary means of clarifying expectations
Well-managed metadata can also support improved data quality efforts
Data governance ISO standards
Data quality improvement life cycle
Improving data quality requires improving the ability to evaluate the relationship between inputs and outputs to ensure that inputs meet the requirements of the process and that outputs are as expected
Planning Phase P
The data quality team assesses the scope, impact, and priority of known issues and evaluates options for resolving them
This phase should be based on a solid foundation of analyzing the root causes of the problems, understanding the costs/benefits in terms of their causes and impacts, establishing priorities, and developing a basic plan to address them
Execution phase D
The data quality team is responsible for working to resolve the root cause of the problem and making plans for ongoing monitoring of the data
Check stage C
This phase includes active monitoring of data quality measured as required
As long as the defined quality threshold is met, no additional action is required
If the data falls below the acceptable quality threshold, additional steps must be taken to bring it up to an acceptable level
Processing Stage A
This phase refers to activities that address and resolve emerging data quality issues
The cycle begins again as the cause of the problem is assessed and a solution proposed
Continuous improvement by starting a new cycle
The new cycle begins
Existing measurement value is below threshold
New datasets are under investigation
New data quality requirements for existing data sets
Changes in business, standards or expectations
The cost of getting the data right the first time is far less than the cost of getting the wrong data and fixing it
The cost of introducing quality into a data management process from the beginning is less than the cost of transforming it
Data quality business rule types
Data quality business rules describe the useful data and the form in which the data is available within the organization
These rules need to comply with the quality dimension requirements and are used to describe the data quality requirements
Common causes of data quality issues
Problems caused by lack of leadership
Many data governance problems are caused by a lack of organizational commitment to high-quality data, which itself is a lack of leadership in the form of governance and management
Barriers to effectively managing data quality include
Lack of awareness among leaders and employees
lack of governance
Lack of leadership and management skills
Difficulty justifying improvements
Tools for measuring value are inappropriate or do not work
Problems caused by data entry process
Problems caused by data processing functions
Problems caused by system design
Solve problems caused by problems
Data analysis
Data profiling is a form of data analysis used to examine data and assess quality
Data profiling uses statistical techniques to discover the true structure, content, and quality of data collections
The profiling engine generates statistics that analysts can use to identify patterns in the content and structure of the data
For example
Number of null values
Max/Min
Max/Min length
Frequency distribution of individual column values
Data types and formats
While profiling is an effective way to understand data, it is only the first step in improving data quality by enabling organizations to identify potential issues.
Solving problems also requires other forms of analysis, including business process analysis, data lineage analysis and deeper data analysis that can help isolate the root cause of the problem
Data governance and data processing
While data governance improvement efforts focus on preventing errors, data quality can also be improved through some form of data processing
Data cleaning
Data cleansing, or data cleansing, can transform data into compliance with data standards and domain rules
Cleaning involves detecting and correcting data errors to bring data quality to an acceptable level
Continuously revising data through cleansing is a costly and risky process
In an ideal world, over time the root cause of the data problem has been addressed and the need for data cleansing should decrease
In some cases, ongoing modifications via midstream systems are also necessary because reprocessing data in midstream systems is less expensive than any other alternative
Way
Implement controls to prevent data entry errors
Correct data in source system
Improve business processes for data entry
data augmentation
Data augmentation or enrichment is the process of adding properties to a data set to improve its quality and usability
Example
Timestamp
Recording the date and time when a data item is created, modified or deactivated helps track historical data events and enables analysts to locate the time range of the problem.
Audit data
Auditing can record data lineage, which is important for historical tracking and verification
Reference glossary
Increase understanding and control of data
contextual information
Add contextual information and tag data for review and analysis
geographical information
Geographic information can be enhanced through address standardization and geocoding, such as area codes, municipalities, neighborhoods, latitude and longitude
Demographic information
Customer data can be enhanced with demographic information such as age, marriage, gender, income, etc.
psychological information
Used to segment data on target groups according to specific behaviors, habits, and preferences
Assessment information
Use this enhancement for asset valuations, inventory, sales data, and more
Data parsing and formatting
Data parsing is the analytical process of interpreting the contents or values of an object using predetermined rules
First, data analysts define a set of patterns. Then, these patterns are recorded in a rules engine that is used to distinguish valid and invalid data values. The rules engine matches specific patterns to trigger actions.
Data conversion and standardization
During normal processing, data rules can be used to convert data into a format readable by the target architecture
Activity
Define high-quality data
Define data quality strategy
Identify key data and business rules
Perform an initial data quality assessment
Identify and prioritize areas for improvement
Define data quality improvement goals
Develop and deploy data quality operations
Manage data governance rules
Measure and monitor data quality
Develop operational procedures for managing data issues
Develop data quality service level agreement
Write data quality reports
tool
Data profiling tools
Data profiling tools generate high-level statistics that allow analysts to identify patterns in the data and make initial assessments of quality characteristics
Profiling tools are particularly important for data discovery efforts, enabling the evaluation of large data sets
Profiling tools, enhanced with data visualization capabilities, will aid the discovery process
Data query tool
Data profiling is only the first step in data analysis and helps identify potential problems
Data quality team members also need to query the data more deeply to answer questions raised by the analysis results and find patterns that can provide insight into the root causes of data problems.
Modeling and ETL tools
The tools used to model data and create ETL processes have a direct impact on data quality
The use of these tools can lead to higher quality data if there is data thinking in the use process.
If they are used blindly without understanding the data, they can have harmful effects
Data quality team members should collaborate with development teams to address data quality risks and leverage effective modeling and data processing tools to ensure the organization has access to higher quality data
Data quality rule template
Rule templates give analysts the opportunity to capture customer expectations for data and help bridge the communication gap between business and technical teams
Continuously developing consistent rules simplifies the process of translating business requirements into code.
metadata repository
Defining data quality requires metadata, and the definition of high-quality data is a way of presenting the value of metadata.
method
Precaution
The best way to create high-quality data is to prevent low-quality data from entering the organization
Precautions prevent known errors from occurring; examining data after the fact does not improve quality
prevention methods
Establish data entry controls
Training data producers
Define and enforce rules
Require data providers to provide high-quality data
Implement data governance and management systems
Develop formal change control
Corrective Action
After a problem occurs and is detected, corrective actions are implemented
Data quality problems should be solved systematically and fundamentally to minimize the cost and risk of corrective measures.
Methods for performing data corrections
automatic correction
Autocorrection technology includes rule-based standardization, canonicalization, and correction
The modified value is obtained or automatically generated and submitted without manual intervention.
Autocorrect requires an environment with good standards, generally accepted rules, and known error patterns
Manual inspection and correction
Straighten and correct data using automated tools and perform human review before corrections are committed to persistent storage
Corrections with scores above a certain confidence level may be submitted without review, but corrections with scores below the confidence level will be submitted to the Data Management Officer for review and approval
Manual correction
Manual correction is the only option when there is a lack of tools, insufficient automation, or when it is determined that changes can be better handled through human oversight.
The documented method of changing and committing updates directly in a build environment is very dangerous and should be avoided
QA and review code module
Create shareable, linkable, reusable code modules that developers can pull from the repository to repeat data quality checks and auditing processes
Well-designed code modules can prevent many data quality issues, and at the same time, they ensure consistent execution of the process
If reporting of specific quality results is required by law or policy, it is often necessary to describe the lineage of the results, and the Quality Inspection module can provide this functionality.
Effective data governance metrics
Measurability
Data quality metrics must be measurable – it must be something that can be quantified
business relevance
While many things are measurable, not all can be converted into useful metrics
If a metric cannot be related to some aspect of business operations or performance, it has limited value
Each data quality metric should be tied to the impact of the data on key business expectations
acceptability
Determine whether data meets business expectations based on specified acceptability thresholds
If the score equals or exceeds the threshold, the data quality meets business expectations
If the score is below the threshold, it is not satisfied
Accountability/Management System
Notifies key stakeholders when a metric's measurement results indicate that quality does not meet expectations
The business data owner is responsible for this and appropriate corrective actions are taken by the Data Management Officer
Controllability
Metrics should reflect controllable aspects of the business
In other words, if it goes out of scope, it should trigger actions to improve the data
trend analysis
Metrics enable organizations to measure data quality improvements over time
Tracking helps data quality team members monitor activities within the scope of data quality SLAs and data sharing agreements and demonstrate the effectiveness of improvement activities
Once the information flow is stabilized, statistical process control techniques can be used to detect changes and achieve predictable changes in the measured results and technical processes under study.
statistical process control
Statistical Process Control (SPC) is a method of managing processes by analyzing changes in measured values of process inputs, outputs, or steps.
Based on the assumption that when a process with consistent inputs is executed consistently, it will produce consistent outputs. It uses measures of central tendency (the tendency of a variable's values to approach its central value, such as mean, median, or mode) and variability around the central value (such as range, variance, standard deviation) to determine deviation tolerance in a process
The main tool used in SPC is the control chart, which is a time series graph that includes a center line for the mean (a measure of central tendency) and upper and lower control limits that describe the measurement (the variability around the central value)
Root Cause Analysis
Once the root cause of a problem disappears, so will the problem itself
Root Cause Analyst A process of understanding what causes a problem and how it works
The purpose is to identify underlying conditions that, once removed, will cause the problem to disappear
Common root cause analysis techniques include Pareto analysis (80/20 rule), fishbone diagram analysis, track and trace, process analysis, and 5WHY
Implementation Guide
Readiness Assessment/Risk Assessment
Organizational and cultural change
Data quality and data governance
data quality system
Metrics
return on investment
quality level
Data quality trends
Data Issue Management Metrics
Service level consistency
Data quality plan diagram