MindMap Gallery Introduction to Data Science
This is an introduction to data science mind map, including modeling and performance evaluation, exploratory data analysis, data acquisition, problems and goals, etc.
Edited at 2023-11-13 09:42:58Introduction to Data Science
introduction
basic concept
Data: In reality, any thing or relationship between things that is quantitatively or qualitatively represented and recorded can become data. Data is the carrier of information.
Big data: large volume, fast generation and high timeliness, various types, high reality and low value density
Data Science: The interdisciplinary discipline that applies scientific methods, processes, algorithms, and systems to extract knowledge and insights from various forms of structured and unstructured data.
Data science project process
Determine the problem (accurate positioning), set goals (clear, specific, verifiable, and quantifiable), search data, exploratory data analysis, build models, perform performance evaluation (empty model indicators), display results, and deploy models
Problems and goals
User-level problems and goals
A user problem is generally a specific problem in the real world
Data Science Problems and Goals
The key is to abstract real-life problems
From the perspective of data science, real-life problems can be abstracted into: classification, prediction, sorting and scoring, correlation, feature extraction, clustering, etc. (here we focus on distinguishing between classification and clustering)
data collection
Prerequisite design and data plan design
Premise Data plan design Feasibility analysis of data acquisition Determine data composition
Population and Sampling
overall and individual sample unbiased sampling sampling bias
Confounding factors and A/B Testing
Confounding factors and Simpson's paradox
To rely on data to obtain reliable results, in addition to unbiased sampling, special attention must be paid to the impact of confounding factors.
Confounding factors: These factors are not the object of our investigation, but they may affect the results
Double-blind experiment and A/B Testing
A/B Testing refers to specially designed comparative experiments to observe the impact of different values of one variable (usually only two options) on the results when all other features are matched (or consistent).
python basic syntax
exploratory data analysis
Data check
The significance and scale of data Characteristic data types and meanings Preliminary exclusion of data leakage
Data preprocessing
Missing handling Exception handling Redundant processing
Descriptive statistics
positional measure discrete measure Graphical descriptive statistics
Modeling and performance evaluation
statistical modeling
Common probability density functions
Parameter Estimation
hypothetical test
p-hacking
regression model
linear regression model
Linear regression model performance evaluation
Linear regression and linear correlation
logistic regression
Training set-test set division
One-hot encoding when applying non-numeric features as input
Naive Bayes model
Bayes theorem
Gaussian model
polynomial model
Bernoulli model
Performance evaluation of classification models
confusion matrix
Metric trade-offs
Application examples
Parameter discrimination performance evaluation
decision tree
How decision trees work
Modeling process of decision tree for classification tasks
Classification decision tree application examples
Supervised learning model and unsupervised learning model
K-means model
Two basic concepts
K-means iterative algorithm
Bias-variance trade-off
Bias-variance dilemma
Overfitting and underfitting
K-fold cross validation
Grid search for parameters
Ensemble learning
Condorcet Jury Theorem
Decision tree integration
Results display
Distinguish object-oriented result display
Visualization in the presentation process