MindMap Gallery Quantity CFA Level 2
Quantity CFA Level 25%-10% mind map, including introduction to linear regression, multiple linear regression, time series analysis, machine learning, and big data.
Edited at 2023-09-13 19:57:14This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
quantity 5%-10%
Introduction to Linear Regression
Basic assumptions
x, y linear relationship
x has nothing to do with the residual
The residual expectation is 0
The variance of the residual term is constant for all observations
Residual terms are distributed independently
Residuals normally distributed
Residual assumptions
regression model
"^" indicates predicted value
Intercept, represents risk-adjusted return, ex-post alpha
Slope coefficient, market risk
SSE: sum of squared errors The standard deviation of the residual (estimated value - actual value), linear regression is the line that minimizes SSE
regression line passes
Parametric test
index
Standard error SEE
standard error of estimate, standard error, measures the degree of change of y and sum, measures the degree of fitting, the smaller the better
The degree of dispersion between sample means in multiple sampling reflects the representativeness of the sample mean to the overall mean
Coefficient of determination: The percentage of changes in y that can be explained by x
For linear regression, it is equal to the square of the correlation coefficient
Not true for multiple regression
ANOVA, analysis of variance
SST, total sum of squares measures the total change between the actual value and the average value, the sum of squares of the actual value - the average value
RSS, Regression sum of squares measures the change in y that can be explained by x. The part found by regression can be explained. Predicted value - mean sum of squares
SSE, Sum of squared errors: measures unexplained changes, actual value - sum of squared predicted values. The difference between the actual value and the predicted value is not explained by the regression equation, so it forms a deviation.
SST=RSS SSE
Residual standard deviation, the degree to which the actual observed values deviate from the regression line
Disadvantages of regression analysis
Parameters are unstable and linear relationships may change over time
Other market participants using the same model limits the effectiveness of the model
The assumptions of regression analysis need to be established, otherwise there will be heteroskedastic (residual variance is non-constant) and autocorrelation (residual terms are not independent)
multiple linear regression
Model
Intercept: y when x are all 0
Slope: other x remains unchanged (holding xxx constant), determine the magnitude of change in y caused by changes in x
Parameter significance test
Test statistics
Hypothesis test, obey t(n-k-1)
n→number of observations; k→number of x; 1→number of intercepts
Compare the calculated test statistic with the critical value obtained by looking up the table to draw a conclusion
p-value
Compare the critical value with the p-value. If p-value < critical value, reject the null hypothesis. If there is a p-value in the exam, use the p-value first.
confidence interval
F(k, n-k-1) test
Mainly used for multiple linear regression, testing that at least 1 x significantly explains Y
single tail
In multiple linear regression, the value increases as the number of x in the regression equation increases.
dummy variables
Take specific values like "yes", "no" etc.
Dummy variable trap, n values, only n-1 variables are needed
The intercept represents the value of omitted category
The slope represents the change in the y-dependent variable caused by the difference between the dummy variable and the omitted category.
violation of assumptions
Heteroskedasticity
Definition: Residual variances are different between sample points
type
unconditional heteroskedasticity: has nothing to do with changes in x and has no significant impact on regression
conditional heteroskedasticity: The residual changes as x changes, which has a significant impact on statistical inference
Influence
Detection
Method 1: Scatter Plot
Method 2: Chi-square test
correct
Method 1: Calculate white-corrected standard error also called robust/heteroskedasticity-consistent standard error
Method 2: Calculate generalized least squares
Serial correlation(i.e., autocorrelation) autocorrelation
Definition: Correlation between residuals, common in time series
type
Positive serial correlation: Positive regression error in the current period increases the probability of positive regression error in the next period
Negative serial correlation: Positive regression error in the current period increases the probability of negative regression error in the next period
Influence
Detection
scatter plot residual plot
DW (Durbin-Watson) statistic
r is the correlation coefficient between the current and previous period residuals
correct
Method 1: Adjust standard errors: if there is only heteroskedasticity, use white-corrected standard errors; if there is autocorrelation or both, use the Hansen method
Method 2: Improve the model, such as adding time characteristics, such as seasons
Multicollinearity Multicollinearity
Definition: Correlation between independent variables or combinations of independent variables
type
perfect multicollinearity
A variable can be expressed by a linear combination of other explanatory variables
Unable to estimate coefficients using OLS method
incomplete multicollinearity
There is a high degree of correlation between two or more independent variables
It does not affect the use of the OLS method, but it will cause a large bias in at least one independent variable coefficient estimator.
Influence
Does not affect the unbiasedness of β1, resulting in a larger var(β1)
Produces type II errors, common in economic models
Detection
The t-test found that no coefficients were significantly different from 0, but the F-test showed that it was significant, and the R square was high
A high correlation between x indicates a high possibility of multicollinearity; but a low correlation between x does not indicate the absence of multicollinearity. It may be that the linear combination between x is correlated
correct
Ignore one or more related independent variables and perform stepwise regression stepwise regression
model misspecification
Influence
Statistical inference of estimated coefficients is wrong
The estimated coefficients are not consistent
type
Function form error
missing important variables
Wrong function form
Wrong fusion of different sample data
The independent variable is related to the residual term
The independent variable contains the lagged term of the dependent variable
The independent variable is some functional form of the dependent variable
There is bias in the measurement of independent variables
Time series setting error
Model setting principles
There needs to be a certain basis to avoid data mining bias.
The variable function form must conform to the actual characteristics of the variable data
loose parsimonious: effective and simple
Meets 6 major assumptions
Out-of-sample data monitoring passed
qualitative dependent variable
dummy variable
regression method
probit model probit model
Logit model logit model
Estimate the probability that the dependent variable takes 1
Discriminant analysis discriminant models
Such as Z-score
time series analysis
trend model
Linear Trend Model (Inflation)
Variables grow by a fixed amount using a linear model
Logarithmic linear trend model (stock price & stock index)
Variables grow at a fixed rate using a logarithmic model
limitation
Log-linear models are not suitable for application to autocorrelated data
autoregressive model, AR
definition
Predict the current y using one or more past y's
covariance stationary
Conditions for establishment
Expectations are constant and finite
Variance is constant and finite
The covariance between leading and lagging values is constant and finite
cyclical
serial correlation test
The regression assumption needs to be met: there is no serial correlation in the residual terms
autocorrelation coefficient autocorrelation
k-order autocorrelation coefficient: the correlation coefficient between time series y at time t and time t-k
Test whether the autocorrelation coefficient of each order between the residual terms is significantly different from 0
Build and estimate AR(1) models
Calculate the correlation coefficient between residual terms
Test whether the correlation coefficients of each order of the residuals are significantly different from 0
T is the period number-1
mean reversionmean reversion
Below the mean rises to the mean, above the mean falls to the mean
mean-reverting levelmean-reverting level
Model prediction
RMSE (root mean squared error) The lower the mean square error, the better
The selection time period is different, the coefficient is different, and it is unstable.
random walk
Does not have mean reversion properties
definition
random walk with drift random walk with drift
nature
mean reversion level to infinity
unit root
incovariance stationary
Detect covariance stationarity
subtopic
solve
first differencing first differencing
Apply autoregressive model AR(1) to y
unit root
Determine whether the time series is stationary
In the AR(1) model, the absolute value of β1 is greater than or equal to 1, and the time series is not stationary.
Dickey Fuller test
If the differenced time series is stationary, then the statistical inference conclusion obtained through the AR(1) model is reliable
Null hypothesis: There is a unit root
seasonal factors
Pattern that repeats every year Seasonal factors need to be added to the AR model
The Lag4 t statistic is significantly different from 0, indicating that lag4 has seasonality and needs to be added to the model.
Still AR(1) not AR(2)
Conditional heteroskedasticity autoregressive model ARCH model
The variance of the current period’s residuals depends on the variance of the previous period’s residuals. At this time, the standard error of the AR model coefficient and the hypothesis test are both inaccurate.
To solve the problem, introduce ARCH model
ARCH(1) regression model: Use the residual variance at t-1 to predict the residual variance at t
Null hypothesis: a1=0
cointegrated
Two time series are related to common macro variables and have the same and unchanged trends
long term relationship
Use one time series to predict another time series
Use DF-EG test to test cointegration. The null hypothesis is: unit root. Rejecting the null hypothesis indicates that the covariance is stationary and cointegrated. For cointegration, linear regression can be used to model the relationship between two time series.
machine learning
Classification
Supervised learning: supervised learning
penalized regression penalized regression
regularizationregularization
LASSO returns
Support vector machineSVM
Suitable for regression and classification problems
Idea: The margin between classes is the largest, forming a separating hyperplane
K is approaching, K-nearest neighbor
Idea: The most common category near target x is the same category as x
classification and regression tree, CART
branch bifurcate
ensemble learning and random forest ensemble learning and random forest
Voting Classification
Bootstrap aggregation, Bagging
Sampling n times to form n model training
Helps prevent overfitting and removes small probability events n times
random forest
Multiple CART voting
Unsupervised learning: unsupervised learning
Principal component analysis PCA, principal component analysis
Dimensionality reduction, orthogonal decomposition
hierarchical clustering
divisive clustering/hierarchical clustering, top-down clustering
agglomerative clustering, bottom-up clustering
The distance between similar samples should be as small as possible, and the distance between different categories should be as large as possible
K-means, k-means
top-down clustering
step
Select k centroids
Calculate the distance between each data point and the centroid and classify it into the closest class
Update the centroid, defined as the mean point of different classes in the previous step
Stop updating if the changes are small
deep learning deep learning
layered
input layer
output layer
Hidden layer
feature
activation function activation function
Weight value of each layer
hyperparameters
Reinforced learning: learning from your own mistakes
Reward and punishment system for action results, training model
alphaGo
Model evaluation
Overfitting
Underfitting
Evaluate error rate
data set
Training set (training model)
within sample
Validation set (validation and debugging model)
Test set (evaluating model on new data)
out of sample
mistake
bias error
In-sample, training set, underfitting
variance variance error
Out-of-sample, validation set, overfitting
Model complexity ↑, variance ↑, bias ↑
basic deviation base error
Residuals of random noise
Big Data
feature
3V: Large volume, wide source of variety, fast velocity data generation; and possibly accurate veracity
Structured data modeling
Have an idea of the tasks to be modeled
Data collection
Data preparation and wrangling
Prepare
data incompleteness
missing value missing value
Data inaccuracy
Data is inconsistent
inconsistent
non-standard error non-uniformity
The format is not uniform
Duplicate data
tidy
Data extraction
Construct new variables
aggregation
Add to get new variable
filter
Remove unnecessary data columns
choose
Remove unnecessary data rows
Convert
Convert to appropriate data type
Outlier handling outlier
Beyond 3 times the standard deviation
Beyond 3 times IQR
IQR: the difference between the 75%-25% quantile
identify
deal with
Trimming: remove outliers
Winsorization: Replace outliers with the maximum and minimum values of non-outliers
Data normalization
normalization
standardization
Data exploration
Exploratory Data AnalysisEDA
data visualization
mean, variance, etc.
Feature selection
Iteratively select the most influential features
The choice between model explanatory power and algorithm speed
feature engineering
Build features
One-hot encoding categorical data is processed into binary representation of data (dummy)
Training model
Model selection
Consider supervised/unsupervised, data type, data type, data size
Numerical type - CART; textual type - generalized linear model GLMs/SVM; image data - deep model
Performance evaluation
Tuning
Imbalanced data set, use oversampling or downsampling
Unstructured data modeling
Text Analysis: Determining Input and Output
data curationdata curation
Text data preparation and organization
Prepare
Remove HTML tags, punctuation, numbers, and whitespace characters
tidying up wrangling
Convert text to lowercase
Remove stop words
stemming steamming
Take root
lemmatization
doing→do
Bags-of-words, BOW An unordered collection of words
Text feature analysis
document term matrix: The rows are documents, the columns are words, and the grid is the number of times a word appears in a document.
N-gram: n words in a sentence are divided into one, 2-gram is divided into two, and a sentence of 3 words produces 2 grams.
text exploration
EDA
term frequency; word cloud, etc.
Feature selection
feature engineering
Training model
Model evaluation
error analysis
confusion matrix confusion matrix
ROC, receiver operating characteristic
RMSE, root mean square error
Model tuning
Balancing variance/bias, regularization, grid search, ceiling analysis (ceiling analysis identifies every step in the optimization modeling process)