Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

Quantity CFA Level 2

Quantity CFA Level 25%-10% mind map, including introduction to linear regression, multiple linear regression, time series analysis, machine learning, and big data.

Edited at 2023-09-13 19:57:14

PlotWizard

Recent works View more works>>

Quantity CFA Level 2

PlotWizard

Recent works View more works>>

Recommended to you
Outline

quantity 5%-10%

Introduction to Linear Regression

Basic assumptions

x, y linear relationship

x has nothing to do with the residual

The residual expectation is 0

The variance of the residual term is constant for all observations

Residual terms are distributed independently

Residuals normally distributed

Residual assumptions

regression model

"^" indicates predicted value

Intercept, represents risk-adjusted return, ex-post alpha

Slope coefficient, market risk

SSE: sum of squared errors The standard deviation of the residual (estimated value - actual value), linear regression is the line that minimizes SSE

regression line passes

Parametric test

index

Standard error SEE

standard error of estimate, standard error, measures the degree of change of y and sum, measures the degree of fitting, the smaller the better

The degree of dispersion between sample means in multiple sampling reflects the representativeness of the sample mean to the overall mean

Coefficient of determination: The percentage of changes in y that can be explained by x

For linear regression, it is equal to the square of the correlation coefficient

Not true for multiple regression

ANOVA, analysis of variance

SST, total sum of squares measures the total change between the actual value and the average value, the sum of squares of the actual value - the average value

RSS, Regression sum of squares measures the change in y that can be explained by x. The part found by regression can be explained. Predicted value - mean sum of squares

SSE, Sum of squared errors: measures unexplained changes, actual value - sum of squared predicted values. The difference between the actual value and the predicted value is not explained by the regression equation, so it forms a deviation.

SST=RSS SSE

Residual standard deviation, the degree to which the actual observed values deviate from the regression line

Disadvantages of regression analysis

Parameters are unstable and linear relationships may change over time

Other market participants using the same model limits the effectiveness of the model

The assumptions of regression analysis need to be established, otherwise there will be heteroskedastic (residual variance is non-constant) and autocorrelation (residual terms are not independent)

multiple linear regression

Model

Intercept: y when x are all 0

Slope: other x remains unchanged (holding xxx constant), determine the magnitude of change in y caused by changes in x

Parameter significance test

Test statistics

Hypothesis test, obey t(n-k-1)

n→number of observations; k→number of x; 1→number of intercepts

Compare the calculated test statistic with the critical value obtained by looking up the table to draw a conclusion

p-value

Compare the critical value with the p-value. If p-value < critical value, reject the null hypothesis. If there is a p-value in the exam, use the p-value first.

confidence interval

F(k, n-k-1) test

Mainly used for multiple linear regression, testing that at least 1 x significantly explains Y

single tail

In multiple linear regression, the value increases as the number of x in the regression equation increases.

dummy variables

Take specific values like "yes", "no" etc.

Dummy variable trap, n values, only n-1 variables are needed

The intercept represents the value of omitted category

The slope represents the change in the y-dependent variable caused by the difference between the dummy variable and the omitted category.

violation of assumptions

Heteroskedasticity

Definition: Residual variances are different between sample points

type

unconditional heteroskedasticity: has nothing to do with changes in x and has no significant impact on regression

conditional heteroskedasticity: The residual changes as x changes, which has a significant impact on statistical inference

Influence

Detection

Method 1: Scatter Plot

Method 2: Chi-square test

correct

Method 1: Calculate white-corrected standard error also called robust/heteroskedasticity-consistent standard error

Method 2: Calculate generalized least squares

Serial correlation(i.e., autocorrelation) autocorrelation

Definition: Correlation between residuals, common in time series

type

Positive serial correlation: Positive regression error in the current period increases the probability of positive regression error in the next period

Negative serial correlation: Positive regression error in the current period increases the probability of negative regression error in the next period

Influence

Detection

scatter plot residual plot

DW (Durbin-Watson) statistic

r is the correlation coefficient between the current and previous period residuals

correct

Method 1: Adjust standard errors: if there is only heteroskedasticity, use white-corrected standard errors; if there is autocorrelation or both, use the Hansen method

Method 2: Improve the model, such as adding time characteristics, such as seasons

Multicollinearity Multicollinearity

Definition: Correlation between independent variables or combinations of independent variables

type

perfect multicollinearity

A variable can be expressed by a linear combination of other explanatory variables

Unable to estimate coefficients using OLS method

incomplete multicollinearity

There is a high degree of correlation between two or more independent variables

It does not affect the use of the OLS method, but it will cause a large bias in at least one independent variable coefficient estimator.

Influence

Does not affect the unbiasedness of β1, resulting in a larger var(β1)

Produces type II errors, common in economic models

Detection

The t-test found that no coefficients were significantly different from 0, but the F-test showed that it was significant, and the R square was high

A high correlation between x indicates a high possibility of multicollinearity; but a low correlation between x does not indicate the absence of multicollinearity. It may be that the linear combination between x is correlated

correct

Ignore one or more related independent variables and perform stepwise regression stepwise regression

model misspecification

Influence

Statistical inference of estimated coefficients is wrong

The estimated coefficients are not consistent

type

Function form error

missing important variables

Wrong function form

Wrong fusion of different sample data

The independent variable is related to the residual term

The independent variable contains the lagged term of the dependent variable

The independent variable is some functional form of the dependent variable

There is bias in the measurement of independent variables

Time series setting error

Model setting principles

There needs to be a certain basis to avoid data mining bias.

The variable function form must conform to the actual characteristics of the variable data

loose parsimonious: effective and simple

Meets 6 major assumptions

Out-of-sample data monitoring passed

qualitative dependent variable

dummy variable

regression method

probit model probit model

Logit model logit model

Estimate the probability that the dependent variable takes 1

Discriminant analysis discriminant models

Such as Z-score

time series analysis

trend model

Linear Trend Model (Inflation)

Variables grow by a fixed amount using a linear model

Logarithmic linear trend model (stock price & stock index)

Variables grow at a fixed rate using a logarithmic model

limitation

Log-linear models are not suitable for application to autocorrelated data

autoregressive model, AR

definition

Predict the current y using one or more past y's

covariance stationary

Conditions for establishment

Expectations are constant and finite

Variance is constant and finite

The covariance between leading and lagging values is constant and finite

cyclical

serial correlation test

The regression assumption needs to be met: there is no serial correlation in the residual terms

autocorrelation coefficient autocorrelation

k-order autocorrelation coefficient: the correlation coefficient between time series y at time t and time t-k

Test whether the autocorrelation coefficient of each order between the residual terms is significantly different from 0

Build and estimate AR(1) models

Calculate the correlation coefficient between residual terms

Test whether the correlation coefficients of each order of the residuals are significantly different from 0

T is the period number-1

mean reversionmean reversion

Below the mean rises to the mean, above the mean falls to the mean

mean-reverting levelmean-reverting level

Model prediction

RMSE (root mean squared error) The lower the mean square error, the better

The selection time period is different, the coefficient is different, and it is unstable.

random walk

Does not have mean reversion properties

definition

random walk with drift random walk with drift

nature

mean reversion level to infinity

unit root

incovariance stationary

Detect covariance stationarity

subtopic

solve

first differencing first differencing

Apply autoregressive model AR(1) to y

unit root

Determine whether the time series is stationary

In the AR(1) model, the absolute value of β1 is greater than or equal to 1, and the time series is not stationary.

Dickey Fuller test

If the differenced time series is stationary, then the statistical inference conclusion obtained through the AR(1) model is reliable

Null hypothesis: There is a unit root

seasonal factors

Pattern that repeats every year Seasonal factors need to be added to the AR model

The Lag4 t statistic is significantly different from 0, indicating that lag4 has seasonality and needs to be added to the model.

Still AR(1) not AR(2)

Conditional heteroskedasticity autoregressive model ARCH model

The variance of the current period’s residuals depends on the variance of the previous period’s residuals. At this time, the standard error of the AR model coefficient and the hypothesis test are both inaccurate.

To solve the problem, introduce ARCH model

ARCH(1) regression model: Use the residual variance at t-1 to predict the residual variance at t

Null hypothesis: a1=0

cointegrated

Two time series are related to common macro variables and have the same and unchanged trends

long term relationship

Use one time series to predict another time series

Use DF-EG test to test cointegration. The null hypothesis is: unit root. Rejecting the null hypothesis indicates that the covariance is stationary and cointegrated. For cointegration, linear regression can be used to model the relationship between two time series.

machine learning

Classification

Supervised learning: supervised learning

penalized regression penalized regression

regularizationregularization

LASSO returns

Support vector machineSVM

Suitable for regression and classification problems

Idea: The margin between classes is the largest, forming a separating hyperplane

K is approaching, K-nearest neighbor

Idea: The most common category near target x is the same category as x

classification and regression tree, CART

branch bifurcate

ensemble learning and random forest ensemble learning and random forest

Voting Classification

Bootstrap aggregation, Bagging

Sampling n times to form n model training

Helps prevent overfitting and removes small probability events n times

random forest

Multiple CART voting

Unsupervised learning: unsupervised learning

Principal component analysis PCA, principal component analysis

Dimensionality reduction, orthogonal decomposition

hierarchical clustering

divisive clustering/hierarchical clustering, top-down clustering

agglomerative clustering, bottom-up clustering

The distance between similar samples should be as small as possible, and the distance between different categories should be as large as possible

K-means, k-means

top-down clustering

step

Select k centroids

Calculate the distance between each data point and the centroid and classify it into the closest class

Update the centroid, defined as the mean point of different classes in the previous step

Stop updating if the changes are small

deep learning deep learning

layered

input layer

output layer

Hidden layer

feature

activation function activation function

Weight value of each layer

hyperparameters

Reinforced learning: learning from your own mistakes

Reward and punishment system for action results, training model

alphaGo

Model evaluation

Overfitting

Underfitting

Evaluate error rate

data set

Training set (training model)

within sample

Validation set (validation and debugging model)

Test set (evaluating model on new data)

out of sample

mistake

bias error

In-sample, training set, underfitting

variance variance error

Out-of-sample, validation set, overfitting

Model complexity ↑, variance ↑, bias ↑

basic deviation base error

Residuals of random noise

Big Data

feature

3V: Large volume, wide source of variety, fast velocity data generation; and possibly accurate veracity

Structured data modeling

Have an idea of the tasks to be modeled

Data collection

Data preparation and wrangling

Prepare

data incompleteness

missing value missing value

Data inaccuracy

Data is inconsistent

inconsistent

non-standard error non-uniformity

The format is not uniform

Duplicate data

tidy

Data extraction

Construct new variables

aggregation

Add to get new variable

filter

Remove unnecessary data columns

choose

Remove unnecessary data rows

Convert

Convert to appropriate data type

Outlier handling outlier

Beyond 3 times the standard deviation

Beyond 3 times IQR

IQR: the difference between the 75%-25% quantile

identify

deal with

Trimming: remove outliers

Winsorization: Replace outliers with the maximum and minimum values of non-outliers

Data normalization

normalization

standardization

Data exploration

Exploratory Data AnalysisEDA

data visualization

mean, variance, etc.

Feature selection

Iteratively select the most influential features

The choice between model explanatory power and algorithm speed

feature engineering

Build features

One-hot encoding categorical data is processed into binary representation of data (dummy)

Training model

Model selection

Consider supervised/unsupervised, data type, data type, data size

Numerical type - CART; textual type - generalized linear model GLMs/SVM; image data - deep model

Performance evaluation

Tuning

Imbalanced data set, use oversampling or downsampling

Unstructured data modeling

Text Analysis: Determining Input and Output

data curationdata curation

Text data preparation and organization

Prepare

Remove HTML tags, punctuation, numbers, and whitespace characters

tidying up wrangling

Convert text to lowercase

Remove stop words

stemming steamming

Take root

lemmatization

doing→do

Bags-of-words, BOW An unordered collection of words

Text feature analysis

document term matrix: The rows are documents, the columns are words, and the grid is the number of times a word appears in a document.

N-gram: n words in a sentence are divided into one, 2-gram is divided into two, and a sentence of 3 words produces 2 grams.

text exploration

EDA

term frequency; word cloud, etc.

Feature selection

feature engineering

Training model

Model evaluation

error analysis

confusion matrix confusion matrix

ROC, receiver operating characteristic

RMSE, root mean square error

Model tuning

Balancing variance/bias, regularization, grid search, ceiling analysis (ceiling analysis identifies every step in the optimization modeling process)