MindMap Gallery econometrics
Econometrics is a branch of economics that is based on mathematical economics and mathematical statistics as a methodological basis. It attempts to synthesize the theoretical quantitative approach and the empirical (empirical) quantitative approach to economic problems.
Edited at 2024-10-15 00:35:43Ceci est le chapitre 5 du livre de l'enseignant "This Is Auth to Read", qui parle principalement de ces aspects: ① L'importance de la capacité d'apprentissage ②Comment ajouter un contexte à l'information ③Comment distinguer les connaissances et les informations Je ne vous précipite pas pour remettre en question et défier ⑤Comment utiliser des notes collantes pour mettre à niveau votre capacité d'apprentissage ⑥Pour pourquoi chasser les "biens secs" un pseudo-apprentissage?
Afin d'aider tout le monde à utiliser Deepseek plus efficacement, une collection de Mind Map Deepseek Guide a été spécialement compilée! Cette carte mentale résume le contenu principal: liens liés à Yitu, analyse de profil DS, comparaison des routes technologiques Deepseek et ChatGpt, Guide de déploiement du modèle Deepseek et Qwen, comment gagner plus d'argent avec Deepseek, comment jouer Deepseek, Deepseek Scientific Research Applications Mows Inside Attendez, vous permettant de saisir rapidement l'essence de l'interaction AI. Qu'il s'agisse de la création de contenu, de la planification du plan, de la génération de code ou de l'amélioration de l'apprentissage, Deepseek peut vous aider à atteindre deux fois le résultat avec la moitié de l'effort!
Il s'agit d'une carte mentale sur les 30 instructions de niveau d'alimentation de Deepseek.
Ceci est le chapitre 5 du livre de l'enseignant "This Is Auth to Read", qui parle principalement de ces aspects: ① L'importance de la capacité d'apprentissage ②Comment ajouter un contexte à l'information ③Comment distinguer les connaissances et les informations Je ne vous précipite pas pour remettre en question et défier ⑤Comment utiliser des notes collantes pour mettre à niveau votre capacité d'apprentissage ⑥Pour pourquoi chasser les "biens secs" un pseudo-apprentissage?
Afin d'aider tout le monde à utiliser Deepseek plus efficacement, une collection de Mind Map Deepseek Guide a été spécialement compilée! Cette carte mentale résume le contenu principal: liens liés à Yitu, analyse de profil DS, comparaison des routes technologiques Deepseek et ChatGpt, Guide de déploiement du modèle Deepseek et Qwen, comment gagner plus d'argent avec Deepseek, comment jouer Deepseek, Deepseek Scientific Research Applications Mows Inside Attendez, vous permettant de saisir rapidement l'essence de l'interaction AI. Qu'il s'agisse de la création de contenu, de la planification du plan, de la génération de code ou de l'amélioration de l'apprentissage, Deepseek peut vous aider à atteindre deux fois le résultat avec la moitié de l'effort!
Il s'agit d'une carte mentale sur les 30 instructions de niveau d'alimentation de Deepseek.
econometrics
satisfy classic assumed
Econometrics Overview
definition
Econometrics is a science that is based on economic theory, uses mathematical statistical methods and computer technology as tools, and conducts quantitative analysis of economic phenomena, economic relationships and economic laws based on actual observational data.
The science of studying the laws of economic activities and their applications using quantitative methods.
Study with variance
China Statistical Yearbook
It is the product of the combination of economics, statistics, and mathematics.
Macro-Jing and Micro-Jing are qualitative, as long as a negative correlation can be found through research
Require
Master the methodology of econometric analysis
Able to establish and apply simple econometric models to analyze real economic problems
Initial mastery of EVIEWS6
effect
Estimating economic relationships
Not only qualitative, but also quantitative
What are the characteristics of the demand for enterprise products? Is price elastic? Is it inelastic? How effective is the price strategy?
Garment factory pricing
The econometric model must establish the relationship between demand and price and must be subject to profit maximization.
What is the impact of advertising on sales revenue and profits?
What is the marginal propensity to consume of urban residents in my country?
What factors affect the salary of newly graduated college students? Major? School reputation?
Hypothesis testing
Is there really gender discrimination in wages?
Have government policies to encourage fuel-efficient vehicles significantly reduced fuel consumption across the country?
Does the taxi price increase really improve the situation of taxi drivers?
predict
Manufacturers need to make product sales forecasts
The government needs to make economic development forecasts
Cities need to forecast transportation development and energy demand
Stock market players need to make stock price predictions and risk estimates
Origin and development
produce
In a market economy, there are intricate relationships between market entities. To survive and develop in fierce competition, companies must have reliable market forecasts.
If the government wants to intervene in the operation of the national economy, it needs to analyze economic trends in a timely manner.
develop
●Computer applications
●Model variables and equations
From few to large, there tends to be fewer models merged into an overall model
●New breakthroughs in theories and methods
In addition to the classic linear econometric models, there are nonlinear models, reasonable expectation models, variable parameters, parameter-free, semi-parametric models, dynamic models, time series models, cointegration theory, Bayesian method, small sample theory, etc. research field
●Expansion of application fields
Applications in macroeconomic and microeconomic fields have shifted from prediction to more testing of economic theoretical assumptions and policy assumptions.
Research steps
a statement of an economic theory or hypothesis
Keynes believed that, generally speaking, people tend to increase their consumption as their income increases, but consumption does not increase as much as income. In economic terms: the marginal propensity to consume, that is, the change in consumption for every unit change in income, is greater than 0 and less than 1.
Build mathematical (mathematical economics) models
deterministic relationship
Y=β0+β1*X, X: income, Y: consumption, 0<β1<1
Build statistical or econometric models
Related relationships
Y=β0+β1*X μ
The functional form of the model should be chosen based on the specific data pattern
μ: Random error term, indicating the statistical relationship between household consumption expenditure and disposable income. The existence of μ indicates that even if the disposable income is the same, there is a certain degree of randomness in consumer spending. μ: other explanatory variables Variance of β0, β1 and μ: parameters to be estimated
Collect and process sample data;
Personal consumption expenditures and gross domestic product data of the United States from 1980 to 1991
Parameter estimation of econometric models;
Get it with Eviews software
The estimated values of β0 and β1 are: -231.8 and 0.719 respectively Y=-231.8 0.719X se:(94.53)(0.022); R2=0.991; F=1094.1
Why estimate?
Generally speaking, the parameters (β0, β1) are unknown and cannot be directly observed.
Due to the existence of random terms, the parameters cannot be calculated accurately and need to be estimated by selecting an appropriate method through sample observation values.
(How to estimate the parameters of the overall model through sample observations is the core content of econometrics)
Parameter estimation method
Econometrics has developed a wealth of parameter estimation methods.
Mainly composed of Two major categories
Classical estimation method
least squares estimate
maximum likelihood estimation
Simultaneous equation estimation method, etc.
instrumental variable method
Bayesian estimation method (not covered in this course)
You may take the postgraduate entrance examination
Test hypotheses from models
Purpose
Evaluate the model and estimated parameters to determine whether they are theoretically meaningful and statistically significant.
The conclusion is just some accidental result of sampling
May violate basic assumptions of econometric estimation
economic significance test
Check whether each parameter is consistent with economic theory and actual experience
In the consumption function example: Y=-231.8 0.719X,0<β1<1?
In (per capita food demand = -2.0-0.5n (per capita income) 4.5n (food price) 0.81n (other commodity prices) ln (per capita food demand) = -2.0 0.5n (per capita income -4.5n (food price) 0.81n (other commodity prices) In (per capita food demand) = -2.0 0.5/m (per capita income) ln (food price) 0.81n (other commodity prices)
Statistical test
Determined by mathematical statistics theory
include
Goodness of fit test: R2
Overall significance test: F
Variable significance test:t
econometric test
Determined by econometric theory
include
Heteroskedasticity test
serial correlation test
Collinearity test
Econometric research is a dynamic process. Only after the model passes the above-mentioned tests can it be practically applied. If it fails the test, the model needs to be revised, re-designed, re-estimated and re-tested.
Test the correctness of the model—hypothesis testing of the model;
Model prediction test
Determined by the application requirements of the model
include
Stability test: re-estimation with expanded sample
Forecasting performance test: actual forecasting at an out-of-sample point
Model use/application
structural analysis
Econometric models are used to quantitatively analyze the relationship between variables, with the purpose of clarifying and explaining relevant economic phenomena. Such as elasticity analysis, multiplier analysis
Policy simulation and evaluation
Use econometric models to simulate the implementation effects of various economic policies in order to compare various policy options and make choices
The "economic policy laboratory" function of econometric models
Economic forecasting: using econometric models to predict the future values of economic variables
Theory verification
Data and variables
data type
Time series data (same space, different times)
Data for the same individual at different times
A student's weight was recorded continuously for ten years, resulting in a time series data with a capacity of 10;
stability problem
Differences from cross-sectional data
Cross-sectional data (same time, different spaces)
Data for a group of individuals at the same time.
Heteroskedasticity
It is a batch of survey data that occurred at the same time section. Study changes at a certain point in time.
For example, industrial census data, population census data, household means survey data, etc.
At a certain time, record the weight of all 30 students in a class to obtain a cross-sectional data with a capacity of 30.
Mixed data (panel data): It is a combination of time series data and cross-sectional data.
GDP of each province in my country from 1978 to 1998
Dummy variables (either-or variables represented by 0-1)
A dummy variable is a variable that only takes one of 1 or 0, and is generally used to represent qualitative variables, such as policy variables, condition variables, etc.
Dummy variables can be combined to represent multiple states.
Can represent qualitative data
The number of dummy variables used = the number of states to be represented - 1. Only 2 dummy variables are used for 3 states. If 3 dummy variables are used for 3 states, multicollinearity will occur.
Wage=β0 β1exp……β5Dum (sequence variable) action (random interference factor) Empirical issues, theory cannot explain it, it needs to be empirical Gender variable (0 male, 1 female variable) β5=0.8>0, there is gender discrimination, and it has a significant impact when it is close to 1. If there is no work experience, use other variables (internship experience, club experience) to replace it. School, major, work experience.
policy variables You can study whether a certain policy is effective or ineffective, and whether it has any impact on fuel consumption. dum=0—not implemented dum=1—implementation How to set dum sequence variables?
Data source
Collection by government agencies, specialized agencies, international agencies or individuals
variable
Distinguish from the causal relationship of variables (single equation model)
Explained variable (response variable)—the variable to be analyzed and studied
The variables as the research object are the "fruit" in the causal relationship, such as the output in the production function. They are the explained variables in the model. In the single equation model, they are at the left end.
Explanatory variable (independent variable) - the variable that explains the main reason for the change in the response variable (Non-main reasons are classified as random items)
Variables as "causes", such as capital and labor technology in the production function, are explanatory variables in the model. In the single equation model, they are at the right end.
Distinguish from the nature of variables
Endogenous variables—variables whose values are determined by the model and are the result of solving the model
belongs to the strain variable
It is determined by the model of the economic system under study itself.
Exogenous variables—variables whose values are determined outside the model
The values of exogenous variables are determined outside the model under study and are not affected by factors internal to the model.
It is a "given" or "known" value that is specified in advance before the model is solved.
Belongs to independent variable
Classification
Policy variables (variables that policymakers can control, such as government spending, interest rates, money supply, etc.)
Non-policy variables (variables that are difficult or uncontrollable, such as climate, natural disasters, agricultural harvests, exchange rates, etc.)
Lag variables: lagged endogenous variables, lagged exogenous variables
Predetermined variables: predetermined endogenous variables and exogenous variables
relation
Changes in the value of exogenous variables can affect changes in endogenous variables
Endogenous variables cannot in turn affect exogenous variables
Model
Regression model: reveal the causal relationship between variables
According to the model form
linear model
Y=β1 β2X2 β3X3…… βk*Xk μ This is the most commonly used model form and can be estimated using linear regression methods in mathematical statistics.
linear regression
"Yuan" ˆ refers to the explanatory variables, and the model is called a k-1 yuan linear regression model.
Unary regression
When there is only one explanatory variable, it is called a simple linear regression model, also called a bivariate regression model;
multiple regression
When there is more than one explanatory variable, it is called a multiple linear regression model. "Yuan" ˆ refers to the explanatory variables, and the above model is called a k-1 yuan linear regression model.
nonlinear regression
log-log model
The basic form is:
where u is the random disturbance term, expressed in natural logarithm as
β2 is the elasticity of Y with respect to X
semi-log model
basic form
This model is called the constant percentage growth model.
α2 is equal to the constant relative change rate of Y when the absolute quantity of X changes to a certain extent.
β2 is equal to the absolute change in the average or expected value of Y when X undergoes a certain relative change.
reciprocal model
It means that as X increases, Y decreases non-linearly (when the coefficient of the second term is negative, it increases), but in the end the intercept term is the asymptotic line.
Classic single-equation econometric model ——Simple/univariate linear regression model
Regression analysis overview and regression function
1. The relationship between variables and Basic concepts of regression analysis
economic variables The relationship can be Divided into two categories
(1) Deterministic relationship or functional relationship
The study is to determine the relationship between non-random variables of phenomena. Y=c i g
(2) Statistical dependence or related relationship
It studies the relationship between random variables of non-deterministic phenomena. C=βy u
The investigation of statistical dependence between variables is mainly through
regression analysis
There is a causal relationship
Related analysis
no causal relationship
Correlation refers to the degree or strength of the relationship between two or more variables. Correlation analysis is the most basic method of studying the relationship between variables. The correlation coefficient derived from correlation analysis is a basic statistic of regression analysis. Mastering it will help you analyze and understand economic issues and econometric models.
①According to intensity
Complete correlation: There is a functional relationship between variables. The perimeter of the example is L=2ΠR
High correlation (strong correlation): There is approximately a functional relationship between variables. For example, the relationship between household income and expenditure in my country.
Weak correlation: There is a relationship between variables but it is not obvious. For example, my country’s cultivated area and output in recent years.
Zero correlation: There is no relationship between variables. For example, the academic performance and age of students in a certain class.
Use scatter plots/correlation plots to depict
linear correlation
Positive correlation
Not relevant
negative correlation
Correlation coefficient: -1≤ρxy≤1
nonlinear correlation
Positive correlation
negative correlation
Not relevant
simple linear Related metrics
Use the simple linear correlation coefficient, referred to as the correlation coefficient, to measure the linear correlation strength between two variables, represented by p.
The random variable expression of p is
Note: The correlation coefficient p is for the overall population. When studying a problem, the data obtained is often a sample. For samples, the correlation coefficient is often expressed as r, that is, r is the estimated value of the overall correlation coefficient p.
Among them, n is the overall capacity; x and y are the observed values of the sample; x,, the mean of the observed values of the variable.
Note
(1) Non-linear correlation does not mean irrelevance;
(2) Correlation analysis treats any (two) variables symmetrically, and both variables are regarded as random.
X has an impact on Y, and Y has an impact on X
There is an asymmetry in the way regression analysis handles variables, that is, it distinguishes between dependent variables (explained variables) and independent variables (explanatory variables): the former is a random variable, and the latter is not.
Only X has an effect on Y
regression analysis
Regression analysis is the theory and method of studying uncertain statistical relationships between variables.
Its basic idea
It consists in estimating and/or predicting the (overall) mean of the explained variable through the known or set value of the explanatory variable.
Here: the former variable is called the explained variable or the response variable, and the latter variable(s) is called the explanatory variable or the independent variable.
Regression analysis components econometric methodological basis, its The main contents include
(1) Choose an appropriate method to estimate the parameters of the econometric model based on the sample observations and obtain the regression equation;
(2) Conduct significance tests on regression equations and parameter estimates;
(3) Use regression equations for analysis, evaluation and prediction.
Y tip t=β0 tip β1 tip Xt et——sample regression function The conditional expected value on both sides is: E(Y/Xt)=β0 β1X——overall regression function
Expected mean, overall regression line
There is a linear regression model (statistical model) as follows, Yi=β0 β1Xi μi The above equation represents the true relationship between variables y and x. Among them, y is called the explained variable (dependent variable), x is called the explanatory variable (independent variable), u is called the random error term, β0 is called the constant term, and β1 is called the regression coefficient (marginal effect). The above model can be divided into two parts (1) regression function part, E(yt)=β0 β1x, (2) random part μi.
2. Overall regression function
In Example 2., the consumption expenditure of individual households is Yi=E(Y/Xi) μi=β0+β1*Xi μi——(*)
That is, given the income level Xi, the individual household’s Expenditure can be expressed as the sum of two parts
(1) The average consumption expenditure E(Y/Xi) of all households at this income level is called the systematic or deterministic part
(2) Other random or non-deterministic μi.
The formula (*) is called the random setting form of the overall regression function (equation) PRF.
Since random terms are introduced into the equation, it becomes an econometric model, so it is also called an overall regression model.
It shows that in addition to the systematic influence of the explanatory variable, the explained variable is also affected by the randomness of other factors.
overall regression function
It depicts the change pattern between the explanatory variable X and the average/expected value of the explained variable Y.
The corresponding function: E(Y/Xi)=f(Xi) is called the (bivariate) overall regression function (PRF)
The regression function (PRF) explains how the average state of the explained variable Y (overall conditional expectation) changes with the explanatory variable X.
Functional form: Can be linear or nonlinear.
In Example 2.1, when residents’ consumption expenditure is regarded as a linear function of their disposable income: E(Y|Xi)=β0 β1Xi is a linear function. Among them, β0 and β1 are unknown parameters, called regression coefficients.
overall regression line
Drawing the scatter plot, we find that as income increases, consumption also increases "on average", and the conditional mean of Y all falls on a straight line with a positive slope. This straight line is called the overall regression line.
The expected trajectory of the explained variable Y under the given explanatory variable ⅹ is called the overall regression line/overall regression curve.
3. Random disturbance term
The overall regression function explains the average consumption expenditure level of households in the community at a given income level Xi. However, for an individual household, its consumption expenditure may deviate from this average level.
Remember μi=Yi-E(YIXi)
μ is called the dispersion of the observed value Yi around its expected value E(YIXi). It is an unobservable random variable, also called a random interference term or a random error term.
Random error terms mainly include The influence of the following factors (consumption)
1) The influence of neglected factors that are not important in the explanatory variables; (preference)
In addition to income affecting consumption, there are also commodity prices, consumer preferences and age, etc.
2) As a comprehensive representative of many small influencing factors;
Consumers’ mood, emotion, personality
Secondary influencing factors are automatically placed into random interference items
3) The impact of observation errors on variable observation values;
There are inherent errors when collecting data, such as measurement errors.
4) The impact of the setting error of the model relationship;
5) The influence of other random factors.
Random human behavior, natural disasters, environmental factors
The mathematical model is poorly formed.
When building the model, the form selection is wrong/the model design is wrong, and more important factors may be lost and put into random interference items.
Merger error (merger of two equations)
Generate and design random The main reasons for the error term:
1) Theoretical ambiguity
The consumption function has been studied thoroughly, but the distance between asteroids is relatively vague.
2) Lack of data;
Some data are hard to find, such as data on official corruption.
3) The principle of saving.
The simpler the model, the better
4. Sample regression function (SRF) /sample regression model
The sample regression function also has the following random form
In the formula, ei is called the (sample) residual (or remaining) term, which represents the set of other random factors that affect Yi and can be regarded as the estimator μi of μi.
Since random terms are introduced into the equation, it becomes an econometric model, so it is also called a sample regression model.
sample regression line
The sample scatter plot is approximately a straight line. Draw a straight line to best fit the scatter plot. Since the sample is taken from the population, the line can approximately represent the population regression line. This line is called the sample regression line. The functional form of the sample regression line is called the sample regression function (SRF)
The functional form of the sample regression line is called the sample regression function (SRF)
PPT 90-92 Case Analysis
The main purpose of regression analysis
Estimate the population regression function PRF based on the sample regression function SRF
Univariate linear regression model parameter estimates of
Assumptions about the random disturbance term μ (called classical hypothesis) The first 3 chapters accommodate these assumptions by default
Zero mean assumption
Homoskedasticity assumption
No autocorrelation assumption
It is assumed that the random disturbance term is uncorrelated with the explanatory variables
Normality assumption
Assumptions about the explained variable y
Assumption 1: The conditional expectation of the random disturbance term is 0
Find the expectation on both sides of the equation at the same time. The expectation of the constant is itself
E(Y|Xi)=β1 β2X
Assumption 2: Homoskedasticity assumption
Var(Yi|Xi)=σ²
Assumption 3: No autocorrelation assumption
Cov(Yi,Yj)=0
Assumption 4: Normality assumption
Y~N(β1 β2Xi,σ²)
If these assumptions are met, parameter estimation can be performed.
Model parameter estimation ordinary least squares (OLS) 80% of the questions use this method
In order to make the sample regression function as "close" as possible to estimate the overall regression number. It is necessary to minimize the error between Yi tip estimated by the sample regression function = β1 tip β2 tip Xi and the actual Yi, that is, the residual error is minimized. Theoretically, the sum of squares of ei=Yi-β1 tip-β2 tip Xi can be minimized
Theoretically, the sum of squares of ei=Yi-β1 tip-β2 tip Xi can be minimized
The principle of the least squares method is to determine the straight line position by "minimizing the sum of squares of the residuals". In addition to being more convenient in calculation, the least squares method also has excellent properties in estimators.
The principle of the least squares method: Find a straight line that minimizes the sum (sum of squares) of the longitudinal distances from all these points to the straight line minΣei²
Σei²=e1² e2² ……en²: The sum of squares of the residuals is the smallest Some family residuals are positive (10), and some families have negative residuals (-10). The positive and negative may cancel each other out, and ultimately be 0, so the square term must be taken.
Dispersion form of ordinary least squares parameter estimators
Let xi=Xi-X, yi=Yi-Y
Properties of Least Squares Estimators These three properties are required: Excellent properties of point estimators:
When the model parameters are estimated, the accuracy of the parameter estimates needs to be considered, that is, whether they can represent the true values of the overall parameters, or the statistical properties of the parameter estimators need to be examined. An estimator used to examine the population, its advantages and disadvantages can be examined from the following aspects:
(1) Unbiasedness
That is, when the sample size approaches infinity, whether its mean sequence tends to the overall true value;
Sample statistic The mean of the sampling distribution is equal to the population estimator being estimated
E (θ tip) = θ, θ tip is called the unbiased estimator of θ, indicating that the direction is towards the target.
(2)Consistency
That is, when the sample size approaches infinity, does it converge to the true value of the population according to probability?
Asymptotic property, when n tends to ∞ is large/the sample size is larger, the sample parameters are closer to the overall parameters, and the error is smaller and more accurate.
If as the sample size n increases, the sample estimator becomes closer and closer to the true value of the population in a probability sense, then the estimator is said to be a consistent estimator of the parameter to be estimated.
(3) Effectiveness
That is, when the sample size approaches infinity, does it converge to the true value of the population according to probability?
Comparison of variances, fluctuating around the interval, lower variance, smaller degree of fluctuation, smaller amplitude of change, higher accuracy, closer to the overall parameter value, and belongs to the normal distribution
Smaller variance and more efficient
rational economic man
Univariate linear regression model Hypothesis testing and interval estimation
Goodness of fit test
Test the degree of fit between the sample regression line and the sample observation values, and evaluate the effect of model estimation.
Index to measure the goodness of fit: coefficient of determination (determination coefficient) R2
Question: Using the ordinary least squares estimation method has ensured that the model best fits the sample observations, why do we need to test the degree of fit?
Decomposition of sum of squares of total deviation/decomposition of variance: formula derivation method
The total dispersion of the observed values of Y around its mean can be decomposed into two parts: one part comes from the regression line (ESS), and the other part comes from the stochastic force (RSS).
In a given sample, TSS remains unchanged. If the actual observation point is closer to the sample regression line, the greater the proportion of ESS in TSS, the more explanations of Y by the regression equation and the more accurate it is. Therefore goodness of fit: regression sum of squares ESS/total dispersion TSS of Y
The coefficient of determination R² of the sample
Value range: [0,1]
It is generally better when >0.7
The closer R² is to 1, it means that the actual observation point is closer to the sample line, the higher the goodness of fit, and the higher the degree of explanation of y by the regression equation.
Features
①The determined coefficient is a non-negative statistic;
②The value range of the determination coefficient is 0≤R2≤1:
③The determination coefficient R2 is a function of the sample observation value and a random variable that changes with sampling.
The statistical reliability of the coefficient of determination should also be tested.
Yi tip = (Yi tip - Y pull) is the difference between the sample regression fitting value and the average value of the observed value, which can be considered as the part explained by the regression straight line; ei=(Yi-Yi tip) is the difference between the actual observed value and the regression fitting value, which is the part that cannot be explained by the regression straight line. If Yi=Yi, that is, the actual observed value falls on the sample regression line, the fit is the best. It can be considered that the "dispersion" all comes from the regression line and has nothing to do with the "residual error"
The proportion of the total change in y that can be explained by the change in x/the degree of explanation of the regression equation 98.69% of consumer spending can be explained by income; Among the sample changes in y, 98.69% can be explained by X: 98.69% of the total changes in household consumption expenditure y can be explained by changes in household income x.
The coefficient of determination R² is the same as The relationship between the correlation coefficient r
The coefficient of determination is numerically equal to the square of the simple linear correlation coefficient.
But conceptually, both There is a clear difference
First of all, in a complex sense, the determination coefficient R measures the degree of fit of the regression model to the sample observations in terms of the estimated regression model, that is, the degree to which the explanatory variables in the model explain the variation of the explained variables: Correlation coefficient In terms of two variables, it shows the degree of linear dependence of the two variables.
Secondly, the coefficient of determination measures the asymmetric causal relationship between the explanatory variable and the explained variable. It explains the proportion of X's explanation of the variation of Y based on regression analysis, but does not explain the explanation of y's X. r measures the symmetric correlation between X and Y, and does not involve the specific causal relationship between X and Y.
Moreover, the determined coefficient is non-negative, and its value range is 0≤R2≤1: while the correlation coefficient can be positive or negative, and its value range is -1≤r≤1
In econometrics, the estimation, testing and application of regression models are mainly studied, so from the practical application of regression analysis, the determination coefficient is more meaningful than the correlation coefficient.
Interval estimates of parameters
regression parameters Statistics distribution
The distribution properties of OLS estimation prove: βi tip => βi, using sample parameters instead of population parameters
Small sample: t distribution; large sample: standard normal distribution, β1 tip and β2 tip both obey normal distribution
In the case of small samples, if the unbiased estimate σ² tip is used instead of σ² to estimate the standard error, the statistics that undergo standard changes no longer obey the normal distribution, but obey the t distribution with n-2 degrees of freedom.
Generally speaking, for β1 tip and β2 tip After transformation, it obeys the t distribution with n-2 degrees of freedom.
In the case of large samples, it can be Approximately regarded as obeying the normal distribution
regression parameters interval estimate
Regression analysis hopes to replace the overall parameter β1 with the parameter β1 estimated by the sample.
Hypothesis testing can test the range of possible hypothesis values of the population parameter (such as whether it is zero) through the results of a sampling, but it does not indicate how "close" the sample parameter value is to the true value of the population parameter in a sampling.
To judge the extent to which the estimated value of the sample parameter can "approximately replace the true value of the population parameter, it is often necessary to construct an interval centered on the estimated value of the sample parameter" to examine how likely (probability) L is ) contains the actual parameter value. This method is the confidence interval estimation of parametric tests.
To judge how "close" the estimated parameter value β tip is to the real parameter value, you can pre-select a probability α (0<α<1) and find a positive number δ such that the random interval (β tip-δ, β The probability that the tip δ) contains the true value of the parameter is 1-α. That is, P (β Tip - δ ≤ β ≤ β Tip δ) = 1 - α
If such an interval exists, it is called a confidence interval; 1-α is called the confidence coefficient (confidence level); α is called the significance level; The endpoints of a confidence interval are called confidence limits or critical values.
From the known to the unknown Infer the whole from the parts Know more from less Example 149-151
standard error
Population regression standard deviation estimate two: S.E. of regression
Interval estimation of regression coefficients: 3 situations
Textbook P41
Significance test of variables (t test)
Regression analysis is to determine whether the explanatory variable X is a significant influencing factor of the explained variable Y.
In the one-variable linear model, it is necessary to determine whether X has a significant linear impact on Y. This requires a significance test of variables. The method used to test the significance of variables is hypothesis testing in mathematical statistics.
In econometrics, significance testing is mainly conducted on whether the true value of the parameter of a variable is zero.
Based on each sampling difference, the β1 tip sample parameter will change, but the overall parameter β1 is the only constant.
Hypothesis testing
The so-called hypothesis testing is to make a hypothesis about the overall parameters or the overall distribution form in advance, and then use the sample information to judge whether the original hypothesis is reasonable, that is, whether there is a significant difference between the sample information and the original hypothesis, so as to decide whether to accept or deny the original hypothesis (with significant difference).
Formulate hypotheses - draw samples - make decisions
The logical reasoning method used in hypothesis testing is proof by contradiction.
The basic idea: first assume that the null hypothesis is correct, and then based on the sample information, observe whether the results caused by this assumption are reasonable, and use appropriate statistics that conform to a certain probability distribution and a given significance level.
The judgment of whether the result is reasonable or not is based on the principle that "small probability events are unlikely to occur". Small probability events will not occur in one sampling. If a small probability event occurs, it means that the original hypothesis is incorrect and the original hypothesis will be rejected.
In order to explain whether the explanatory variables in the regression model have a significant impact on the explained variables, the regression coefficient β1=0 is usually used as the original hypothesis in quantitative testing to test whether the original hypothesis is true.
β1 tip is calculated as 0.53. Different results will be different under different data. 0.53 is not very stable or robust. The number obtained from a certain group of samples is not very reliable.
Most of them are based on α=0.05, and the smaller probability is more stringent. Different standards lead to different test results. A higher standard rejects the null hypothesis. We want it to be significant so it makes more sense.
P value
The p value is the probability value. What is calculated is the probability that the value of the statistic is greater than or equal to the value of the statistic calculated using the sample.
Take the two-sided test of the statistic U as an example. If the value of the statistic calculated by the sample is represented by U, then the definition of p value is P{|U|≥U0}=p
The output results of most computer software report p values.
P value and test level α What is the relationship?
α is set artificially. The P value is calculated using the sample and corresponds to the exact significance level.
When P<α, the value of the statistic is in the rejection region of the null hypothesis, so the test conclusion is to reject the null hypothesis at the α level;
When p>a, the value of the statistic is in the acceptance domain of the null hypothesis, so the conclusion is to accept the null hypothesis at the a level.
The relationship between confidence α and P: comparison of calculated value and critical value: 1.36<2, α<p, indicating that it is on the left, accept; p<α, reject
converted into corresponding probabilities
Univariate linear regression model predictions
Point forecast for Y0 mean
By directly substituting the predicted value Xf of the explanatory variable into the estimated sample regression function, the predicted value of the mean value of the explained variable can be calculated.
Confidence interval for the Y0 mean prediction
Forecast the average value of Y0 to see what distribution it obeys
See textbook 44-45 for details.
Prediction interval for the overall value of Y0
Convert the obeyed distribution into t distribution, and the residual ei obeys the normal distribution
in conclusion
1. The prediction interval for individual values is wider than the prediction interval for the average value;
2. When the sample size is constant, when the width of the confidence band is smallest at the mean value of When using a regression model to make predictions, the value of Xf should not deviate too far from X, otherwise the accuracy of the prediction will be greatly reduced, so we generally predict the situation in the second year;
3. As the sample size increases, the prediction accuracy will improve. If the sample size is too small, the prediction accuracy will also be poor. When the sample size tends to infinity (i.e. n→∞), the sampling error tends to 0. At this time, the prediction error of the average value also tends to 0, while the prediction error of individual values only depends on the variance of the random disturbance term μi σ². The larger the sample size n, the higher the prediction accuracy, and conversely, the lower the prediction accuracy;
Multiple linear regression model
multiple linear regression Models and classical assumptions
General form/random disturbance term form
Yi=β1 β2*X2i β3*X3i ……βk*Xki μi
There are K explanatory variables, the number of k-1 X
K-1 meta-regression model
In the model, βj (j=1.2.3...k) is the partial regression coefficient
In the real economy, there is rarely one explanatory variable, and there are many variables that can be measured and observed.
Controlling other explanatory variables unchanged, the unit change of the jth explanatory variable affects the average value of the corresponding variable.
Multivariate linear models are expressed as matrices/vectors
Classic assumptions of multivariate linear models
1. Zero mean assumption/error term is unbiased—the mean of the random disturbance term is 0: E(μi)=0
2. Homoscedasticity and autocorrelation-free/random disturbance terms are uncorrelated with each other and have the same variance: Var(U)=σ²In
3. The random disturbance term is not related to the explanatory variables: Cov (Xji, μi) = 0, j = 2.3.......k
4. No multicollinearity. This is a unique assumption of the multiple linear regression model. It is assumed that the explanatory variable (k-1 X) vectors are linearly independent. In this way, the explanatory variable matrix X column has full rank: R(X)=k
5. Normality assumption/random disturbance term μi obeys normal distribution: μi~N(0,σ²)i=1.2.3.……k
Prerequisites that must be met before OLS: unbiasedness, validity, and consistency, so that the sample parameters can replace the population parameters.
multiple linear regression Model estimation OLS
Least squares estimates of parameters
Derivation of parameter estimates
Minimum sum of squares of residuals
Textbook P 67-68 Process
Variance-covariance matrix of multivariate model parameters
The sample standard deviation of the parameter estimator, also called the coefficient standard deviation
Example: PPT207-214
Variable Observability and Data Quality
Properties of Parametric Least Squares Estimation
The least squares estimator of a multiple linear regression model is also the best linear unbiased estimator
linear properties
Unbiasedness, due to the zero mean assumption, ^β is an unbiased estimate of β
minimum variance
The least squares estimate of parameter vector β^β is the estimator with the smallest variance among all linear unbiased estimators of β
multiple linear regression Model testing
Goodness of fit test
Multiple determination coefficient R²
Value range: [0,1]
The closer R² is to 1, the better the model fits the data.
Example questions P217-221: Increase the number of explanatory variables in sequence. Each time an explanatory variable is added, R² will increase. But when R² is the largest, it does not necessarily prove that this is an optimal model or that it has the highest degree of explanation.
What kind of model is the optimal model?
Look at the accuracy it requires
Look at the cost
It’s not about choosing the most, but only the most appropriate and scientific ones are the best.
Which one is optimal based on theoretical analysis, statistical indicators and actual conditions?
The modified coefficient of determination is more accurate than the original R²
The multiple determination coefficient R² will first increase and then decrease as the explanatory variables increase, so sometimes we cannot directly compare the accuracy. If model A has 3 explanatory variables, R² is 0.9; model B has 100 explanatory variables, R² is 0.97. Although model B has a higher degree of explanation, as the number of explanatory variables increases, the degree of explanation will naturally increase. There is water in this. In terms of accuracy, R² is confusing.
The multiple determination coefficient has an important property, that is, it is a non-decreasing function of the number of explanatory variables in the model. That is to say, when the sample size remains unchanged, as the explanatory variables in the model increase, the total deviation square sum TSS does not change. will change, and the explained sum of squares ESS may increase, and the value of the multiple determination coefficient R2 will become larger. When the explained variables are the same but the number of explanatory variables is different, this brings defects to the use of multiple determination coefficients to compare the fitting degree of the two models. At this time, the number of explanatory variables in the model is different, and multiple determinable coefficients cannot be simply and directly compared. The determinable coefficients only involve variation and do not consider degrees of freedom. Obviously, if degrees of freedom are used to correct the calculated variation, comparison difficulties caused by different numbers of explanatory variables can be corrected. Because when the sample size is certain, increasing the explanatory variables will inevitably increase the number of parameters to be estimated, resulting in a loss of degrees of freedom. For this reason, degrees of freedom can be used to correct the residual sum of squares and regression sum of squares in the multiple determinability coefficient R2, thereby introducing the modified determinability coefficient R².
Corrected determination coefficient R²
It needs to be emphasized that the coefficient of determination and the modified coefficient of determination calculated for the regression model estimated using samples are also random variables that change with sampling. In actual econometric analysis, it is often hoped that the larger the R2 or R2 of the established model is, the closer it is to 1, the better, indicating that the model fits the data better.
However, it should be clear that the coefficient of determination is only a measure of the goodness of fit of the model. The larger R2 and R2 are, it only means that the explanatory variables included in the model have a greater joint impact on the explained variables. It does not mean that each explanation in the model is greater. The greater the influence of the variable on the explained variable.
In regression analysis, not only must the model have a high degree of fitting, but also a reliable estimator of the overall regression coefficient must be obtained. Therefore, when selecting a model, the quality of the model cannot be judged simply by the level of the coefficient of determination.
The value of the coefficient of determination increases as the explanatory variables increase, while the value of the modified coefficient of determination increases first and then decreases as the explanatory variables increase. Therefore, in the above example, when there are 3 or 4 explanatory variables, the comprehensive It seems that the data fit is good.
t-test
PPT P225-232 Textbook P 75-76
The purpose of multiple linear regression analysis is not only to obtain a model with higher goodness of fit, nor to seek the significance of the overall equation, but to make meaningful estimates of each overall regression parameter.
Because the overall linear relationship of the equation is significant does not necessarily mean that the impact of each explanatory variable on the explained variable is significant. Therefore, significance testing must also be performed on each explanatory variable separately.
The significance test of each regression coefficient in multiple regression analysis is to test whether the explanatory variable corresponding to the regression coefficient has a significant impact on the explained variable when other explanatory variables remain unchanged. The test method is basically the same as that of simple linear regression.
F test
Since the multiple linear regression model contains multiple explanatory variables, whether there is a significant linear relationship between them and the explained variables requires further judgment. That is to say, we need to make inferences about whether the linear relationship between the explained variable and all explanatory variables in the model is significant as a whole.
Textbook P73-74
In the case of single regression, since there is only one explanatory variable, there is no overall test problem of the joint influence of the explanatory variables, and there is no need for the F test. In essence, the F test is consistent with the t test, that is, the F statistic is equal to the square of the t statistic.
The formula shows that the two statistics F and R change in the same direction, indicating that the two are consistent. Generally speaking, the higher the degree of fit of the model to the observed value, the stronger the significance of the overall linear relationship of the wedge. Therefore, it is of practical significance to find the quantitative relationship between two test statistics and verify each other in practical applications. The statistic R does not give a precise distribution, but can only provide a fuzzy guess. How big their values must be for the model to pass the test, and there is no definite limit; while the statistic F has a precise distribution, which can be given Under the significance level, a statistically rigorous conclusion is given.
multiple linear regression model predictions
point prediction
Substitute data directly
Interval forecast
Interval prediction of mean E(Y0)
analogy to one dollar
Interval forecast for individual values Y0
violate classic assumed
multicollinearity
definition
perfect multicollinearity
Without adding μi=0, there is an accurate one-to-one correspondence between the multiple variables.
incomplete multicollinearity
Add μi=0, there is an approximate collinearity between these multiple variables
Is there a high degree of linear correlation between multiple explanatory variables?
Assuming that there is no complete or incomplete linear relationship between explanatory variables, it is said that there is no multicollinearity and is represented by a matrix. X is a full-rank matrix, that is, rank(X)=k(X'X)-1 exists.
in regression model explanatory variables possible relationship table Now there are three situation
(1) Rxiyj=0, there is no linear relationship between the explanatory variables, and the variables are orthogonal to each other. In fact, there is no need to perform multiple regression at this time, and each parameter can be estimated through the unary regression of y on x.
(2) Rxiyj=1, complete collinearity among explanatory variables. At this point the model parameters will not be determined. Intuitively, when two variables change in the same way, it is very difficult to distinguish the impact of each explanatory variable on the explained variable.
(3)0<Rxiyj<1, there is a certain degree of linear relationship between explanatory variables. This is the situation that is often encountered in practice.
It needs to be emphasized that there is no linear relationship between explanatory variables, which does not mean that there is no nonlinear relationship. When there is a nonlinear relationship between explanatory variables, it does not violate the assumption of no multicollinearity.
The enhancement of the degree of collinearity will affect the accuracy and stability of parameter estimates. Therefore, incomplete multicollinearity is in fact a serious problem.
generate background
(1) There is a common trend among economic variables. For example, for time series data, income, consumption, employment rate, etc. are affected by the economic cycle. These economic variables often have a common trend in the process of changing over time. They all show a growth trend during the economic upturn, while during the economic upturn, During the period of economic contraction, they all showed a downward trend. Multicollinearity problems arise when these variables are simultaneously entered into the model as explanatory variables.
(2) The model includes lagged variables. When the lag variables of the explanatory variables are introduced into the established model, such as Successive values of the same variable are related to each other.
(3) Multicollinearity may also occur when building models using cross-sectional data. The explanatory variables are often closely related in the economic sense. We choose various provinces in China, listed company samples, and large company sizes. For example, Jiangsu and Zhejiang are relatively close in space, economically interconnected, and both are relatively developed. When using cross-sectional data for modeling, many variable changes are related to the scale of development and will show a common growth trend. For example, inputs such as capital, labor, technology, and energy are related to the scale of output, and multicollinearity is prone to occur. Sometimes if changes in some factors are highly correlated with changes in other factors, collinearity may easily occur. For example, a regression model was established between grain output and chemical fertilizer consumption, irrigated land area, and agricultural investment funds. It was found that the regression effect was poor because the impact of agricultural funds has been reflected through the two factors of chemical fertilizer consumption and irrigated land area.
(4) Due to the reasons of the sample data itself, when sampling based on the overall data, there may also be correlations and multilinearity is prone to occur. Relevant variables were selected during modeling. There is a certain correlation between the influencing factors. For example, sampling is only limited to a limited range of explanatory variable values in the population, so that the variable variation is not large; or due to the restriction of the population, the sample data of multiple explanatory variables are correlated, in which case multicollinearity may occur. The multiplier effect of investment and consumption, consumption of 1 yuan can stimulate national income several times.
For example, GDP and CPI are positively correlated, influence each other, and spiral upward.
have consequences
Consequences of complete multicollinearity
Less commonly, the inverse matrix does not exist and the parameters do not exist. There will be infinite solutions. The estimated values of the parameters cannot be determined and cannot be estimated.
The variance of the parameter estimates is infinite
Consequences of incomplete multicollinearity Worse
1. The variance of the parameter estimator is large, the covariance increases, and the ordinary least squares parameter estimator is not effective.
R23 is the square of the correlation coefficient between variables X2 and The less effective
2. The economic meaning of the parameter estimator is unreasonable, and the coefficient sign is inconsistent with the theory.
There may be multiple linear correlations, where variables are replaced by related variables. This is an obvious signal light.
See PPT analysis
3. The contribution of each explanatory variable to the regression sum of squares cannot be accurately distinguished.
X3=R23*X2, (0<R23<1), indicating that there is a positive correlation between X2 and X3, and the two will have cross-influence
Y^=β1^ β2^X2 β3^X3=β1^ (β2^ β3^*X23)X2
The economic meaning of β2^ itself: with other variables unchanged, for every unit increase in X2, the explained variable Y increases by β2^ units. But the premise for reaching this conclusion is that X2 has nothing to do with X3, so that the data β2^ can be used to measure the impact of X2 on Y.
If X2 and X3 are related, β2^ cannot represent the degree of influence of X2 on Y at this time. Changes in X2 will affect X3, and then affect Y through changes in X3. At this time, β2^ loses the significance of estimation. Because X2 and X3 cannot be simply separated and difficult to distinguish, this affects the interpretation of the coefficients.
4. The t value becomes smaller, the significance test of the variable loses its meaning, and a false t test value is obtained. ——The most serious harm, there is no way to know whether it is really significant or not significant, and it has no research value.
What impact does it have on parameter estimation and hypothesis testing?
Or is it due to the problem of multicollinearity that the variables interfere with each other and we cannot use the t test to estimate the results?
According to the formula for calculating the t value, when there is a high degree of collinearity, the variance of the parameter estimates increases rapidly, which will make the t value smaller, and the null hypothesis of "the coefficient is 0" that should be denied is mistakenly accepted, and the t test Invalid.
The F-test is also distorted and invalid.
5. The confidence interval becomes wider and the prediction function of the model fails.
The confidence interval is enlarged, the prediction accuracy decreases, and the error becomes larger, which affects the parameter estimation and hypothesis testing of the model.
test
Simple correlation coefficient matrix method
It is often used in time series, and you can intuitively see the degree of correlation between variables.
The simple correlation coefficient test method is a simple method that uses the degree of linear correlation between explanatory variables to determine whether there is serious multicollinearity.
Generally speaking, if the simple correlation coefficient (zero-order correlation coefficient) of each two explanatory variables is relatively high, such as greater than 0.8, it can be considered that there is serious multicollinearity.
However, it should be noted that a higher simple correlation coefficient is only a sufficient condition for the existence of multicollinearity, not a necessary condition. Especially in regression models with more than two explanatory variables, sometimes low simple correlation coefficients may also suffer from multicollinearity. Therefore, multicollinearity cannot be accurately judged simply based on the correlation coefficient.
Comprehensive judgment method of variable significance (t) and equation significance (F, R²)
If R2 is large and the F value is significantly greater than the critical value at a given significance level, but the partial regression coefficient value corresponding to the variable is not significant, it indicates that the model has multicollinearity.
auxiliary regression method
Summary of methods for multicollinearity diagnosis
(1) The correlation coefficient between explanatory variables is large
(2) In the regression model, although R2 is large, the signs of some important explanatory variables do not meet economic significance, and the parameters do not pass the significance test;
(3) In the auxiliary regression equation of ⅹj for other explanatory variables, the regression parameters pass the significance test, and the VIF is much greater than 10, indicating that the degree of multicollinearity is serious
(4) If the newly added explanatory variables in the original equation can improve the goodness of fit, but have a significant impact on the signs and significance tests of other parameters, it can be judged that the new explanatory variables have caused multicollinearity.
Remedies and Corrections Improvements
1. Eliminate variables method
2. Increase sample capacity
3. Use prior information to change the constraint form of parameters
4. Transform the model into a difference equation form
5. Stepwise regression method (key points)
step
The explained variable Y is regressed on each explanatory variable Xi (i=1,2,...,k) respectively, and each regression equation is comprehensively analyzed and judged based on economic theory and statistical testing, and an optimal basic equation is selected from it. Regression equation. On this basis, other explanatory variables are introduced one by one, the regression is re-run, and the number of explanatory variables in the model is gradually expanded until the best estimation model emerges from the comprehensive situation.
1. y performs a one-way regression on each explanatory variable
2. Choose the one with the largest R2 as the basic equation
3. Gradually increase the explanatory variables, requiring R to increase, and the t test of the variables to be significant.
4. If the t-test of other explanatory variables is not significant, delete the variable
5. Select the optimal regression model
6. Ridge regression estimation
Heteroskedasticity
autocorrelation
Distributed lag model and autoregressive model
Dummy variable regression
Maybe not