Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

econometrics

Econometrics is a branch of economics that is based on mathematical economics and mathematical statistics as a methodological basis. It attempts to synthesize the theoretical quantitative approach and the empirical (empirical) quantitative approach to economic problems.

Edited at 2024-10-15 00:35:43

Wyatt9956

Recent works View more works>>

econometrics

Wyatt9956

Recent works View more works>>

Recommended to you
Outline

Multiple linear regression mind map_copy
- 96
PlotWizard
Univariate linear regression model
- 91
PlotWizard

econometrics

satisfy classic assumed

Econometrics Overview

definition

Econometrics is a science that is based on economic theory, uses mathematical statistical methods and computer technology as tools, and conducts quantitative analysis of economic phenomena, economic relationships and economic laws based on actual observational data.

The science of studying the laws of economic activities and their applications using quantitative methods.

Study with variance

China Statistical Yearbook

It is the product of the combination of economics, statistics, and mathematics.

Macro-Jing and Micro-Jing are qualitative, as long as a negative correlation can be found through research

Require

Master the methodology of econometric analysis

Able to establish and apply simple econometric models to analyze real economic problems

Initial mastery of EVIEWS6

effect

Estimating economic relationships

Not only qualitative, but also quantitative

What are the characteristics of the demand for enterprise products? Is price elastic? Is it inelastic? How effective is the price strategy?

Garment factory pricing

The econometric model must establish the relationship between demand and price and must be subject to profit maximization.

What is the impact of advertising on sales revenue and profits?

What is the marginal propensity to consume of urban residents in my country?

What factors affect the salary of newly graduated college students? Major? School reputation?

Hypothesis testing

Is there really gender discrimination in wages?

Have government policies to encourage fuel-efficient vehicles significantly reduced fuel consumption across the country?

Does the taxi price increase really improve the situation of taxi drivers?

predict

Manufacturers need to make product sales forecasts

The government needs to make economic development forecasts

Cities need to forecast transportation development and energy demand

Stock market players need to make stock price predictions and risk estimates

Origin and development

produce

In a market economy, there are intricate relationships between market entities. To survive and develop in fierce competition, companies must have reliable market forecasts.

If the government wants to intervene in the operation of the national economy, it needs to analyze economic trends in a timely manner.

develop

●Computer applications

●Model variables and equations

From few to large, there tends to be fewer models merged into an overall model

●New breakthroughs in theories and methods

In addition to the classic linear econometric models, there are nonlinear models, reasonable expectation models, variable parameters, parameter-free, semi-parametric models, dynamic models, time series models, cointegration theory, Bayesian method, small sample theory, etc. research field

●Expansion of application fields

Applications in macroeconomic and microeconomic fields have shifted from prediction to more testing of economic theoretical assumptions and policy assumptions.

Research steps

a statement of an economic theory or hypothesis

Keynes believed that, generally speaking, people tend to increase their consumption as their income increases, but consumption does not increase as much as income. In economic terms: the marginal propensity to consume, that is, the change in consumption for every unit change in income, is greater than 0 and less than 1.

Build mathematical (mathematical economics) models

deterministic relationship

Y=β0+β1*X, X: income, Y: consumption, 0<β1<1

Build statistical or econometric models

Related relationships

Y=β0＋β1*X μ

The functional form of the model should be chosen based on the specific data pattern

μ: Random error term, indicating the statistical relationship between household consumption expenditure and disposable income. The existence of μ indicates that even if the disposable income is the same, there is a certain degree of randomness in consumer spending. μ: other explanatory variables Variance of β0, β1 and μ: parameters to be estimated

Collect and process sample data;

Personal consumption expenditures and gross domestic product data of the United States from 1980 to 1991

Parameter estimation of econometric models;

Get it with Eviews software

The estimated values of β0 and β1 are: -231.8 and 0.719 respectively Y=-231.8 0.719X se:(94.53)(0.022); R2=0.991; F=1094.1

Why estimate?

Generally speaking, the parameters (β0, β1) are unknown and cannot be directly observed.

Due to the existence of random terms, the parameters cannot be calculated accurately and need to be estimated by selecting an appropriate method through sample observation values.

(How to estimate the parameters of the overall model through sample observations is the core content of econometrics)

Parameter estimation method

Econometrics has developed a wealth of parameter estimation methods.

Mainly composed of Two major categories

Classical estimation method

least squares estimate

maximum likelihood estimation

Simultaneous equation estimation method, etc.

instrumental variable method

Bayesian estimation method (not covered in this course)

You may take the postgraduate entrance examination

Test hypotheses from models

Purpose

Evaluate the model and estimated parameters to determine whether they are theoretically meaningful and statistically significant.

The conclusion is just some accidental result of sampling

May violate basic assumptions of econometric estimation

economic significance test

Check whether each parameter is consistent with economic theory and actual experience

In the consumption function example: Y=-231.8 0.719X,0<β1<1?

In (per capita food demand = -2.0-0.5n (per capita income) 4.5n (food price) 0.81n (other commodity prices) ln (per capita food demand) = -2.0 0.5n (per capita income -4.5n (food price) 0.81n (other commodity prices) In (per capita food demand) = -2.0 0.5/m (per capita income) ln (food price) 0.81n (other commodity prices)

Statistical test

Determined by mathematical statistics theory

include

Goodness of fit test: R2

Overall significance test: F

Variable significance test:t

econometric test

Determined by econometric theory

include

Heteroskedasticity test

serial correlation test

Collinearity test

Econometric research is a dynamic process. Only after the model passes the above-mentioned tests can it be practically applied. If it fails the test, the model needs to be revised, re-designed, re-estimated and re-tested.

Test the correctness of the model—hypothesis testing of the model;

Model prediction test

Determined by the application requirements of the model

include

Stability test: re-estimation with expanded sample

Forecasting performance test: actual forecasting at an out-of-sample point

Model use/application

structural analysis

Econometric models are used to quantitatively analyze the relationship between variables, with the purpose of clarifying and explaining relevant economic phenomena. Such as elasticity analysis, multiplier analysis

Policy simulation and evaluation

Use econometric models to simulate the implementation effects of various economic policies in order to compare various policy options and make choices

The "economic policy laboratory" function of econometric models

Economic forecasting: using econometric models to predict the future values of economic variables

Theory verification

Data and variables

data type

Time series data (same space, different times)

Data for the same individual at different times

A student's weight was recorded continuously for ten years, resulting in a time series data with a capacity of 10;

stability problem

Differences from cross-sectional data

Cross-sectional data (same time, different spaces)

Data for a group of individuals at the same time.

Heteroskedasticity

It is a batch of survey data that occurred at the same time section. Study changes at a certain point in time.

For example, industrial census data, population census data, household means survey data, etc.

At a certain time, record the weight of all 30 students in a class to obtain a cross-sectional data with a capacity of 30.

Mixed data (panel data): It is a combination of time series data and cross-sectional data.

GDP of each province in my country from 1978 to 1998

Dummy variables (either-or variables represented by 0-1)

A dummy variable is a variable that only takes one of 1 or 0, and is generally used to represent qualitative variables, such as policy variables, condition variables, etc.

Dummy variables can be combined to represent multiple states.

Can represent qualitative data

The number of dummy variables used = the number of states to be represented - 1. Only 2 dummy variables are used for 3 states. If 3 dummy variables are used for 3 states, multicollinearity will occur.

Wage=β0 β1exp……β5Dum (sequence variable) action (random interference factor) Empirical issues, theory cannot explain it, it needs to be empirical Gender variable (0 male, 1 female variable) β5=0.8>0, there is gender discrimination, and it has a significant impact when it is close to 1. If there is no work experience, use other variables (internship experience, club experience) to replace it. School, major, work experience.

policy variables You can study whether a certain policy is effective or ineffective, and whether it has any impact on fuel consumption. dum=0—not implemented dum=1—implementation How to set dum sequence variables?

Data source

Collection by government agencies, specialized agencies, international agencies or individuals

variable

Distinguish from the causal relationship of variables (single equation model)

Explained variable (response variable)—the variable to be analyzed and studied

The variables as the research object are the "fruit" in the causal relationship, such as the output in the production function. They are the explained variables in the model. In the single equation model, they are at the left end.

Explanatory variable (independent variable) - the variable that explains the main reason for the change in the response variable (Non-main reasons are classified as random items)

Variables as "causes", such as capital and labor technology in the production function, are explanatory variables in the model. In the single equation model, they are at the right end.

Distinguish from the nature of variables

Endogenous variables—variables whose values are determined by the model and are the result of solving the model

belongs to the strain variable

It is determined by the model of the economic system under study itself.

Exogenous variables—variables whose values are determined outside the model

The values of exogenous variables are determined outside the model under study and are not affected by factors internal to the model.

It is a "given" or "known" value that is specified in advance before the model is solved.

Belongs to independent variable

Classification

Policy variables (variables that policymakers can control, such as government spending, interest rates, money supply, etc.)

Non-policy variables (variables that are difficult or uncontrollable, such as climate, natural disasters, agricultural harvests, exchange rates, etc.)

Lag variables: lagged endogenous variables, lagged exogenous variables

Predetermined variables: predetermined endogenous variables and exogenous variables

relation

Changes in the value of exogenous variables can affect changes in endogenous variables

Endogenous variables cannot in turn affect exogenous variables

Model

Regression model: reveal the causal relationship between variables

According to the model form

linear model

Y=β1 β2X2 β3X3…… βk*Xk μ This is the most commonly used model form and can be estimated using linear regression methods in mathematical statistics.

linear regression

"Yuan" ˆ refers to the explanatory variables, and the model is called a k-1 yuan linear regression model.

Unary regression

When there is only one explanatory variable, it is called a simple linear regression model, also called a bivariate regression model;

multiple regression

When there is more than one explanatory variable, it is called a multiple linear regression model. "Yuan" ˆ refers to the explanatory variables, and the above model is called a k-1 yuan linear regression model.

nonlinear regression

log-log model

The basic form is:

where u is the random disturbance term, expressed in natural logarithm as

β2 is the elasticity of Y with respect to X

semi-log model

basic form

This model is called the constant percentage growth model.

α2 is equal to the constant relative change rate of Y when the absolute quantity of X changes to a certain extent.

β2 is equal to the absolute change in the average or expected value of Y when X undergoes a certain relative change.

reciprocal model

It means that as X increases, Y decreases non-linearly (when the coefficient of the second term is negative, it increases), but in the end the intercept term is the asymptotic line.

Classic single-equation econometric model ——Simple/univariate linear regression model

Regression analysis overview and regression function

1. The relationship between variables and Basic concepts of regression analysis

economic variables The relationship can be Divided into two categories

(1) Deterministic relationship or functional relationship

The study is to determine the relationship between non-random variables of phenomena. Y=c i g

(2) Statistical dependence or related relationship

It studies the relationship between random variables of non-deterministic phenomena. C=βy u

The investigation of statistical dependence between variables is mainly through

regression analysis

There is a causal relationship

Related analysis

no causal relationship

Correlation refers to the degree or strength of the relationship between two or more variables. Correlation analysis is the most basic method of studying the relationship between variables. The correlation coefficient derived from correlation analysis is a basic statistic of regression analysis. Mastering it will help you analyze and understand economic issues and econometric models.

①According to intensity

Complete correlation: There is a functional relationship between variables. The perimeter of the example is L=2ΠR

High correlation (strong correlation): There is approximately a functional relationship between variables. For example, the relationship between household income and expenditure in my country.

Weak correlation: There is a relationship between variables but it is not obvious. For example, my country’s cultivated area and output in recent years.

Zero correlation: There is no relationship between variables. For example, the academic performance and age of students in a certain class.

Use scatter plots/correlation plots to depict

linear correlation

Positive correlation

Not relevant

negative correlation

Correlation coefficient: -1≤ρxy≤1

nonlinear correlation

Positive correlation

negative correlation

Not relevant

simple linear Related metrics

Use the simple linear correlation coefficient, referred to as the correlation coefficient, to measure the linear correlation strength between two variables, represented by p.

The random variable expression of p is

Note: The correlation coefficient p is for the overall population. When studying a problem, the data obtained is often a sample. For samples, the correlation coefficient is often expressed as r, that is, r is the estimated value of the overall correlation coefficient p.

Among them, n is the overall capacity; x and y are the observed values of the sample; x,, the mean of the observed values of the variable.

Note

(1) Non-linear correlation does not mean irrelevance;

(2) Correlation analysis treats any (two) variables symmetrically, and both variables are regarded as random.

X has an impact on Y, and Y has an impact on X

There is an asymmetry in the way regression analysis handles variables, that is, it distinguishes between dependent variables (explained variables) and independent variables (explanatory variables): the former is a random variable, and the latter is not.

Only X has an effect on Y

regression analysis

Regression analysis is the theory and method of studying uncertain statistical relationships between variables.

Its basic idea

It consists in estimating and/or predicting the (overall) mean of the explained variable through the known or set value of the explanatory variable.

Here: the former variable is called the explained variable or the response variable, and the latter variable(s) is called the explanatory variable or the independent variable.

Regression analysis components econometric methodological basis, its The main contents include

(1) Choose an appropriate method to estimate the parameters of the econometric model based on the sample observations and obtain the regression equation;

(2) Conduct significance tests on regression equations and parameter estimates;

(3) Use regression equations for analysis, evaluation and prediction.

Y tip t=β0 tip β1 tip Xt et——sample regression function The conditional expected value on both sides is: E(Y/Xt)=β0 β1X——overall regression function

Expected mean, overall regression line

There is a linear regression model (statistical model) as follows, Yi=β0 β1Xi μi The above equation represents the true relationship between variables y and x. Among them, y is called the explained variable (dependent variable), x is called the explanatory variable (independent variable), u is called the random error term, β0 is called the constant term, and β1 is called the regression coefficient (marginal effect). The above model can be divided into two parts (1) regression function part, E(yt)=β0 β1x, (2) random part μi.

2. Overall regression function

In Example 2., the consumption expenditure of individual households is Yi=E(Y/Xi) μi=β0＋β1*Xi μi——(*)

That is, given the income level Xi, the individual household’s Expenditure can be expressed as the sum of two parts

(1) The average consumption expenditure E(Y/Xi) of all households at this income level is called the systematic or deterministic part

(2) Other random or non-deterministic μi.

The formula (*) is called the random setting form of the overall regression function (equation) PRF.

Since random terms are introduced into the equation, it becomes an econometric model, so it is also called an overall regression model.

It shows that in addition to the systematic influence of the explanatory variable, the explained variable is also affected by the randomness of other factors.

overall regression function

It depicts the change pattern between the explanatory variable X and the average/expected value of the explained variable Y.

The corresponding function: E(Y/Xi)=f(Xi) is called the (bivariate) overall regression function (PRF)

The regression function (PRF) explains how the average state of the explained variable Y (overall conditional expectation) changes with the explanatory variable X.

Functional form: Can be linear or nonlinear.

In Example 2.1, when residents’ consumption expenditure is regarded as a linear function of their disposable income: E(Y|Xi)=β0 β1Xi is a linear function. Among them, β0 and β1 are unknown parameters, called regression coefficients.

overall regression line

Drawing the scatter plot, we find that as income increases, consumption also increases "on average", and the conditional mean of Y all falls on a straight line with a positive slope. This straight line is called the overall regression line.

The expected trajectory of the explained variable Y under the given explanatory variable ⅹ is called the overall regression line/overall regression curve.

3. Random disturbance term

The overall regression function explains the average consumption expenditure level of households in the community at a given income level Xi. However, for an individual household, its consumption expenditure may deviate from this average level.

Remember μi=Yi-E(YIXi)

μ is called the dispersion of the observed value Yi around its expected value E(YIXi). It is an unobservable random variable, also called a random interference term or a random error term.

Random error terms mainly include The influence of the following factors (consumption)

1) The influence of neglected factors that are not important in the explanatory variables; (preference)

In addition to income affecting consumption, there are also commodity prices, consumer preferences and age, etc.

2) As a comprehensive representative of many small influencing factors;

Consumers’ mood, emotion, personality

Secondary influencing factors are automatically placed into random interference items

3) The impact of observation errors on variable observation values;

There are inherent errors when collecting data, such as measurement errors.

4) The impact of the setting error of the model relationship;

5) The influence of other random factors.

Random human behavior, natural disasters, environmental factors

The mathematical model is poorly formed.

When building the model, the form selection is wrong/the model design is wrong, and more important factors may be lost and put into random interference items.

Merger error (merger of two equations)

Generate and design random The main reasons for the error term:

1) Theoretical ambiguity

The consumption function has been studied thoroughly, but the distance between asteroids is relatively vague.

2) Lack of data;

Some data are hard to find, such as data on official corruption.

3) The principle of saving.

The simpler the model, the better

4. Sample regression function (SRF) /sample regression model

The sample regression function also has the following random form

In the formula, ei is called the (sample) residual (or remaining) term, which represents the set of other random factors that affect Yi and can be regarded as the estimator μi of μi.

Since random terms are introduced into the equation, it becomes an econometric model, so it is also called a sample regression model.

sample regression line

The sample scatter plot is approximately a straight line. Draw a straight line to best fit the scatter plot. Since the sample is taken from the population, the line can approximately represent the population regression line. This line is called the sample regression line. The functional form of the sample regression line is called the sample regression function (SRF)

The functional form of the sample regression line is called the sample regression function (SRF)

PPT 90-92 Case Analysis

The main purpose of regression analysis

Estimate the population regression function PRF based on the sample regression function SRF

Univariate linear regression model parameter estimates of

Assumptions about the random disturbance term μ (called classical hypothesis) The first 3 chapters accommodate these assumptions by default

Zero mean assumption

Homoskedasticity assumption

No autocorrelation assumption

It is assumed that the random disturbance term is uncorrelated with the explanatory variables

Normality assumption

Assumptions about the explained variable y

Assumption 1: The conditional expectation of the random disturbance term is 0

Find the expectation on both sides of the equation at the same time. The expectation of the constant is itself

E(Y|Xi)=β1 β2X

Assumption 2: Homoskedasticity assumption

Var(Yi|Xi)=σ²

Assumption 3: No autocorrelation assumption

Cov(Yi,Yj)=0

Assumption 4: Normality assumption

Y~N(β1 β2Xi,σ²)

If these assumptions are met, parameter estimation can be performed.

Model parameter estimation ordinary least squares (OLS) 80% of the questions use this method

In order to make the sample regression function as "close" as possible to estimate the overall regression number. It is necessary to minimize the error between Yi tip estimated by the sample regression function = β1 tip β2 tip Xi and the actual Yi, that is, the residual error is minimized. Theoretically, the sum of squares of ei=Yi-β1 tip-β2 tip Xi can be minimized

Theoretically, the sum of squares of ei=Yi-β1 tip-β2 tip Xi can be minimized

The principle of the least squares method is to determine the straight line position by "minimizing the sum of squares of the residuals". In addition to being more convenient in calculation, the least squares method also has excellent properties in estimators.

The principle of the least squares method: Find a straight line that minimizes the sum (sum of squares) of the longitudinal distances from all these points to the straight line minΣei²

Σei²=e1² e2² ……en²: The sum of squares of the residuals is the smallest Some family residuals are positive (10), and some families have negative residuals (-10). The positive and negative may cancel each other out, and ultimately be 0, so the square term must be taken.

Dispersion form of ordinary least squares parameter estimators

Let xi=Xi-X, yi=Yi-Y

Properties of Least Squares Estimators These three properties are required: Excellent properties of point estimators:

When the model parameters are estimated, the accuracy of the parameter estimates needs to be considered, that is, whether they can represent the true values of the overall parameters, or the statistical properties of the parameter estimators need to be examined. An estimator used to examine the population, its advantages and disadvantages can be examined from the following aspects:

(1) Unbiasedness

That is, when the sample size approaches infinity, whether its mean sequence tends to the overall true value;

Sample statistic The mean of the sampling distribution is equal to the population estimator being estimated

E (θ tip) = θ, θ tip is called the unbiased estimator of θ, indicating that the direction is towards the target.

(2)Consistency

That is, when the sample size approaches infinity, does it converge to the true value of the population according to probability?

Asymptotic property, when n tends to ∞ is large/the sample size is larger, the sample parameters are closer to the overall parameters, and the error is smaller and more accurate.

If as the sample size n increases, the sample estimator becomes closer and closer to the true value of the population in a probability sense, then the estimator is said to be a consistent estimator of the parameter to be estimated.

(3) Effectiveness

That is, when the sample size approaches infinity, does it converge to the true value of the population according to probability?

Comparison of variances, fluctuating around the interval, lower variance, smaller degree of fluctuation, smaller amplitude of change, higher accuracy, closer to the overall parameter value, and belongs to the normal distribution

Smaller variance and more efficient

rational economic man

Univariate linear regression model Hypothesis testing and interval estimation

Goodness of fit test

Test the degree of fit between the sample regression line and the sample observation values, and evaluate the effect of model estimation.

Index to measure the goodness of fit: coefficient of determination (determination coefficient) R2

Question: Using the ordinary least squares estimation method has ensured that the model best fits the sample observations, why do we need to test the degree of fit?

Decomposition of sum of squares of total deviation/decomposition of variance: formula derivation method

The total dispersion of the observed values of Y around its mean can be decomposed into two parts: one part comes from the regression line (ESS), and the other part comes from the stochastic force (RSS).

In a given sample, TSS remains unchanged. If the actual observation point is closer to the sample regression line, the greater the proportion of ESS in TSS, the more explanations of Y by the regression equation and the more accurate it is. Therefore goodness of fit: regression sum of squares ESS/total dispersion TSS of Y

The coefficient of determination R² of the sample

Value range: [0,1]

It is generally better when >0.7

The closer R² is to 1, it means that the actual observation point is closer to the sample line, the higher the goodness of fit, and the higher the degree of explanation of y by the regression equation.

Features

①The determined coefficient is a non-negative statistic;

②The value range of the determination coefficient is 0≤R2≤1:

③The determination coefficient R2 is a function of the sample observation value and a random variable that changes with sampling.

The statistical reliability of the coefficient of determination should also be tested.

Yi tip = (Yi tip - Y pull) is the difference between the sample regression fitting value and the average value of the observed value, which can be considered as the part explained by the regression straight line; ei=(Yi-Yi tip) is the difference between the actual observed value and the regression fitting value, which is the part that cannot be explained by the regression straight line. If Yi=Yi, that is, the actual observed value falls on the sample regression line, the fit is the best. It can be considered that the "dispersion" all comes from the regression line and has nothing to do with the "residual error"

The proportion of the total change in y that can be explained by the change in x/the degree of explanation of the regression equation 98.69% of consumer spending can be explained by income; Among the sample changes in y, 98.69% can be explained by X: 98.69% of the total changes in household consumption expenditure y can be explained by changes in household income x.

The coefficient of determination R² is the same as The relationship between the correlation coefficient r

The coefficient of determination is numerically equal to the square of the simple linear correlation coefficient.

But conceptually, both There is a clear difference

First of all, in a complex sense, the determination coefficient R measures the degree of fit of the regression model to the sample observations in terms of the estimated regression model, that is, the degree to which the explanatory variables in the model explain the variation of the explained variables: Correlation coefficient In terms of two variables, it shows the degree of linear dependence of the two variables.

Secondly, the coefficient of determination measures the asymmetric causal relationship between the explanatory variable and the explained variable. It explains the proportion of X's explanation of the variation of Y based on regression analysis, but does not explain the explanation of y's X. r measures the symmetric correlation between X and Y, and does not involve the specific causal relationship between X and Y.

Moreover, the determined coefficient is non-negative, and its value range is 0≤R2≤1: while the correlation coefficient can be positive or negative, and its value range is -1≤r≤1

In econometrics, the estimation, testing and application of regression models are mainly studied, so from the practical application of regression analysis, the determination coefficient is more meaningful than the correlation coefficient.

Interval estimates of parameters

regression parameters Statistics distribution

The distribution properties of OLS estimation prove: βi tip => βi, using sample parameters instead of population parameters

Small sample: t distribution; large sample: standard normal distribution, β1 tip and β2 tip both obey normal distribution

In the case of small samples, if the unbiased estimate σ² tip is used instead of σ² to estimate the standard error, the statistics that undergo standard changes no longer obey the normal distribution, but obey the t distribution with n-2 degrees of freedom.

Generally speaking, for β1 tip and β2 tip After transformation, it obeys the t distribution with n-2 degrees of freedom.

In the case of large samples, it can be Approximately regarded as obeying the normal distribution

regression parameters interval estimate

Regression analysis hopes to replace the overall parameter β1 with the parameter β1 estimated by the sample.

Hypothesis testing can test the range of possible hypothesis values of the population parameter (such as whether it is zero) through the results of a sampling, but it does not indicate how "close" the sample parameter value is to the true value of the population parameter in a sampling.

To judge the extent to which the estimated value of the sample parameter can "approximately replace the true value of the population parameter, it is often necessary to construct an interval centered on the estimated value of the sample parameter" to examine how likely (probability) L is ) contains the actual parameter value. This method is the confidence interval estimation of parametric tests.

To judge how "close" the estimated parameter value β tip is to the real parameter value, you can pre-select a probability α (0<α<1) and find a positive number δ such that the random interval (β tip-δ, β The probability that the tip δ) contains the true value of the parameter is 1-α. That is, P (β Tip - δ ≤ β ≤ β Tip δ) = 1 - α

If such an interval exists, it is called a confidence interval; 1-α is called the confidence coefficient (confidence level); α is called the significance level; The endpoints of a confidence interval are called confidence limits or critical values.

From the known to the unknown Infer the whole from the parts Know more from less Example 149-151

standard error

Population regression standard deviation estimate two: S.E. of regression

Interval estimation of regression coefficients: 3 situations

Textbook P41

Significance test of variables (t test)

Regression analysis is to determine whether the explanatory variable X is a significant influencing factor of the explained variable Y.

In the one-variable linear model, it is necessary to determine whether X has a significant linear impact on Y. This requires a significance test of variables. The method used to test the significance of variables is hypothesis testing in mathematical statistics.

In econometrics, significance testing is mainly conducted on whether the true value of the parameter of a variable is zero.

Based on each sampling difference, the β1 tip sample parameter will change, but the overall parameter β1 is the only constant.

Hypothesis testing

The so-called hypothesis testing is to make a hypothesis about the overall parameters or the overall distribution form in advance, and then use the sample information to judge whether the original hypothesis is reasonable, that is, whether there is a significant difference between the sample information and the original hypothesis, so as to decide whether to accept or deny the original hypothesis (with significant difference).

Formulate hypotheses - draw samples - make decisions

The logical reasoning method used in hypothesis testing is proof by contradiction.

The basic idea: first assume that the null hypothesis is correct, and then based on the sample information, observe whether the results caused by this assumption are reasonable, and use appropriate statistics that conform to a certain probability distribution and a given significance level.

The judgment of whether the result is reasonable or not is based on the principle that "small probability events are unlikely to occur". Small probability events will not occur in one sampling. If a small probability event occurs, it means that the original hypothesis is incorrect and the original hypothesis will be rejected.

In order to explain whether the explanatory variables in the regression model have a significant impact on the explained variables, the regression coefficient β1=0 is usually used as the original hypothesis in quantitative testing to test whether the original hypothesis is true.

β1 tip is calculated as 0.53. Different results will be different under different data. 0.53 is not very stable or robust. The number obtained from a certain group of samples is not very reliable.

Most of them are based on α=0.05, and the smaller probability is more stringent. Different standards lead to different test results. A higher standard rejects the null hypothesis. We want it to be significant so it makes more sense.

P value

The p value is the probability value. What is calculated is the probability that the value of the statistic is greater than or equal to the value of the statistic calculated using the sample.

Take the two-sided test of the statistic U as an example. If the value of the statistic calculated by the sample is represented by U, then the definition of p value is P{|U|≥U0}=p

The output results of most computer software report p values.

P value and test level α What is the relationship?

α is set artificially. The P value is calculated using the sample and corresponds to the exact significance level.

When P<α, the value of the statistic is in the rejection region of the null hypothesis, so the test conclusion is to reject the null hypothesis at the α level;

When p>a, the value of the statistic is in the acceptance domain of the null hypothesis, so the conclusion is to accept the null hypothesis at the a level.

The relationship between confidence α and P: comparison of calculated value and critical value: 1.36<2, α<p, indicating that it is on the left, accept; p<α, reject

converted into corresponding probabilities

Univariate linear regression model predictions

Point forecast for Y0 mean

By directly substituting the predicted value Xf of the explanatory variable into the estimated sample regression function, the predicted value of the mean value of the explained variable can be calculated.

Confidence interval for the Y0 mean prediction

Forecast the average value of Y0 to see what distribution it obeys

See textbook 44-45 for details.

Prediction interval for the overall value of Y0

Convert the obeyed distribution into t distribution, and the residual ei obeys the normal distribution

in conclusion

1. The prediction interval for individual values is wider than the prediction interval for the average value;

2. When the sample size is constant, when the width of the confidence band is smallest at the mean value of When using a regression model to make predictions, the value of Xf should not deviate too far from X, otherwise the accuracy of the prediction will be greatly reduced, so we generally predict the situation in the second year;

3. As the sample size increases, the prediction accuracy will improve. If the sample size is too small, the prediction accuracy will also be poor. When the sample size tends to infinity (i.e. n→∞), the sampling error tends to 0. At this time, the prediction error of the average value also tends to 0, while the prediction error of individual values only depends on the variance of the random disturbance term μi σ². The larger the sample size n, the higher the prediction accuracy, and conversely, the lower the prediction accuracy;

Multiple linear regression model

multiple linear regression Models and classical assumptions

General form/random disturbance term form

Yi=β1 β2*X2i β3*X3i ……βk*Xki μi

There are K explanatory variables, the number of k-1 X

K-1 meta-regression model

In the model, βj (j=1.2.3...k) is the partial regression coefficient

In the real economy, there is rarely one explanatory variable, and there are many variables that can be measured and observed.

Controlling other explanatory variables unchanged, the unit change of the jth explanatory variable affects the average value of the corresponding variable.

Multivariate linear models are expressed as matrices/vectors

Classic assumptions of multivariate linear models

1. Zero mean assumption/error term is unbiased—the mean of the random disturbance term is 0: E(μi)=0

2. Homoscedasticity and autocorrelation-free/random disturbance terms are uncorrelated with each other and have the same variance: Var(U)=σ²In

3. The random disturbance term is not related to the explanatory variables: Cov (Xji, μi) = 0, j = 2.3.......k

4. No multicollinearity. This is a unique assumption of the multiple linear regression model. It is assumed that the explanatory variable (k-1 X) vectors are linearly independent. In this way, the explanatory variable matrix X column has full rank: R(X)=k

5. Normality assumption/random disturbance term μi obeys normal distribution: μi~N(0,σ²)i=1.2.3.……k

Prerequisites that must be met before OLS: unbiasedness, validity, and consistency, so that the sample parameters can replace the population parameters.

multiple linear regression Model estimation OLS

Least squares estimates of parameters

Derivation of parameter estimates

Minimum sum of squares of residuals

Textbook P 67-68 Process

Variance-covariance matrix of multivariate model parameters

The sample standard deviation of the parameter estimator, also called the coefficient standard deviation

Example: PPT207-214

Variable Observability and Data Quality

Properties of Parametric Least Squares Estimation

The least squares estimator of a multiple linear regression model is also the best linear unbiased estimator

linear properties

Unbiasedness, due to the zero mean assumption, ^β is an unbiased estimate of β

minimum variance

The least squares estimate of parameter vector β^β is the estimator with the smallest variance among all linear unbiased estimators of β

multiple linear regression Model testing

Goodness of fit test

Multiple determination coefficient R²

Value range: [0,1]

The closer R² is to 1, the better the model fits the data.

Example questions P217-221: Increase the number of explanatory variables in sequence. Each time an explanatory variable is added, R² will increase. But when R² is the largest, it does not necessarily prove that this is an optimal model or that it has the highest degree of explanation.

What kind of model is the optimal model?

Look at the accuracy it requires

Look at the cost

It’s not about choosing the most, but only the most appropriate and scientific ones are the best.

Which one is optimal based on theoretical analysis, statistical indicators and actual conditions?

The modified coefficient of determination is more accurate than the original R²

The multiple determination coefficient R² will first increase and then decrease as the explanatory variables increase, so sometimes we cannot directly compare the accuracy. If model A has 3 explanatory variables, R² is 0.9; model B has 100 explanatory variables, R² is 0.97. Although model B has a higher degree of explanation, as the number of explanatory variables increases, the degree of explanation will naturally increase. There is water in this. In terms of accuracy, R² is confusing.

The multiple determination coefficient has an important property, that is, it is a non-decreasing function of the number of explanatory variables in the model. That is to say, when the sample size remains unchanged, as the explanatory variables in the model increase, the total deviation square sum TSS does not change. will change, and the explained sum of squares ESS may increase, and the value of the multiple determination coefficient R2 will become larger. When the explained variables are the same but the number of explanatory variables is different, this brings defects to the use of multiple determination coefficients to compare the fitting degree of the two models. At this time, the number of explanatory variables in the model is different, and multiple determinable coefficients cannot be simply and directly compared. The determinable coefficients only involve variation and do not consider degrees of freedom. Obviously, if degrees of freedom are used to correct the calculated variation, comparison difficulties caused by different numbers of explanatory variables can be corrected. Because when the sample size is certain, increasing the explanatory variables will inevitably increase the number of parameters to be estimated, resulting in a loss of degrees of freedom. For this reason, degrees of freedom can be used to correct the residual sum of squares and regression sum of squares in the multiple determinability coefficient R2, thereby introducing the modified determinability coefficient R².

Corrected determination coefficient R²

It needs to be emphasized that the coefficient of determination and the modified coefficient of determination calculated for the regression model estimated using samples are also random variables that change with sampling. In actual econometric analysis, it is often hoped that the larger the R2 or R2 of the established model is, the closer it is to 1, the better, indicating that the model fits the data better.

However, it should be clear that the coefficient of determination is only a measure of the goodness of fit of the model. The larger R2 and R2 are, it only means that the explanatory variables included in the model have a greater joint impact on the explained variables. It does not mean that each explanation in the model is greater. The greater the influence of the variable on the explained variable.

In regression analysis, not only must the model have a high degree of fitting, but also a reliable estimator of the overall regression coefficient must be obtained. Therefore, when selecting a model, the quality of the model cannot be judged simply by the level of the coefficient of determination.

The value of the coefficient of determination increases as the explanatory variables increase, while the value of the modified coefficient of determination increases first and then decreases as the explanatory variables increase. Therefore, in the above example, when there are 3 or 4 explanatory variables, the comprehensive It seems that the data fit is good.

t-test

PPT P225-232 Textbook P 75-76

The purpose of multiple linear regression analysis is not only to obtain a model with higher goodness of fit, nor to seek the significance of the overall equation, but to make meaningful estimates of each overall regression parameter.

Because the overall linear relationship of the equation is significant does not necessarily mean that the impact of each explanatory variable on the explained variable is significant. Therefore, significance testing must also be performed on each explanatory variable separately.

The significance test of each regression coefficient in multiple regression analysis is to test whether the explanatory variable corresponding to the regression coefficient has a significant impact on the explained variable when other explanatory variables remain unchanged. The test method is basically the same as that of simple linear regression.

F test

Since the multiple linear regression model contains multiple explanatory variables, whether there is a significant linear relationship between them and the explained variables requires further judgment. That is to say, we need to make inferences about whether the linear relationship between the explained variable and all explanatory variables in the model is significant as a whole.

Textbook P73-74

In the case of single regression, since there is only one explanatory variable, there is no overall test problem of the joint influence of the explanatory variables, and there is no need for the F test. In essence, the F test is consistent with the t test, that is, the F statistic is equal to the square of the t statistic.

The formula shows that the two statistics F and R change in the same direction, indicating that the two are consistent. Generally speaking, the higher the degree of fit of the model to the observed value, the stronger the significance of the overall linear relationship of the wedge. Therefore, it is of practical significance to find the quantitative relationship between two test statistics and verify each other in practical applications. The statistic R does not give a precise distribution, but can only provide a fuzzy guess. How big their values must be for the model to pass the test, and there is no definite limit; while the statistic F has a precise distribution, which can be given Under the significance level, a statistically rigorous conclusion is given.

multiple linear regression model predictions

point prediction

Substitute data directly

Interval forecast

Interval prediction of mean E(Y0)

analogy to one dollar

Interval forecast for individual values Y0

violate classic assumed

multicollinearity

definition

perfect multicollinearity

Without adding μi=0, there is an accurate one-to-one correspondence between the multiple variables.

incomplete multicollinearity

Add μi=0, there is an approximate collinearity between these multiple variables

Is there a high degree of linear correlation between multiple explanatory variables?

Assuming that there is no complete or incomplete linear relationship between explanatory variables, it is said that there is no multicollinearity and is represented by a matrix. X is a full-rank matrix, that is, rank(X)=k(X'X)-1 exists.

in regression model explanatory variables possible relationship table Now there are three situation

(1) Rxiyj=0, there is no linear relationship between the explanatory variables, and the variables are orthogonal to each other. In fact, there is no need to perform multiple regression at this time, and each parameter can be estimated through the unary regression of y on x.

(2) Rxiyj=1, complete collinearity among explanatory variables. At this point the model parameters will not be determined. Intuitively, when two variables change in the same way, it is very difficult to distinguish the impact of each explanatory variable on the explained variable.

(3)0<Rxiyj<1, there is a certain degree of linear relationship between explanatory variables. This is the situation that is often encountered in practice.

It needs to be emphasized that there is no linear relationship between explanatory variables, which does not mean that there is no nonlinear relationship. When there is a nonlinear relationship between explanatory variables, it does not violate the assumption of no multicollinearity.

The enhancement of the degree of collinearity will affect the accuracy and stability of parameter estimates. Therefore, incomplete multicollinearity is in fact a serious problem.

generate background

(1) There is a common trend among economic variables. For example, for time series data, income, consumption, employment rate, etc. are affected by the economic cycle. These economic variables often have a common trend in the process of changing over time. They all show a growth trend during the economic upturn, while during the economic upturn, During the period of economic contraction, they all showed a downward trend. Multicollinearity problems arise when these variables are simultaneously entered into the model as explanatory variables.

(2) The model includes lagged variables. When the lag variables of the explanatory variables are introduced into the established model, such as Successive values of the same variable are related to each other.

(3) Multicollinearity may also occur when building models using cross-sectional data. The explanatory variables are often closely related in the economic sense. We choose various provinces in China, listed company samples, and large company sizes. For example, Jiangsu and Zhejiang are relatively close in space, economically interconnected, and both are relatively developed. When using cross-sectional data for modeling, many variable changes are related to the scale of development and will show a common growth trend. For example, inputs such as capital, labor, technology, and energy are related to the scale of output, and multicollinearity is prone to occur. Sometimes if changes in some factors are highly correlated with changes in other factors, collinearity may easily occur. For example, a regression model was established between grain output and chemical fertilizer consumption, irrigated land area, and agricultural investment funds. It was found that the regression effect was poor because the impact of agricultural funds has been reflected through the two factors of chemical fertilizer consumption and irrigated land area.

(4) Due to the reasons of the sample data itself, when sampling based on the overall data, there may also be correlations and multilinearity is prone to occur. Relevant variables were selected during modeling. There is a certain correlation between the influencing factors. For example, sampling is only limited to a limited range of explanatory variable values in the population, so that the variable variation is not large; or due to the restriction of the population, the sample data of multiple explanatory variables are correlated, in which case multicollinearity may occur. The multiplier effect of investment and consumption, consumption of 1 yuan can stimulate national income several times.

For example, GDP and CPI are positively correlated, influence each other, and spiral upward.

have consequences

Consequences of complete multicollinearity

Less commonly, the inverse matrix does not exist and the parameters do not exist. There will be infinite solutions. The estimated values of the parameters cannot be determined and cannot be estimated.

The variance of the parameter estimates is infinite

Consequences of incomplete multicollinearity Worse

1. The variance of the parameter estimator is large, the covariance increases, and the ordinary least squares parameter estimator is not effective.

R23 is the square of the correlation coefficient between variables X2 and The less effective

2. The economic meaning of the parameter estimator is unreasonable, and the coefficient sign is inconsistent with the theory.

There may be multiple linear correlations, where variables are replaced by related variables. This is an obvious signal light.

See PPT analysis

3. The contribution of each explanatory variable to the regression sum of squares cannot be accurately distinguished.

X3=R23*X2, (0<R23<1), indicating that there is a positive correlation between X2 and X3, and the two will have cross-influence

Y^=β1^ β2^X2 β3^X3=β1^ (β2^ β3^*X23)X2

The economic meaning of β2^ itself: with other variables unchanged, for every unit increase in X2, the explained variable Y increases by β2^ units. But the premise for reaching this conclusion is that X2 has nothing to do with X3, so that the data β2^ can be used to measure the impact of X2 on Y.

If X2 and X3 are related, β2^ cannot represent the degree of influence of X2 on Y at this time. Changes in X2 will affect X3, and then affect Y through changes in X3. At this time, β2^ loses the significance of estimation. Because X2 and X3 cannot be simply separated and difficult to distinguish, this affects the interpretation of the coefficients.

4. The t value becomes smaller, the significance test of the variable loses its meaning, and a false t test value is obtained. ——The most serious harm, there is no way to know whether it is really significant or not significant, and it has no research value.

What impact does it have on parameter estimation and hypothesis testing?

Or is it due to the problem of multicollinearity that the variables interfere with each other and we cannot use the t test to estimate the results?

According to the formula for calculating the t value, when there is a high degree of collinearity, the variance of the parameter estimates increases rapidly, which will make the t value smaller, and the null hypothesis of "the coefficient is 0" that should be denied is mistakenly accepted, and the t test Invalid.

The F-test is also distorted and invalid.

5. The confidence interval becomes wider and the prediction function of the model fails.

The confidence interval is enlarged, the prediction accuracy decreases, and the error becomes larger, which affects the parameter estimation and hypothesis testing of the model.

test

Simple correlation coefficient matrix method

It is often used in time series, and you can intuitively see the degree of correlation between variables.

The simple correlation coefficient test method is a simple method that uses the degree of linear correlation between explanatory variables to determine whether there is serious multicollinearity.

Generally speaking, if the simple correlation coefficient (zero-order correlation coefficient) of each two explanatory variables is relatively high, such as greater than 0.8, it can be considered that there is serious multicollinearity.

However, it should be noted that a higher simple correlation coefficient is only a sufficient condition for the existence of multicollinearity, not a necessary condition. Especially in regression models with more than two explanatory variables, sometimes low simple correlation coefficients may also suffer from multicollinearity. Therefore, multicollinearity cannot be accurately judged simply based on the correlation coefficient.

Comprehensive judgment method of variable significance (t) and equation significance (F, R²)

If R2 is large and the F value is significantly greater than the critical value at a given significance level, but the partial regression coefficient value corresponding to the variable is not significant, it indicates that the model has multicollinearity.

auxiliary regression method

Summary of methods for multicollinearity diagnosis

(1) The correlation coefficient between explanatory variables is large

(2) In the regression model, although R2 is large, the signs of some important explanatory variables do not meet economic significance, and the parameters do not pass the significance test;

(3) In the auxiliary regression equation of ⅹj for other explanatory variables, the regression parameters pass the significance test, and the VIF is much greater than 10, indicating that the degree of multicollinearity is serious

(4) If the newly added explanatory variables in the original equation can improve the goodness of fit, but have a significant impact on the signs and significance tests of other parameters, it can be judged that the new explanatory variables have caused multicollinearity.

Remedies and Corrections Improvements

1. Eliminate variables method

2. Increase sample capacity

3. Use prior information to change the constraint form of parameters

4. Transform the model into a difference equation form

5. Stepwise regression method (key points)

step

The explained variable Y is regressed on each explanatory variable Xi (i=1,2,...,k) respectively, and each regression equation is comprehensively analyzed and judged based on economic theory and statistical testing, and an optimal basic equation is selected from it. Regression equation. On this basis, other explanatory variables are introduced one by one, the regression is re-run, and the number of explanatory variables in the model is gradually expanded until the best estimation model emerges from the comprehensive situation.

1. y performs a one-way regression on each explanatory variable

2. Choose the one with the largest R2 as the basic equation

3. Gradually increase the explanatory variables, requiring R to increase, and the t test of the variables to be significant.

4. If the t-test of other explanatory variables is not significant, delete the variable

5. Select the optimal regression model

6. Ridge regression estimation

Heteroskedasticity

autocorrelation

Distributed lag model and autoregressive model

Dummy variable regression

Maybe not