Kinh tế ứng dụng_ Lecture 3: Outliers, Leverage and Influence

Chia sẻ: Truong Doan | Ngày: | Loại File: PDF | Số trang:8

0
64
lượt xem
6
download

Kinh tế ứng dụng_ Lecture 3: Outliers, Leverage and Influence

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

1) Introduction The estimates of the regression parameters are influenced by a few extreme observations. The residual plot may let us pick out, which the individual data points are high or low. We may use the residual plot to find the outlier, which are inadequately captured by the regression model itself.

Chủ đề:
Lưu

Nội dung Text: Kinh tế ứng dụng_ Lecture 3: Outliers, Leverage and Influence

  1. Applied Econometrics 1 Outliers, Leverage and Influence Applied Econometrics Lecture 3: Outliers, Leverage and Influence ‘Life is the art of drawing sufficient conclusions from insufficient premises’ SAMUEL BUTLER 1) Introduction The estimates of the regression parameters are influenced by a few extreme observations. The residual plot may let us pick out, which the individual data points are high or low. We may use the residual plot to find the outlier, which are inadequately captured by the regression model itself. 2) Identification of outliers The percentiles that cut the data up into four quarters have special names: The 25th percentiles and the 75th percentiles are called the lower and upper quartiles (QL and QU) The lower quartile will be the [integer((n+1)/2)+1]/2 value from the bottom of the ordered list. the upper quartile is the [integer((n+1)/2)+1]/2 value from the top A data point Y0 is considered to be an outliers if Y0 < QL – 1.5 IQR or Y0 > QU + 1.5 IQR where IQR is the inter – quartile range (IQR = QU – QL) (Source: Hoaglin, 1983) 3) Outliers An outlier is a point, which is far removed from its fitted value (i.e., has large residual). Large in this context does not refer to the absolute size of a residual but to its size relative to most of the other residuals in the regression. When a point is an outlier in univariate analysis, it is defined with reference to its own mean. When a point is an outlier in bivariate analysis, it has a large residual (i.e., Y value is far removed from its fitted value). Apart from the graphical methods, we can also rely on special statistics to detect outliers. In order to compare the large residual to the other residual, we may calculate the standardized residual, which is simply the residual divided by the standard error of the estimate (ei/s). But an outlier in the data set will inflate the standard error of the regression. Hence we use the studentized residual Written by Nguyen Hoang Bao May 20, 2004
  2. Applied Econometrics 2 Outliers, Leverage and Influence ei ti = s(i) 1 − h i where ei ˆ is the residual (ei = yi – yi ) s(i) is the standard error of the estimation having dropped the ith observation from the sample hi is the hat statistic for the observation ith, which is defined as 2 1 (Xi − X ) hi = + n n 2 ∑ (Xi − X ) i =1 The additional term in the denominator, 1 − h i , is necessary since the variance of the residuals is assumed not to be constant. With the adjustment factor, we get a t – statistic, which tests whether the ith residual is significant different from zero and, hence, signals an outliers, which does not really fit the overall pattern. Alternatively, the t – statistic of the coefficient of the dummy variable pick out a single observation from the sample. 4) Leverage A data point has a high leverage if it is far removed in the X – direction (i.e., it is a disproportionate distance away from the middle range of the X – direction) (Myers, 1990). The points of high leverage can exert undue influence on the outcome of a least squares regression line. That is, points with high leverage are capable of exerting a strong pull on the slope of the regression line. In univariate analysis, the definition of an outlier and a point of leverage are the same. A point, which is an outlier, also has high leverage with respect to the mean. In bivariate analysis, a point of high leverage (with respect to the slope coefficient) is one which is far removed in the X – direction (as opposed to an outliers. which are far removed from Y – direction). A test statistic for the leverage is the hat statistic: 2 1 (Xi − X ) hi = + n n 2 ∑ (Xi − X ) i =1 Written by Nguyen Hoang Bao May 20, 2004
  3. Applied Econometrics 3 Outliers, Leverage and Influence which serves as a measure of leverage of the ith data point. It measures leverage because the numerator is the squared distance of the ith data point from its mean in the X – direction, while its denominator is a measure of overall variability of the data points along the X – axis. Therefore, the higher value of hi the higher is the leverage of the ith data point, the greater the distance of Xi from its mean. hi can vary from 1/n (i.e., close to zero) for a point with no leverage and tend to one for very high leverage. It is suggested that the following guidelines are based on the maximum observed hi = max(hi) (Huber, 1981): max(hi) < 0.2 little to worry about 0.2 < max(hi) < 0.5 risky 0.5 < max(hi) too much leverage 5) Influence A data point is influential if removing it from the sample would markedly change the position of the least squares regression line (Moore and McCabe, 1989). Hence, influential data points pull the regression line in their regression. The influential data points do not necessarily produce large residuals. That is, they are not always outliers as well, although they can be. Conversely, an outlier is not necessarily an influential point, particularly when it is a point with little leverage. In univariate analysis, an outlier has high leverage and will be influential. In bivariate analysis, high leverage is a necessarily condition for influence on the slope, but not a sufficient one. Similarly, an outlier may not be influential if it has low leverage, nor a point of high leverage be an outlier if its leverage is strong enough. A test for influence is the DFBETA statistic, which is defined as 1 : β1 − β1 (i) DFBETAi = SE(β1)(i) where bracket (i) refers to the value of the statistic when observation ith is excluded from the regression. The DFBETAs measure the sensitivity of the slope coefficient to the deletion of the ith data point 1 We suppose that the regression model can be specified as Y = β0 + β1X Written by Nguyen Hoang Bao May 20, 2004
  4. Applied Econometrics 4 Outliers, Leverage and Influence if DFBETA < 2/ n , the point has no influence if DFBETA > 3/ n , the point is influential if 2/ n < DFBETA < 3/ n , the point is inconclusive The regression analysis should capture general pattern in the data: an influential point can prevent this from being so. Hence, they are often best dropped from the regression. DFBETAs should always be used in conjunction with diagnostic regression graphics. It is always possible that a cluster of points is exerting influence rather than a single data point. Table 5: Summary measures outliers, leverage, and influence Statistic Formula Use Critical value Studentized residual (ti) ei Outliers Critical values available (higher than usual t–test), but ti = s(i) 1 − h i recommend use ti as an exploratory tool Hat statistic (hi) 2 Bounded by 1/n (no leverage) and 1 (extremely 1 (Xi − X ) hi = + n Leverage leverage); values above 0.5 indicate excessive leverage n 2 ∑ (Xi − X ) i =1 and values over 0.2 indicate the observation may give problems DFBETA β1 − β1 (i) Influence Under 2/ n , the point has no influence; over 3/ n , DFBETAi = SE(β1)(i) the point is influential and strongly so if DFBETA exceeds 2 Note: n is the sample size; k is the number of regressors; the subscript (i) (i.e., with parentheses) indicates an estimation from the sample omitting observation i. In each case you should use the absolute value of the calculated statistic. Source: Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for Developing Countries’ published by Routledge, London, UK. Written by Nguyen Hoang Bao May 20, 2004
  5. Applied Econometrics 5 Outliers, Leverage and Influence References Bao, Nguyen Hoang (1995), ‘Applied Econometrics’, Lecture notes and Readings, Vietnam-Netherlands Project for MA Program in Economics of Development. Hoaglin, David C., Mosteller F., Tukey J. (1983), Understanding Robust and Exploratory Data Analysis, New York: John Wiley. Huber, Peter J. (1981), Robust Statistics, New York: John Wiley. Maddala, G.S. (1992), ‘Introduction to Econometrics’, Macmillan Publishing Company, New York. Moore, D.S. and McCabe, G.P. (1989), Introduction to the Practice of Statistics, New York: Freeman. Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for Developing Countries’ published by Routledge, London, UK. Myers R. H. (1990), Classical and Modern Regression with Application, Second Edition, Boston, M.A: PWS – Kent. Written by Nguyen Hoang Bao May 20, 2004
  6. Applied Econometrics 6 Outliers, Leverage and Influence Workshop 3: Outliers, Leverage and Influence 1) Look carefully at the four plots in the attached figure. For each plot write down whether any of the points is: an outliers, a point of high leverage, an influential points or some combination of these. Briefly comment on your findings Hint: Outliers are not necessarily influential (plot 4) But they can be so (depending on leverage) (plot 3) Yet high leverage points are not always influential (plot 1) And influential points are not necessarily outliers (plot 2) Plot summary Plot Outliers Leverage Influence Plot 1 ______ ______ ______ Plot 2 ______ ______ ______ Plot 3 ______ ______ ______ Plot 4 ______ ______ ______ 2) An examination of residuals provides a diagnostic check on the model. When the regression model is inadequately specified, the residuals are not just pure noise. Instead they contain a message that can help us to specify a better model. Consider the four different relations between Y and X plotted below (Anscombe, 1973) – a simplified version of some common phenomena. 2.1) Calculate the regression line (Y against X), and graph it in panel 1, 2, 3 and 4 with the data points. 2.2) State which graph above corresponds to this situation: (i) The relation is really curved, rather than linear (ii) The positive relation is entirely the result of just one data point (iii) The residual variance is entirely the result of just one data point – which may very well be recorded in error (iv) It makes good sense to use the regression line for prediction 2.3) Briefly, what lesson does this show? Written by Nguyen Hoang Bao May 20, 2004
  7. Applied Econometrics 7 Outliers, Leverage and Influence Regression 1 Regression 2 Regression 3 Regression 4 Y X1 Y X2 Y X3 Y X4 8.04 10 9.14 10 7.46 10 6.58 8 6.95 8 8.14 8 6.77 8 5.76 8 7.58 13 8.74 13 12.74 13 7.71 8 8.81 9 8.77 9 7.11 9 8 8 8.33 11 9.26 11 7.81 11 8.47 8 9.96 14 8.1 14 8.84 14 7.04 8 7.24 6 6.13 6 6.08 6 5.25 8 4.26 4 3.1 4 5.39 4 12.5 19 10.84 12 9.13 12 8.15 12 5.56 8 4.82 7 7.26 7 6.42 7 7.91 8 5.68 5 4.74 5 5.73 5 6.89 8 3) The identification of outliers in univariate analysis Using the data file LEACCESS.WK1, identify if there is any outliers in each of the following data sets: 3.1) LE 3.2) Y 3.3) ln(Y) 4) The identification of outliers in bivariate analysis 4.1) Using the data file AIDSAV, test whether observation 26 (Lesotho) is: a) an outlier b) an point of high leverage c) an influential point 4.2) Draw the scatter plot of S/Y against A/Y showing the regression line with and without point 26 in the same graph 4.3) What happen to the R2 when observation 26 is dropped from the data set? Explain 4.4) Are there any other problematic points in the sample? Written by Nguyen Hoang Bao May 20, 2004
  8. Applied Econometrics 8 Outliers, Leverage and Influence 4.5) Show algebraically that a point with no leverage cannot have any influence on the slope coefficient 5) Outliers in bivariate analysis 5.1) Using the data file HOLMQ, which contains the data for EDUEXP and EAID, examine the figure and test whether any possible points is: a) an outlier b) a point of high leverage c) an influential point 5.2) Draw the scatter plot, showing the fitted line. Briefly comment on your findings. Written by Nguyen Hoang Bao May 20, 2004
Đồng bộ tài khoản