To the Coming of a Better Time

They say the most important age for how one grows to see and understand the world is the age of eight. At the age of eight, I was the only black kid in my desegregated public school class in…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




What to do after OLS Regression modeling

After you have completed your regression and analyzed the fit of the model, there are certain assumptions that have to be met in order to appropriately estimate the population parameters and conduct tests of statistical significance. Here is a quick primer on them:

1. No specification error.

a.No relevant independent variables have been excluded from the model.

b. No irrelevant variables have been included.

c. The relationship between Y and each X is linear, and that the effects of the k independent variables are additive.

2. No measurement error.

a. All variables have been measured without error.

b. All independent variables (X1, X2…. Xk) are quantitative or dichotomous, and the dependent variable, Y, is quantitative, continuous, and unbounded.

3. All independent variables have a nonzero variance.

4. There is no perfect multicollinearity (i.e. there is no exact linear relationship between two or more independent variables).

5. Each independent variable is uncorrelated with the error term. For each Xi, COV (Xij, εj = 0)

6. For each set of values of the k independent variables (X1, X2…., Xk), the variance of the error term is constant or homoscedastic. Therefore VAR (εij, X1j, X2j…., Xkj) = σ², where σ² is a constant.

7. The errors associated with one term (Yj) should not be correlated with the errors associated with any other observation (Yh); this assumption is known as lack of autocorrelation. Thus, COV (εj, εh) = 0.

8. For each set of values for the k independent variables (X1, X2….Xk), the mean value or expected value of the error term is 0. Thus, E(εj) = 0.

9. At each set of values for the K independent variables, εj is normally distributed (Berry and Feldman, 1985: 10; Berry, 1993: 12).

If you want to delve further into these, here is more information about each and how to detect them and deal with the violations:

Specification error: Using the wrong independent variables

Firstly, I will assess the impact of including irrelevant variables. Suppose the true population model is:

Y = α + β1 + ε

However, the model that is estimated is:

Y = α + β1 + β2 + ε

X1 is therefore the relevant variable and X2 is an irrelevant variable, and should not have been included in the model. Given that in the population X2 has no effect in the population, β2 = 0. Therefore b2 is equal to zero, so that E(b2) = 0. Similarly β1 is unbiased, so that E(b1) = β1 (Berry and Feldman, 1985: 19).

However, the inclusion of irrelevant variables will affect the standard errors. If X1 and X2 are correlated the standard error of b1 will be inflated when irrelevant variable X2 is included in the model. In fact, the degree to which the standard error is inflated is directly related to the size of correlation between independent variables. This will affect the efficiency of the estimate. Lastly, we can draw wrong conclusions from the hypothesis test and make a type II error (Berry and Feldman, 1985: 19–20).

As stated earlier specification error also occurs when relevant variables are excluded from the model. This assumption is related to the other assumption that each independent must not be correlated with the error term For example, if the true model in the population has two relevant variables:

Y = α + β1 + β2 + ε

Suppose a researcher leaves out X2 :

Y = α + β1 + ε

In this situation firstly, the irrelevant variables will be correlated with the error term. The slope will be biased as expected value of b1 is no longer β1. Instead,

E(b1) = β1 + β1 b21

Where b21 is the slope coefficient that would be obtained by regressing X1 on X2. The magnitude of the bias depends directly on the relationship between the included and excluded variables. However, the standard error will decrease. This happens because the standard error of coefficient tends to increase as magnitude of its correlation with other independent variable increases (Berry and Feldman, 1985: 20–22).

Non-linearity and Non-Additivity

In a regression model, we assume that for each set of k independent variables, (X1j, X2j…, Xkj), the mean of distribution of Yj falls on the surface.

E(Yj) = β1X1j + β2X2 ……….+ βkXkj

The assumptions of linearity and additivity are implicit in this equation. “Linearity is the assumption that for each independent variable Xi, the amount of change in the mean value of Y associated with a unit increase in Xi, holding all other independent variables constant, is the same regardless of the level of Xi… Additivity is the assumption that for each independent variables Xi, the amount of change in E(Y) associated with a unit increase in Xi (holding all other independent variables constant) is the same regardless of the values of the other independent variables in the equation.” (ibid, 1985: 51).

A nonlinear relationship makes the unstandardized coefficient biased. In fact, if partial slope is found to be insignificant it may become significant once you specify nonlinear relationship. Even if partial slope is found to be significant then we can probably improve the goodness-of-fit by specifying non-linear effect.

The best way to detect both nonlinearity and nonadditivity is to use the theory to hypothesize form of the nonlinear or nonadditive relationship. A researcher can run diagnostic test to detect nonlinearity. Plotting residuals or studentized residuals against each x, augmented by a lowess smooth, is helpful for detecting departures from linearity. However, these plots cannot distinguish between monotone (when slope does not change sign) and nonmonotone (when slope changes sign) linearity. The distinction between monotone and nonmonotone nonlinearity is important because monotone nonlinearity can be corrected by single transformation whereas nonmonotone nonlinearity requires power polynomial transformations depending upon the number of times the slope changes sign (Fox, 1991:54).

If additive relationships are detected, there are several nonadditive specifications that can be used for estimating OLS regression. One commonly used specification is called a dummy variable interactive model, and it is applicable in a situation in which one of independent variables is dichotomous. Suppose a variable X interacts with a dummy variable D in influencing the dependent variable, Y. We can assume that D takes on two values 0 and 1, and estimate two regression models,

When D = 0: Y = α0 + β0 + ε0

When D = 1 Y = α1 + β1 + ε1

And use OLS regression separately on the two samples. Then, the research hypothesis of interaction could be tested against the null hypothesis that β0 = β1 (Berry and Feldman, 1985: 65–66).

The second interactive specification is called a multiplicative model. “It is applicable when two independent variables — X1 and X2, both measured at the interval level — are thought to interact in influencing Y such that the slope of the relationship between each independent variable and E(Y) is linearly related to the value of the other independent variables.” (ibid, 1985: 67) The specification takes the form

Y = α + β1X1 + β2X 2 + β3 ( X1X 2 ) + ε

Multicollinearity

Firstly, we need to distinguish between perfect collinearity and less extreme forms of multicollinearity. Perfect collinearity exists when one of the independent variables in a regression equation is perfectly linearly related to one or more of the other independent variables. Secondly, it may exist when a mistake is committed in handling dummy variables when incorporating discrete independent variables into a regression model. Lastly, it may occur when the sample size is very small (Berry, 1991: 25–26).

“The problem with perfect collinearity is that among a set of observations, an infinite number of regression surfaces fit the observations equally well, and therefore, it is impossible to derive unique estimates of the intercept and partial slope coefficients for the regression equation. Fortunately, perfect collinearity is rarely found in social science research.” (Berry and Feldman, 1985: 38).

A strong, but less than perfect, linear relationship between the xs causes the least square estimates to be unstable, but are still BLUE. Coefficient of standard errors are large; reflecting the imprecision of estimation of the βs. Lastly, the major effect of multicollinearity is on significance tests and confidence intervals for regression coefficients. When high multicollinearity is present, confidence intervals for coefficients tend to be wide and t-statistics for significance tests tend to be very small (Fox, 1991: 11; Berry and Feldman, 1985: 41).

A common rule of the thumb is that multicollinearity should be suspected when none of the t-tests for individual variables are significant even though F-test for the full model is significant. The most commonly used test for multicollinearity is the inspection of a matrix of bivariate correlations. The other test for multicollinearity is to regress each independent variable in the equation on all the other independent variables, and look at R-square for these regressors (Berry and Feldman, 1985: 42–43). Lastly, the impact of collinearity on the precision of estimation is captured by 1/(1-Rj²), called the variance-inflation factor VIFj.

where the estimated variance of the least-square regression coefficient bj is

ˆV(bj) = s² X 1

(n-1)sj² (1-Rj²)

A VIF of four and greater than four signifies a collinearity problem (Fox, 1991:11).

Although the ideal solution to deal with collinearity is to gather new data, it is not practical. The other less adequate strategies for dealing with collinearity are:

Model respecficiation: A researcher can respecify the model. For example, several regressors in the model can be combined that can be conceptualized as alternative indicators of the same underlying concept.

Variable selection: A common approach is to reduce the regressors in the model to a less highly correlated set. Forward stepwise methods add variables to the model one at a time. Whereas backward stepwise method starts with the full model and deletes variables one at a time.

Biased estimation: The general idea in this technique is to trade a small amount of bias in the coefficient estimates for a substantial reduction in coefficient sampling variance. The result is a smaller mean-squared error of estimation of the βs than provided by the least-squares estimates. The most common biased method is called ridge regression.

Prior information about the βs: The final approach is to introduce additional prior information that helps to reduce ambiguity produced by collinearity. For example, if we want to estimate the model

Y = α + β1X1 + β2X 2 + β3 X3 + ε

Where Y is savings, X1 is income from wages and salaries, X2 is dividend from stocks and X3 is interest income. If we have trouble estimating β2 and β3 because X2 and X3 are highly correlated (Fox, 1991: 13–20). We can denote the common quality as β* and estimate the model as

Y = α + β1X1 + β2 * (X 2 + X3) + ε

Heteroscadisticity and autocorrelation

Heteroscadisitcity refers to the situation in which error term in a regression model does not have constant variance. Firstly, it occurs when the dependent variable is measured with error and the amount of error varies with the value of the independent variable. It is also likely ‘when the unit of analysis is an aggregate and the dependent variable is an average of values for the individual objects composing the aggregate units — such as mean income in some aggregate unit.’ Lastly, it occurs when a relevant variable that might have interaction effect with one of the independent variables has been excluded from the model. Autocorrelation occurs when error terms associated with different observation are correlated. It is generally a problem in a time series regression model in which the observations consist of a single individual or unit at multiple points in time. (Berry and Feldman, 1985: 73–77).

Even with heteroscadisticity the OLS estimators for both the intercept and for partial slope coefficients remain unbiased. But with heteroscadisticity or autocorrelation the least square estimates of the intercept and partial slope coefficients are not BLUE. Another problem is that standard errors are biased. They can be negatively or positively biased depending upon the sign of correlation between independent variables and the variance of the error term. If standard error is negatively biased then confidence intervals for β will be too narrow. This means that b may incorrectly appear to be statistically significant when hypothesis tests are conducted. On the other hand, when standard error is positively biased then confidence intervals are too wide. Therefore significance tests will be too difficult to pass (ibid, 1985: 78).

If heteroscadisticity is suspected, the first test that should be conducted is a visual inspection of a plot of regression residual. In particular, graph in which regression residuals are plotted against the independent variables suspected to be correlated with the variance of the error term should be examined. Another way of detecting heteroscadisticity is the Goldfield and Quandt test. In the test the n observations in the sample are reordered in order of increasing magnitude on the independent variable, which is suspected to covary with variance of the error term. Then a certain number of observations (denoted m) are deleted from the middle. This leaves n-m observations. Then OLS regression is used to estimate the coefficients for (1) the first (n-m)/2 observations and (2) the last (n-m)/2 observations, separately. The error sums of squares for both the sub-samples are calculated and a F-test is conducted to test the null hypothesis that the error term is homoscedastic (ibid, 1985: 80–81).

H0 = Var (ej Xij, X2j….. Xkj) = σ²

H1 = Var (ej Xij, X2j….. Xkj) ≠ σ²

Where σ² is constant.

Another test for heteroscadisticity is the White’s test. It involves the following steps to test the null hypothesis of homoscedasticity (McClendon, 1994: 181):

If tests indicate that heteroscadisticity is present, the researcher must consider the possibility that heteroscadisticity is a result of interaction with an independent variable that has been left out of the model. Theory is the best guide to deal with this problem. To deal with heteroscadisticity or autocorrelation, a researcher can use the Generalized Least Squares (GLS) technique. It yields estimators that are BLUE. It accomplishes this task by using information about the relationship between (i) the variance of the error term and an independent variable (with heteroscadisticity) or (ii) error term associated with different observations (with autocorrelation). To deal with heteroscadisticity, observations expected to have error terms with large variance are given a small weight compared to observations with error terms with small variance (Berry and Feldman: 1985: 86).

Non-normally distributed errors

The assumption of normally distributed error is important because even in large samples if the assumption of normality is violated — although the validity of least-square estimation is robust, the method is not robust in efficiency. It should be noted that non-normally distributed errors is more of a problem in small samples as in large samples, the central-limit theorem assures that under broad conditions inferences based on least-squares estimators are valid. Secondly, highly skewed error distributions, not only generate outliers in the direction of the skew, but also compromise the interpretation of the least-squares. Finally, a multimodal error distribution suggests the omission of one or more qualitative variables that divide the data naturally into groups (Fox, 1991:40–41).

There are several graphical methods to examine whether the error term is normally distributed. One such method is the normal probability plot of residuals, which permits us to compare visually the studentized residuals, which are t-distributed, to unit-normal distribution. “If ti were drawn from a unit-normal distribution, then, within the bounds of the sampling error, ti = zi. Consequently, we expect to find an approximately linear plot with zero intercept and unit slope, a line that can be placed on the plot for comparison. Nonlinearity in the plot, in contrast, is symptomatic to non-normality.” (ibid, 1991:42). Another graph that can be used to examine normality is the histogram of residuals. It does a good job of conveying general distributional information (ibid, 1991:44). A researcher can also test whether the error term is normally distributed by using the Shapiro-wilk test (1965).

H0 = Var (eij X1j X2j….. Xkj) = 1 & (ēij X1j X2j….. Xkj) = 0

H1 = Var (eij X2j, X2j….. Xkj) ≠ 1 & (ēj X1j X2j….. Xkj) ≠ 0

A researcher can correct for a skewed error distribution by transforming the dependent variable. This can be done either going up or down the ladder of roots and power depending upon the direction of the skew (Fox, 1991:47). The other technique to correct for non-normally distributed error term is reweighted least squares estimation. This method allows you not only to re-estimate standard error, but also intercept and partial slopes.

Assumptions about the level of measurement error

Assumption 2(b) requires that the independent variables in a regression equation be quantitative or dichotomous and that dependent variable be quantitative, continuous and unbounded. There are a variety of discrete variables that are appropriate in a regression equation.

Discrete variables: Dichotomous three or more three or more unordered values (qualitative) ordered values

Ordered: Quantitative, Non-quantitative

Discrete variables can be dichotomous or qualitative — have three or more values that are unordered (eg, race) or ordered (eg, the number of children). Ordered values are of two type — quantitative or non-quantitative. To be appropriate as independent variables in a regression model, ordered discrete variables should be quantitative. But whether an ordered variable is quantitative is not always obvious. According to Berry, sometimes it is appropriate to treat a quantitative ordered discrete variable as if it is continuous and thereby use it as a dependent variable. He also argues that qualitative unordered discrete variables and non-quantitative ordered discrete variables are `never appropriate as distinct regressors in a regression model.’ If a researcher uses a dichotomous dependent variable for regression then the assumption about normally distributed error is violated.

Suppose Y can only assume the value 1 and 0.Therefore,

Εj can take on only two values: 1- (α –Σ βi Xij) or — (α –Σ βi Xij)

therefore variance of the error term = (α + Σ βi Xij) . [1 — (α + Σ βi Xij)]

This also shows that the variance of the error term varies with the value of the independent variable and this violates the assumption of homoscedasticity.

Another consequence is that coefficients have `nonsensical interpretations.’

If the value of Y is 0 or 1 then expected value of Y must equal (1 multiplied by the probability that Y equals 1) plus (0 multiplied by the probability that Y equals 0).

E (Yj X1j X 2j ….. Xkj) = [1. P (Yj = 1 X1j X 2j ….. Xkj)]

+ [0 . P (Yj = 0 X1j X 2j ….. Xkj)]

Since the rightmost bracketed term equals to 0, the equation reduces to

E (Yj X1j X 2j ….. Xkj) = [1. P (Yj = 1 X1j X 2j ….. Xkj)]

This means that the expected value of Y equals to the probability that Yj equals to 1.

Lastly, the linearity assumption of regression is not plausible (Berry, 1993: 46–49).

Measurement error

There are mainly two types of measurement error — random and non-random. Non-random measurement error occurs when a researcher measures some other variable(s) in addition to the true variable of interest. Random error is that error that is introduced into the data as `unsystematic noise.’ It might occur because when data is collected from respondents, there may be some guessing or response categories may be vague and not well defined. It might occur if there is an error in recording the data or mistakes are made in coding and keypunching (Berry and Feldman, 1985:26–27).

The first consequence of random error is that it increases the variance in the dependent variable as

Y* = Y + μ

As a result of this regression sum of square will be less. Although partial coefficient estimators will remain unbiased, they will be less efficient. The standard error will be inflated as variance in error will increase. Thus it will be more difficult to achieve statistical significance and there may be a higher probability of committing a type II error.

The situation is more complex when there is measurement error in one or more of the independent variables. Suppose in a bivariate regression model, there is random measurement error in the independent variable. The equation we wish to estimate is

Y = α + βX + ε

But instead of measuring X, we have X*, where

X* = X + μ

In our sample, therefore, we are estimating

Y = a + bX* + e

The magnitude of E(b) will be less than β. It will be

E(b) = β s²x

s²x+ s²μ

Therefore, the expected value of b will be the product of β times the reliability of X* (rxx):

E(b) = βrxx

Therefore, the regression coefficient will be biased. In a multivariate model, it is difficult to predict the direction of the bias. It depends upon the extent of measurement error and magnitude of correlation between the independent variables (Berry and Feldman, 1985:28–29).

The appropriate time to deal with the problem of measurement error is before a regression equation is estimated. Strategies of data collection and coding should be designed to minimize measurement error. Multiple indicators of a concept should be collected so that estimates of measurement error can be obtained to produce more reliable measures. Lastly, instrumental variables can be employed to deal with measurement error. An instrumental variable for an independent variable (Xi) is a variable that is correlated to (Xi), but does not have a direct effect on the dependent variable except an indirect effect through Xi. However, this is not a practical solution as it is difficult to identify instrumental variables (ibid, 1985: 34)s

Zero variance in independent variables

If a variable has zero variance then it is not a variable, but a constant. If an independent variable has zero variance then it will have zero covariance with the dependent variable therefore the partial coefficient will also be zero. This problem is rarely encountered in practical research. Instead restricted variance in independent variables is a more common problem. As a result of restricted variance, partial slopes are underestimated and standard errors are inflated. Thus it will be more difficult to achieve statistical significance and there may be a higher probability of committing a type II error.

Each independent variable must not be correlated with the error term

The error term refers to the combined effect of all variables that influence the dependent, but are excluded from the regression, along with the random component in the dependent variable’s behavior.

εj = δ0 + δ1 Z1j + δ2Z2j + ….. δmZmj + Rj

where the excluded independent variables are labeled as Zs, R denotes the random component and the intercept is named δ0 (Berry, 1993: 28–29).

This assumption is interrelated to the one with no specification error. If the independent variables are correlated with the error term, the slope will be biased as expected value of b1 is no longer β1. However, the standard error will decrease. This happens because the standard error of coefficient tends to increase as magnitude of its correlation with other independent variable increases (Berry and Feldman, 1985: 20–22).

Mean of the error term is zero

In a regression model, it is assumed that

Y = α + β1X1j + β2X 2j + ….. + βk Xkj + εj

E (εj X1j X 2j ….. Xkj) = 0. However, if this assumption is violated then E (εj X1j X 2j ….. Xkj) = μj, where μj is not zero. Therefore

E (Yj X1j X 2j ….. Xkj) = α + β1X1j + β2X 2j + ….. + βk Xkj

Is not true, instead

E (Yj X1j X 2j ….. Xkj) = α + β1X1j + β2X 2j + ….. + βk Xkj + μj

Where μj is a constant or can vary across observations.

If μj is a constant then least square estimators of partial slope coefficients remain unbiased. However, the intercept is biased by μ units. In the second case when the mean of the disturbance term varies across cases, the expected value of Y is determined by the regression parameters, value of Xs and μ. Thus μ becomes a variable that is excluded from the regression model, resulting in bias in the partial slope coefficient estimators. We cannot deal with this problem statistically. A researcher can assume that this assumption is not violated if there is non-constant error variance and error term is normally distributed (Berry, 1993: 41–45).

Bibliography

Berry, William D. 1993. Understanding Regression Assumptions. Beverly Hills, CA: Sage.

— — — — — — — — — and Stanley Feldman. 1985. Multiple Regression in Practice. Beverly Hills, CA: Sage.

Fox, John. 1991. Regression Diagnostics. Beverly Hills, CA: Sage.

McClendon, McKee J. 1994. Multiple Regression and Causal Analysis. Itsaca, IL: F. E. Peacock.

Shapiro, S S and M B Wilk. 1965. “An Analysis of Variance Test for Normality (complete samples),” Biometrika. 52, 3 and 4, 591–611.

PS: This is an old paper that I wrote when I was doing my PhD and came across it recently on my old hard drive. I remember putting my heart and soul in writing it, but the only other person who had read it till now was my advisor. In today’s world it’s so much easier to share your knowledge.

Note of caution: Some of the formatting for the equations is hard to do in Medium so some equations are not 100% correct.

#research #statistics #regression

Add a comment

Related posts:

Erectile dysfunction

Erectile dysfunction is more likely to affect men older than 40 years. In the United States, over 18 million males who are 20 years of age and older are having erectile dysfunction which represent…