Linear regression

From Wikipedia, the free encyclopedia

Linear regression is a form of regression analysis in which data are modeled by a least squares function which is a linear combination of the model parameters and depends on one or more independent variables. In simple linear regression the model function represents a straight line. The results of data fitting are subject to statistical analysis.

Example of linear regression with one dependent and one independent variable.

[edit] Definitions

The data consist of m values, $y_1, \ldots, y_m$ of the dependent variable (response variable), y, derived from observations. The dependent variable is subject to error. This error is assumed to be random variable, with a mean of zero. Systematic error (e.g. mean ≠ 0) may be present but its treatment is outside the scope of regression analysis. The independent variable (explanatory variable) $x$ , is error-free. If this is not so, modeling should be done using errors-in-variables model techniques. The independent variables are also called regressors, exogenous variables, input variables and predictor variables.

In general there are n parameters to be determined, $\beta_1, \ldots, \beta_n$ . The model is a linear combination of these parameters

$y_i = \sum_{j=1}^{n} X_{ij} \beta_j + \varepsilon_i.$

The elements of the matrix X, that is $X i j$ , are constants or functions of the independent variable, x_i and $\varepsilon_i$ is an observational error. Models which do not conform to this specification must be treated by nonlinear regression.

In simple linear regression the data model represents a straight line and can be written as

$y_i = \beta_1 + x_i \beta_2 + \varepsilon_i$

n = 2, $β 1$ (intercept) and $β 2$ (slope) are the parameters of the model and the coefficients are $X i 1 = 1$ and $X i 2 = x i$ .

A linear regression model need not be a linear function of the independent variable. For example, the model $y=\beta_1+\beta_2 x+\beta_3 x^2+\varepsilon$ is linear in the parameters, $\beta_1,\ \beta_2$ and $\beta_3\,$ , but it is not linear in x, as $X_{i3}=x_i^2$ , a nonlinear function of x_i. An illustration of this model is shown in the example, below.

If there is any linear dependence amongst the columns of X, then the parameters β cannot be estimated by least squares unless β is constrained, as, for example, by requiring the sum of some of its components to be 0. However, some linear combinations of the components of β may still be uniquely estimable in such cases. For example, the model

$y_i=\beta_1+x_i\beta_2+2x_i\beta_3+\varepsilon_i\,$

cannot be solved for $β 2$ and $β 3$ independently as the matrix of coefficients, X, has the reduced rank 2. In this case the model can be rewritten as

$y_i=\beta_1+x_i(\beta_2+2\beta_3)+\varepsilon_i\,$

and can be solved to give values for $\hat\beta_1$ and the composite entity $(\widehat{\beta_2+2\beta_3})$ .

Unless otherwise stated it is assumed that the observational errors are uncorrelated and belong to a normal distribution. This, or another assumption, is used when performing statistical tests on the regression results. An equivalent formulation of simple linear regression that explicitly shows the linear regression as a model of conditional expectation can be given as

$\mbox{E}(y|x) = \alpha + \beta x \,$

The conditional distribution of y given x is a linear transformation of the distribution of the error term.

[edit] Notation and naming conventions

Scalars and vectors are denoted by lower case letters.
Matrices are denoted by upper case letters.
Parameters are denoted by Greek letters.
Vectors and matrices are denoted by bold letters.
A parameter with a hat, such as $\hat\beta_j$ , denotes a parameter estimator.

[edit] Least-squares analysis

Main article: Linear least squares

The first objective of regression analysis is to best-fit the data by adjusting the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. The objective function, S, is defined as a sum of squared residuals, r_i

$S=\sum_{i=1}^{m}r_i^2,$

where each residual is the difference between the observed value and the value calculated by the model:

$r_i = y_i-\sum_{j=1}^{n} X_{ij} \beta_j.$

The best fit is obtained when S, the sum of squared residuals, is minimized. Subject to certain conditions, the parameters then have minimum variance (Gauss–Markov theorem) and may also represent a maximum likelihood solution to the optimization problem.

From the theory of linear least squares, the parameter estimators are found by solving the normal equations

$\sum_{i=1}^{m}\sum_{k=1}^{n} X_{ij}X_{ik}\hat \beta_k=\sum_{i=1}^{m} X_{ij}y_i,~~~ j=1,\ldots,n$

In matrix notation, these equations are written as

$\mathbf{\left(X^TX\right) \boldsymbol {\hat \beta}=X^Ty}$ ,

And thus when the matrix $X T X$ is non singular:

$\boldsymbol {\hat \beta} = \mathbf{\left(X^TX\right)^{-1} X^Ty}$ ,

Specifically, for straight-line fitting, this is shown in straight line fitting.

[edit] Regression statistics

The second objective of regression is the statistical analysis of the results of data fitting.

Denote by $σ 2$ the variance of the error term $\varepsilon$ (so that $\varepsilon_i \sim N(0,\sigma^2)\,$ for every $i=1,\ldots,m$ ). An unbiased estimate of $σ 2$ is given by

$\hat \sigma^2 = \frac {S} {m-n}$ .

The relation between the estimate and the true value is:

$\hat\sigma^2 \cdot \frac{m-n}{\sigma^2} \sim \chi_{m-n}^2$

where $\chi_{m-n}^2$ has Chi-square distribution with $m - n$ degrees of freedom.

Application of this test requires that $σ 2$ , the variance of an observation of unit weight, be estimated. If the $χ 2$ test is passed, the data may be said to be fitted within observational error.

The solution to the normal equations can be written as ^[1]

$\hat\boldsymbol\beta=(\mathbf{X^TX)^{-1}X^Ty}$

This shows that the parameter estimators are linear combinations of the dependent variable. It follows that, if the observational errors are normally distributed, the parameter estimators will belong to a Student's t distribution with $m - n$ degrees of freedom. The standard deviation on a parameter estimator is given by

$\hat\sigma_j=\sqrt{ \frac{S}{m-n}\left[\mathbf{(X^TX)}^{-1}\right]_{jj}}$

The $100(1 - α)%$ confidence interval for the parameter, $β j$ , is computed as follows:

$\hat \beta_j \pm t_{\frac{\alpha }{2},m-n - 1} \hat \sigma_j$

The residuals can be expressed as

$\mathbf{\hat r = y-X \hat\boldsymbol\beta= y-X(X^TX)^{-1}X^Ty}\,$

The matrix $\mathbf{X(X^TX)^{-1}X^T}$ is known as the hat matrix and has the useful property that it is idempotent. Using this property it can be shown that, if the errors are normally distributed, the residuals will follow a Student's t distribution with $m - n$ degrees of freedom. Studentized residuals are useful in testing for outliers.

Given a value of the independent variable, x_d, the predicted response is calculated as

$y_d=\sum_{j=1}^{n}X_{dj}\hat\beta_j$

Writing the elements $X_{dj},\ j=1,n$ as $\mathbf z$ , the $100(1 - α)%$ mean response confidence interval for the prediction is given, using error propagation theory, by:

$\mathbf{z^T\hat\boldsymbol\beta} \pm t_{ \frac{\alpha }{2} ,m-n} \hat \sigma \sqrt {\mathbf{ z^T(X^T X)^{- 1}z }}$

The $100(1 - α)%$ predicted response confidence intervals for the data are given by:

$\mathbf z^T \hat\boldsymbol\beta \pm t_{\frac{\alpha }{2},m-n} \hat \sigma \sqrt {1 + \mathbf{z^T(X^TX)^{-1}z}}$ .

[edit] Linear case

In the case that the formula to be fitted is a straight line, $y=\alpha+\beta x\!$ , the normal equations are

$\begin{array}{lcl} m\ \alpha + \sum x_i\ \beta =\sum y_i \\ \sum x_i\ \alpha + \sum x_i^2\ \beta =\sum x_iy_i \end{array}$

where all summations are from i=1 to i=m. Thence, by Cramer's rule,

$\hat\beta = \frac {m \sum x_iy_i - \sum x_i \sum y_i} {\Delta} =\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2} \,$

$\hat\alpha = \frac {\sum x_i^2 \sum y_i - \sum x_i \sum x_iy_i} {\Delta}= \bar y-\bar x \hat\beta$

where

$\Delta = m\sum x_i^2 - \left(\sum x_i\right)^2$

The covariance matrix is

$\frac{1}{\Delta}\begin{pmatrix} \sum x_i^2 & -\sum x_i \\ -\sum x_i & m \end{pmatrix}$

The mean response confidence interval is given by

$y_d = (\alpha+\hat\beta x_d) \pm t_{ \frac{\alpha }{2} ,m-2} \hat \sigma \sqrt {\frac{1}{m} + \frac{(x_d - \bar{x})^2}{\sum (x_i - \bar{x})^2}}$

The predicted response confidence interval is given by

$y_d = (\alpha+\hat\beta x_d) \pm t_{ \frac{\alpha }{2} ,m-2} \hat \sigma \sqrt {1+\frac{1}{m} + \frac{(x_d - \bar{x})^2}{\sum (x_i - \bar{x})^2}}$

[edit] Variance analysis

Variance analysis is similar to ANOVA in that the sum of squared residuals is split into two components. The regression sum of squares (or sum of squared residuals) SSR (also commonly called RSS) is given by:

$\mathit{SSR} = \sum {\left( {\hat y_i - \bar y} \right)^2 } = \hat\boldsymbol\beta^T \mathbf{X}^T \mathbf y - \frac{1}{n}\left( \mathbf {y^T u u^T y} \right)$

where $\bar y = \frac{1}{n} \sum y_i$ and u is an n by 1 unit vector (i.e. each element is 1). Note that the terms $\mathbf{y^T u}$ and $\mathbf{u^T y}$ are both equivalent to $\sum y_i$ , and so the term $\frac{1}{n} \mathbf{y^T u u^T y}$ is equivalent to $\frac{1}{n}\left(\sum y_i\right)^2$ .

The error (or unexplained) sum of squares ESS is given by:

$\mathit{ESS} = \sum \left( y_i - \hat y_i \right)^2 = \mathbf{ y^T y - \hat\boldsymbol\beta^T X^T y}$

The total sum of squares TSS is given by

${\mathit{TSS} = \sum\left( y_i-\bar y \right)^2 = \mathbf{ y^T y}-\frac{1}{n}\left( \mathbf{y^Tuu^Ty}\right)=\mathit{SSR}+ \mathit{ESS}}$

Pearson's coefficient of regression, R² is then given as

${R^2 = \frac{\mathit{SSR}}{{\mathit{TSS}}} = 1 - \frac{\mathit{ESS}}{\mathit{TSS}}}$

[edit] Example

To illustrate the various goals of regression, we give an example. The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

Height/ m	1.47	1.5	1.52	1.55	1.57	1.60	1.63	1.65	1.68	1.7	1.73	1.75	1.78	1.8	1.83
Weight/kg	52.21	53.12	54.48	55.84	57.2	58.57	59.93	61.29	63.11	64.47	66.28	68.1	69.92	72.19	74.46

A plot of weight against height (see below) shows that it cannot be modeled by a straight line, so a regression is performed by modeling the data by a parabola.

$y = \beta_0 + \beta_1 x+ \beta_2 x^2 +\varepsilon\!$

where the dependent variable, y, is weight and the independent variable, x is height.

Place the coefficients, $1, x_i\ \mbox{and}\ x_i^2$ , of the parameters for the ith observation in the ith row of the matrix X.

$X= \begin{bmatrix} 1&1.47&2.16\\ 1&1.50&2.25\\ 1&1.52&2.31\\ 1&1.55&2.40\\ 1&1.57&2.46\\ 1&1.60&2.56\\ 1&1.63&2.66\\ 1&1.65&2.72\\ 1&1.68&2.82\\ 1&1.70&2.89\\ 1&1.73&2.99\\ 1&1.75&3.06\\ 1&1.78&3.17\\ 1&1.81&3.24\\ 1&1.83&3.35\\ \end{bmatrix}$

The values of the parameters are found by solving the normal equations

$\mathbf{(X^TX)\boldsymbol\hat\beta=X^Ty}$

Element ij of the normal equation matrix, $\mathbf{X^TX}$ is formed by summing the products of column i and column j of X.

$X_{ij}=\sum_{k=1}^{k=15} X_{ki}X_{kj}$

Element i of the right-hand side vector $\mathbf{X^Ty}$ is formed by summing the products of column i of X with the column of independent variable values.

$\left(\mathbf{X^Ty}\right)_i=\sum X_{ki} y_k$

Thus, the normal equations are

$\begin{bmatrix} 15&24.76&41.05\\ 24.76&41.05&68.37\\ 41.05&68.37&114.35\\ \end{bmatrix} \begin{bmatrix} \hat\beta_0\\ \hat\beta_1\\ \hat\beta_2\\ \end{bmatrix} = \begin{bmatrix} 931\\ 1548\\ 2586\\ \end{bmatrix}$

$\hat\beta_0=129 \pm 16$ (value $\pm$ standard deviation)

$\hat\beta_1=-143 \pm 20$

$\hat\beta_2=62 \pm 6$

The calculated values are given by

$y^{calc}_i = \hat\beta_0 + \hat\beta_1 x_i+ \hat\beta_2 x^2_i$

The observed and calculated data are plotted together and the residuals, $y_i-y^{calc}_i$ , are calculated and plotted. Standard deviations are calculated using the sum of squares, $S = 0.76$ .

The confidence intervals are computed using:

$[\hat{\beta_j}-\sigma_j t_{m-n;1-\frac{\alpha}{2}};\hat{\beta_j}+\sigma_j t_{m-n;1-\frac{\alpha}{2}}]$

with $α$ =5%, $t_{m-n;1-\frac{\alpha}{2}}$ = 2.2. Therefore, we can say that the 95% confidence intervals are:

$\beta_0\in[92.9,164.7]$

$\beta_1\in[-186.8,-99.5]$

$\beta_2\in[48.7,75.2]$

[edit] Examining results of regression models

[edit] Checking model assumptions

The model assumptions are checked by calculating the residuals and plotting them. The following plots can be constructed to test the validity of the assumptions:

Residuals against the explanatory variables in the model, as illustrated above. The residuals should have no relation to these variables (look for possible non-linear relations) and the spread of the residuals should be the same over the whole range.
Residuals against explanatory variables not in the model. Any relation of the residuals to these varibles would suggest considering these varibles for inclusion in the model.
Residuals against the fitted values, $\hat\mathbf y\,$ .
A time series plot of the residuals, that is, plotting the residuals as a function of time.
Residuals against the preceding residual.
A normal probability plot of the residuals to test normality. The points should lie along a straight line.

There should not be any noticeable pattern to the data in all but the last plot

[edit] Checking model structure

The structure of the model, in terms of whether all variables need to be included, can be checked using any of the following methods:

Using the confidence interval for each of the parameters, $\hat\beta_j$ . If the confidence interval includes 0, then the parameter can be removed from the model. Ideally, a new regression analysis excluding that parameter would need to be performed and continued until there are no more parameters to remove. This is equivalent to using a t-test for each variable.
Computing F-statistics to check whether groups of explanatory variables can be removed.

[edit] Seeing how good the model is

When fitting a straight line, calculate the coefficient of determination. The closer the value is to 1; the better the regression is. This coefficient gives what fraction of the observed behaviour can be explained by the given variables.
Examining the observational and prediction confidence intervals. The smaller they are the better.

[edit] Other procedures

[edit] Weighted least squares

Weighted least squares is a generalisation of the least squares method, used when the observational errors have unequal variance.

[edit] Errors-in-variables model

Errors-in-variables model or total least squares when the independent variable is subject to error

[edit] Generalized linear model

Generalized linear model is used when the distribution function of the errors is not a Normal distribution. Examples include Exponential distribution, gamma distribution, Inverse Gaussian distribution, Poisson distribution, binomial distribution, multinomial distribution

[edit] Robust regression

Main article: robust regression

A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.

Among Stata users, Robust regression is frequently taken to mean linear regression with Huber-White standard error estimates due to the naming conventions for regression commands. This procedure relaxes the assumption of homoscedasticity for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. This occasionally leads to confusion; Stata users sometimes believe that linear regression is a robust method when this option is used, although it is actually not robust in the sense of outlier-resistance.

[edit] Applications of linear regression

Linear regression is widely used in biological, behavioral and social sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines.

[edit] The trend line

For trend lines as used in technical analysis, see Trend lines (technical analysis)

A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.

Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

[edit] Epidemiology

As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are considered to be more trustworthy than a regression analysis.

[edit] Finance

The capital asset pricing model uses linear regression as well as the concept of Beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the Beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets.

Regression may not be the appropriate way to estimate beta in finance given that it is supposed to provide the volatility of an investment relative to the volatility of the market as a whole. This would require that both these variables be treated in the same way when estimating the slope. Whereas regression treats all variability as being in the investment returns variable, i.e. it only considers residuals in the dependent variable.^[2]

[edit] See also

[edit] References

^ The parameter estimators should be obtained by solving the normal equations as simultaneous linear equations. The inverse normal equations matrix need only be calculated in order to obtain the standard deviations on the parameters.
^ Tofallis, C. (2008). "Investment Volatility: A Critique of Standard Beta Estimation and a Simple Way Forward". European Journal of Operational Research.

[edit] Additional sources

Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates
Charles Darwin. The Variation of Animals and Plants under Domestication. (1869) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
Draper, N.R. and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics (1998)
Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246-263 (1886). (Facsimile at: [1])
Robert S. Pindyck amd Daniel L. Rubinfeld (1998, 4h ed.). Econometric Models and Economic Forecasts,, ch. 1 (Intro, incl. appendices on Σ operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).

[edit] External links

http://homepage.mac.com/nshoffner/nsh/CalcBookAll/Chapter%201/1functions.html
Investment Volatility: A Critique of Standard Beta Estimation and a Simple Way Forward, C.TofallisDownloadable version of paper, subsequently published in the European Journal of Operational Research 2008.
Scale-adaptive nonparametric regression (with Matlab software).
Visual Least Squares: An interactive, visual flash demonstration of how linear regression works.
In Situ Adaptive Tabulation: Combining many linear regressions to approximate any nonlinear function.
Earliest Known uses of some of the Words of Mathematics. See: [2] for "error", [3] for "Gauss-Markov theorem", [4] for "method of least squares", and [5] for "regression".

Online linear regression calculator.
Perpendicular Regression Of a Line at MathPages
Online regression by eye (simulation).
Leverage Effect Interactive simulation to show the effect of outliers on the regression results
Linear regression as an optimisation problem
Visual Statistics with Multimedia
Multiple Regression by Elmer G. Wiens. Online multiple and restricted multiple regression package.
ZunZun.com Online curve and surface fitting.
CAUSEweb.org Many resources for teaching statistics including Linear Regression.
Multivariate Regression Python, Smalltalk & Java Implementation of Linear Regression Calculation.
xuru.org Online linear regression and regression tools
Matlab SUrrogate MOdeling Toolbox - SUMO Toolbox - Matlab code for Active Learning + Model Selection + Surrogate Model Regression

v • d • e Statistics

Descriptive statistics	Mean (Arithmetic, Geometric, Harmonic) - Median - Mode - Range - Variance - Standard deviation

Inferential statistics	Hypothesis testing - Significance - Power - Null hypothesis/Alternative hypothesis - Error - Z-test - Student's t-test - Maximum likelihood - Standard score/Z score - P-value - Analysis of variance - Meta-analysis

Survival analysis	Survival function - Kaplan-Meier - Logrank test - Failure rate - Proportional hazards models

Correlation	Confounding variable - Pearson product-moment correlation coefficient - Rank correlation (Spearman's rank correlation coefficient, Kendall tau rank correlation coefficient)

Regression analysis	Linear regression - Nonlinear regression - Logistic regression

Category · Project · Portal

v • d • e Least squares and regression analysis

Least squares	Linear least squares - Non-linear least squares - Partial least squares -Total least squares - Gauss–Newton algorithm - Levenberg–Marquardt algorithm

Regression analysis	Linear regression - Nonlinear regression - Linear model - Generalized linear model - Robust regression - Least-squares estimation of linear regression coefficients- Mean and predicted response - Poisson regression - Logistic regression - Isotonic regression - Ridge regression - Segmented regression - Nonparametric regression - Regression discontinuity

Statistics	Gauss–Markov theorem - Errors and residuals in statistics - Goodness of fit - Studentized residual - Mean squared error - R-factor (crystallography) - Mean squared prediction error - Minimum mean-square error - Root mean square deviation - Squared deviations - M-estimator

Applications	Curve fitting - Calibration curve - Numerical smoothing and differentiation - Least mean squares filter - Recursive least squares filter - Moving least squares - BHHH algorithm

See also ebooksgratis.com: no banners, no cookies, totally FREE.