# Linear Regression Analysis

Predictive learning is a process where a model is trained from known predictors and the model is used to predict, for a given new observation, either a continuous value or a categorical label. This results in two types of data mining techniques, classification for a categorical label and regression for a continuous value.

Linear regression is not only the first type but also the simplest type of regression techniques. As indicated by the name, linear regression computes a linear model which is line of best fit for a set of data points. In a two-dimensional dataset, i.e., two variables, input variable (predictor) $$x$$ and output (dependent) variable $$y$$, the resulting linear model represents the linear part in the relationship between $$x$$ and $$y$$, which is expressed in the following equation:

$\begin{eqnarray} y \approx ax + b \tag{1} \end{eqnarray}$

The parameters $$a$$ and $$b$$ are often calculated by the least squares approach. After developing the model, given a sample value of $$x$$, the model can compute its associated $$y$$, which, in predictive learning, is considered as the predicted value of true $$y$$.

## 1. Least Squares Approach

With the linear model above, each $$x$$ will produce a fitted value $${\hat y}$$. The least squares approach finds $$a$$ and $$b$$ which minimize the sum of the squared distances between each observed response $$y$$ to its fitted value $${\hat y}$$. The sum of the squared errors over $$n$$ data points is

$\begin{eqnarray} SE = \sum_{i=1}^n({\hat y_i}-y_i)^2 \\\\\\ = \sum_{i=1}^n(ax_i + b -y_i)^2 \tag{2} \end{eqnarray}$

To optimize $$SE$$ with $$a$$ and $$b$$, let two following partial derivatives be zero,

$\frac{\partial SE}{\partial a} = 0 \\\\\\ \frac{\partial SE}{\partial b} = 0$

After simplifying the equations above (the steps are skipped here), it shows that two points are on the best-fit line:

$\begin{eqnarray} (\overline{x}, \overline{y}), \left(\frac{\overline{x^2}}{\overline{x}}, \frac {\overline{xy}}{\overline{x}}\right) \end{eqnarray}$

The overline represents the mean.

Thus, the slope $$a$$ and the intercept $$b$$ are

$a = \frac {\overline{x}\hspace{.1cm}\overline{y}-\overline{xy}} {\left(\overline{x}\right)^2-\overline{x^2}} \\\\\\ b = {\overline{y}}-a{\overline{x}} \tag{3}$

### Multiple predictor variables

If the predictor consists of $$m$$ variables, $$x_1, x_2, ..., x_m$$, then the linear equation for the response $$y$$ is

$\begin{eqnarray} y \approx b + \sum_{i=1}^{m}a_ix_i \tag{4} \end{eqnarray}$

## 2. R-squared

To measure how good a linear regression line is, R-squared, a.k.a. the coefficient of determination, is the percentage of variation in $$y$$ which is captured by the linear model. R-squared is calculated by the following formula:

$\begin{eqnarray} r^2 &=& 1- \frac{SE}{SE_{\overline{y}}} \\\\\\ &=& 1 - \frac { \sum_{i=1}^n({\hat y_i}-y_i)^2 } { \sum_{i=1}^n(\overline{y}-y_i)^2 } \tag{5} \end{eqnarray}$

R-squared is between 0 and 1. In general, the higher R-squared, the better the model fits the data.

## 3. R Function lm

R has a function lm to fit a linear model. The function lm returns an object of class "lm".

### Example 1

The following snippet shows an example of using lm to fit a dataset of two variable x and y.

df <- data.frame(x=c(1:50))
df$y <- 2 + 3*df$x + rnorm(50, sd=10)
lm.xy <- lm(y~x, df)
summary(lm.xy)

The following is the result after running the R snippet above.

Call:
lm(formula = y ~ x, data = df)

Residuals:
Min      1Q  Median      3Q     Max
-22.174  -6.641  -1.059   7.731  24.380

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.2205     3.2104  -0.692    0.492
x             3.1069     0.1096  28.355   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.18 on 48 degrees of freedom
Multiple R-squared:  0.9437,	Adjusted R-squared:  0.9425
F-statistic:   804 on 1 and 48 DF,  p-value: < 2.2e-16

#### Accessing Results

The developed model is read from Coefficients which is y = -2.2205 + 3.1069x.

To examine a fit, use the following expression to access R-squared in the summary, which is $$0.9436641$$.

summary(lm.xy)$r.squared  The Residuals are differences between each $$y$$ and its fitted value $${\hat y}$$. residuals(lm.xy) To get the coefficients, call the coef function. coef(lm.xy) (Intercept) x -2.220484 3.106902  #### Other information In addition, the lm object lm.xy can be plotted by the following script: par(mfrow=c(2,2)) plot(lm.xy) To go back to a single graph per window, run par(mfrow=c(1,1)). ### Example 2 (Multiple predictor variables) The following snippet shows how to fit a linear model of $$z$$ against two predictor variables $$x$$ and $$y$$. 1df <- data.frame(x=c(1:50)) 2df$y <- 2 + 3*df$x + rnorm(50, sd=10) 3df$z <- -2 + 3*df$x -5*df$y + rnorm(50,sd=1)
4
5lm.zxy <- lm(z ~ x+y, df)
6summary(lm.zxy)

### Example 3

The following is an example which constructs three response variables $$y1$$, $$y2$$ and $$y3$$ from a predictor variable $$x$$ by adding random gaussian noise and compute linear regression model for each as well.

1df <- data.frame(x = c(1:50))
2# y = 2 + 3x + N(0,sd)
3df$y1 <- 2 + 3 * df$x + rnorm(50, sd = 10)
4df$y2 <- 2 + 3 * df$x + rnorm(50, sd = 30)
5df$y3 <- 2 + 3 * df$x + rnorm(50, sd = 50) 
 1library(ggplot2)
2#install.packages("devtools")
3#library(devtools)
4#install_github("kassambara/easyGgplot2")
5library(easyGgplot2)
6
7# lm_eqn creates a string which writes the linear model equation.
8lm_eqn = function(m) {
9  l <- list(a = format(coef(m), digits = 2),
10            b = format(abs(coef(m)), digits = 2),
11            r2 = format(summary(m)$r.squared, digits = 3)); 12 if (coef(m) >= 0) { 13 eq <- substitute(italic(hat(y)) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l) 14 } else { 15 eq <- substitute(italic(hat(y)) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l) 16 } 17 as.character(as.expression(eq)); 18} 19 20# createPlot returns a ggplot object that visualizes each pair x and y in scatterplot with a linear regression line above. 21createPlot <- function(vec,ylabel){ 22 p <- ggplot(df, aes(x = df$x, y = vec)) +
23  geom_smooth(method="lm",se=FALSE, color="red", formula=y~x) +
24  geom_point() +
25  labs(x="x",y=ylabel) +
26  theme(panel.background=element_rect(fill='white',colour = "black"),axis.title=element_text(size=9,face='bold'))+
27  geom_text(aes(x = 15, y = max(vec)-20), size=4, label = lm_eqn(lm(vec ~ x, df)), parse = TRUE)
28    return(p)
29}
30
31library(plyr)
32ps <- apply(df[,2:4], 2, createPlot, ylabel='y')
33
34#options(repr.plot.width=5,repr.plot.height=5)
35ggplot2.multiplot(ps[], ps[], ps[], cols=1)

## 4. Applications

For the prediction purpose, linear regression can

• Fit a linear model for the relationship between a quantitative label $$y$$ and one or more predictors (regular attributes). The trained linear model can predict $$y$$ for the instances without accompanying $$y$$.

For descriptive learning, linear regression can

• Measure the strength of the relationship between $$y$$ and each predictor attribute,

• Identify the less significant predictors.

Continue with the post Predictive Learning from an Operational Perspective