jDataLab

5 minute read

Predictive learning is a process where a model is trained from known predictors and the model is used to predict, for a given new observation, either a continuous value or a categorical label. This results in two types of data mining techniques, classification for a categorical label and regression for a continuous value.

Linear regression is not only the first type but also the simplest type of regression techniques. As indicated by the name, linear regression computes a linear model which is line of best fit for a set of data points. In a two-dimensional dataset, i.e., two variables, input variable (predictor) \(x\) and output (dependent) variable \(y\), the resulting linear model represents the linear part in the relationship between \(x\) and \(y\), which is expressed in the following equation:

\[ \begin{eqnarray} y \approx ax + b \tag{1} \end{eqnarray} \]

The parameters \(a\) and \(b\) are often calculated by the least squares approach. After developing the model, given a sample value of \(x\), the model can compute its associated \(y\), which, in predictive learning, is considered as the predicted value of true \(y\).

1. Least Squares Approach

With the linear model above, each \(x\) will produce a fitted value \({\hat y}\). The least squares approach finds \(a\) and \(b\) which minimize the sum of the squared distances between each observed response \(y\) to its fitted value \({\hat y}\). The sum of the squared errors over \(n\) data points is

\[ \begin{eqnarray} SE = \sum_{i=1}^n({\hat y_i}-y_i)^2 \\\\\\ = \sum_{i=1}^n(ax_i + b -y_i)^2 \tag{2} \end{eqnarray} \]

To optimize \(SE\) with \(a\) and \(b\), let two following partial derivatives be zero,

\[ \frac{\partial SE}{\partial a} = 0 \\\\\\ \frac{\partial SE}{\partial b} = 0 \]

After simplifying the equations above (the steps are skipped here), it shows that two points are on the best-fit line:

\[ \begin{eqnarray} (\overline{x}, \overline{y}), \left(\frac{\overline{x^2}}{\overline{x}}, \frac {\overline{xy}}{\overline{x}}\right) \end{eqnarray} \]

The overline represents the mean.

Thus, the slope \(a\) and the intercept \(b\) are

\[ a = \frac {\overline{x}\hspace{.1cm}\overline{y}-\overline{xy}} {\left(\overline{x}\right)^2-\overline{x^2}} \\\\\\ b = {\overline{y}}-a{\overline{x}} \tag{3} \]

Multiple predictor variables

If the predictor consists of \(m\) variables, \(x_1, x_2, ..., x_m\), then the linear equation for the response \(y\) is

\[ \begin{eqnarray} y \approx b + \sum_{i=1}^{m}a_ix_i \tag{4} \end{eqnarray} \]

2. R-squared

To measure how good a linear regression line is, R-squared, a.k.a. the coefficient of determination, is the percentage of variation in \(y\) which is captured by the linear model. R-squared is calculated by the following formula:

\[ \begin{eqnarray} r^2 &=& 1- \frac{SE}{SE_{\overline{y}}} \\\\\\ &=& 1 - \frac { \sum_{i=1}^n({\hat y_i}-y_i)^2 } { \sum_{i=1}^n(\overline{y}-y_i)^2 } \tag{5} \end{eqnarray} \]

R-squared is between 0 and 1. In general, the higher R-squared, the better the model fits the data.

3. R Function lm

R has a function lm to fit a linear model. The function lm returns an object of class "lm".

Example 1

The following snippet shows an example of using lm to fit a dataset of two variable x and y.

df <- data.frame(x=c(1:50))
df$y <- 2 + 3*df$x + rnorm(50, sd=10)
lm.xy <- lm(y~x, df)
summary(lm.xy)

The following is the result after running the R snippet above.

Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.174  -6.641  -1.059   7.731  24.380 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -2.2205     3.2104  -0.692    0.492    
x             3.1069     0.1096  28.355   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.18 on 48 degrees of freedom
Multiple R-squared:  0.9437,	Adjusted R-squared:  0.9425 
F-statistic:   804 on 1 and 48 DF,  p-value: < 2.2e-16

Accessing Results

The developed model is read from Coefficients which is y = -2.2205 + 3.1069x.

To examine a fit, use the following expression to access R-squared in the summary, which is \(0.9436641\).

summary(lm.xy)$r.squared 

The Residuals are differences between each \(y\) and its fitted value \({\hat y}\).

residuals(lm.xy)

To get the coefficients, call the coef function.

coef(lm.xy)

(Intercept)           x 
  -2.220484    3.106902 

Other information

In addition, the lm object lm.xy can be plotted by the following script:

par(mfrow=c(2,2)) 
plot(lm.xy)

To go back to a single graph per window, run par(mfrow=c(1,1)).

Example 2 (Multiple predictor variables)

The following snippet shows how to fit a linear model of \(z\) against two predictor variables \(x\) and \(y\).

1df <- data.frame(x=c(1:50))
2df$y <- 2 + 3*df$x + rnorm(50, sd=10)
3df$z <- -2 + 3*df$x -5*df$y + rnorm(50,sd=1)
4
5lm.zxy <- lm(z ~ x+y, df)
6summary(lm.zxy)

Example 3

The following is an example which constructs three response variables \(y1\), \(y2\) and \(y3\) from a predictor variable \(x\) by adding random gaussian noise and compute linear regression model for each as well.

1df <- data.frame(x = c(1:50))
2# y = 2 + 3x + N(0,sd)
3df$y1 <- 2 + 3 * df$x + rnorm(50, sd = 10) 
4df$y2 <- 2 + 3 * df$x + rnorm(50, sd = 30) 
5df$y3 <- 2 + 3 * df$x + rnorm(50, sd = 50) 
 1library(ggplot2)
 2#install.packages("devtools")
 3#library(devtools)
 4#install_github("kassambara/easyGgplot2")
 5library(easyGgplot2)
 6
 7# lm_eqn creates a string which writes the linear model equation.
 8lm_eqn = function(m) {
 9  l <- list(a = format(coef(m)[1], digits = 2),
10            b = format(abs(coef(m)[2]), digits = 2),
11            r2 = format(summary(m)$r.squared, digits = 3));
12  if (coef(m)[2] >= 0)  {
13    eq <- substitute(italic(hat(y)) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
14  } else {
15    eq <- substitute(italic(hat(y)) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l)    
16  }
17  as.character(as.expression(eq));                 
18}
19
20# createPlot returns a ggplot object that visualizes each pair x and y in scatterplot with a linear regression line above.        
21createPlot <- function(vec,ylabel){
22    p <- ggplot(df, aes(x = df$x, y = vec)) +
23  geom_smooth(method="lm",se=FALSE, color="red", formula=y~x) +
24  geom_point() +
25  labs(x="x",y=ylabel) +
26  theme(panel.background=element_rect(fill='white',colour = "black"),axis.title=element_text(size=9,face='bold'))+
27  geom_text(aes(x = 15, y = max(vec)-20), size=4, label = lm_eqn(lm(vec ~ x, df)), parse = TRUE)
28    return(p)
29}
30
31library(plyr)
32ps <- apply(df[,2:4], 2, createPlot, ylabel='y')
33
34#options(repr.plot.width=5,repr.plot.height=5)
35ggplot2.multiplot(ps[[1]], ps[[2]], ps[[3]], cols=1)

4. Applications

For the prediction purpose, linear regression can

  • Fit a linear model for the relationship between a quantitative label \(y\) and one or more predictors (regular attributes). The trained linear model can predict \(y\) for the instances without accompanying \(y\).

For descriptive learning, linear regression can

  • Measure the strength of the relationship between \(y\) and each predictor attribute,

  • Identify the less significant predictors.

Continue with the post Predictive Learning from an Operational Perspective