The four primary steps of statistical model building are:
When creating a linear regression model, one of the data assumptions is that the data is linear.
This can be tested visually by creating a scatter plot.
It can also be tested by calculating the correlation coefficient. A correlation coefficient near 0
suggests a weak linear relationship. A correlation coefficient near 1
or -1
suggests a strong linear relationship.
To calculate the correlation coefficient in R, call the cor.test()
function using two feature columns as parameters.
# This finds the correlation coefficient between the TV and Sales columns of the data frame named advertising.coefficient <- cor.test(advertising$TV, advertising$Sales)coefficient$estimate
When creating a linear regression model, one of the data assumptions is that the data has no extreme values that are not representative of the actual relationship between the two variables.
We can visualize outliers of a feature by creating a boxplot.
# This will create a box plot of the sales column from a data frame named advertisingplot <- advertising %>%ggplot(aes(sales, sales)) +geom_boxplot()
The lm()
function creates a linear regression model in R. This function takes an R formula Y ~ X
where Y
is the outcome variable and X
is the predictor variable.
To create a multiple linear regression model in R, add additional predictor variables using +
.
# This creates a simple linear regression model where sales is the outcome variable and podcast is the predictor variable. The data used is a data frame named train.model <- lm(sales ~ podcast, data = train)# This creates a multiple linear regression model where the predictor variables are podcast and TV.model2 <- lm(sales ~ podcast + TV, data = train)
Residual Standard Error (RSE) provides an absolute measure of lack of fit of a linear regression model to the data. Because it is measured in the units of the outcome variable, it is not always clear what RSE value constitutes a strongly fitted model.
For example, if we create a model that was trying to predict the amount of money earned by sales based on TV advertisements, RSE would be measured in dollars (the units of the outcome variable).
In R, the RSE of a linear regression model can be found by calling the summary()
function using the model as a parameter. It can also be found by calling the sigma()
function using the model as a parameter.
# RSE can be found in the summary of a model.summary(model)# This will also return the RSE of a model.sigma(model)
A linear regression model’s R Squared value describes the proportion of variance explained by the model.
A value of 1
means that all of the variance in the data is explained by the model, and the model fits the data well. A value of 0
means that none of the variance is explained by the model.
In R, the R Squared value of a linear regression model can be found by calling the summary()
function using the model as a parameter.
# This finds the R Squared value of a linear regression model named model.summary(model)$r.squared
After creating a linear regression model, the vertical distance between a data point and the model’s estimation is called a residual.
This image shows a visualization of residuals. The blue dots are the model’s estimation for a given x value. The black dots are the original data points. We’ve scaled these black dots to be bigger the further it is away from the estimation. The residuals are the vertical distance between each estimation and each data point.
Comparing a linear regression model to a LOESS smoother is one way to visualize where the linear regression model diverges from the training data.
In R, you can visualize a LOESS smoother by adding geom_smooth(se = FALSE, color = "red")
to you plot.
In the image, the red line is the LOESS smoother. You’ll see that as the values on the x axis increase, the data starts to diverge from the model.
The interpretation of coefficients in multiple linear regression is different than that of coefficients in simple linear regression. The coefficient of an independent continuous variable represents the difference in the predicted value of the outcome variable for one unit increase in the predictor variable, given that all other variables in the model are held constant.
For example, if we made a multiple linear regression model that tried to predict sales based on tv advertisement and podcast advertisement, the coefficient for podcast advertisement would represent the change in the number of sales per one dollar increase in podcast advertisement, assuming tv advertisement was held constant.
When creating a linear regression model using one predictor variable, the regression coefficient represents the difference in the predicted value of the outcome variable for each one-unit increase in the predictor variable.
For example, if we made a simple linear regression model that tried to predict sales based on podcast advertisements, the coefficient for podcast advertisement would represent the change in the number of sales per one dollar increase in podcast advertisement.