Linear Regressions

Tips before you start:

  • You can pull up documentation for a function by executing ?function (e.g. ?lm) in the Console.
  • Have the tidyverse package installed and the dplyr library loaded in RStudio.
# only need to run once in the console if tidyverse not installed: install.packages('tidyverse')
library(dplyr)

In this activity, we build linear models to predict the life expectancy with one or more independent/explanatory variables. Download the Life Expectancy dataset by rightblocking on this text and save select “Download linked file as”. Save it in the same folder as your current R Markdown file.

Data source.

  1. Import the dataset by typing the following:

    life_expectancy <- read.csv("WHO_Life_Expectancy_Data.csv") #change to the appropriate file path to the downloaded dataset on your computer
    head(life_expectancy,2)
    

    We are going to use only data from the year of 2015 as it is most recent. Do filtering by dplyr, which we cover in the workshop Introduction to RStudio/Data Manipulation:

    life_expectancy_2015 <- life_expectancy %>% filter(Year == 2015)
    head(life_expectancy_2015,2)
    

    The lm(...) command creates linear regression models. It takes the following format:

    lm([response variable] ~ [predictor variables], data = [data source])
    
  2. Simple linear regression:

    • A plot of Schooling vs Life expectancy shows that a linear relationship is reasonable. life_expectancy_2015$Schooling selects the Schooling column in the life_expectancy_2015 dataset; the same applies for the Life.expectancy column. Run:

      plot(life_expectancy_2015$Schooling, life_expectancy_2015$Life.expectancy)
      
    • We create a simple linear regression model where the response variable is Life expectancy and the predictor variable is Schooling. The model can be written as:

      life expectancy = slope * schooling + intercept

In R:

lm_schooling <- lm(Life.expectancy ~ Schooling, data = life_expectancy_2015)
summary(lm_schooling)

plot the data

The small p-values (<0.001) indicate that the estimates for the intercept and slope estimates are statistically significant. The R-squared value of 0.6694 indicates that 66.94% of the variation in Life expectancy can be explained by Schooling. We can write the model mathematically as: Life expectancy = 2.2287 * Schooling + 42.9016

  • Add this regression line to the plot with abline(lm_schooling):
plot(life_expectancy_2015$Schooling, life_expectancy_2015$Life.expectancy)
abline(lm_schooling)

regression line

  • Get the 95% confidence interval for the coefficient estimates: confint(lm_schooling)
  • Linear regression makes several assumptions about the data:

    Linearity of the data & constant variance: we want to check the Residuals vs Fitted plot for no pattern, the red line should be fairly flat, the points should be equally scattered.

plot(lm_schooling, 1)`

fitted line

Normality: points should be close to the line in the Normal Q-Q plot.

plot(lm_schooling, 2)

normality

Overall the assumptions are met. However, there seem to be a few outliers seen in the Schooling vs. Life expectancy plot. We may want to examine these data points in further analysis.

  1. Multiple linear regression:

    • We want to expand our model to consider an additional predictor variable, body mass index (BMI). Run the following code:
life_expectancy_2015 %>% 
    select(Life.expectancy, BMI, Schooling) %>% # get the relevant columns form the dataset
    pairs() # plot pairwise correlation plots

multiple plots

From the plot, BMI doesn’t look as good as Schooling as a predictor of Life expectancy. But we will go ahead and fit a multiple regression model to have a concrete result.

  • Create a multiple linear regression model where the response variable is Life expectancy and the independent variables are BMI and Schooling. The model can be written as:

Life expectancy = slope_1 * BMI + slope_2 * Schooling + intercept

Run the following code for that model:

lm_multiple <- lm(Life.expectancy ~ Schooling + BMI, data = life_expectancy_2015)
summary(lm_multiple)

modelling life expectancy

As we thought, BMI is not a significant variable with a p-value of 0.234. The model is still significant however, with p-value of 2.2e^-16, because Schooling is included. We conclude that the simple regression model adequately fits the data. For the sake of completeness, the multiple regression model can be written as:

Life expectancy = 2.16981 * Schooling + 0.02442 * BMI + 42.57196

  • Similar to the simple model, this command produces graphs to check the model assumptions.
plot(lm_multiple)
  1. Conduct a single or multiple regression analysis with other variables of your choosing:
    • Make a scatter plot to explore their linear relationship
    • Build a linear regression model
    • Assess the results. Let the instructors know if you need help! This activity’s code in your Markdown file may look like this:

act3 Rmd screenshot

NEXT STEP: If statements, loops, and custom functions