Linear Regressions
Tips before you start:
 You can pull up documentation for a function by executing
?function
(e.g.?lm
) in the Console. 
Have the tidyverse package installed and the dplyr library loaded in RStudio.
install.packages('tidyverse') library(dplyr)
 dplyr cheat sheet
In this activity, we build linear models to predict the life expectancy with one or more independent/explanatory variables. Download the Life Expectancy dataset here and save it in your current directory or somewhere you can find later. Data source.

Import the dataset by typing the following:
life_expectancy < read.csv("WHO Life Expectancy Data.csv") #change to the appropriate file path to the downloaded dataset on your computer head(life_expectancy,2)
We are going to use only data from the year of 2015 as it is most recent. Do filtering by
dplyr
, which we cover in the workshop Introduction to RStudio/Data Manipulation:life_expectancy_2015 < life_expectancy %>% filter(Year == 2015) head(life_expectancy_2015,2)
The
lm(...)
command creates linear regression models. It takes the following format:lm([response variable] ~ [predictor variables], data = [data source])

Simple linear regression:

A plot of Schooling vs Life expectancy shows that a linear relationship is reasonable.
life_expectancy_2015$Schooling
selects theSchooling
column in thelife_expectancy_2015
dataset; the sane applies for theLife expectancy
column. Run:plot(life_expectancy_2015$Schooling, life_expectancy_2015$`Life.expectancy`)

We create a simple linear regression model where the response variable is
Life expectancy
and the predictor variable isSchooling
. The model can be written as:life expectancy = slope * schooling + intercept In R:
lm_schooling < lm(`Life.expectancy` ~ Schooling, data=life_expectancy_2015) summary(lm_schooling)
The small pvalues (<0.001) indicate that the estimates for the intercept and slope estimates are statistically significant. The Rsquared value of 0.6694 indicates that 66.94% of the variation in Life expectancy can be explained by Schooling. We can write the model mathematically as: Life expectancy = 2.2287 * Schooling + 42.9016

Add this regression line to the plot with
abline(lm_schooling)
 Get the 95% confidence interval for the coefficient estimates:
confint(lm_schooling)

Linear regression makes several assumptions about the data:
Linearity of the data & constant variance: we want to check the Residuals vs Fitted plot for no pattern, the red line should be fairly flat, the points should be equally scattered.
plot(lm_schooling, 1)
Normality: points should be close to the line in the Normal QQ plot.
plot(lm_schooling, 2)
Overall the assumptions are met. However, there seem to be a few outliers seen in the Schooling vs. Life expectancy plot. We may want to examine these data points in further analysis.


Multiple linear regression:

We want to expand our model to consider an additional predictor variable, BMI. Run the following code:
plot(select(life_expectancy_2015, one_of(c("Life expectancy", "BMI", "Schooling"))))
From the plot, BMI doesn’t look as good as Schooling as a predictor of Life expectancy. But we will go ahead and fit a multiple regression model to have a concrete result.

Create a multiple linear regression model where the response variable is Life expectancy and the independent variables are BMI and Schooling. The model can be written as: Life expectancy = slope_1 * BMI + slope_2 * Schooling + intercept
Run the following code for that model:
lm_multiple < lm(life_expectancy_2015$`Life expectancy` ~ life_expectancy_2015$Schooling + life_expectancy_2015$BMI, data=life_expectancy_2015) summary(lm_multiple)
As we thought, BMI is not a significant variable with a pvalue of 0.234. The model is still significant however, with pvalue of 2.2e^16, because Schooling is included. We conclude that the simple regression model adequately fits the data. For the sake of completeness, the multiple regression model can be written as:
Life expectancy = 2.16981 * Schooling + 0.02442 * BMI + 42.57196

Similar to the simple model, this command produces graphs to check the model assumptions.
plot(lm_multiple)


Conduct a single or multiple regression analysis with other variables of your choosing:
 Make a scatter plot to explore their linear relationship
 Build a linear regression model
 Assess the results. Let the instructors know if you need help! This activity’s code in your Markdown file may look like this: