4-Data Visualization with ggplot2

rstudio logo

If you and your group have any questions or get stuck as you work through this in-class exercise, please ask the instructor for assistance. Have fun!

The ggplot2 package is a popular system for creating data visualizations like plots, charts, graphs, etc.

In this activity, you will make a scatter plot, bar chart, and a line chart.

1. Getting Ready


Task 1.1: Install and load the ‘ggthemes’ and ‘janitor’ packages.

  • Package names:
    • tidyverse
    • ggthemes
    • janitor

Check Your Code

install.packages("ggthemes") #then, as always, type 'enter' or 'return' to submit the command for execution
install.packages("janitor")
library(ggthemes) #Do not wrap library() parameter string in quotes
library(janitor)


Hint: wrap the package name in "" quotations
- Do not wrap the library() parameter in "" quotations

More about ggthemes here. More about janitor here.


Task 1.2: Read and clean your data set.

  • Data set file name: flavors_of_cacao.csv (unless you changed the filename after downloading)
  • Name your variable: chocolateData
  • Clean the column header names using clean_names() where the parameter is chocolateData (leave parentheses blank if piping)
  • Remove first (empty) row using filter(ref != "REF")

Check Your Code

#if your file cannot be found, enter `getwd()` into your console and it will tell you the file path you should most likely use. If you cannot find the file, use Option a.
chocolateData <- read_csv("Desktop/flavors_of_cacao.csv") %>%
  clean_names() %>% #Clean the column header names
  filter(ref != "REF")

#If you get a column specification error, add `, show_col_types = FALSE` as to a parameter read_csv()
#e.g. chocolateData <- read_csv("Desktop/flavors_of_cacao.csv", show_col_types = FALSE)
## Rows: 1795 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Company, SpecificBeanOrigin_BarName, Cocoa_Percent, Company_Locatio...
## dbl (3): REF, Review_Date, Rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Hint: See Activity 3, Task 3.1 for instructions on importing a csv file.


Task 1.3: Preview the first 5 rows of your chocolate data.

Check Your Code

#preview first 5 lines of chocolateData
chocolateData %>% head(5)
## # A tibble: 5 × 9
##   company  specific_bean_origin_bar_name   ref review_date cocoa_percent
##   <chr>    <chr>                         <dbl>       <dbl> <chr>        
## 1 A. Morin Agua Grande                    1876        2016 63%          
## 2 A. Morin Kpime                          1676        2015 70%          
## 3 A. Morin Atsane                         1676        2015 70%          
## 4 A. Morin Akata                          1680        2015 70%          
## 5 A. Morin Quilla                         1704        2015 70%          
## # ℹ 4 more variables: company_location <chr>, rating <dbl>, bean_type <chr>,
## #   broad_bean_origin <chr>


2. Creating Plots and Charts in ggplot2

Here is some information about creating and formatting plots, common to all types we will look at in this activity. Don’t do anything yet!

The command to begin plots and charts are very similar. Let’s first look at the commonalities. For all of them, we will use the ggplot() function and a geometry function. ggplot() parameters are:

  • The dataset used for the plot data = datasetName
  • The aesthetic mappings. This specifies which column values is assigned to the x axis, and which is assigned to the y axis.
    • aes(x = columnForXAxis, y = columnForYAxis)

The geometry function is attached to the ggplot() function with + geom_ and is completed by the type of plot or chart. - scatter plot or point plots: + geom_point() - bar charts: geom_bar() - line charts: geom_line()

Plots will appear in the “Plot” tab (probably in the bottom right hand quadrant of your workspace).

2.1. Scatter Plots

First things first, we need to quickly clean up our dataframe for scatter plots. Copy and paste the following code into your console, and execute to imort and prepare our data.


#remove the percentage signs from the column cocoa_percent by converting the values to numbers
chocolateData$cocoa_percent <- parse_number(chocolateData$cocoa_percent)

#make sure the data type of each column is correct.
chocolateData <- type_convert(chocolateData)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   company = col_character(),
##   specific_bean_origin_bar_name = col_character(),
##   company_location = col_character(),
##   bean_type = col_character(),
##   broad_bean_origin = col_character()
## )
#You can ignore the Column Specification comment in the output. It indicates the column specification, which describes the data types of various columns after conversion, and shows that several columns have been confirmed as character columns.


Let’s apply the ggplot command above to create a scatter plot.

Definition - Scatter plot: A plot with two axes, each representing a different variable. Each individual observation is showing using a single point. The position of the point is determined by the value of the variables assigned to the x and y axes for that observation.

Chocolate bar pseudo scatter plot

Task 2.1.1: Make a scatter plot of the cocoa percentage and the rating a chocolate bar received.

  • Using chocolate data : chocolateData
  • X-axis = Cocoa percentage: cocoa_percent
  • Y-axis = Rating a chocolate bar received: rating

Check Your Code

ggplot(data = chocolateData, aes(x = cocoa_percent, y = rating)) +
    geom_point() # then add a layer of points


Output


Before we add details to our plot, we need to learn about the different components. Again, wait until the next task to do anything.

Definition - Fitted line: (aka. a ‘line of best fit’) is a line representing some function of x and y that has the best fit (or the smallest overall error) for the observed data.

Function for adding a smooth line to a plot: geom_smooth(method = "")

  • method type specifies the type of smoothing to be used
Expand for more geom_smooth method types *Linear Model (“lm”):* fits a linear regression model, suitable for linear relationships. *Locally Estimated Scatterplot Smoothing (“loess” or “lowess”)*: creates a smooth line through the plot by fitting simple models in a localized manner, which can handle non-linear relationships well. Ideal for smaller datasets *Generalized Additive Models (“gam”):* model complex, nonlinear trends in data.Ideal for larger datasets. *Moving Average (“ma”):* smooths data by creating an average of different subsets of the full dataset. It’s useful for highlighting trends in noisy data. *Splines (“splines”):* provide a way to smoothly interpolate between fixed points, creating a piecewise polynomial function. They are useful for fitting complex, flexible models to data. *Robust Linear Model (“rlm”):* Similar to linear models but less sensitive to outliers. It’s useful when your data contains outliers that might skew the results of a standard linear model.


  • Fitted line: method = "lm"

Task 2.1.2: Make another scatter plot of the cocoa percentage and the rating a chocolate bar received, with the following:

  • A “line of best fit”

  • Informative x and y axis labels

  • A title

    • Using chocolate data : chocolateData
    • X-axis = Cocoa percentage: cocoa_percent
    • Y-axis = Rating a chocolate bar received: rating
    • Line of best fit: geom_smooth(method = "lm")

Check Your Code

ggplot(data = chocolateData, aes(x = cocoa_percent, y = rating)) +
  geom_point() + # then add a layer of points
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'


Task 2.1.3: Add descriptive axis labels and a title to your scatter plot.

We’re also going to add labels and custom colors using the labs() function and custom colors.

  • Labels + labs(title = "", x = "", y = " ")

Check Your Code

#you can use the following labels or make your own.
ggplot(data = chocolateData, aes(x = cocoa_percent, y = rating)) +
  geom_point() + # then add a layer of points
  geom_smooth(method = "lm") + 
  labs(title = "Rating of Chocolate Bar by Cocoa Percentage", x = "Chocolate Bar Rating", y = "Cocoa Percentage")


Output:

## `geom_smooth()` using formula = 'y ~ x'

2.2. Bar Charts

First things first, we need to quickly clean up our dataframe for bar charts. Copy and paste the following code into your console, and execute.


chocolateData$bean_type_simplified <- word(chocolateData$bean_type, 1)

chocolateData$bean_type_simplified <- gsub('[[:punct:]]', '', chocolateData$bean_type_simplified)
chocolateData$bean_type_simplified <- trimws(chocolateData$bean_type_simplified)

chocolateData <- chocolateData %>%
filter(str_detect(bean_type_simplified, "\\S")) # This ensures the string contains at least one non-whitespace character

commonBeanTypes <- chocolateData %>%
  select(bean_type_simplified) %>%
  group_by(bean_type_simplified) %>%
  count() %>%
  filter(n > 20) %>%
  ungroup() %>%
  mutate(bean_type_simplified = as.factor(bean_type_simplified))


# Filter chocolateData to only include common beans
chocolateData_commonBeans <- chocolateData %>%
  filter(bean_type_simplified %in% commonBeanTypes$bean_type_simplified)


A bar chart illustrates categories along the x axis and the count of observations from each category on the y axis.

To make a bar chart, you need the data (categories, and values relevate to those categories), and the categories the data will be separated by (each representing one bar).

The first 5 rows of the bars made of common beans:

## # A tibble: 5 × 10
##   company  specific_bean_origin_bar_name   ref review_date cocoa_percent
##   <chr>    <chr>                         <dbl>       <dbl>         <dbl>
## 1 A. Morin Carenero                       1315        2014            70
## 2 A. Morin Sur del Lago                   1315        2014            70
## 3 A. Morin Puerto Cabello                 1319        2014            70
## 4 A. Morin Madagascar                     1011        2013            70
## 5 A. Morin Chuao                          1015        2013            70
## # ℹ 5 more variables: company_location <chr>, rating <dbl>, bean_type <chr>,
## #   broad_bean_origin <chr>, bean_type_simplified <chr>

The bars will represent the following categories:

## # A tibble: 4 × 2
##   bean_type_simplified     n
##   <fct>                <int>
## 1 Blend                   41
## 2 Criollo                213
## 3 Forastero              195
## 4 Trinitario             436

With the code above, you now have:

  • A dataset chocolateData_commonBeans: containing the chocolate bars made with the most common beans
  • A vector commonBeanTypes list of the common bean types, which will be used as the categories for the x-axis.

Task 2.2.1: Create a basic bar chart

Your chart will illustrate the frequency that chocolate bars are being made in different countries.

  • Country bar was made in: broad_bean_origin

Check Your Code

ggplot(chocolateData_commonBeans, aes(x = chocolateData_commonBeans$bean_type_simplified)) + geom_bar()

Hint: geom type = “bar”

Output:


Task 2.2.2: Create a stacked bar chart

A stacked bar chart shows two dimensions (variables) of data. Each bar will represent one variable, and each bar will be chopped into sections which represent a second variable.

To add a second dimension,

  • following the same command as the bar chart above, modify it by:
    • adding the parameter fill=~factor2name to aes(), where ‘factor2name’ is the second variable’s column name.
    • setting the parameter of geom_bar() to position="stack"


Check Your Code

ggplot(chocolateData_commonBeans, aes(x = bean_type_simplified, fill = company_location)) +
  geom_bar(position = "stack")


Output:


2.3. Line Charts

Task 2.2.3: Create a variable with the mean chocolate rating by year.

Using piping, create a new variable, meanRatingByYear

  • base data: chocolateData
  • group_by: review_date
  • use summarise()
    • the parameter is rating=mean(rating)

Check Your Code

meanRatingByYear <- chocolateData %>% group_by(review_date)%>%summarise(rating=mean(rating))

Your output will be:

Then convert “review_date to Date class by entering

meanRatingByYear$review_date <- as.integer(meanRatingByYear$review_date)

Task 2.2.3: Create a line chart using the mean chocolate rating by year.

Here we’ll make a line chart to show how the mean rating of chocolate has changed by year.

  • Your base data will be the mean rating table you just created
  • the x axis value will be the review date
  • the y axis will be the rating
  • the geom type is line, with no parameter

After the geom type, add:

ggplot(meanRatingByYear, aes(x = review_date, y = rating)) + geom_line()+ scale_x_continuous(breaks = meanRatingByYear$review_date, labels = as.character(meanRatingByYear$review_date))

Check Your Code

ggplot(meanRatingByYear, aes(x = review_date, y = rating)) +
  geom_line()+  scale_x_continuous(
    breaks = meanRatingByYear$review_date,  # Use actual review dates for breaks
    labels = as.character(meanRatingByYear$review_date)  # Convert to character to avoid decimals
  )


Output:


Task 2.2.4: Style your line chart.

Using the same chart you just made, add some stylistic features and modifications.

  • rename the x label to “Review Date”
  • rename the y label to “Rating”
  • Add a title using ggtitle() : “Change in Rating Over Time

Check Your Code

ggplot(meanRatingByYear, aes(x = review_date, y = rating)) +
  geom_line() +
  scale_x_continuous(
    breaks = meanRatingByYear$review_date,  # Use actual review dates for breaks
    labels = as.character(meanRatingByYear$review_date)  # Convert to character to avoid decimals
  ) +
  labs(
    x = "Review Date", 
    y = "Rating", 
    title = "Change in Rating Over Time"
  ) 


Output:

APPENDIX: ggplot2 Cheatsheet{: .btn .btn-purple }{:target=“_blank”} NEXT STEPS: Earn a Workshop Badge{: .btn .btn-blue }