4-Importing Data

rstudio logo

So far, we’ve create our own objects by manually entering all of the data in the console. In this section, we’ll learn how to create objects by importing (aka ‘reading’) data (compiled outside of R) into R and visualise it with a histogram.

1. Importing Excel data into R

R can handle multiple file types:

  • .csv (comma separated values)
  • excel (.xls, .xlsx)
  • .txt (and .tsv - tab separated values)
  • .json (used for nested data structures)
    • These would likely be arrays of more than 2 dimensions.
  • SPSS (another specialized statistics software)
  • Data scraped from the web or via an API.

Task 1-1

Download data.

Download and save this Excel spreadsheet of Income data

  • Note: Please remember where the income.xlsx file is saved (usually in a “downloads” or “desktop” folder).

Task 1-2

Import data.

Import the dataset of Income data

  • From the top menu bar, select…
  • File
  • Import dataset
  • From Excel
  • In the ‘Import Excel Data’ window select your file by:
  • Entering the file path to the income.xlsx file you just downloaded.
  • Selecting “Browse” on the right side of the path bar and locating it in the browser.
  • Under ‘Import Options,’ make sure ‘Name’ is the same text as you wish for the variable to be named. Ours will be ‘income’.
  • Click “Import”
  • If asked to install the readxl package, click Yes.

Don’t worry about making a mistake importing this data. You can always remove it using the rm() function.

Browse and import menu and buttons

Browse and import menu and buttons


Import excel data window

Import excel data window


What you just imported is now stored as a ‘data frame’ object whose name is income.

Definition - Data frame: essentially a table. It is 2-dimensional object that can hold different types of data types.

More about Data frames Data frames contain information about a set of objects (e.g., cats).
- The data frame will contain one or more columns and one or more rows.
- One column contains related values (column 1 = age, column 2 = eye color).
- Because the column contains the same type of information, it is equivalent to a vector.
- i.e., the ‘eye color’ column will contain characters, not numbers.
- One row denotes one object from the set. In a data frame of information about a set of cats, each row is information about one specific cats.

A row can contain many different bits of information, like age (numerical), eye color (character), breed (character), whether or not it’s spayed/neutered (boolean). Because rows may contain values of different types, one row would most likely not be a vector. It would likely be a list, which can contain values of different types.

To see the data in our data frame, simply enter the name of the data frame in the console and type ‘enter’ or ‘return’.

Check your code

income


The following will be the output:

## # A tibble: 10 × 4
##       id gender income experience
##    <dbl> <chr>   <dbl>      <dbl>
##  1     1 M       23000          3
##  2     2 M       55000          7
##  3     3 M       43000          5
##  4     4 F       37000          5
##  5     5 M       75000          9
##  6     6 M       72000         10
##  7     7 F      121000         13
##  8     8 F       27000          1
##  9     9 F       57000          8
## 10    10 F       91000         10

We will explore other ways to view and preview content of our data frames in Activity 3.

Note: <char> stands for “character” data type and <dbl> stands for “double-precision floating point numbers data” type.

We can see now that our data frame income contains 10 objects (rows), and 4 variables (columns)

  • It can be inferred that this data relates to 10 people
  • The values with each person are:
    • id (in lieu of a name) (dbl)
    • gender (char)
    • income (dbl)
    • experience (dbl)

Task 1-3

Display summary statistics.

Display a summary of statistics for the income data.

Check your code

summary(income)
##        id           gender              income         experience   
##  Min.   : 1.00   Length:10          Min.   : 23000   Min.   : 1.00  
##  1st Qu.: 3.25   Class :character   1st Qu.: 38500   1st Qu.: 5.00  
##  Median : 5.50   Mode  :character   Median : 56000   Median : 7.50  
##  Mean   : 5.50                      Mean   : 60100   Mean   : 7.10  
##  3rd Qu.: 7.75                      3rd Qu.: 74250   3rd Qu.: 9.75  
##  Max.   :10.00                      Max.   :121000   Max.   :13.00


2 Visualize Income with a Histogram plot

In 3.2 we made a histogram to visualize the distribution of the pig weights. Remember that the parameter that the histogram function takes is a vector.

To extract a vector (column) from our data frame, we will pass in dataframeName$columnName, where the name of our data is separated by the name identifying a single set of values within that data frame.

  • Replace dataframeName with the name of your imported data
  • Replace columnName with the column name representing the information you would like to analyse.
  • e.g. ‘eyeColour’ might be the column name in a dataframe named ‘cats’.

Task 2

Create a histogram.

Display the vector of data relating to ‘experience’ as a histogram.

  • X-label: ‘Experience’
  • Title: ‘Histogram of Experience’

Check your code

#Remember, the generated histogram will appear in the Plot tab.
hist(income$experience, main='Histogram of Experience',xlab='Experience')

The following will be the output:

We can see in the histogram that there are 7 intervals with equally spaced breaks. In this case, the height of a cell is equal to the number of observations falling in that cell.

  • Why are there 7 intervals? R automatically chooses the number of intervals for you.

Additional: If you preferred having 4 intervals (i.e., ‘bins’), use can set that using the breaks='' parameter.

Check Your Code for custom number of intervals

#breaks is equal to the number of intervals
#You can add the custom labels if you would like `main='Histogram of Experience',xlab='Experience', `
hist(income$experience, breaks=3)


NEXT STEP: Tidyverse and Data Manipulation