Basic Data Analysis
- 1 Importing data into R
- 2 Data frames
- 3 Summary statistics
- 4 Histograms
- 5 Packages and additional functions

So far, we have created our own objects by manually entering all of the data in the console. In this section, we will learn how to create objects by importing (aka ‘reading’) data (compiled outside of R) into R, perform basic statistics on it, and visualize it with a histogram.
1. Importing data into R
1.1. Working directory
Before importing your data into R, it is important to understand what the working directory is. The working directory is the location on your computer (i.e., the folder) where R looks for files when importing data and where it saves files. You typically want to have all the files related to a single project in the same folder, so that R can easily find them, and you know where they are saved.
You can check the path of your working directory by running the function getwd() in the console.
⭐ Task 3-1
Check your working directory.
Type in getwd() in the console and hit enter.
Check your code
getwd()
## [1] “A Path to a Folder”
You will get a path to a folder on your computer. This is your current working directory.
More often than not, you will want to change your working directory to a specific folder rather than the default folder. To do that, you can use the setwd() function. Inside the parentheses (i.e. as the function parameter), you should type the path to the folder between quotes. For example, let’s assume you want your working directory to be a folder called “my_project” that is in the main Documents folder. You would type:
setwd("C:/Documents/my_project")
⭐ Task 3-2
Change your working directory.
Change your working directory to a folder where you will keep all the files related to this workshop. Note: You should use forward slashes to denote the path to your folder. This should work on both Mac and Windows.
Check your code
setwd("Path to Folder") # NOTE: you should change this to the path to the folder in your computer!
If you are working alone on your scripts, always on the same computer, it is good practice to start every script by setting the working directory using setwd(). However, once you start collaborating with others, the path to the folders can be different between computers. At that point, you might want to learn about R Projects, which makes all paths relative to a pre-specified project working directory. You can read more about it here.
1.2. Importing tabular data
Now that you have your working directory set up, you can import your data into R. R can handle multiple file types:
- .csv (comma-separated values)
- Excel (.xls, .xlsx)
- .txt (and .tsv - tab-separated values)
- .json (used for nested data structures)
- These would likely be arrays of more than 2 dimensions.
- SPSS (another specialized statistics software)
- Data scraped from the web or via an API.
For tabular data, you will most likely be importing .csv or .xlsx files. In this workshop, we will work with .csv files because you can import them with base R. If you have your data in Excel, you can save it as .csv by clicking on File > Save as. If you want to import .xslx files directly, you will need to install a specific package (see here).
⭐ Task 3-3
Download data.
Download and save this spreadsheet of Income data.
- Note: Please save the file in your working directory, specified in the task above.
To import a .csv file in R, you can use the read.csv() function. This function takes as its main argument the name of the file you want to import. This should be in quotes and include the file type.
If you want R to import your file and save it in an object, you need to specify the name of the object and use the <- symbol to assign the imported file to the object:
# This code will create an object called object.name with the data from the .csv file
object.name <- read.csv("path-to-file.csv")
Attention: if you do not assign an object to the imported file, R will simply print the imported data in the console and not save it in an object for future use. Always import data by assigning it to an object.
⭐ Task 3-4
Import data.
Use the function read.csv() to import the dataset of Income data to an object called “income”.
Check your code
income <- read.csv("income.csv")
After running this code, you should see the object “income” in your environment panel in the top right.
If you get an error message that says “No such file or directory”, it’s probably because you did not save the .csv file in your working directory, or because there is a typo in the file name.
There are other functions in R to import other types of tabular data, and a generic function called read.table(), which is really useful if you need to specify some details when importing data, for example, which values to consider NA. To learn more about it, check this.
2. Data frames
Now that you imported your file into R, we can take a closer look at it. The file you just imported is an object of the type data frame.
To check that, you can run the code:
# The function class() tells you the type of object. It is good for checking if you imported your files correctly
class(income)
## [1] "data.frame"
Definition - Data frame: essentially a table. It is a two-dimensional object that can hold different types of data.
- Usually, data frames are used to store values of variables (i.e. the columns) recorded for different observations (i.e. the rows). For example, different observations made for different cats.
- Data frames can contain one or more columns and one or more rows.
- All values in a column are related (e.g., column 1 = age, column 2 = eye color)
- Because the column contains the same type of information, it is equivalent to a vector (i.e., the ‘eye color’ column will contain characters, not numbers).
- One row denotes one object from the set. For example, in the data frame of information about a set of cats, each row contains information about one specific cat.
- A row can contain many different bits of information, like age (numerical), eye color (character), breed (character), whether or not it’s spayed/neutered (boolean). Because rows may contain values of different types, one row would most likely not be a vector. It would likely be a list, which can contain values of different types.
To see the data in your data frame, simply enter the name of the data frame in the console and type ‘enter’ or ‘return’.
income
## id gender income experience
## 1 1 M 23000 3
## 2 2 M 55000 7
## 3 3 M 43000 5
## 4 4 F 37000 5
## 5 5 M 75000 9
## 6 6 M 72000 10
## 7 7 F 121000 13
## 8 8 F 27000 1
## 9 9 F 57000 8
## 10 10 F 91000 10
If you data frame is too long, you might want to just check the top rows. You can do that with the function head():
head(income)
## id gender income experience
## 1 1 M 23000 3
## 2 2 M 55000 7
## 3 3 M 43000 5
## 4 4 F 37000 5
## 5 5 M 75000 9
## 6 6 M 72000 10
Another useful way to inspect your data frame is to use the str() function:
str(income)
## 'data.frame': 10 obs. of 4 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10
## $ gender : chr "M" "M" "M" "F" ...
## $ income : int 23000 55000 43000 37000 75000 72000 121000 27000 57000 91000
## $ experience: int 3 7 5 5 9 10 13 1 8 10
This tells you that your data frame is made of 10 observations of 4 variables. It can be inferred that this data relates to 10 people. It then tells you the name of each variable (id, gender, income, experience), the data type of each variable (int = integer, chr = character), and the first few values of each column.
You can use de $ symbol to refer R to specific columns inside your dataframe. For example, if you want to check the individual values for gender, you can type:
income$gender
## [1] "M" "M" "M" "F" "M" "M" "F" "F" "F" "F"
These columns are treated as vectors in R, so if you wanted to get the 4th value of the column gender, you can use the indexing inside [] that you learned in the previous section:
income$gender[4]
## [1] "F"
If you want to explore more ways to view and preview the content of our data frames, check out the Data Analysis with RStudio - Data cleaning and manipulation and visualization workshop. You can also go here for more information about data frames.
3. Summary statistics.
Statistics is:
- the science of collecting, analyzing, and interpreting
- data to uncover patterns and trends,
- and inform decisions based on this data.
If you’re unfamiliar with statistics, you can learn more about it from the w3school Statistics Tutorial
In this section, we’ll be focusing on
- Basic statistical measures
- Presenting data in a histogram
More on data visualization is covered in the Data Analysis with RStudio - Data cleaning and manipulation and visualization workshop, and more on data analysis, such as statistical tests, is covered in the Data Analysis with RStudio - Intermediate data analysis workshop.
Basic statistical measures
The function names for the following three statistical measures (mean, median, standard deviation) are quite intuitive.
It is just the name or abbreviation of the statistical measure, where the argument is the object containing the set of values we are analyzing.
Each function takes the vector containing the values of the variable as its argument.
These three functions are designed for sets of numerical and integer data types. If run on other types (character, aka text, and boolean, aka true/false), the result will be NA.
⭐ Task 3-5
Get the mean (average) income.
Mean: the average value in a set.
The mean() function calculates the sum of the values in the set and divides the sum by the number of items in the set.
Write and execute a command that outputs the mean income across the 10 people in our dataset. Remember: you can use the $ symbol to extract one column (i.e., one vector) from your data frame.
Check your code
# output the average income
mean(income$income)
## [1] 60100
⭐ Task 3-6
Get median value.
Write and execute a command that outputs the median value of income
Median: The middle value in a sorted set (e.g. lowest - highest). median()
Check your code
median(income$income)
## [1] 56000
The output tells you the income value that falls between the higher income half and the lower income half of the people in your dataset.
⭐ Task 3-7
Get standard deviation.
Standard deviation: Describes how spread out the data is.
The function in R is sd()
Write and execute a command that outputs the standard deviation of the income.
The output tells you how much the individual incomes vary from the average income.
- A small standard deviation means that most people have an income that is close to the average, indicating uniformity in income.
- A large standard deviation suggests a wide range of incomes.
Check your code
sd(income$income)
## [1] 30479.32
Up until now, you were calculating mean, median and standard deviation for one single variable in your data frame. However, often you will want to calculate that for the entire data frame. For this, a useful function is summary(), which takes a data frame as input and returns a summary of each variable as the output.
⭐ Task 4.8
Get summary of statistics.
Display a summary of statistics for the income data.
Check your code
summary(income)
## id gender income experience
## Min. : 1.00 Length:10 Min. : 23000 Min. : 1.00
## 1st Qu.: 3.25 Class :character 1st Qu.: 38500 1st Qu.: 5.00
## Median : 5.50 Mode :character Median : 56000 Median : 7.50
## Mean : 5.50 Mean : 60100 Mean : 7.10
## 3rd Qu.: 7.75 3rd Qu.: 74250 3rd Qu.: 9.75
## Max. :10.00 Max. :121000 Max. :13.00
4. Histograms
Histogram: A graph used for understanding and analysing the distribution of values in a vector.
A histogram illustrates:
- Where data points tend to cluster
- The variability of data
- The shape of variability
The histogram will appear in the Plots tab (bottom right quadrant if you haven’t modified your RStudio layout).
To create a histogram, you can use the function hist(). For example, for a histogram of the income data:
# Remember that income$income grabs the variable "income" in the data frame "income"
hist(income$income)
We can also pass in additional parameters to control the way our plot looks.
Some of the frequently used parameters are:
main: The title of the plot- e.g.,
main = "This is the Plot Title"
- e.g.,
xlab: The x-axis label- e.g.,
xlab = "The X Label"
- e.g.,
ylab: The y-axis label. “Frequency” is the default value, and you don’t have to specify it unless you would like a different label.- e.g., ylab = “The Y Label”
Multiple parameters are given to a function by putting them in parentheses separated by commas, function_name(parameter1, parameter2):
# The first parameter is the name of the data (vector) object
# 'main' is the graph title
# 'xlab' is the label of the x-axis
# label parameters can be in any order, but following the data object
# y-label on a histogram defaults to "frequency". You can add 'ylab=""' if you'd like.
hist(income$income, xlab="Income", main = "Histogram of Income")

⭐ Task 3-9
Create a histogram.
Create a histogram for the experience data using the histogram function hist(). Remember to add an informative title and labels.
- Parameter: vector of values to plot
Check your code
hist(income$experience, main = "Histogram of Experience", xlab = "Experience")

We can see in the histogram that there are 7 intervals with equally spaced breaks. In this case, the height of a cell is equal to the number of observations falling in that cell.
- Why are there 7 intervals? R automatically chooses the number of intervals for you.
Additional: If you preferred having fewer or more intervals (i.e., ‘bins’), use can set that using the breaks parameter.
⭐ Task 3-10
Create a histogram with a different number of intervals.
Use the argument breaks inside the function hist() to create a histogram of experience that has only 3 intervals.
Check your code
# breaks is equal to the number of intervals
# You can add the custom labels if you would like `main='Histogram of Experience',xlab='Experience', `
hist(income$experience, main = "Histogram of Experience", xlab = "Experience", breaks = 3)

5. Packages and additional functions
One of the most fascinating things about R is that it has an active community developing a lot of packages everyday, which makes R very powerful. A package is a compilation of functions (data sets, code, documentations and tests) external to R that provide it with additional capabilities. For example, if you want to calculate the skewness and kurtosis of a variable or distribution, you will need to install an additional package called “moments” that has those functions, as they are not available in base R.
We can install packages in the console using the install.packages() function. You should use the console and not the code editor to run this code because you only need to install the package once.
⭐ Task 3-11
Install the moments packages in the Console window.
Package names: moments.
Check your code
install.packages("moments") # Install the moments package
Hint: wrap the package name in "" quotations, because it is a string type.
Note: The installation may take a while. When it’s complete, the right angle bracket > will appear at the last line of your console.
Confirm installation
To check if the package is installed, enter the following in the console
# Paste these lines into the console, and then run.
installed <- installed.packages() # this creates an object with names of installed packages
"moments" %in% rownames(installed) # this looks for moments in that object
## [1] TRUE
Load the libraries.
After we install a package, we have to load it using the library() function.
- Do not wrap the package name in quotes when using
library().
Why no quotations for library()?
When you install a package in R using install.packages(), the package name must be a character string, hence the quotes. This is because install.packages() is a function that takes a character vector as its argument, representing the names of the packages to be installed.
However, when you load a package using library() or require(), you’re not passing a character string; instead, you’re using a non-evaluated expression that refers to the package name. Here, the package name is an object of mode “name” which library() interprets as the name of a package to load.
In summary, the quotes are needed for install.packages() because it expects a character string, while library() is designed to take an unquoted name that it interprets as a package name.
- ❗ Put this command in your R script, not in the console. Why? The package only needs to be installed once, but it needs to be loaded any time you are running your script.
# Load the packages
library(moments)
Now that you have loaded the package moments, you can use it to calculate the kurtosis and skewness.
- Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Relatively peaked distribution are called “positive kurtosis” and indicated by values larger than 3 in the kurtosis estimator. Relatively flat distributions are called “negative kurtosis” and are indicated by values smaller than 3 in the kurtosis estimator.

To calculate the kurtosis of income, we use the function kurtosis() from the moments package. This function takes in as an argument a numeric vector:
kurtosis(income$income)
## [1] 2.603855
- Skewness characterizes the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values.

⭐ Task 3-12
Calculate the skewness of the income data
Use the function skewness() to calculate the skewness of the income data
Check your code
skewness(income$income)
## [1] 0.6433921
Great job! Now you know how to use the basic syntax of R in R Studio to calculate basic statistical measures! This is the official end of this workshop, but if you want, we have an optional activity for you about how to troubleshoot errors in R.