# install.packages("tidyverse") # Run this line if you don't have it installed
# install.packages("DiagrammeR") # Run this line if you don't have it installed
library(tidyverse)
library(DiagrammeR)R Workshop: The Grammar of Data Wrangling with dplyr
1. Introduction: What is Data Wrangling?
Welcome! Before we can make beautiful plots or build complex models, our data is almost never in the right format. It’s often messy, has too many columns, not enough columns, or is just plain confusing.
Data Wrangling (or data manipulation) is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making in less time.
Today, we’re going to learn the most popular and powerful tool for this in R: the dplyr package.
dplyr is beloved because it provides a “grammar” for data manipulation. It gives us a few simple verbs (functions) that we can combine in logical sentences to solve complex data problems.
2. The Problem: The Instagram Reels Conundrum
We’ve all been there: “I’ll just watch one…” and suddenly it’s 45 minutes later.
As data scientists, we’ve been asked to investigate a new synthetic dataset. Our goal is to understand the factors related to high Instagram Reels usage. We have data on 200 users, including:
- Demographics (age, gender)
- Screen time (Reels hours per week)
- Self-reported wellness (stress, anxiety, sleep quality)
- Personality metrics (extraversion, conscientiousness)
Our main questions are:
- What does a “high-usage” user look like?
- Is Reels usage related to personality?
- Does higher usage correlate with higher stress or anxiety?
- Are there differences between groups (e.g., by gender)?
Let’s use dplyr to “wrangle” this data and find some answers!
3. Setup: Loading Our Tools and Data
First, we need to load our tools. We’ll load the tidyverse, which is a collection of R packages (including dplyr) designed for data science. We’ll also load DiagrammeR for our visual mind maps.
Next, let’s create our synthetic dataset. (In a real-world scenario, you would load this from a CSV or database using read_csv()).
# We use set.seed() so that our "random" data is the same for everyone
# This makes our workshop reproducible!
set.seed(123)
n <- 200 # We'll create 200 users
# --- Create our base variables ---
age = sample(18:60, n, replace = TRUE)
gender = sample(c("Male", "Female", "Non-binary"), n, replace = TRUE, prob = c(0.45, 0.45, 0.1))
stress_level = sample(1:10, n, replace = TRUE)
anxiety_level = sample(1:10, n, replace = TRUE)
personality_extraversion = sample(1:10, n, replace = TRUE)
personality_conscientiousness = sample(1:10, n, replace = TRUE)
reported_sleep_quality = sample(1:10, n, replace = TRUE)
# --- Create our "outcome" variable: Reels Usage ---
# We'll make it correlated with our other variables to make it interesting!
# Logic: More usage for younger, more extraverted, more anxious, poor sleepers
base_usage = 10
reels_hours_per_week = base_usage +
(personality_extraversion * 0.5) +
(anxiety_level * 0.4) -
((age - 18) * 0.2) - # Younger people use it more
(reported_sleep_quality * 0.3) +
rnorm(n, mean = 0, sd = 3) # Add some random noise
# Clean up the data: no negative hours, and round to 1 decimal
reels_hours_per_week[reels_hours_per_week < 0] <- 0
reels_hours_per_week <- round(reels_hours_per_week, 1)
# --- Combine into our final dataset (a "tibble") ---
reels_data <- tibble::tibble(
user_id = paste0("user_", 1:n),
age,
gender,
reels_hours_per_week,
stress_level,
anxiety_level,
personality_extraversion,
personality_conscientiousness,
reported_sleep_quality
)
# Let's take a look at our data!
print(reels_data)# A tibble: 200 × 9
user_id age gender reels_hours_per_week stress_level anxiety_level
<chr> <int> <chr> <dbl> <int> <int>
1 user_1 48 Female 8.2 5 7
2 user_2 32 Male 11.7 2 3
3 user_3 31 Male 10.3 2 7
4 user_4 20 Male 18.3 3 10
5 user_5 59 Non-binary 0 5 3
6 user_6 60 Female 0.3 6 6
7 user_7 54 Female 6.3 10 7
8 user_8 31 Male 15.4 3 4
9 user_9 42 Male 7.7 3 10
10 user_10 43 Female 7.9 1 1
# ℹ 190 more rows
# ℹ 3 more variables: personality_extraversion <int>,
# personality_conscientiousness <int>, reported_sleep_quality <int>
# exporting the dataset for use in challenges
write_csv(reels_data, "reels_data.csv")4. The Core Verbs of Data Wrangling
Let’s learn the grammar. The most important part of dplyr is the native R pipe operator: |>.
The “Then” Operator: The Native R Pipe (|>)
Think of the pipe (|>) as the word “then”. It takes the output of the step before it and feeds it as the first argument to the step after it.
This pipe is built directly into R (since version 4.1.0), so you don’t need any special packages to use it.
Without the pipe, code looks nested and hard to read: print(head(reels_data, 5))
With the pipe, code reads like a sentence: reels_data |> head(5) |> print() (Read as: “Take the reels_data, then take the head of it with 5 rows, then print it.”)
We will use the pipe |> to connect all our dplyr verbs.
Verb 1: select() - Choosing Columns
What it does: Subsets your data by picking specific columns.
Excel Analogy: This is like right-clicking and “Hiding” or “Unhiding” columns in Excel. You could also think of it as creating a new sheet and only copy-pasting the columns you want to keep.
Our Question: We are interested in personality. Let’s just look at the user_id and the two personality columns.
# --- Using select() ---
# We take our data, THEN...
personality_data <- reels_data |>
# ...we select *only* these three columns
select(user_id, personality_extraversion, personality_conscientiousness)
# Let's see the result
print(personality_data)# A tibble: 200 × 3
user_id personality_extraversion personality_conscientiousness
<chr> <int> <int>
1 user_1 1 7
2 user_2 9 5
3 user_3 3 8
4 user_4 9 1
5 user_5 2 7
6 user_6 4 10
7 user_7 8 7
8 user_8 4 5
9 user_9 2 1
10 user_10 6 7
# ℹ 190 more rows
Bonus Tip: You can also use select() to remove columns using the - (minus) sign.
# This gives us all columns *except* stress_level and anxiety_level
no_wellness_data <- reels_data |>
select(-stress_level, -anxiety_level)
# print(no_wellness_data)Verb 2: filter() - Subsetting Rows
What it does: Picks rows based on a logical condition.
Excel Analogy: This is the Filter button in Excel (the little funnel icon - ). You’re clicking the dropdown on a column header and checking boxes or setting a custom filter (e.g., ‘Number Filters’ > ‘Greater Than…’).
Our Question: Who are our “power users”? Let’s find everyone who uses Reels for more than 20 hours per week.
# --- Using filter() ---
# We take our data, THEN...
power_users <- reels_data |>
# ...we filter for rows where this condition is TRUE
filter(reels_hours_per_week > 20)
print(power_users)# A tibble: 0 × 9
# ℹ 9 variables: user_id <chr>, age <int>, gender <chr>,
# reels_hours_per_week <dbl>, stress_level <int>, anxiety_level <int>,
# personality_extraversion <int>, personality_conscientiousness <int>,
# reported_sleep_quality <int>
You can also use multiple conditions. Let’s find users who are under 25 AND have a stress level above 7.
# We can use comma (,) or the & symbol for AND
young_and_stressed <- reels_data |>
filter(
age < 25,
stress_level > 7
)
print(young_and_stressed)# A tibble: 6 × 9
user_id age gender reels_hours_per_week stress_level anxiety_level
<chr> <int> <chr> <dbl> <int> <int>
1 user_71 21 Male 13.2 10 4
2 user_123 23 Female 10.1 8 6
3 user_135 24 Female 14.6 8 6
4 user_160 19 Female 14.3 10 7
5 user_178 24 Male 7.9 9 6
6 user_179 24 Female 4.9 9 3
# ℹ 3 more variables: personality_extraversion <int>,
# personality_conscientiousness <int>, reported_sleep_quality <int>
Verb 3: mutate() - Creating New Columns
What it does: Creates new columns (or changes existing ones) based on other columns.
Excel Analogy: This is like adding a new, blank column in Excel and writing a formula that refers to other cells in the same row. For example, writing =IF(D2 < 5, "Low", IF(D2 <= 15, "Medium", "High")) in a new column E and dragging that formula all the way down.
Our Question: The “hours” column is too specific. Let’s create a usage_category (‘Low’, ‘Medium’, ‘High’) to make it easier to analyze.
We’ll use a helpful function called case_when() inside our mutate() call.
# --- Using mutate() ---
# We take our data, THEN...
reels_data_with_categories <- reels_data |>
# ...we create a new column called 'usage_category'
mutate(
usage_category = case_when(
reels_hours_per_week < 5 ~ "Low", # If hours < 5, label "Low"
reels_hours_per_week <= 15 ~ "Medium", # If hours is between 5 and 15
reels_hours_per_week > 15 ~ "High" # If hours > 15
)
)
# Print the result, but let's select just a few columns to see
reels_data_with_categories |>
select(user_id, reels_hours_per_week, usage_category)# A tibble: 200 × 3
user_id reels_hours_per_week usage_category
<chr> <dbl> <chr>
1 user_1 8.2 Medium
2 user_2 11.7 Medium
3 user_3 10.3 Medium
4 user_4 18.3 High
5 user_5 0 Low
6 user_6 0.3 Low
7 user_7 6.3 Medium
8 user_8 15.4 High
9 user_9 7.7 Medium
10 user_10 7.9 Medium
# ℹ 190 more rows
Verb 4: arrange() - Sorting Your Data
What it does: Re-orders the rows of your data based on one or more columns.
Excel Analogy: This is the Sort button in Excel. You select a column and choose ‘Sort A to Z’ (ascending) or ‘Sort Z to A’ (descending).
Our Question: Who are the top 5 most anxious users in our dataset?
# --- Using arrange() ---
# By default, arrange() sorts in ASCENDING order (A-Z, 1-10)
# To sort in DESCENDING order (10-1, Z-A), we use the desc() helper
most_anxious_users <- reels_data |>
arrange(desc(anxiety_level)) # Sort by anxiety_level, highest to lowest
# Let's just look at the top 5
most_anxious_users |>
select(user_id, anxiety_level, stress_level) |>
head(5)# A tibble: 5 × 3
user_id anxiety_level stress_level
<chr> <int> <int>
1 user_4 10 3
2 user_9 10 3
3 user_26 10 5
4 user_27 10 2
5 user_48 10 9
Verbs 5 & 6: group_by() and summarise() - The Dynamic Duo
This is the most powerful combination in dplyr.
Excel Analogy: This is the Pivot Table in Excel. * group_by(gender) is exactly like dragging the gender field into the ‘Rows’ area of a Pivot Table. * summarise(...) is like dragging reels_hours_per_week and stress_level into the ‘Values’ area and setting them to ‘Average’.
Thinking it Through: Our “Mind Map” for Grouping and Summarising
Goal: What is the average Reels usage and average stress level for each gender group?
Our Question: What is the average Reels usage and average stress level for each gender group?
# --- Using group_by() and summarise() ---
# We take our data, THEN...
summary_by_gender <- reels_data |>
# ...we create groups for each unique value in 'gender'
group_by(gender) |>
# ...THEN, we collapse each group into summary statistics
summarise(
# We create a new column 'average_reels_usage'
average_reels_usage = mean(reels_hours_per_week),
# We create 'average_stress'
average_stress = mean(stress_level),
# n() is a special function that counts the number of rows in each group
number_of_users = n()
)
print(summary_by_gender)# A tibble: 3 × 4
gender average_reels_usage average_stress number_of_users
<chr> <dbl> <dbl> <int>
1 Female 9.29 5.63 97
2 Male 9.09 5.81 83
3 Non-binary 8.28 5.95 20
Look at that! In 6 lines of code, we have a clear, aggregated table answering our question. This is the magic of dplyr and Pivot Tables!
5. Demo: Putting It All Together
Now, let’s combine our verbs into one “sentence” to answer a more complex question.
Thinking it Through: Our “Mind Map” for the Final Analysis
Our Final Problem: “Find the average reels_hours_per_week and anxiety_level for users who are ‘highly extraverted’ (Extraversion > 7). We only want to see results for the ‘Male’ and ‘Female’ groups, and the final table should be sorted to show which group has the highest average Reels usage.”
Here is the entire chain of logic:
# --- The Final Chain ---
final_analysis <- reels_data |>
# 1. First, FILTER to get only the users we care about
filter(
personality_extraversion > 7, # Must be highly extraverted
gender %in% c("Male", "Female") # %in% is a helper for 'is one of these'
) |>
# 2. THEN, GROUP by the 'gender' column
group_by(gender) |>
# 3. THEN, SUMMARISE each group
summarise(
avg_reels_usage = mean(reels_hours_per_week),
avg_anxiety = mean(anxiety_level),
count = n()
) |>
# 4. FINALLY, ARRANGE the resulting summary table
arrange(desc(avg_reels_usage))
# Let's print our final answer!
print(final_analysis)# A tibble: 2 × 4
gender avg_reels_usage avg_anxiety count
<chr> <dbl> <dbl> <int>
1 Male 10.8 5.2 20
2 Female 10.4 6.85 34
6. Conclusion
Today, you’ve learned the core “grammar” of data manipulation in R:
select()to pick columns (like Hiding columns)filter()to pick rows (like the Filter button)mutate()to create/change columns (like writing a formula)arrange()to sort rows (like the Sort button)group_by()&summarise()to aggregate data (like a Pivot Table)- And the
|>native R pipe to connect them all together!
With these six verbs, you can solve an incredible range of data wrangling problems. Happy coding!
7. Your Turn: Challenges!
Now it’s time to test your skills. Try to complete the following code chunks to answer the questions. The ___ is a placeholder for the code you need to add.
Challenge 1: The Least Stressed
Question: Find the 3 least stressed users (stress_level) who are over 40 years old. Show only their user_id, age, and stress_level.
Hint: By default, arrange() sorts from lowest to highest (ascending).
# --- Challenge 1 Code ---
# Find the 3 least stressed users over 40
least_stressed_over_40 <- reels_data |>
filter(age > ___) |>
arrange(____) |> # Sort by stress_level (lowest first)
select(user_id, age, ____) |>
head(3)
# After you fill in the blanks, uncomment the line below to see your answer
# print(least_stressed_over_40)Challenge 2: Low-Stress Summary
Question: Create a summary that shows the average personality_extraversion and average reported_sleep_quality for each gender group. But, only include users who have a stress_level of 5 or less.
# --- Challenge 2 Code ---
# Average personality and sleep for low-stress users by gender
low_stress_summary <- reels_data |>
filter(stress_level <= ___) |>
group_by(____) |>
summarise(
avg_extraversion = mean(____),
avg_sleep = mean(____)
)
# After you fill in the blanks, uncomment the line below to see your answer
# print(low_stress_summary)Challenge 3: Wellness Score
Question: First, create a new column wellness_score which is the sum of stress_level and anxiety_level. Then, find the average wellness_score for users in each gender group.
# --- Challenge 3 Code ---
# Average "wellness_score" by gender
wellness_by_gender <- reels_data |>
mutate(wellness_score = ____ + ____) |>
group_by(____) |>
summarise(
avg_wellness = mean(____)
)
# After you fill in the blanks, uncomment the line below to see your answer
# print(wellness_by_gender)