In our dplyr session, we learned how to ask specific questions of our data. But what if we don’t know what to ask yet? We need to do an “Exploratory Data Analysis” (EDA)—an initial investigation to understand our data’s structure, find missing values, and discover patterns.
This used to be a very manual process. Now, R has incredible packages that can give us a comprehensive overview with just one or two lines of code.
Today, we’ll learn a 3-step workflow for rapid EDA:
The 5-Second Check-up: A quick look in the console.
The 5-Minute “Table 1”: The essential summary for any clinical or medical paper.
The 1-Minute Full Physical: A complete, automated HTML report.
Our Case Study: Clinical Patient Data
For this session, we’ll use a new dataset, df.xlsx. This is a synthetic (fake) dataset representing baseline characteristics for a group of clinical study participants.
The data contains the following columns:
Age: Patient’s age in years.
BMI: Body Mass Index.
WaistCircumference: Waist circumference in cm.
Cholesterol: Total cholesterol in mg/dL.
CVD_Risk_Score: A calculated cardiovascular disease risk score.
Has_DM: Categorical ("Yes"/"No") indicating if the patient has Diabetes Mellitus.
Has_IHD: Categorical ("Yes"/"No") indicating if the patient has Ischemic Heart Disease.
This is a classic dataset for which we’d want to generate a “Table 1” of baseline characteristics.
1. Setup and Data Import
First, we need to load all our packages for this session, including our “golden combo” of rio and here.
# Load all our tools for todaylibrary(rio) # For the powerful `import()` functionlibrary(here) # For finding our files safely
here() starts at /Users/drpakhare/Dropbox/R Workshop 2025 AIIMS Bhopal
library(dplyr) # For data manipulation (we'll need `mutate`)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(skimr) # For the "5-second check-up"library(gtsummary) # For the "5-minute Table 1"library(DataExplorer) # For the "1-minute Full Physical"
Now, let’s import our data. We’ll use rio::import() because it’s “smart” and can handle .xlsx files, .csv files, .sav files, and more with the same function.
# Import the datapatient_data <-import(here("dataset", "df.xlsx"))# Let's peek at the datahead(patient_data)
Age BMI WaistCircumference Cholesterol CVD_Risk_Score Has_DM Has_IHD
1 67 29.5 85.2 188 54.1 No No
2 67 29.2 86.5 175 43.8 Yes No
3 41 23.3 61.6 148 6.0 No No
4 63 26.4 81.6 175 35.1 No No
5 56 23.0 61.5 167 18.6 No No
6 51 23.9 70.4 155 10.9 No No
2. The 5-Second Check-up: skimr
The summary() function in base R is okay, but skimr::skim() is far better. It gives you a rich, text-based summary in your console and intelligently groups variables by type (e.g., numeric vs. character).
# Run skim() on our dataskimr::skim(patient_data)
Data summary
Name
patient_data
Number of rows
1000
Number of columns
7
_______________________
Column type frequency:
character
2
numeric
5
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
Has_DM
0
1
2
3
0
2
0
Has_IHD
0
1
2
3
0
2
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Age
0
1
49.53
11.67
30.0
39.00
49.00
60.00
70.0
▇▇▇▆▆
BMI
0
1
24.25
3.36
13.6
22.10
24.30
26.60
36.4
▁▅▇▃▁
WaistCircumference
0
1
69.98
12.59
26.1
61.68
70.00
78.53
121.8
▁▅▇▂▁
Cholesterol
0
1
155.44
23.72
84.0
140.00
155.00
171.00
236.0
▁▅▇▃▁
CVD_Risk_Score
0
1
16.92
14.90
0.0
5.60
11.95
24.50
77.0
▇▃▂▁▁
How to Read This
Look at the skim() output. It’s fantastic!
It tells us we have 7 variables and 0 missing values.
It separates character variables (Has_DM, Has_IHD) from numeric variables.
For numeric data, it gives us key stats (mean, median, sd) and even a tiny hist (histogram) so we can see the shape of the distribution at a glance. We can see Age is fairly uniform, but CVD_Risk_Score is “right-skewed.”
For character data, it tells us there are 2 unique values for both, which is what we’d expect for “Yes/No” columns.
3. A Quick Prep-Step: Converting to Factors
Our “5-second check-up” told us Has_DM and Has_IHD are “character” (text). For statistics and tables, it’s best practice to convert these into factors.
A factor tells R that a variable is categorical and gives it specific “levels” (e.g., “No” and “Yes”). This ensures our tables and plots use the correct order.
We can do this easily with dplyr::mutate().
# Create a new, "clean" dataframepatient_data_clean <- patient_data |>mutate(# Convert Has_DM to a factor with "No" as the first levelHas_DM =factor(Has_DM, levels =c("No", "Yes")),# Convert Has_IHD to a factor with "No" as the first levelHas_IHD =factor(Has_IHD, levels =c("No", "Yes")) )# Let's check the structure of our new 'clean' data# We can use str() (structure) from base R for thisstr(patient_data_clean)
Notice how Has_DM and Has_IHD are now listed as Factor w/ 2 levels "No","Yes"? This is perfect. We’ll use patient_data_clean from now on.
4. The 5-Minute “Table 1”: gtsummary
This is the killer app for clinical researchers. The gtsummary package creates beautiful, publication-ready summary tables (“Table 1”) with one function: tbl_summary().
First, let’s get a basic summary of our whole cohort.
# Create a basic summary tablepatient_data_clean |> gtsummary::tbl_summary()
Characteristic
N = 1,0001
Age
49 (39, 60)
BMI
24.3 (22.1, 26.6)
WaistCircumference
70 (62, 79)
Cholesterol
155 (140, 171)
CVD_Risk_Score
12 (6, 25)
Has_DM
187 (19%)
Has_IHD
38 (3.8%)
1 Median (Q1, Q3); n (%)
That’s a beautiful table! Notice it’s smart: * It gives median (IQR) for skewed numeric data like CVD_Risk_Score. * It gives mean (SD) for normally distributed data. * It gives n (%) for our factor variables.
But the real magic is using the by = argument. Let’s create a “Table 1” that compares all variables by Diabetes status (Has_DM).
# Create a summary table, grouped by Diabetes statuspatient_data_clean |> gtsummary::tbl_summary(by = Has_DM # The grouping variable ) |>add_p() # Add a column of p-values!
Characteristic
No
N = 8131
Yes
N = 1871
p-value2
Age
48 (38, 59)
57 (46, 63)
<0.001
BMI
23.7 (21.6, 25.9)
26.7 (24.6, 28.6)
<0.001
WaistCircumference
68 (60, 76)
80 (71, 87)
<0.001
Cholesterol
153 (139, 169)
161 (149, 179)
<0.001
CVD_Risk_Score
11 (5, 22)
22 (10, 36)
<0.001
Has_IHD
28 (3.4%)
10 (5.3%)
0.2
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test
This is the ‘Wow’ Moment
With one line of piped code, we just created a complete “Table 1” that: 1. Summarized all our variables. 2. Split the summary by our group of interest (Has_DM). 3. Ran the correct statistical test for each comparison (e.g., t-test/Wilcoxon for numeric, Chi-squared/Fisher’s for categorical). 4. Formatted it all in a table ready to be copied into a manuscript.
5. The 1-Minute Full Physical: DataExplorer
gtsummary is perfect for presenting your data. But DataExplorer is perfect for exploring it.
This package creates a full, automated HTML report that gives you a deep dive into every aspect of your data.
# This command will create and open an HTML file# We set eval=FALSE so it doesn't run *inside* this documentDataExplorer::create_report(data = patient_data_clean,output_file ="Patient_EDA_Report.html",y ="CVD_Risk_Score"# We can tell it our main outcome variable)
When you run the code above, R will generate an HTML file in your project folder. Open it in a web browser! You will see:
Data Structure: An overview of all variables.
Missing Data: A beautiful plot showing where (if any) missing data is.
Univariate Analysis: Histograms for every numeric variable and bar plots for every categorical variable.
Bivariate Analysis: A correlation heatmap (to see how Age, BMI, and Cholesterol relate) and scatter plots.
Principal Component Analysis (PCA): An advanced look at the relationships in your data.
Conclusion
Exploratory Data Analysis doesn’t have to take days. By using a simple 3-step workflow, you can get a comprehensive understanding of your data in minutes:
skimr::skim() for a quick console overview.
gtsummary::tbl_summary() for a publication-ready “Table 1.”
DataExplorer::create_report() for a deep-dive automated HTML report.
Using these tools lets you move on to the more important part: thinking about what the data means. # This command will create and open an HTML file # We set eval=FALSE so it doesn’t run inside this document DataExplorer::create_report( data = patient_data_clean, output_file = “Patient_EDA_Report.html”, y = “CVD_Risk_Score” # We can tell it our main outcome variable )