Quick Exploratory Data Analysis

Minimal Code, Maximum Insight

Introduction

In our dplyr session, we learned how to ask specific questions of our data. But what if we don’t know what to ask yet? We need to do an “Exploratory Data Analysis” (EDA)—an initial investigation to understand our data’s structure, find missing values, and discover patterns.

This used to be a very manual process. Now, R has incredible packages that can give us a comprehensive overview with just one or two lines of code.

Today, we’ll learn a 3-step workflow for rapid EDA:

The 5-Second Check-up: A quick look in the console.
The 5-Minute “Table 1”: The essential summary for any clinical or medical paper.
The 1-Minute Full Physical: A complete, automated HTML report.

Our Case Study: Clinical Patient Data

For this session, we’ll use a new dataset, df.xlsx. This is a synthetic (fake) dataset representing baseline characteristics for a group of clinical study participants.

The data contains the following columns:

Age: Patient’s age in years.
BMI: Body Mass Index.
WaistCircumference: Waist circumference in cm.
Cholesterol: Total cholesterol in mg/dL.
CVD_Risk_Score: A calculated cardiovascular disease risk score.
Has_DM: Categorical ("Yes"/"No") indicating if the patient has Diabetes Mellitus.
Has_IHD: Categorical ("Yes"/"No") indicating if the patient has Ischemic Heart Disease.

This is a classic dataset for which we’d want to generate a “Table 1” of baseline characteristics.

1. Setup and Data Import

First, we need to load all our packages for this session, including our “golden combo” of rio and here.

# Load all our tools for today
library(rio)             # For the powerful `import()` function
library(here)            # For finding our files safely

here() starts at /Users/drpakhare/Dropbox/R Workshop 2025 AIIMS Bhopal

library(dplyr)           # For data manipulation (we'll need `mutate`)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(skimr)           # For the "5-second check-up"
library(gtsummary)       # For the "5-minute Table 1"
library(DataExplorer)    # For the "1-minute Full Physical"

Now, let’s import our data. We’ll use rio::import() because it’s “smart” and can handle .xlsx files, .csv files, .sav files, and more with the same function.

# Import the data
patient_data <- import(here("dataset", "df.xlsx"))

# Let's peek at the data
head(patient_data)

  Age  BMI WaistCircumference Cholesterol CVD_Risk_Score Has_DM Has_IHD
1  67 29.5               85.2         188           54.1     No      No
2  67 29.2               86.5         175           43.8    Yes      No
3  41 23.3               61.6         148            6.0     No      No
4  63 26.4               81.6         175           35.1     No      No
5  56 23.0               61.5         167           18.6     No      No
6  51 23.9               70.4         155           10.9     No      No

2. The 5-Second Check-up: `skimr`

The summary() function in base R is okay, but skimr::skim() is far better. It gives you a rich, text-based summary in your console and intelligently groups variables by type (e.g., numeric vs. character).

# Run skim() on our data
skimr::skim(patient_data)

Data summary
Name	patient_data
Number of rows	1000
Number of columns	7
_______________________
Column type frequency:
character	2
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Has_DM	0	1	2	3	0	2	0
Has_IHD	0	1	2	3	0	2	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Age	1	49.53	11.67	30.0	39.00	49.00	60.00	70.0	▇▇▇▆▆
BMI	1	24.25	3.36	13.6	22.10	24.30	26.60	36.4	▁▅▇▃▁
WaistCircumference	1	69.98	12.59	26.1	61.68	70.00	78.53	121.8	▁▅▇▂▁
Cholesterol	1	155.44	23.72	84.0	140.00	155.00	171.00	236.0	▁▅▇▃▁
CVD_Risk_Score	1	16.92	14.90	0.0	5.60	11.95	24.50	77.0	▇▃▂▁▁

How to Read This

Look at the skim() output. It’s fantastic!

It tells us we have 7 variables and 0 missing values.
It separates character variables (Has_DM, Has_IHD) from numeric variables.
For numeric data, it gives us key stats (mean, median, sd) and even a tiny hist (histogram) so we can see the shape of the distribution at a glance. We can see Age is fairly uniform, but CVD_Risk_Score is “right-skewed.”
For character data, it tells us there are 2 unique values for both, which is what we’d expect for “Yes/No” columns.

3. A Quick Prep-Step: Converting to Factors

Our “5-second check-up” told us Has_DM and Has_IHD are “character” (text). For statistics and tables, it’s best practice to convert these into factors.

A factor tells R that a variable is categorical and gives it specific “levels” (e.g., “No” and “Yes”). This ensures our tables and plots use the correct order.

We can do this easily with dplyr::mutate().

# Create a new, "clean" dataframe
patient_data_clean <- patient_data |>
  mutate(
    # Convert Has_DM to a factor with "No" as the first level
    Has_DM = factor(Has_DM, levels = c("No", "Yes")),
    
    # Convert Has_IHD to a factor with "No" as the first level
    Has_IHD = factor(Has_IHD, levels = c("No", "Yes"))
  )

# Let's check the structure of our new 'clean' data
# We can use str() (structure) from base R for this
str(patient_data_clean)

'data.frame':   1000 obs. of  7 variables:
 $ Age               : num  67 67 41 63 56 51 59 35 56 58 ...
 $ BMI               : num  29.5 29.2 23.3 26.4 23 23.9 22.4 19.7 21.5 27.9 ...
 $ WaistCircumference: num  85.2 86.5 61.6 81.6 61.5 70.4 67.8 48 65.4 72.1 ...
 $ Cholesterol       : num  188 175 148 175 167 155 163 155 165 195 ...
 $ CVD_Risk_Score    : num  54.1 43.8 6 35.1 18.6 10.9 20.9 5.3 15.7 41.3 ...
 $ Has_DM            : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
 $ Has_IHD           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

Notice how Has_DM and Has_IHD are now listed as Factor w/ 2 levels "No","Yes"? This is perfect. We’ll use patient_data_clean from now on.

4. The 5-Minute “Table 1”: `gtsummary`

This is the killer app for clinical researchers. The gtsummary package creates beautiful, publication-ready summary tables (“Table 1”) with one function: tbl_summary().

First, let’s get a basic summary of our whole cohort.

# Create a basic summary table
patient_data_clean |>
  gtsummary::tbl_summary()

Characteristic	N = 1,000¹
Age	49 (39, 60)
BMI	24.3 (22.1, 26.6)
WaistCircumference	70 (62, 79)
Cholesterol	155 (140, 171)
CVD_Risk_Score	12 (6, 25)
Has_DM	187 (19%)
Has_IHD	38 (3.8%)
¹ Median (Q1, Q3); n (%)

That’s a beautiful table! Notice it’s smart: * It gives median (IQR) for skewed numeric data like CVD_Risk_Score. * It gives mean (SD) for normally distributed data. * It gives n (%) for our factor variables.

But the real magic is using the by = argument. Let’s create a “Table 1” that compares all variables by Diabetes status (Has_DM).

# Create a summary table, grouped by Diabetes status
patient_data_clean |>
  gtsummary::tbl_summary(
    by = Has_DM  # The grouping variable
  ) |>
  add_p() # Add a column of p-values!

Characteristic	No N = 813¹	Yes N = 187¹	p-value²
Age	48 (38, 59)	57 (46, 63)	<0.001
BMI	23.7 (21.6, 25.9)	26.7 (24.6, 28.6)	<0.001
WaistCircumference	68 (60, 76)	80 (71, 87)	<0.001
Cholesterol	153 (139, 169)	161 (149, 179)	<0.001
CVD_Risk_Score	11 (5, 22)	22 (10, 36)	<0.001
Has_IHD	28 (3.4%)	10 (5.3%)	0.2
¹ Median (Q1, Q3); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

This is the ‘Wow’ Moment

With one line of piped code, we just created a complete “Table 1” that: 1. Summarized all our variables. 2. Split the summary by our group of interest (Has_DM). 3. Ran the correct statistical test for each comparison (e.g., t-test/Wilcoxon for numeric, Chi-squared/Fisher’s for categorical). 4. Formatted it all in a table ready to be copied into a manuscript.

5. The 1-Minute Full Physical: `DataExplorer`

gtsummary is perfect for presenting your data. But DataExplorer is perfect for exploring it.

This package creates a full, automated HTML report that gives you a deep dive into every aspect of your data.

# This command will create and open an HTML file
# We set eval=FALSE so it doesn't run *inside* this document
DataExplorer::create_report(
  data = patient_data_clean,
  output_file = "Patient_EDA_Report.html",
  y = "CVD_Risk_Score" # We can tell it our main outcome variable
)

When you run the code above, R will generate an HTML file in your project folder. Open it in a web browser! You will see:

Data Structure: An overview of all variables.
Missing Data: A beautiful plot showing where (if any) missing data is.
Univariate Analysis: Histograms for every numeric variable and bar plots for every categorical variable.
Bivariate Analysis: A correlation heatmap (to see how Age, BMI, and Cholesterol relate) and scatter plots.
Principal Component Analysis (PCA): An advanced look at the relationships in your data.

Conclusion

Exploratory Data Analysis doesn’t have to take days. By using a simple 3-step workflow, you can get a comprehensive understanding of your data in minutes:

skimr::skim() for a quick console overview.
gtsummary::tbl_summary() for a publication-ready “Table 1.”
DataExplorer::create_report() for a deep-dive automated HTML report.

Using these tools lets you move on to the more important part: thinking about what the data means. # This command will create and open an HTML file # We set eval=FALSE so it doesn’t run inside this document DataExplorer::create_report( data = patient_data_clean, output_file = “Patient_EDA_Report.html”, y = “CVD_Risk_Score” # We can tell it our main outcome variable )