Quick Exploratory Data Analysis

Minimal Code, Maximum Insight

Introduction

In our dplyr session, we learned how to ask specific questions of our data. But what if we don’t know what to ask yet? We need to do an “Exploratory Data Analysis” (EDA)—an initial investigation to understand our data’s structure, find missing values, and discover patterns.

This used to be a very manual process. Now, R has incredible packages that can give us a comprehensive overview with just one or two lines of code.

Today, we’ll learn a 3-step workflow for rapid EDA:

  1. The 5-Second Check-up: A quick look in the console.
  2. The 5-Minute “Table 1”: The essential summary for any clinical or medical paper.
  3. The 1-Minute Full Physical: A complete, automated HTML report.

Our Case Study: Clinical Patient Data

For this session, we’ll use a new dataset, df.xlsx. This is a synthetic (fake) dataset representing baseline characteristics for a group of clinical study participants.

The data contains the following columns:

  • Age: Patient’s age in years.
  • BMI: Body Mass Index.
  • WaistCircumference: Waist circumference in cm.
  • Cholesterol: Total cholesterol in mg/dL.
  • CVD_Risk_Score: A calculated cardiovascular disease risk score.
  • Has_DM: Categorical ("Yes"/"No") indicating if the patient has Diabetes Mellitus.
  • Has_IHD: Categorical ("Yes"/"No") indicating if the patient has Ischemic Heart Disease.

This is a classic dataset for which we’d want to generate a “Table 1” of baseline characteristics.

1. Setup and Data Import

First, we need to load all our packages for this session, including our “golden combo” of rio and here.

# Load all our tools for today
library(rio)             # For the powerful `import()` function
library(here)            # For finding our files safely
here() starts at /Users/drpakhare/Dropbox/R Workshop 2025 AIIMS Bhopal
library(dplyr)           # For data manipulation (we'll need `mutate`)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(skimr)           # For the "5-second check-up"
library(gtsummary)       # For the "5-minute Table 1"
library(DataExplorer)    # For the "1-minute Full Physical"

Now, let’s import our data. We’ll use rio::import() because it’s “smart” and can handle .xlsx files, .csv files, .sav files, and more with the same function.

# Import the data
patient_data <- import(here("dataset", "df.xlsx"))

# Let's peek at the data
head(patient_data)
  Age  BMI WaistCircumference Cholesterol CVD_Risk_Score Has_DM Has_IHD
1  67 29.5               85.2         188           54.1     No      No
2  67 29.2               86.5         175           43.8    Yes      No
3  41 23.3               61.6         148            6.0     No      No
4  63 26.4               81.6         175           35.1     No      No
5  56 23.0               61.5         167           18.6     No      No
6  51 23.9               70.4         155           10.9     No      No

2. The 5-Second Check-up: skimr

The summary() function in base R is okay, but skimr::skim() is far better. It gives you a rich, text-based summary in your console and intelligently groups variables by type (e.g., numeric vs. character).

# Run skim() on our data
skimr::skim(patient_data)
Data summary
Name patient_data
Number of rows 1000
Number of columns 7
_______________________
Column type frequency:
character 2
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Has_DM 0 1 2 3 0 2 0
Has_IHD 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 49.53 11.67 30.0 39.00 49.00 60.00 70.0 ▇▇▇▆▆
BMI 0 1 24.25 3.36 13.6 22.10 24.30 26.60 36.4 ▁▅▇▃▁
WaistCircumference 0 1 69.98 12.59 26.1 61.68 70.00 78.53 121.8 ▁▅▇▂▁
Cholesterol 0 1 155.44 23.72 84.0 140.00 155.00 171.00 236.0 ▁▅▇▃▁
CVD_Risk_Score 0 1 16.92 14.90 0.0 5.60 11.95 24.50 77.0 ▇▃▂▁▁
How to Read This

Look at the skim() output. It’s fantastic!

  • It tells us we have 7 variables and 0 missing values.
  • It separates character variables (Has_DM, Has_IHD) from numeric variables.
  • For numeric data, it gives us key stats (mean, median, sd) and even a tiny hist (histogram) so we can see the shape of the distribution at a glance. We can see Age is fairly uniform, but CVD_Risk_Score is “right-skewed.”
  • For character data, it tells us there are 2 unique values for both, which is what we’d expect for “Yes/No” columns.

3. A Quick Prep-Step: Converting to Factors

Our “5-second check-up” told us Has_DM and Has_IHD are “character” (text). For statistics and tables, it’s best practice to convert these into factors.

A factor tells R that a variable is categorical and gives it specific “levels” (e.g., “No” and “Yes”). This ensures our tables and plots use the correct order.

We can do this easily with dplyr::mutate().

# Create a new, "clean" dataframe
patient_data_clean <- patient_data |>
  mutate(
    # Convert Has_DM to a factor with "No" as the first level
    Has_DM = factor(Has_DM, levels = c("No", "Yes")),
    
    # Convert Has_IHD to a factor with "No" as the first level
    Has_IHD = factor(Has_IHD, levels = c("No", "Yes"))
  )

# Let's check the structure of our new 'clean' data
# We can use str() (structure) from base R for this
str(patient_data_clean)
'data.frame':   1000 obs. of  7 variables:
 $ Age               : num  67 67 41 63 56 51 59 35 56 58 ...
 $ BMI               : num  29.5 29.2 23.3 26.4 23 23.9 22.4 19.7 21.5 27.9 ...
 $ WaistCircumference: num  85.2 86.5 61.6 81.6 61.5 70.4 67.8 48 65.4 72.1 ...
 $ Cholesterol       : num  188 175 148 175 167 155 163 155 165 195 ...
 $ CVD_Risk_Score    : num  54.1 43.8 6 35.1 18.6 10.9 20.9 5.3 15.7 41.3 ...
 $ Has_DM            : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
 $ Has_IHD           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

Notice how Has_DM and Has_IHD are now listed as Factor w/ 2 levels "No","Yes"? This is perfect. We’ll use patient_data_clean from now on.

4. The 5-Minute “Table 1”: gtsummary

This is the killer app for clinical researchers. The gtsummary package creates beautiful, publication-ready summary tables (“Table 1”) with one function: tbl_summary().

First, let’s get a basic summary of our whole cohort.

# Create a basic summary table
patient_data_clean |>
  gtsummary::tbl_summary()
Characteristic N = 1,0001
Age 49 (39, 60)
BMI 24.3 (22.1, 26.6)
WaistCircumference 70 (62, 79)
Cholesterol 155 (140, 171)
CVD_Risk_Score 12 (6, 25)
Has_DM 187 (19%)
Has_IHD 38 (3.8%)
1 Median (Q1, Q3); n (%)

That’s a beautiful table! Notice it’s smart: * It gives median (IQR) for skewed numeric data like CVD_Risk_Score. * It gives mean (SD) for normally distributed data. * It gives n (%) for our factor variables.

But the real magic is using the by = argument. Let’s create a “Table 1” that compares all variables by Diabetes status (Has_DM).

# Create a summary table, grouped by Diabetes status
patient_data_clean |>
  gtsummary::tbl_summary(
    by = Has_DM  # The grouping variable
  ) |>
  add_p() # Add a column of p-values!
Characteristic No
N = 8131
Yes
N = 1871
p-value2
Age 48 (38, 59) 57 (46, 63) <0.001
BMI 23.7 (21.6, 25.9) 26.7 (24.6, 28.6) <0.001
WaistCircumference 68 (60, 76) 80 (71, 87) <0.001
Cholesterol 153 (139, 169) 161 (149, 179) <0.001
CVD_Risk_Score 11 (5, 22) 22 (10, 36) <0.001
Has_IHD 28 (3.4%) 10 (5.3%) 0.2
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test
This is the ‘Wow’ Moment

With one line of piped code, we just created a complete “Table 1” that: 1. Summarized all our variables. 2. Split the summary by our group of interest (Has_DM). 3. Ran the correct statistical test for each comparison (e.g., t-test/Wilcoxon for numeric, Chi-squared/Fisher’s for categorical). 4. Formatted it all in a table ready to be copied into a manuscript.

5. The 1-Minute Full Physical: DataExplorer

gtsummary is perfect for presenting your data. But DataExplorer is perfect for exploring it.

This package creates a full, automated HTML report that gives you a deep dive into every aspect of your data.

# This command will create and open an HTML file
# We set eval=FALSE so it doesn't run *inside* this document
DataExplorer::create_report(
  data = patient_data_clean,
  output_file = "Patient_EDA_Report.html",
  y = "CVD_Risk_Score" # We can tell it our main outcome variable
)

When you run the code above, R will generate an HTML file in your project folder. Open it in a web browser! You will see:

  • Data Structure: An overview of all variables.
  • Missing Data: A beautiful plot showing where (if any) missing data is.
  • Univariate Analysis: Histograms for every numeric variable and bar plots for every categorical variable.
  • Bivariate Analysis: A correlation heatmap (to see how Age, BMI, and Cholesterol relate) and scatter plots.
  • Principal Component Analysis (PCA): An advanced look at the relationships in your data.

Conclusion

Exploratory Data Analysis doesn’t have to take days. By using a simple 3-step workflow, you can get a comprehensive understanding of your data in minutes:

  1. skimr::skim() for a quick console overview.
  2. gtsummary::tbl_summary() for a publication-ready “Table 1.”
  3. DataExplorer::create_report() for a deep-dive automated HTML report.

Using these tools lets you move on to the more important part: thinking about what the data means. # This command will create and open an HTML file # We set eval=FALSE so it doesn’t run inside this document DataExplorer::create_report( data = patient_data_clean, output_file = “Patient_EDA_Report.html”, y = “CVD_Risk_Score” # We can tell it our main outcome variable )