Data Wrangling with R — Beginner’s Tutorial

Author

CEU, AIIMS Bhopal

Published

November 1, 2025

Data Wrangling-

The process of removing errors and combining complex data sets to make them more accessible and easier to analyze. # Tidyverse package There are eight core Tidyverse packages namely ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats .an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. Easier to read than base r

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(1234)
df <- data.frame(
  Previous_status=factor(rep(c("Diabetes", "Non-Diabetes"), each=200)),
  FBS=round(c(rnorm(200, mean=160, sd=20),
                 rnorm(200, mean=100, sd=20))),
  BMI=round(c(rnorm(200,mean=32,sd=8),
              rnorm(200,mean=30.5,sd=7)),1),
  HbA1c=round(c(rnorm(200, mean=10.60, sd=1.5),
rnorm(200, mean=6.1, sd=0.5)) ,1),
Smoking=rbinom(n=400,size=1,prob=0.30),
Gender=rbinom(n=400,size = 1,prob=0.45)
)
df$Gender<-as.factor(df$Gender)
levels(df$Gender)<-c("Female","Male")
df$Smoking<-as.factor(df$Smoking)
levels(df$Smoking)<-c("Non-Smoker","Smoker")
df %>% mutate(BMI_Cat=factor(case_when(BMI>30~"obese",
                                BMI<22~"Not-obese",
                                TRUE~"Pre-obese")))->df
df %>% mutate(serum_creat=round(rnorm(400, mean=1.2, sd=0.2),1))->df
df %>% mutate(diet=factor(rep(c("Vegetarian","NonVegetarian"),each=200)))->df
df %>% mutate( ID = paste(row_number(),"XYZ",sep = "-"))->df

str(df)
'data.frame':   400 obs. of  10 variables:
 $ Previous_status: Factor w/ 2 levels "Diabetes","Non-Diabetes": 1 1 1 1 1 1 1 1 1 1 ...
 $ FBS            : num  136 166 182 113 169 170 149 149 149 142 ...
 $ BMI            : num  22.2 32.3 28.6 24.8 35.3 33.2 43.7 23 27.9 31.4 ...
 $ HbA1c          : num  9.1 8.5 10.5 13.3 10.5 11.8 8.9 10.3 11.4 10.1 ...
 $ Smoking        : Factor w/ 2 levels "Non-Smoker","Smoker": 1 1 2 1 2 2 1 2 2 2 ...
 $ Gender         : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 1 1 2 2 ...
 $ BMI_Cat        : Factor w/ 3 levels "Not-obese","obese",..: 3 2 3 3 2 2 2 3 3 2 ...
 $ serum_creat    : num  1.2 1 1.2 1.5 1.1 1.2 1.4 0.9 0.9 1 ...
 $ diet           : Factor w/ 2 levels "NonVegetarian",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ ID             : chr  "1-XYZ" "2-XYZ" "3-XYZ" "4-XYZ" ...
summary(df)
     Previous_status      FBS             BMI            HbA1c      
 Diabetes    :200    Min.   : 32.0   Min.   : 9.10   Min.   : 4.60  
 Non-Diabetes:200    1st Qu.:103.0   1st Qu.:25.75   1st Qu.: 6.10  
                     Median :132.0   Median :30.70   Median : 7.00  
                     Mean   :130.1   Mean   :30.78   Mean   : 8.32  
                     3rd Qu.:157.0   3rd Qu.:35.52   3rd Qu.:10.50  
                     Max.   :221.0   Max.   :57.60   Max.   :14.30  
       Smoking       Gender         BMI_Cat     serum_creat   
 Non-Smoker:272   Female:222   Not-obese: 41   Min.   :0.600  
 Smoker    :128   Male  :178   obese    :210   1st Qu.:1.100  
                               Pre-obese:149   Median :1.200  
                                               Mean   :1.204  
                                               3rd Qu.:1.300  
                                               Max.   :1.800  
            diet          ID           
 NonVegetarian:200   Length:400        
 Vegetarian   :200   Class :character  
                     Mode  :character  
                                       
                                       
                                       
head(df)
  Previous_status FBS  BMI HbA1c    Smoking Gender   BMI_Cat serum_creat
1        Diabetes 136 22.2   9.1 Non-Smoker Female Pre-obese         1.2
2        Diabetes 166 32.3   8.5 Non-Smoker Female     obese         1.0
3        Diabetes 182 28.6  10.5     Smoker Female Pre-obese         1.2
4        Diabetes 113 24.8  13.3 Non-Smoker Female Pre-obese         1.5
5        Diabetes 169 35.3  10.5     Smoker   Male     obese         1.1
6        Diabetes 170 33.2  11.8     Smoker   Male     obese         1.2
        diet    ID
1 Vegetarian 1-XYZ
2 Vegetarian 2-XYZ
3 Vegetarian 3-XYZ
4 Vegetarian 4-XYZ
5 Vegetarian 5-XYZ
6 Vegetarian 6-XYZ

TidyData format-1Each variable is saved in its own coloumn 2 Each observation is saved in its own row

Useful for the r vectorize operation

Reshaping data set

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns

pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows

dim(df)
[1] 400  10
names(df)
 [1] "Previous_status" "FBS"             "BMI"             "HbA1c"          
 [5] "Smoking"         "Gender"          "BMI_Cat"         "serum_creat"    
 [9] "diet"            "ID"             
glimpse(df)
Rows: 400
Columns: 10
$ Previous_status <fct> Diabetes, Diabetes, Diabetes, Diabetes, Diabetes, Diab…
$ FBS             <dbl> 136, 166, 182, 113, 169, 170, 149, 149, 149, 142, 150,…
$ BMI             <dbl> 22.2, 32.3, 28.6, 24.8, 35.3, 33.2, 43.7, 23.0, 27.9, …
$ HbA1c           <dbl> 9.1, 8.5, 10.5, 13.3, 10.5, 11.8, 8.9, 10.3, 11.4, 10.…
$ Smoking         <fct> Non-Smoker, Non-Smoker, Smoker, Non-Smoker, Smoker, Sm…
$ Gender          <fct> Female, Female, Female, Female, Male, Male, Female, Fe…
$ BMI_Cat         <fct> Pre-obese, obese, Pre-obese, Pre-obese, obese, obese, …
$ serum_creat     <dbl> 1.2, 1.0, 1.2, 1.5, 1.1, 1.2, 1.4, 0.9, 0.9, 1.0, 1.3,…
$ diet            <fct> Vegetarian, Vegetarian, Vegetarian, Vegetarian, Vegeta…
$ ID              <chr> "1-XYZ", "2-XYZ", "3-XYZ", "4-XYZ", "5-XYZ", "6-XYZ", …
str(df)
'data.frame':   400 obs. of  10 variables:
 $ Previous_status: Factor w/ 2 levels "Diabetes","Non-Diabetes": 1 1 1 1 1 1 1 1 1 1 ...
 $ FBS            : num  136 166 182 113 169 170 149 149 149 142 ...
 $ BMI            : num  22.2 32.3 28.6 24.8 35.3 33.2 43.7 23 27.9 31.4 ...
 $ HbA1c          : num  9.1 8.5 10.5 13.3 10.5 11.8 8.9 10.3 11.4 10.1 ...
 $ Smoking        : Factor w/ 2 levels "Non-Smoker","Smoker": 1 1 2 1 2 2 1 2 2 2 ...
 $ Gender         : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 1 1 2 2 ...
 $ BMI_Cat        : Factor w/ 3 levels "Not-obese","obese",..: 3 2 3 3 2 2 2 3 3 2 ...
 $ serum_creat    : num  1.2 1 1.2 1.5 1.1 1.2 1.4 0.9 0.9 1 ...
 $ diet           : Factor w/ 2 levels "NonVegetarian",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ ID             : chr  "1-XYZ" "2-XYZ" "3-XYZ" "4-XYZ" ...
summary(df)
     Previous_status      FBS             BMI            HbA1c      
 Diabetes    :200    Min.   : 32.0   Min.   : 9.10   Min.   : 4.60  
 Non-Diabetes:200    1st Qu.:103.0   1st Qu.:25.75   1st Qu.: 6.10  
                     Median :132.0   Median :30.70   Median : 7.00  
                     Mean   :130.1   Mean   :30.78   Mean   : 8.32  
                     3rd Qu.:157.0   3rd Qu.:35.52   3rd Qu.:10.50  
                     Max.   :221.0   Max.   :57.60   Max.   :14.30  
       Smoking       Gender         BMI_Cat     serum_creat   
 Non-Smoker:272   Female:222   Not-obese: 41   Min.   :0.600  
 Smoker    :128   Male  :178   obese    :210   1st Qu.:1.100  
                               Pre-obese:149   Median :1.200  
                                               Mean   :1.204  
                                               3rd Qu.:1.300  
                                               Max.   :1.800  
            diet          ID           
 NonVegetarian:200   Length:400        
 Vegetarian   :200   Class :character  
                     Mode  :character  
                                       
                                       
                                       
df.l<- pivot_longer(data=df,cols=c("Smoking","diet","Previous_status", "Gender" ) , names_to=" variable_name",values_to="status" )
df.w<- pivot_wider(data=df.l, names_from=" variable_name",values_from="status" )
##This removes leading and trailing spaces from all column names.
#Then your original code will work fine:
names(df.l) <- trimws(names(df.l))
df.w1<- pivot_wider(data=df.l,id_cols=!serum_creat ,names_from="variable_name" ,values_from="status" )

Dplyr action verbs-arrange/rename

names(df)
 [1] "Previous_status" "FBS"             "BMI"             "HbA1c"          
 [5] "Smoking"         "Gender"          "BMI_Cat"         "serum_creat"    
 [9] "diet"            "ID"             
df %>% arrange(HbA1c)->df.h
df %>% arrange(desc(HbA1c))->df.h
df %>% arrange(HbA1c,serum_creat)->df.h1## Order rows by values of a column(low to high). 
df.h1 %>% rename(Smoking_Status=Smoking)->df.h1
colnames(df.h1)
 [1] "Previous_status" "FBS"             "BMI"             "HbA1c"          
 [5] "Smoking_Status"  "Gender"          "BMI_Cat"         "serum_creat"    
 [9] "diet"            "ID"             

subset observation(dplyr action verb= filter/slice for row and dplyr action verb= select for col )

df %>% filter(  HbA1c> 10)->df.r
df %>% filter(  HbA1c> 10 & Gender=="Female")->df.r1
#Rows with minimum and maximum values of a variable
 df%>% slice_min(HbA1c, n = 3)
  Previous_status FBS  BMI HbA1c    Smoking Gender   BMI_Cat serum_creat
1    Non-Diabetes 103 16.5   4.6 Non-Smoker Female Not-obese         1.1
2    Non-Diabetes  88 31.7   4.9 Non-Smoker   Male     obese         1.2
3    Non-Diabetes 117 29.7   4.9     Smoker   Male Pre-obese         1.2
           diet      ID
1 NonVegetarian 321-XYZ
2 NonVegetarian 218-XYZ
3 NonVegetarian 385-XYZ
df%>% slice_max(BMI, n = 4)
  Previous_status FBS  BMI HbA1c    Smoking Gender BMI_Cat serum_creat
1        Diabetes 174 57.6  12.0 Non-Smoker Female   obese         1.3
2        Diabetes 194 49.9  11.7     Smoker   Male   obese         1.2
3        Diabetes 137 49.8  10.8     Smoker Female   obese         1.1
4    Non-Diabetes  88 49.4   6.1 Non-Smoker   Male   obese         1.1
           diet      ID
1    Vegetarian  86-XYZ
2    Vegetarian  93-XYZ
3    Vegetarian  97-XYZ
4 NonVegetarian 371-XYZ
# Rows can be dropped with negative indices:
df.rx<-slice(df, -(1:9))
#Randomly select fraction of rows.
df %>% slice_sample( n = 12)
   Previous_status FBS  BMI HbA1c    Smoking Gender   BMI_Cat serum_creat
1         Diabetes 150 41.2  10.8 Non-Smoker Female     obese         1.4
2     Non-Diabetes  88 49.4   6.1 Non-Smoker   Male     obese         1.1
3         Diabetes 193 20.2  10.8 Non-Smoker Female Not-obese         1.0
4     Non-Diabetes  90 32.8   6.4 Non-Smoker Female     obese         1.3
5         Diabetes 150 31.1  11.5 Non-Smoker Female     obese         1.0
6         Diabetes 157 36.4  11.7 Non-Smoker   Male     obese         1.1
7     Non-Diabetes  99 31.5   6.0 Non-Smoker Female     obese         1.4
8     Non-Diabetes 108 41.9   6.0 Non-Smoker   Male     obese         1.2
9         Diabetes 151 22.2  10.0 Non-Smoker   Male Pre-obese         1.5
10    Non-Diabetes 106 37.0   6.1 Non-Smoker Female     obese         1.2
11        Diabetes 156 32.9  10.6     Smoker Female     obese         1.3
12        Diabetes 211 25.6  11.4 Non-Smoker   Male Pre-obese         1.3
            diet      ID
1     Vegetarian 188-XYZ
2  NonVegetarian 371-XYZ
3     Vegetarian  57-XYZ
4  NonVegetarian 307-XYZ
5     Vegetarian 104-XYZ
6     Vegetarian  76-XYZ
7  NonVegetarian 314-XYZ
8  NonVegetarian 269-XYZ
9     Vegetarian 102-XYZ
10 NonVegetarian 300-XYZ
11    Vegetarian 164-XYZ
12    Vegetarian  62-XYZ
#Randomly select n rows.
df.r2<-slice(df.r1, 10:15)

subset observation(dplyr action verb= select for col )

names(df)
 [1] "Previous_status" "FBS"             "BMI"             "HbA1c"          
 [5] "Smoking"         "Gender"          "BMI_Cat"         "serum_creat"    
 [9] "diet"            "ID"             
df %>% select(Previous_status,FBS,BMI,HbA1c)->df.s
df %>% select(1:4)->df.s
df %>% select(-c(6:10))->df.s
df %>% select(starts_with("B"))->df.b #from tidyr...so there will be ends_with 
df %>% select(-FBS)->df.s2

Grouping and summarise

Also a new variable

## creating a new variable 
##using ifelse
df %>% mutate(BMI.s=ifelse(BMI>22,"obese","not-obese"))->df.ss
## using case_when
df %>% mutate(SC=case_when(
  serum_creat>1.35 & Gender=="Male"~"ab",
  serum_creat<1.35 & Gender=="Male"~"n",
  serum_creat>1.04 & Gender=="Female"~"ab",
  TRUE~"n"))->df.ss1
#typical range for serum creatinine  For adult men, 0.74 to 1.35 mg/dL  For adult women, 0.59 to 1.04 mg/dL 
df %>% mutate(Serum_creat_cat=factor(case_when(serum_creat>1.35 & Gender=="Male" ~"sc_above_normal",
                                               serum_creat>1.04 & Gender=="Female" ~"sc_above_normal",
                                
                                TRUE~"sc_normal")))->df
df %>% mutate(Serum_creat_catb=factor(if_else(serum_creat>1.20 ,"sc_above_normal","sc_normal")))->df
colnames(df)
 [1] "Previous_status"  "FBS"              "BMI"              "HbA1c"           
 [5] "Smoking"          "Gender"           "BMI_Cat"          "serum_creat"     
 [9] "diet"             "ID"               "Serum_creat_cat"  "Serum_creat_catb"
df %>% group_by(Previous_status,Gender) %>% summarise(Average=mean(serum_creat),Std.dev=sd(serum_creat))
`summarise()` has grouped output by 'Previous_status'. You can override using
the `.groups` argument.
# A tibble: 4 × 4
# Groups:   Previous_status [2]
  Previous_status Gender Average Std.dev
  <fct>           <fct>    <dbl>   <dbl>
1 Diabetes        Female    1.19   0.223
2 Diabetes        Male      1.19   0.193
3 Non-Diabetes    Female    1.22   0.205
4 Non-Diabetes    Male      1.22   0.212
df %>% group_by(Previous_status,Gender,Serum_creat_catb) %>% summarise(count_cat=n(),Average=mean(serum_creat),Std.dev=sd(serum_creat),IQR=IQR(serum_creat))->df.sc
`summarise()` has grouped output by 'Previous_status', 'Gender'. You can
override using the `.groups` argument.
df %>%  
  group_by(Previous_status,Gender) %>%
  summarise(across(c(2:3,serum_creat), mean, na.rm = TRUE))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(c(2:3, serum_creat), mean, na.rm = TRUE)`.
ℹ In group 1: `Previous_status = Diabetes` `Gender = Female`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
`summarise()` has grouped output by 'Previous_status'. You can override using
the `.groups` argument.
# A tibble: 4 × 5
# Groups:   Previous_status [2]
  Previous_status Gender   BMI HbA1c serum_creat
  <fct>           <fct>  <dbl> <dbl>       <dbl>
1 Diabetes        Female  32.6 10.6         1.19
2 Diabetes        Male    30.0 10.5         1.19
3 Non-Diabetes    Female  29.3  6.07        1.22
4 Non-Diabetes    Male    31.4  6.08        1.22
df %>%  
  group_by(Previous_status,Gender) %>%
  summarise(across(c(2:3,serum_creat), ~ mean(.x, na.rm = TRUE),.names = "{col}_prop")) %>% select(Previous_status, ends_with("prop"))
`summarise()` has grouped output by 'Previous_status'. You can override using
the `.groups` argument.
# A tibble: 4 × 4
# Groups:   Previous_status [2]
  Previous_status BMI_prop HbA1c_prop serum_creat_prop
  <fct>              <dbl>      <dbl>            <dbl>
1 Diabetes            32.6      10.6              1.19
2 Diabetes            30.0      10.5              1.19
3 Non-Diabetes        29.3       6.07             1.22
4 Non-Diabetes        31.4       6.08             1.22
#We’ll use ~ to indicate that we’re supplying a lambda function and use .x to indicate where the variable in across is used.

combining data set

# Create emp Data Frame
library(tidyverse)
a=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("A","B","C","D","E","F"),
  superior_name=c("XX","XX","YY","XX","YY","XX"),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
b=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("med","surg","ortho","cfm"),
  dept_branch_id= c(101,102,103,104)
)
a
  emp_id name superior_name dept_id dept_branch_id
1      1    A            XX      10            101
2      2    B            XX      20            102
3      3    C            YY      10            101
4      4    D            XX      10            101
5      5    E            YY      40            104
6      6    F            XX      50            105
b
  dept_id dept_name dept_branch_id
1      10       med            101
2      20      surg            102
3      30     ortho            103
4      40       cfm            104
df.lj<-left_join(a, b, by = "dept_id")
df.lj
  emp_id name superior_name dept_id dept_branch_id.x dept_name dept_branch_id.y
1      1    A            XX      10              101       med              101
2      2    B            XX      20              102      surg              102
3      3    C            YY      10              101       med              101
4      4    D            XX      10              101       med              101
5      5    E            YY      40              104       cfm              104
6      6    F            XX      50              105      <NA>               NA
#Join matching rows from b to a.
df.rj<-right_join(a, b, by = "dept_id")
df.rj
  emp_id name superior_name dept_id dept_branch_id.x dept_name dept_branch_id.y
1      1    A            XX      10              101       med              101
2      2    B            XX      20              102      surg              102
3      3    C            YY      10              101       med              101
4      4    D            XX      10              101       med              101
5      5    E            YY      40              104       cfm              104
6     NA <NA>          <NA>      30               NA     ortho              103
#Join matching rows from a to b.
df.ij<-inner_join(a, b, by = "dept_id")
df.ij
  emp_id name superior_name dept_id dept_branch_id.x dept_name dept_branch_id.y
1      1    A            XX      10              101       med              101
2      2    B            XX      20              102      surg              102
3      3    C            YY      10              101       med              101
4      4    D            XX      10              101       med              101
5      5    E            YY      40              104       cfm              104
#Join data. Retain only rows in both sets.
df.fj<-full_join(a, b, by = "dept_id")
df.fj
  emp_id name superior_name dept_id dept_branch_id.x dept_name dept_branch_id.y
1      1    A            XX      10              101       med              101
2      2    B            XX      20              102      surg              102
3      3    C            YY      10              101       med              101
4      4    D            XX      10              101       med              101
5      5    E            YY      40              104       cfm              104
6      6    F            XX      50              105      <NA>               NA
7     NA <NA>          <NA>      30               NA     ortho              103
#Join data. Retain all values, all rows.

Special values (NaN) (Inf) NA and NULL

#0/0#answer is NaN
#1/0 answer is Inf
sv1<-c(1,0,1,0,3,1)
sv2<-c(1,3,1,2,NA,1)
s<-sv1+sv2
s
[1]  2  3  2  2 NA  2
#Warning message:In Ops.factor(f1, f2) : ‘+’ not meaningful for factors
#Q =What is the difference  between NA an NaN
# NA=Not Available ( can be with character ,numeric or logical )
# NaN=Not a Number (only with numeric vector )
### NaN is also NA but converse is not true 
f3<- c(1,2,3,NA,4,5,NA)
is.na(f3)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
is.nan(f3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
f4 <- c(5, 9, NaN, 3, 8, NA, NaN) 

is.na(f4)
[1] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
is.nan(f4)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
#NaN in R means “Not a Number” which means there is something or some value, but it cannot be described in the computer. NaN designates a result that cannot be calculated for whatever reason, or it is not a floating-point number.

#NULL : is for empty object.
a<-c()
length(a)
[1] 0
dim(a)
NULL

Attributes

can be thought of nameplates and identifiers of r objects .

Objects can have attributes. Attributes are part of the object. These include:

names

dimnames

dim

class

attributes (contain metadata)

df<-data.frame(a=c(2,3,5,6,7,3),b=c("a","b","c","d","e","f"))
a<-c(2,3,5,6,7,3)
class(a)
[1] "numeric"
length(a)
[1] 6
dim(a)
NULL
attributes(df)
$names
[1] "a" "b"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5 6
mymatrix<-matrix(1:9,nrow=3,ncol=3)
mymatrix
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
attributes(mymatrix)
$dim
[1] 3 3

subsetting in r

name<-c("A","B","C","D","H")
age<-c(32,36,29,25,21)
sex<-c("M","M","F","F","M")
hba1c<-c(6.4,7.8,9.1,6.4,10)
df<-data.frame(name,age,sex,hba1c)
attributes(df)
$names
[1] "name"  "age"   "sex"   "hba1c"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5
dim(df)
[1] 5 4
class(df)
[1] "data.frame"
df
  name age sex hba1c
1    A  32   M   6.4
2    B  36   M   7.8
3    C  29   F   9.1
4    D  25   F   6.4
5    H  21   M  10.0
## I want rows of  hba1c>6.5.
df6.5<-df[df$hba1c>6.5,]## by indexing 
df6.5
  name age sex hba1c
2    B  36   M   7.8
3    C  29   F   9.1
5    H  21   M  10.0
df6.5<-subset(df,hba1c>6.5)# by base subsetting 
df6.5
  name age sex hba1c
2    B  36   M   7.8
3    C  29   F   9.1
5    H  21   M  10.0
library(dplyr) # by dplyr
df %>% select(1:4) %>% filter(hba1c>6.5)->df6.5
df6.5
  name age sex hba1c
1    B  36   M   7.8
2    C  29   F   9.1
3    H  21   M  10.0
df6.5<-df[which(df$hba1c>6.5),] ## using which 
df6.5
  name age sex hba1c
2    B  36   M   7.8
3    C  29   F   9.1
5    H  21   M  10.0
## but if i want to extract the age only  of participants having hba1c>6.5
df$age[which(df$hba1c>6.5)]
[1] 36 29 21
## but if i want to extract the gender of participants having hba1c>6.5
df$sex[which(df$hba1c>6.5)]
[1] "M" "F" "M"
## I want to extract the 5th row 
df[5,]
  name age sex hba1c
5    H  21   M    10
## I want to extract the 3rd coloumn
df[,3]
[1] "M" "M" "F" "F" "M"
## i want to extract the 4th observation in 5th row
df[5,4]
[1] 10

missing value extraction

m1<-c(3,4,6,NA,7,4)
m2<-c("a","b","c", NA,"e","f")
m3<-c(2.6,2.9,5.6,NA,NA,4.9)
df.m<-data.frame(m1,m2,m3)
str(df.m)
'data.frame':   6 obs. of  3 variables:
 $ m1: num  3 4 6 NA 7 4
 $ m2: chr  "a" "b" "c" NA ...
 $ m3: num  2.6 2.9 5.6 NA NA 4.9
summary(df.m)
       m1           m2                  m3       
 Min.   :3.0   Length:6           Min.   :2.600  
 1st Qu.:4.0   Class :character   1st Qu.:2.825  
 Median :4.0   Mode  :character   Median :3.900  
 Mean   :4.8                      Mean   :4.000  
 3rd Qu.:6.0                      3rd Qu.:5.075  
 Max.   :7.0                      Max.   :5.600  
 NA's   :1                        NA's   :2      
df.c<- df.m[complete.cases(df.m),]
df.m
  m1   m2  m3
1  3    a 2.6
2  4    b 2.9
3  6    c 5.6
4 NA <NA>  NA
5  7    e  NA
6  4    f 4.9
df.c
  m1 m2  m3
1  3  a 2.6
2  4  b 2.9
3  6  c 5.6
6  4  f 4.9
## remove na only from a single col 

What is data wrangling?

Concept: Data wrangling (also called data munging or cleaning) is the process of transforming raw data into a tidy, analysis-ready form. Typical steps include: import, inspect, clean, transform, reshape, and export.


Core dplyr verbs — explained and demonstrated

# Load tidyverse (if not already in your file)
library(dplyr)
library(tidyr)

select() — choose columns

# Select a subset of columns
select(df, 1:5)

Concept: select() keeps only the columns you need; use tidyselect helpers like starts_with(), ends_with(), contains(), matches().

filter() — row selection by condition

# Rows where a numeric column (replace 'age' with your column) is > 50
filter(df, if ("age" %in% names(df)) age > 50 else TRUE)

Concept: filter() uses logical expressions; combine conditions with &, |, !.

mutate() — create or transform columns

# Create a new variable 'bmi_cat' from a BMI column if present
mutate(df, 
       BMI_Category = if ("BMI" %in% names(df)) case_when(
         BMI < 18.5 ~ "Underweight",
         BMI >= 18.5 & BMI < 25 ~ "Normal",
         BMI >= 25 & BMI < 30 ~ "Overweight",
         BMI >= 30 ~ "Obese",
         TRUE ~ NA_character_
       ) else NA_character_)

Concept: mutate() adds new columns or modifies existing ones, applying vectorized operations.

arrange() — sort rows

# Arrange rows by descending value of a column (replace 'score' as appropriate)
arrange(df, desc(if ("score" %in% names(df)) score else NA))

group_by() + summarise() — aggregation

# Example: mean of a variable by group (replace 'groupvar' and 'measure' with real names)
if ("Gender" %in% names(df) & "HbA1c" %in% names(df)) {
  df %>%
    group_by(Gender) %>%
    summarise(n = n(), mean_HbA1c = mean(HbA1c, na.rm = TRUE))
} else {
  tibble(message = "Please replace Gender/HbA1c with column names from your dataset")
}

Concept: group_by() defines groups; summarise() computes summary statistics per group. Remember to ungroup() when further row-wise operations are needed.


Reshaping data with tidyr

pivot_longer() — wide → long

# Example: pivot multiple measurement columns into key-value pairs
# Replace measure1, measure2 with your column names
if (all(c("measure1","measure2") %in% names(df))) {
  df %>%
    pivot_longer(cols = c(measure1, measure2), names_to = "measure", values_to = "value")
} else {
  tibble(message = "No generic measure1/measure2 columns found in dataset; replace with actual column names to run pivot_longer example")
}

pivot_wider() — long → wide

# Example: spread measure back into columns
# This assumes a long-format table called 'long_df' exists
# long_df %>% pivot_wider(names_from = measure, values_from = value)

Concept: Reshaping is essential for plotting and modeling — tidy data means one variable per column and one observation per row.


Joins — combining tables

# Example placeholders — replace 'other_df' with real second table if available
# left_join(x = df, y = other_df, by = "id")

Concept: left_join() keeps all rows of x; inner_join() keeps only matches; full_join() keeps all rows from both. Always check join keys and duplicates.


Handling missing values

# Count missing per column
sapply(df, function(x) sum(is.na(x)))
# Remove rows with any NA
df %>% drop_na()
# Or keep complete cases
df[complete.cases(df), ]

Tip: Decide on a strategy (remove, impute, flag) depending on why values are missing.


Practical piping patterns with %>%

# A typical pipeline: filter -> select -> mutate -> summarise
df %>%
  filter(if ("Gender" %in% names(df)) Gender == "Male" else TRUE) %>%
  select(where(is.numeric)) %>%
  summarise_all(list(mean = ~mean(.x, na.rm = TRUE)))

Concept: The pipe passes the left-hand side as the first argument to the function on the right — improving readability.


Wrangling Workflow Summary (Complete Mini-Pipeline)

# Replace column names below with ones present in your dataset.
workflow_result <- df %>%
  filter(if ("Diabetes_Status" %in% names(df)) Diabetes_Status == "Diabetes" else TRUE) %>%
  mutate(
    BMI = if ("BMI" %in% names(df)) BMI else NA_real_,
    BMI_Category = if (!("BMI_Category" %in% names(df)) & "BMI" %in% names(df)) 
      case_when(
        BMI < 18.5 ~ "Underweight",
        BMI >= 18.5 & BMI < 25 ~ "Normal",
        BMI >= 25 & BMI < 30 ~ "Overweight",
        BMI >= 30 ~ "Obese",
        TRUE ~ NA_character_
      ) else if ("BMI_Category" %in% names(df)) BMI_Category else NA_character_
  ) %>%
  group_by(if ("Gender" %in% names(df)) Gender else NULL, BMI_Category) %>%
  summarise(
    n = n(),
    mean_HbA1c = if ("HbA1c" %in% names(df)) mean(HbA1c, na.rm = TRUE) else NA_real_,
    median_FBS = if ("FBS" %in% names(df)) median(FBS, na.rm = TRUE) else NA_real_
  ) %>%
  arrange(desc(n))

workflow_result

Final Tips and Good Practices

  • Keep raw data untouched; create transformed copies.
  • Use version control (git) for reproducibility.
  • Document your steps with comments and meaningful object names.
  • Save cleaned datasets to disk (e.g., write_csv(clean_df, "clean_data.csv")) for future work.

End of appended wrangling content