Exploratory graphs Part - 1

EDA Beginner boxplots summary

I love exploring the data as soon as I import. In this post, I describe some basics of exploratory analyses and graphs which reveal a lot about your data.

Soundarya Soundararajan true
2022-04-01

For the initial steps right after importing the data, I have written a blog post here.

Libraries

library(tidyverse)
library(naniar) # for missing data

Data

head(iris) # displays just the top of the data of interest
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Missing data

As an exploration i would like to see if any data is missing, For this we shall use vis_miss from the package naniar

vis_miss(iris)

This dataset is complete. Wondering how would it look if some data is actually missing?

library(readxl)
data <- read_excel("data.xlsx", # I am importing my data
  na = "NA" # indicating the missing variables are denoted as NA in my data
)

vis_miss(data)

The black band indicates missing data. For data with more number of observations, this will more often look like a line rather than a band.

Boxplots

More often I explore data with boxplots, to study their spread.

iris %>%
  ggplot(aes(x = Species, y = Sepal.Length)) +
  geom_boxplot(width = 0.5)

I explore outliers quickly this way and color them if needed.

iris %>%
  ggplot(aes(x = Species, y = Sepal.Length)) +
  geom_boxplot(width = 0.5, outlier.color = "red")

I have neither added any customization nor modified themes as these are exploratory. Some call these “dirty plots”, but I don’t as I find all outputs from R beautiful. :-)

For more polished version, consult here.

Basic plot way

If you prefer the base plot method, its much simpler and the code is below.

boxplot(x = iris$Sepal.Length)

In my personal experience, i have departed from the base plots (many might argue against), and have become closer to ggplot methods.

Pick what you are comfortable with.

Summary stats

Last but not the least, the best and most effective method to view any data or its variables is by summary statistics.

summary(iris) # full data
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
summary(iris$Sepal.Length) #variable of interest
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900 
# If the variable is a factor variable, the the levels will be displayed.
summary(iris$Species)
    setosa versicolor  virginica 
        50         50         50 

Happy exploring until part-2 of exploratory graphs.

Citation

For attribution, please cite this work as

Soundararajan (2022, April 1). My R Space: Exploratory graphs Part - 1. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2022-04-01-exploratory-graphs-part-1/

BibTeX citation

@misc{soundararajan2022exploratory,
  author = {Soundararajan, Soundarya},
  title = {My R Space: Exploratory graphs Part - 1},
  url = {https://github.com/soundarya24/SoundBlog/posts/2022-04-01-exploratory-graphs-part-1/},
  year = {2022}
}