Exploratory Data Analysis (EDA) before Analysis - A Step-by-Step Guide for Beginners in R

EDA Beginner

Exploratory data analysis (EDA) is a crucial step in the analytical process, offering a profound understanding of our data and revealing hidden insights.

Soundarya Soundararajan true
2023-06-05

Introduction

Photo by Joshua Earle on Unsplash

Exploratory data analysis (EDA) allows us to go beyond mere numbers and variables and develop a deep connection with our data. EDA is akin to conducting “night science,” where we embark on a journey of exploration without predefined hypotheses, enabling us to stumble upon fascinating discoveries that can shape our research trajectory. By engaging in EDA, we break free from assumptions and gain a comprehensive understanding of our data, setting the stage for impactful analysis.

The Power of EDA = breaking free from assumptions and driving impactful analysis

EDA empowers us to make better decisions throughout the analytical process. It provides us with the necessary knowledge to select appropriate statistical techniques, identify potential biases or outliers, and choose the most relevant variables for analysis. By investing time in EDA, we lay a solid foundation that allows us to extract meaningful insights and draw accurate conclusions. However, it is important to strike a balance and approach EDA with intention. While it is tempting to delve into endless exploratory analyses, we must remember that EDA serves as a means to an end. It offers a preliminary understanding of our data and helps us identify key areas of interest. By focusing on extracting actionable insights, we can ensure that EDA drives our analysis forward and leads to informed decision-making.

The Process

In this blog post, we will take a step-by-step journey into the world of EDA in R. We will explore the art of understanding, visualizing, and exploring data to unlock its full potential. Together, we will discover the transformative power of EDA, gaining the tools and knowledge to conduct impactful analysis and make informed decisions based on a deep understanding of our data.

Data Import

For detailed instructions on importing data into R, you can refer to this blog post. If you have any questions as a beginner, you can also check out the Twitter thread I wrote addressing common queries.

Overview of Continuous Variables

The following techniques provide a starting point for EDA on continuous variables. It’s important to customize and combine them based on your specific data and research goals. As you gain more experience, you can explore additional methods and advanced visualizations to uncover further insights in your data.

Summary Statistics

Begin by calculating summary statistics such as mean, median, standard deviation, minimum, and maximum. This provides a quick overview of the central tendency, spread, and range of your continuous variables. In R, you can use functions like summary(), mean(), median(), sd(), min(), and max().

Detailed steps are here

What to observe

For the overall data, str(), dim() are great options to display. This gives a comprehensive understanding of the structure of the data, including variable types, dimensions, and any missing values.

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

Histograms

Create histograms to visualize the distribution of your continuous variables. Histograms display the frequency or count of values within specific intervals or bins. In R, you can use the hist() function to generate histograms. Experiment with different bin widths to find the most informative representation of your data. You can find more information on histograms in this blogpost

Box Plots

Construct box plots to examine the distribution, central tendency, and variability of your continuous variables. Box plots provide a visual representation of the median, quartiles, and any potential outliers in your data. In R, you can use the boxplot() function to create box plots. For more details on dot and box plots, refer to this blogpost.

Density Plots

Generate density plots to visualize the probability density function of your continuous variables. Density plots provide insights into the shape and skewness of the distribution. In R, you can use functions like density() and plot() to create density plots. For a comprehensive guide, check out this detailed blogpost.

Scatter Plots

Use scatter plots to explore the relationship between two continuous variables. Scatter plots can help identify patterns, trends, or correlations. In R, you can use the plot() function, specifying the x and y variables. Learn how to draw grouped scatter plots here.

Correlation Analysis

Assess the correlation between pairs of continuous variables using correlation coefficients such as Pearson’s correlation coefficient. This helps identify relationships and dependencies between variables. In R, you can use the cor() function to calculate correlations. Using pairs() is a great way to quickly understand the relationships.

pairs(iris)

If you are familiar with the tidyverse, you can try:

iris |>
  dplyr::select(-Species) |>
  pairs()

Overview of Categorical Variables

When working with categorical variables, understanding their distribution and frequencies is essential for gaining insights into your data. In this section, we will explore techniques to analyze and visualize categorical variables, allowing us to uncover patterns, proportions, and relationships within the data. By examining the frequencies and proportions of different categories, we can uncover valuable information that contributes to a comprehensive understanding of the dataset.

Frequencies and proportions

Here, we will calculate and analyze the counts and proportions of each category within a categorical variable. This analysis provides a quick overview of the distribution and prevalence of different categories, highlighting the relative frequencies and proportions for further investigation.

library(palmerpenguins)
table(penguins$sex)

female   male 
   165    168 
prop.table(table(penguins$sex))

   female      male 
0.4954955 0.5045045 
# Quickly visualize the frequencies
barplot(table(penguins$sex))

From dplyr

here, we will utilize the power of the dplyr package to efficiently calculate the counts of each category within a categorical variable. By using the count() function from dplyr, we can quickly generate a table that displays the number of occurrences for each category in the specified variable. This approach simplifies the process of obtaining a clear and concise summary of the categorical data, aiding in the understanding of the distribution and prevalence of different categories.

penguins |> 
  dplyr::count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Know what is missing

By being aware of what is missing in our data, we can take proactive steps to address these gaps and ensure the integrity of our analysis. In this regard, I leave the treatment of missing values to your discretion, knowing that your choices will be guided by the principles of EDA.

library(naniar)
vis_miss(penguins)

By utilizing the naniar package and the vis_miss() function, we can easily visualize missing values within the penguins dataset. This visualization helps us identify the patterns and extent of missing data, allowing us to make informed decisions on how to handle missing values in our analysis.

Outliers

Outliers are data points that deviate significantly from the overall pattern of the dataset. They can have a significant impact on statistical analyses and models. To detect outliers in your data, various techniques such as graphical methods and statistical tests can be employed. For a detailed exploration of outlier detection in R, I recommend referring to the comprehensive blog post by statsandr which provides valuable insights and practical approaches to identify and handle outliers in your data. You can find the blog post here.

Simplifying EDA with Packages

In the realm of exploratory data analysis (EDA), utilizing specialized packages can simplify the process and enhance efficiency. Two such packages, “inspectdf” and “summarytools,” offer valuable functionalities for gaining insights from your data. The “inspectdf” package provides comprehensive summaries of categorical variables, allowing you to understand their distributions, frequencies, and proportions effortlessly.

# A tibble: 3 × 5
  col_name   cnt common common_pcnt levels          
  <chr>    <int> <chr>        <dbl> <named list>    
1 island       3 Biscoe        48.8 <tibble [3 × 3]>
2 sex          3 male          48.8 <tibble [3 × 3]>
3 species      3 Adelie        44.2 <tibble [3 × 3]>
show_plot(inspect_cat(penguins)) 

Creating a Summary of All Variables

On the other hand, the “dfSummary” function from the summarytools package offers a sleek and concise summary of all variables, presenting an overview of your data at a glance. By leveraging these powerful packages, you can streamline your EDA workflow and extract meaningful information from your dataset with ease.

Like this!

Output from the summarytools package

For a detailed guide on creating a beautiful summary, refer to this blogpost

Conclusion and Decision-making

EDA goes beyond providing an overview; it serves as a guiding force, helping to identify data cleaning requirements and determine the variables that hold value beyond potential insights and intriguing relationships. Through EDA, one gains a comprehensive understanding of the data landscape, uncovering anomalies, inconsistencies, and missing values that require cleaning. Additionally, EDA enables the identification of variables that possess relevance and significance, ultimately contributing to the overall analytical process. By conducting thorough EDA, researchers can navigate through the data complexities, ensuring data quality, and extracting meaningful insights that drive informed decision-making.

Closing Thoughts

Although a layperson may perceive a person engaging in exploratory data analysis (EDA) as being lost, there is a sense of tranquility in embracing the profound insights that EDA offers. As one delves into the depths of data exploration, they gain a majestic overview that guides their analytical journey. This sentiment brings to mind the renowned quote by German philosopher Friedrich Nietzsche:

When you stare into the abyss, the abyss stares back at you.

See image in the introduction.

References/Packages you need to dive in


Citation

For attribution, please cite this work as

Soundararajan (2023, June 5). My R Space: Exploratory Data Analysis (EDA) before Analysis - A Step-by-Step Guide for Beginners in R. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2023-06-05-eda/

BibTeX citation

@misc{soundararajan2023exploratory,
  author = {Soundararajan, Soundarya},
  title = {My R Space: Exploratory Data Analysis (EDA) before Analysis - A Step-by-Step Guide for Beginners in R},
  url = {https://github.com/soundarya24/SoundBlog/posts/2023-06-05-eda/},
  year = {2023}
}