Exploratory data analysis (EDA) is a crucial step in the analytical process, offering a profound understanding of our data and revealing hidden insights.
Exploratory data analysis (EDA) allows us to go beyond mere numbers and variables and develop a deep connection with our data. EDA is akin to conducting “night science,” where we embark on a journey of exploration without predefined hypotheses, enabling us to stumble upon fascinating discoveries that can shape our research trajectory. By engaging in EDA, we break free from assumptions and gain a comprehensive understanding of our data, setting the stage for impactful analysis.
EDA empowers us to make better decisions throughout the analytical process. It provides us with the necessary knowledge to select appropriate statistical techniques, identify potential biases or outliers, and choose the most relevant variables for analysis. By investing time in EDA, we lay a solid foundation that allows us to extract meaningful insights and draw accurate conclusions. However, it is important to strike a balance and approach EDA with intention. While it is tempting to delve into endless exploratory analyses, we must remember that EDA serves as a means to an end. It offers a preliminary understanding of our data and helps us identify key areas of interest. By focusing on extracting actionable insights, we can ensure that EDA drives our analysis forward and leads to informed decision-making.
In this blog post, we will take a step-by-step journey into the world of EDA in R. We will explore the art of understanding, visualizing, and exploring data to unlock its full potential. Together, we will discover the transformative power of EDA, gaining the tools and knowledge to conduct impactful analysis and make informed decisions based on a deep understanding of our data.
For detailed instructions on importing data into R, you can refer to this blog post. If you have any questions as a beginner, you can also check out the Twitter thread I wrote addressing common queries.
The following techniques provide a starting point for EDA on continuous variables. It’s important to customize and combine them based on your specific data and research goals. As you gain more experience, you can explore additional methods and advanced visualizations to uncover further insights in your data.
Begin by calculating summary statistics such as mean, median, standard deviation, minimum, and maximum. This provides a quick overview of the central tendency, spread, and range of your continuous variables. In R, you can use functions like summary(), mean(), median(), sd(), min(), and max().
Detailed steps are here
Look at the mean and median to understand the typical or average value of the variable. This provides insight into the center of the distribution.
Assess the standard deviation to gauge the variability or dispersion of the data points around the mean. A higher standard deviation indicates greater variability.
Identify the minimum and maximum values to understand the range of the data. This helps identify potential outliers or extreme values.
Interquartile Range (IQR) represents the range between the 25th and 75th percentiles. It provides information about the spread of the middle 50% of the data and can be useful for identifying skewness or outliers.
Observe the skewness value to determine the asymmetry of the data distribution. A skewness close to zero suggests a symmetric distribution, while positive or negative skewness indicates a skewed distribution.
For the overall data, str(), dim() are great options to display. This gives a comprehensive understanding of the structure of the data, including variable types, dimensions, and any missing values.
library(palmerpenguins)
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
library(palmerpenguins)
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
Create histograms to visualize the distribution of your continuous variables. Histograms display the frequency or count of values within specific intervals or bins. In R, you can use the hist() function to generate histograms. Experiment with different bin widths to find the most informative representation of your data. You can find more information on histograms in this blogpost
Construct box plots to examine the distribution, central tendency, and variability of your continuous variables. Box plots provide a visual representation of the median, quartiles, and any potential outliers in your data. In R, you can use the boxplot() function to create box plots. For more details on dot and box plots, refer to this blogpost.
Generate density plots to visualize the probability density function of your continuous variables. Density plots provide insights into the shape and skewness of the distribution. In R, you can use functions like density() and plot() to create density plots. For a comprehensive guide, check out this detailed blogpost.
Use scatter plots to explore the relationship between two continuous variables. Scatter plots can help identify patterns, trends, or correlations. In R, you can use the plot() function, specifying the x and y variables. Learn how to draw grouped scatter plots here.
Assess the correlation between pairs of continuous variables using correlation coefficients such as Pearson’s correlation coefficient. This helps identify relationships and dependencies between variables. In R, you can use the cor() function to calculate correlations. Using pairs() is a great way to quickly understand the relationships.
pairs(iris)
If you are familiar with the tidyverse, you can try:
When working with categorical variables, understanding their distribution and frequencies is essential for gaining insights into your data. In this section, we will explore techniques to analyze and visualize categorical variables, allowing us to uncover patterns, proportions, and relationships within the data. By examining the frequencies and proportions of different categories, we can uncover valuable information that contributes to a comprehensive understanding of the dataset.
Here, we will calculate and analyze the counts and proportions of each category within a categorical variable. This analysis provides a quick overview of the distribution and prevalence of different categories, highlighting the relative frequencies and proportions for further investigation.
library(palmerpenguins)
table(penguins$sex)
female male
165 168
prop.table(table(penguins$sex))
female male
0.4954955 0.5045045
here, we will utilize the power of the dplyr package to efficiently calculate the counts of each category within a categorical variable. By using the count() function from dplyr, we can quickly generate a table that displays the number of occurrences for each category in the specified variable. This approach simplifies the process of obtaining a clear and concise summary of the categorical data, aiding in the understanding of the distribution and prevalence of different categories.
penguins |>
dplyr::count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
By being aware of what is missing in our data, we can take proactive steps to address these gaps and ensure the integrity of our analysis. In this regard, I leave the treatment of missing values to your discretion, knowing that your choices will be guided by the principles of EDA.
By utilizing the naniar package and the vis_miss() function, we can easily visualize missing values within the penguins dataset. This visualization helps us identify the patterns and extent of missing data, allowing us to make informed decisions on how to handle missing values in our analysis.
Outliers are data points that deviate significantly from the overall pattern of the dataset. They can have a significant impact on statistical analyses and models. To detect outliers in your data, various techniques such as graphical methods and statistical tests can be employed. For a detailed exploration of outlier detection in R, I recommend referring to the comprehensive blog post by statsandr which provides valuable insights and practical approaches to identify and handle outliers in your data. You can find the blog post here.
In the realm of exploratory data analysis (EDA), utilizing specialized packages can simplify the process and enhance efficiency. Two such packages, “inspectdf” and “summarytools,” offer valuable functionalities for gaining insights from your data. The “inspectdf” package provides comprehensive summaries of categorical variables, allowing you to understand their distributions, frequencies, and proportions effortlessly.
library(inspectdf)
inspect_cat(penguins)
# A tibble: 3 × 5
col_name cnt common common_pcnt levels
<chr> <int> <chr> <dbl> <named list>
1 island 3 Biscoe 48.8 <tibble [3 × 3]>
2 sex 3 male 48.8 <tibble [3 × 3]>
3 species 3 Adelie 44.2 <tibble [3 × 3]>
show_plot(inspect_cat(penguins))
On the other hand, the “dfSummary” function from the summarytools
package offers a sleek and concise summary of all variables, presenting an overview of your data at a glance. By leveraging these powerful packages, you can streamline your EDA workflow and extract meaningful information from your dataset with ease.
Like this!
For a detailed guide on creating a beautiful summary, refer to this blogpost
EDA goes beyond providing an overview; it serves as a guiding force, helping to identify data cleaning requirements and determine the variables that hold value beyond potential insights and intriguing relationships. Through EDA, one gains a comprehensive understanding of the data landscape, uncovering anomalies, inconsistencies, and missing values that require cleaning. Additionally, EDA enables the identification of variables that possess relevance and significance, ultimately contributing to the overall analytical process. By conducting thorough EDA, researchers can navigate through the data complexities, ensuring data quality, and extracting meaningful insights that drive informed decision-making.
Although a layperson may perceive a person engaging in exploratory data analysis (EDA) as being lost, there is a sense of tranquility in embracing the profound insights that EDA offers. As one delves into the depths of data exploration, they gain a majestic overview that guides their analytical journey. This sentiment brings to mind the renowned quote by German philosopher Friedrich Nietzsche:
When you stare into the abyss, the abyss stares back at you.
See image in the introduction.
Exploratory Data Analysis with R - An excellent book by Roger Peng.
Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.
Comtois D (2022). summarytools: Tools to Quickly and Neatly Summarize Data. R package version 1.0.1, https://CRAN.R-project.org/package=summarytools.
Dataset used in this analysis - Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
For attribution, please cite this work as
Soundararajan (2023, June 5). My R Space: Exploratory Data Analysis (EDA) before Analysis - A Step-by-Step Guide for Beginners in R. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2023-06-05-eda/
BibTeX citation
@misc{soundararajan2023exploratory, author = {Soundararajan, Soundarya}, title = {My R Space: Exploratory Data Analysis (EDA) before Analysis - A Step-by-Step Guide for Beginners in R}, url = {https://github.com/soundarya24/SoundBlog/posts/2023-06-05-eda/}, year = {2023} }