I love exploring the data as soon as I import. In this post, I describe some basics of exploratory analyses and graphs which reveal a lot about your data.
For the initial steps right after importing the data, I have written a blog post here.
head(iris) # displays just the top of the data of interest
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
As an exploration i would like to see if any data is missing, For this we shall use vis_miss
from the package naniar
vis_miss(iris)
This dataset is complete. Wondering how would it look if some data is actually missing?
library(readxl)
data <- read_excel("data.xlsx", # I am importing my data
na = "NA" # indicating the missing variables are denoted as NA in my data
)
vis_miss(data)
The black band indicates missing data. For data with more number of observations, this will more often look like a line rather than a band.
More often I explore data with boxplots, to study their spread.
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_boxplot(width = 0.5)
I explore outliers quickly this way and color them if needed.
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_boxplot(width = 0.5, outlier.color = "red")
I have neither added any customization nor modified themes as these are exploratory. Some call these “dirty plots”, but I don’t as I find all outputs from R beautiful. :-)
For more polished version, consult here.
If you prefer the base plot method, its much simpler and the code is below.
boxplot(x = iris$Sepal.Length)
In my personal experience, i have departed from the base plots (many might argue against), and have become closer to ggplot methods.
Pick what you are comfortable with.
Last but not the least, the best and most effective method to view any data or its variables is by summary statistics.
summary(iris) # full data
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
summary(iris$Sepal.Length) #variable of interest
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
# If the variable is a factor variable, the the levels will be displayed.
summary(iris$Species)
setosa versicolor virginica
50 50 50
Happy exploring until part-2 of exploratory graphs.
For attribution, please cite this work as
Soundararajan (2022, April 1). My R Space: Exploratory graphs Part - 1. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2022-04-01-exploratory-graphs-part-1/
BibTeX citation
@misc{soundararajan2022exploratory, author = {Soundararajan, Soundarya}, title = {My R Space: Exploratory graphs Part - 1}, url = {https://github.com/soundarya24/SoundBlog/posts/2022-04-01-exploratory-graphs-part-1/}, year = {2022} }