Density Plots

EDA Beginner Density plots

A detailed walkthrough of drawing density plots.

Soundarya Soundararajan true
08-08-2021

I prefer the density plots as they depict the distributions better than histograms. Density plots are in fact smoother version of histograms.

Libraries

library(tidyverse) # just tidyverse is enough for today

For today’s demonstration, we will work with the mtcars dataset. You have to know 2 variables which we are using today: mpg which is the miles per gallon variable, aka mileage, and am variable which is the transmission type: manual or automatic.

Cut me the chase and take me to the final plot or the much prettier plot.

Density plot - rough

Let’s draw a density plot on mileage and also color them by transmission type.

mtcars %>% # your data here
  ggplot(aes(mpg, # inside aesthetics, add your variable of interest
    fill = am
  )) + # we want to fill the plots by transmission type
  geom_density() # :-)

Okay and not okay. What is the problem with this figure? We did not get a colored plot as we thought of. But why? To be colored based on a group, that variable should be coded as a categorical variable, otherwise called as factor variable in R. Let’s check how am variable is coded.

Checking the structure

str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Ah, we see that the am variable is coded as a numerical variable. Let’s tweak the type and recode now.

and redrawing the plot

mtcars %>%
  ggplot(aes(mpg,
    # note that we code am variable as a factor
    fill = factor(am)
  )) +
  geom_density()

We got the colors based on transmission type. Overlapping regions obscure the plot ends. Hence I prefer lighter colors.

Add alpha to see overlaps

Adding alpha inside any geom adds transparency.

We see that the am group and thus the legend is coded as 0 and 1, that legends are coded so too. It would be better to spell them out, what 0 and 1 means.

To achieve this, let us recode the data using mutate command and store in a new name

Changing to factor variable

df <- mtcars %>%
  mutate(
    am = # am variable should be mutated (changed) to..
    factor(am, # a factor variable..
      levels = c(0, 1), # which has these levels
      labels = c("Automatic", "Manual") # add corresponding names to the levels
    )
  )
# to check
str(df$am)
 Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...

Yup, it is a factor with 2 levels now. Awesome!

Add title and labels

df %>%
  ggplot(aes(mpg, fill = am)) +
  geom_density(alpha = 0.4) +
  theme_classic() +
  labs( # I am adding title and axes labels using this command
    x = "Miles/(US) gallon",
    y = "Density",
    title = "Density plots of mileage by transmission type"
  )

Still the title of legends need some working; am is not intuitive.

Change legend title

df %>%
  ggplot(aes(mpg, fill = am)) +
  geom_density(alpha = 0.4) +
  theme_classic() +
  labs(
    x = "Miles/(US) gallon",
    y = "Density",
    title = "Density plots of mileage by transmission type",
    fill = "Transmission type"
  )

To change the legend title, I used the fill here, as the colors fill the plots.

This is pretty good and we can end it here. But if you want to step this up, try adding the sample sizes to the density plots.

Saving sample sizes as dataframe

# sample sizes by transmission group
df %>%
  group_by(am) %>%
  summarize(samplesize = n())
# A tibble: 2 × 2
  am        samplesize
  <fct>          <int>
1 Automatic         19
2 Manual            13
# let's save this with the coordinates to plot.

# to annotate
fortext <- df %>%
  group_by(am) %>%
  summarize(samplesize = n()) %>%
  mutate(
    # i added these manually to match up with the final figure
    x = c(12, 30),
    y = c(0.075, 0.055)
  )

We will now use this summary from the object fortext to annotate the plots.

Prettier final plot

df %>%
  ggplot(aes(mpg, fill = am)) +
  geom_density(alpha = 0.8) +
  geom_text( # we call this to annotate
    data = fortext, # using this data we created earlier
    aes(
      y = y, # x and y we input manually
      x = x,
      label = am, # asking to label the transmission types
      fontface = "bold",
      color = am
    )
  ) +
  scale_fill_manual(values=c("#868E74", "#AD5988"))+
  # adding sample sizes
  geom_text(
    data = fortext,
    aes(
      x = x,
      y = y - 0.005, # a bit lower than the transmission type level
      color = am,
      fontface = "bold",
      label = str_glue(
        "n = {samplesize}" # I want n = before the sample sizes
      )
    )
  ) +
    scale_color_manual(values=c("#868E74", "#AD5988"))+
  theme_classic() +
  labs(
    x = "Miles/(US) gallon",
    y = "Density",
    title = "Density plots of mileage by transmission type"
  ) +
  theme(legend.position = "none") # no longer required since we have labeled them in the plots

Oh how much I love it!

“Jump for Joy” by kreg.steppe is licensed with CC BY-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/2.0/

I would probably do this if this was my first density plot!

See you with a next one!

Citation

For attribution, please cite this work as

Soundararajan (2021, Aug. 8). My R Space: Density Plots. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2021-08-05-density-plots/

BibTeX citation

@misc{soundararajan2021density,
  author = {Soundararajan, Soundarya},
  title = {My R Space: Density Plots},
  url = {https://github.com/soundarya24/SoundBlog/posts/2021-08-05-density-plots/},
  year = {2021}
}