Histograms

Histograms Beginner ggplot

Learn to draw histograms by three beginner-friendly ways.

Soundarya Soundararajan true
08-05-2021

Histograms not only tell you about the distribution of the variable of interest, but also informs you of the outliers. This makes histograms one of my favorite plots in exploratory data analysis.

My dataset to demonstrate histograms today is iris. Briefly speaking, iris is a dataset with 5 variables which gives us information on 50 flowers in 3 species of iris. To know more about this dataset type ?iris in the console and hit enter. You will see the relevant information open in the help window.

Now, let us draw histograms

Base R method

Prerequisites

What do I mean when I say base R?

I just mean I am going to use functions in R, which do not require any packages (generally you call a package like this: library("package name"))

How to call a variable in the dataset?

For example, I want to draw a histogram for one of the variables in the iris dataset, let’s say Sepal length. This is how I call the variable.

iris$Sepal.Length #note the operator $
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7
 [17] 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4
 [33] 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6
 [49] 5.3 5.0 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1
 [65] 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7
 [81] 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7
 [97] 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4
[113] 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1
[129] 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

You see that all values of sepal length are displayed when I call the variable.

Histograms by the hist() function

Histograms in base R are drawn by the function hist()

hist(iris$Sepal.Length)

That’s it, we have a histogram. I often stop with this without much customization when i perform exploratory data analysis.

What are the problems in the current plot? Nothing, we just need some aesthetic improvements, to make it prettier and more importantly publishable.

So, let’s learn to tweak this a bit, in case you have to draw one for your manuscript.

color of the bars: by default is grey, we can change this.

Change the bar colors

hist(iris$Sepal.Length, #note the comma, to add the next function
     col = "darkgreen") #note the function col and the color is given within quotes.

Now, the distinctions between bars is not clear. Lets correct it.

Make the bars stand out

hist(iris$Sepal.Length,
     col = "darkgreen",
     border = "white") 

By adding a white border for a dark bar, we have made the bars to stand out.

Wouldn’t labeling the bars look good?

Label the counts

hist(iris$Sepal.Length,
     col = "darkgreen",
     border = "white",
     labels = TRUE)

yes!

Now if we edit the title and axis labels, it would look much better.

Edit title and axes labels

hist(iris$Sepal.Length,
     col = "darkgreen",
     border = "white",
     labels = TRUE,
     main = "Histogram of Sepal Length from Iris dataset", #note main changes the plot title
     xlab = "Sepal Length in cms")

I think it looks perfect. Feel free to skip the unnecessary steps and edit as needed.

Now, there is one more way of achieving plots, which is more customizable.

Enter the world of ggplot. I personally use base R to get to know the dataset but I prefer ggplot for all my plots which I export.

The ggplot method

Why did I call the tidyverse library, when I had to call ggplot? Tidyverse library is a bundle of many libraries, which are often required for the plots and analyses. So, by practice, I call tidyverse which includes ggplot whenever I start my analyses.

Prerequisite

  1. Get introduced to the pipe operator %>%. This is part of another package magrittr which will be essential. To know more about pipes, consult here. For now, just be informed that pipe operator lets you move forward and connects the codes and that it is part of the tidyverse package.

Produce the plot ggplot way

iris %>% #with the pipe we ask R to connect the data with the ggplot function
  ggplot(aes(x = Sepal.Length)) + 
  geom_histogram() 

Note inbetween ggplot and the geom function it is a plus symbol, not the pipe. We have produced a plot, but we need to customize it.

Change colors and bins

iris %>% 
  ggplot(aes(x = Sepal.Length)) +
  geom_histogram(
    bins = 10, # to learn more about bins, see side bar
    color = "white", #note that here it is color, unlike col in base R
    fill = "darkgreen"
  ) 

Looks much better, we still need to get rid of the grey background and change axes titles.

Add title and axes titles

iris %>% 
  ggplot(aes(x = Sepal.Length)) +
  geom_histogram(
    bins = 10, 
    color = "white", 
    fill = "darkgreen"
  ) +
  labs(x="Sepal Length in cms",
       title = "Histogram of Sepal Length from iris dataset")

Change the background

iris %>% 
  ggplot(aes(x = Sepal.Length)) +
  geom_histogram(
    bins = 10, 
    color = "white", 
    fill = "darkgreen"
  ) +
  labs(x="Sepal Length in cms",
       title = "Histogram of Sepal Length from iris dataset")+
  theme_classic() 

I change the background by changing the theme, there are n number of themes. When you start typing theme_ you will see a number of options coming up, select any one and play around. Note that the theme is added last with a plus symbol.

Easiest way

Probably the easiest way to produce plots is via the esquisse package. Once you install esquisse by install.packages("esquisse"), this gets added in Rstudio as an add-in. You will see the add-ins here.

After installing, click here

You can find the ggplot builder

Search for esquisse in the add-ins This gives you an interface and you can click and select the x, and y variables and select the plot types, without having you to write the codes. Once the plots are produced, you can export the codes for reproducbility.

This is a great find for a beginner, who wants to produce plots quickly and those who are frustrated with learning to code.

That’s it for histograms today: I personally prefer density plots than histograms, I will describe them in a later post.

Happy plotting until that!

Citation

For attribution, please cite this work as

Soundararajan (2021, Aug. 5). My R Space: Histograms. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2021-08-04-histograms/

BibTeX citation

@misc{soundararajan2021histograms,
  author = {Soundararajan, Soundarya},
  title = {My R Space: Histograms},
  url = {https://github.com/soundarya24/SoundBlog/posts/2021-08-04-histograms/},
  year = {2021}
}