Today we add colors to the scatterplot.
Welcome to Day 6 of “Viz with Me”!
Yesterday, we practiced our learning on creating scatterplots and adding labels to the axes.
library(tidyverse)
library(palmerpenguins)
penguins %>%
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() +
labs(title = "Flipper Length vs Body Mass",
x = "Flipper Length (mm)",
y = "Body Mass (g)",
subtitle = "Penguins dataset",
caption = "Source: Palmer Penguins")
Goals for today: Today, let’s add colors to the plot we learnt so far.
We specify the color inside the aes()
function, using the name of the column that contains the categorical variable.
But how do we know what are the categorical variables in the dataset?
We can use the glimpse()
function to check the data type of each column.
library(palmerpenguins)
library(tidyverse)
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 46…
$ sex <fct> male, female, female, NA, female, male, fe…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
glimpse
comes from the dplyr
package. It gives a concise summary of the dataset. Do not wory, dplyr
is loaded when you load tidyverse
.
The output will show you the data type of each column. Look for columns that have the data type factor
names as fct
. These are the categorical variables in the dataset.
Now, there are 2 categorical variables in the penguins dataset:
species and
island.
We can use either of these to color the points.
Now, let’s add colors based on the species of penguins.
Here’s the updated code:
penguins %>%
ggplot(aes(bill_length_mm, body_mass_g,
color = species)) +
geom_point()
Also, did you notice? I didn’t explicitly specify the x- and y-axis labels here. That’s because ggplot2 automatically uses the column names as labels. Once you’re comfortable with specifying the x and y axes, you can skip mentioning them too. Just remember, the first variable goes to the x-axis and the second to the y-axis.
If you’re wondering whether color can be mentioned inside aes()
within the geom_point()
, you are right, the following code would also work:
penguins %>%
ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species))
However, we prefer the first method for the following reason:
Right now, the code is simple, so adding colors inside geom_point() might seem fine. But as we add more elements—like regression lines—we want the colors to stay consistent across all the layers. This is why it’s better to define the color in ggplot(aes())
itself.
The full code would look like this:
penguins %>%
ggplot(aes(bill_length_mm, body_mass_g,
color = species)) +
geom_point()+
labs(
title = "My First Plot",
x = "Bill Length of the Penguins",
y = "Body Mass of the Penguins",
caption = "Data from the Palmer Penguins package"
)
The legends are automatically added to the plot. The legend title is the name of the column we used to color the points. The legend labels are the species of penguins available in that column.
That’s it for today! Tomorrow, we’ll learn how to differentiate between groups using shapes in a plot. Stay tuned!
Jump ahead to Day 7.
For attribution, please cite this work as
Soundararajan (2024, Oct. 6). My R Space: Day 6 of viz with me. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2024-10-06-day-6-of-viz-with-me/
BibTeX citation
@misc{soundararajan2024day, author = {Soundararajan, Soundarya}, title = {My R Space: Day 6 of viz with me}, url = {https://github.com/soundarya24/SoundBlog/posts/2024-10-06-day-6-of-viz-with-me/}, year = {2024} }