Get a glimpse of the data after importing

EDA Beginner

As we import the data, there are some immediate steps to follow for a good workflow.

Soundarya Soundararajan true
08-02-2021

You have imported the data which is available in the environment. Let us say our data is iris and you can type View(iris) in the console and run it (hit enter) and have a look at the dataset.

There are a series of commands you might want to check once the data is imported.

Check the dimensions of the data

dim(iris)
[1] 150   5

You guessed it right, this is number of rows and columns-iris dataset has 150 rows (or observations) and 5 columns (or variables).

Check the top and bottom of the data

It is otherwise called checking the head and tail ends of the data.

head(iris) 
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
tail(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

Get a glimpse of the variables

There are two ways to do this.

By checking the structure of the data

str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Other way is using a library called tidyverse

library(tidyverse)
glimpse(iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa,…

Get an overall summary of the dataset

After importing many “eyeball” the data; this function is equivalent to that, only more objective. Once you start using this, you will appreciate its utility.

library(summarytools)
# view(dfSummary(iris))

This opens the following page in your browser.

A beautiful summary of your dataset
This gives a compact summary of all the variables: variable type, stats and histograms for continuous variables and levels for categorical variables. This table also includes valid and missing values. This comes in handy after import to quickly see the data.

Be mindful of doing this for a large data, it might take a lot of time. In that case you can select the variables you want the summary to consist of: Using select (more on this in a later post).

Visualize the missing data

There is no missing data in iris. But I want to portray how beautiful this captures the missing data. So I shall use another dataset with some NA’s.

library(readxl)
df <- read_excel("data.xlsx", 
    na = "NA")
df
# A tibble: 5 × 2
      x     y
  <dbl> <dbl>
1     1     2
2     2     4
3     3    NA
4     4     6
5     5    NA

I created a dummy data with a few NA’s.

vis_miss(df) 

Note that as we have already loaded the library we can just call the function vis_mis.

I suggest not to try vis_miss with all your data. Here, in the demonstration there are only a few variables, in reality you might have many variables which will clutter the vis_miss output. To avoid this, select only the variables of interest to visualize missing values.

library(tidyverse)
iris %>% # your dataframe here followed by the pipe (%>%)
  select(Sepal.Length,Species) %>% #select function from dplyr bundled in tidyverse library
  vis_miss()

Let me know in the comments what more you would like to do with the data before running your descriptives and analyses.

Happy after importing!

Citation

For attribution, please cite this work as

Soundararajan (2021, Aug. 2). My R Space: Get a glimpse of the data after importing. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2021-08-01-after-importing-data/

BibTeX citation

@misc{soundararajan2021get,
  author = {Soundararajan, Soundarya},
  title = {My R Space: Get a glimpse of the data after importing},
  url = {https://github.com/soundarya24/SoundBlog/posts/2021-08-01-after-importing-data/},
  year = {2021}
}