As we import the data, there are some immediate steps to follow for a good workflow.
You have imported the data which is available in the environment. Let us say our data is iris
and you can type View(iris)
in the console and run it (hit enter) and have a look at the dataset.
There are a series of commands you might want to check once the data is imported.
dim(iris)
[1] 150 5
You guessed it right, this is number of rows and columns-iris
dataset has 150 rows (or observations) and 5 columns (or variables).
It is otherwise called checking the head and tail ends of the data.
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
tail(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
There are two ways to do this.
By checking the structure of the data
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Other way is using a library called tidyverse
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa,…
After importing many “eyeball” the data; this function is equivalent to that, only more objective. Once you start using this, you will appreciate its utility.
library(summarytools)
# view(dfSummary(iris))
This opens the following page in your browser.
This gives a compact summary of all the variables: variable type, stats and histograms for continuous variables and levels for categorical variables. This table also includes valid and missing values. This comes in handy after import to quickly see the data.
Be mindful of doing this for a large data, it might take a lot of time. In that case you can select the variables you want the summary to consist of: Using select
(more on this in a later post).
There is no missing data in iris
. But I want to portray how beautiful this captures the missing data. So I shall use another dataset with some NA’s.
library(readxl)
df <- read_excel("data.xlsx",
na = "NA")
df
# A tibble: 5 × 2
x y
<dbl> <dbl>
1 1 2
2 2 4
3 3 NA
4 4 6
5 5 NA
I created a dummy data with a few NA’s.
vis_miss(df)
Note that as we have already loaded the library we can just call the function vis_mis
.
I suggest not to try vis_miss
with all your data. Here, in the demonstration there are only a few variables, in reality you might have many variables which will clutter the vis_miss output. To avoid this, select only the variables of interest to visualize missing values.
Let me know in the comments what more you would like to do with the data before running your descriptives and analyses.
Happy after importing!
For attribution, please cite this work as
Soundararajan (2021, Aug. 2). My R Space: Get a glimpse of the data after importing. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2021-08-01-after-importing-data/
BibTeX citation
@misc{soundararajan2021get, author = {Soundararajan, Soundarya}, title = {My R Space: Get a glimpse of the data after importing}, url = {https://github.com/soundarya24/SoundBlog/posts/2021-08-01-after-importing-data/}, year = {2021} }