Learn a streamlined, beginner-friendly approach to transform raw data into clean, analyzable data in R.
Data cleaning is often considered the most time-consuming step in any data analysis project(Chu et al. 2016). But it doesn’t have to be! In this guide I will walk you through a simplified outline of how to work with raw data and transform it into analyzable data while maintaining a reproducible workflow in R.
Before you begin, ensure you’re working within an R project. If you’re unfamiliar with setting up R projects, refer to my blog post on starting projects in RStudio.
Required organization:
scripts
folder: to store R scriptscleanData
folder: to store cleaned dataPlace your raw data inside the rawData
folder. This could be in the form of CSV, Excel, or any other file format. For this example, I will use a CSV file named penguins.csv
. This can be done manually by copying the file into the designated folder in your project directory.
At this stage, the data is only on your system, not in your R environment.
Next, import the raw data into your R environment. Here’s how:
Write a script to load the data (e.g., data_import.R) and save this script inside your Scripts folder.
If you are wondering about the here
package, refer to my blog post on the here
package.. Here package helps in creating paths that are relative to the project directory.
Before cleaning, take a quick look at the data to understand its structure and identify potential issues like missing values (NAs). For example:
This step helps you decide what transformations or cleaning steps are required. I see lot of NAs in the sex
column. I will address this in the next step.
Cleaning data involves removing or dealing with inconsistencies like missing values. Here’s an example of a basic cleaning process:
Remove rows with missing values (NAs):
Save this cleaned version as an object in R.
At this stage, your cleaned data exists in the R environment but hasn’t been saved to a file.
Finally, save the cleaned data to the cleanData
folder. This ensures that you can access the cleaned data for analysis without having to repeat the cleaning process.
Now, your cleaned data is saved and ready for analysis.
The full workflow is here:
For attribution, please cite this work as
Soundararajan (2024, Dec. 6). My R Space: From Raw Data to Analyzable Data: 5 Simple Steps for Beginners. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2024-12-06-from-raw-data-to-analyzable-data-5-simple-steps-for-beginners/
BibTeX citation
@misc{soundararajan2024from, author = {Soundararajan, Soundarya}, title = {My R Space: From Raw Data to Analyzable Data: 5 Simple Steps for Beginners}, url = {https://github.com/soundarya24/SoundBlog/posts/2024-12-06-from-raw-data-to-analyzable-data-5-simple-steps-for-beginners/}, year = {2024} }