From Raw Data to Analyzable Data: 5 Simple Steps for Beginners

Data import Beginner

Learn a streamlined, beginner-friendly approach to transform raw data into clean, analyzable data in R.

Soundarya Soundararajan true
2024-12-06

Data cleaning is often considered the most time-consuming step in any data analysis project(Chu et al. 2016). But it doesn’t have to be! In this guide I will walk you through a simplified outline of how to work with raw data and transform it into analyzable data while maintaining a reproducible workflow in R.

Pre-requisites

Before you begin, ensure you’re working within an R project. If you’re unfamiliar with setting up R projects, refer to my blog post on starting projects in RStudio.

Required organization:

Screenshot of the basic folder structure

Step 1: Organize Your Raw Data

Place your raw data inside the rawData folder. This could be in the form of CSV, Excel, or any other file format. For this example, I will use a CSV file named penguins.csv. This can be done manually by copying the file into the designated folder in your project directory.

At this stage, the data is only on your system, not in your R environment.

Step 2: Import the Data into R

Next, import the raw data into your R environment. Here’s how:

Write a script to load the data (e.g., data_import.R) and save this script inside your Scripts folder.

How my script to import data looks

If you are wondering about the here package, refer to my blog post on the here package.. Here package helps in creating paths that are relative to the project directory.

Step 3: Explore the Data

Before cleaning, take a quick look at the data to understand its structure and identify potential issues like missing values (NAs). For example:

This step helps you decide what transformations or cleaning steps are required. I see lot of NAs in the sex column. I will address this in the next step.

Step 4: Clean the Data

Cleaning data involves removing or dealing with inconsistencies like missing values. Here’s an example of a basic cleaning process:

Remove rows with missing values (NAs):

Save this cleaned version as an object in R.

At this stage, your cleaned data exists in the R environment but hasn’t been saved to a file.

Step 5: Save the Cleaned Data

Finally, save the cleaned data to the cleanData folder. This ensures that you can access the cleaned data for analysis without having to repeat the cleaning process.

Now, your cleaned data is saved and ready for analysis.

The full workflow is here:

Chu, Xu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. “Data Cleaning: Overview and Emerging Challenges.” In Proceedings of the 2016 International Conference on Management of Data, 2201–6. SIGMOD ’16. Association for Computing Machinery. https://doi.org/10.1145/2882903.2912574.

References

Citation

For attribution, please cite this work as

Soundararajan (2024, Dec. 6). My R Space: From Raw Data to Analyzable Data: 5 Simple Steps for Beginners. Retrieved from https://github.com/soundarya24/SoundBlog/posts/2024-12-06-from-raw-data-to-analyzable-data-5-simple-steps-for-beginners/

BibTeX citation

@misc{soundararajan2024from,
  author = {Soundararajan, Soundarya},
  title = {My R Space: From Raw Data to Analyzable Data: 5 Simple Steps for Beginners},
  url = {https://github.com/soundarya24/SoundBlog/posts/2024-12-06-from-raw-data-to-analyzable-data-5-simple-steps-for-beginners/},
  year = {2024}
}