RYouReadyCleaning1 | handwriststudygroup

Gemstracker and HandWristStudyGroup data

R-course: RYouReady

General introduction to R and GitHub

R-course: RYouReady

Applying R to HandWristStudyGroup data

CleanItUp1: Variables

No dataset is ready for analysis once it has just been generated. There are always superfluous variables, missing data, unclear variable names, or impossible answers (for example, someone filling in to be 180 meters instead of 180 centimeters). In addition, it is sometimes needed to merge multiple data files. Or it may be necessary to calculate new variables, for example, because you want to look at the effect of BMI in your analysis, while you only have the variables height and weight. Or you want to know how old your subjects are, but you only have their year of birth and the date on which they completed a form.

Cleaning up your data can be time-consuming. In fact, some people say this is often the most time-consuming part of most data projects. So every dataset still needs preprocessing. Clean it up!

What do we clean?

The data we often work with in R, and that we use in this course, has the same format as an excel sheet: each variable has a column and each observation a row. So a subject who has been measured once has a row with the value of a specific variable in each cell. In CleanItUp, the variables are cleaned up first. After that, the rows are cleaned. And then we teach you to calculate new variables.

CleanItUp: The variables

First an Introduction by the R Ladies:

The next video (13:20 min) explains how you can organise and rename variables

Assignment cleaning names

Continue in your RYouReady_Basics script.
Install the Janitor package and add loading the package to your script.
Run the function 'clean_names' on the data frame Example_LongFormat in and assign this to a new variable (using <-) called cleannames_data_long. Inspect the result with the command 'view(cleannames_data_long)'. What has changed?

#Install package---

install.packages("janitor")

#Load package janitor---
library(janitor)

#If you get an error loading the janitor package try----
install.packages("rlang")
install.packages("devtools")
devtools::install_github("sfirke/janitor")

#Run function clean_names---
cleannames_data_long <- clean_names(Example_LongFormat)

#Inspect the result
view(cleannames_data_long)

Answers

Now watch the following video (2:57 min) on how to reorganize the variables in your data frame

Assignment organising variables

Now make a new data frame with the name select_data_long based on Example_LongFormat, with as variables first 'behandeling', then 'rounddescription', and then everything else.
Remove from the the new data frame the variable patient_traject_id by using the 'select' command and adding a - sign.

#Create a data frame with the name select_data_long
select_data_long<- select(Example_LongFormat, behandeling, rounddescription, everything())

#Remove from the data the variables Patient.traject.ID)
select_data_long<- select(select_data_long, -patient_traject_id)

#inspect the result
view(select_data_long)

Answers

The convenience of %>%

In the next video (6:09 min) you will learn how you can perform different actions in a smart and easy to read way with the Pipe construct (%>%)

Assignment clean it up with Pipes

make a pipe (%>%) in which you take Example_LongFormat and create a new data frame with the name data_long_clean. Now apply the pipe function to clean the names (clean_names command) and ensure that the variables 'behandeling' and 'rounddescription' are the first 2 variables in the new data frame, followed by the rest.
Remove the variable Patient.traject.ID with the select() command.

From now on, use the Pipe (%>%) function for all assignments, even if not explicitly mentioned.

#Make a pipe in which you take data_long and create a new data frame with the name data_long_clean----

data_long_clean <- Example_LongFormat %>%
clean_names() %>%
select(behandeling, rounddescription, everything()) %>%
select(-patient_traject_id)

#Remember that because of clean_names(), Patient.traject.ID and Survey.ID are now named patient_traject_id and survey_id

Answers

So now that you clean the columns (variables), you can start thinking about the rows (subjects).

CleanItUp2