4.0. ETL and Selection | handwriststudygroup

Back

Home

Gemstracker and HandWristStudyGroup data

R-course: RYouReady

General introduction to R and GitHub

R-course: RYouReady

Applying R to HandWristStudyGroup data

4.0 Generating data sets in R from Gemstracker output

Selecting specific data sets

For the HandWristStudyGroup data, data sets are established in two steps.

ETL
Selection

ETL (Extract, Transform, Load)

In the video below (29.17 min) Jeanne Bakx explains the basics from an ETL script.

As discussed in the video, the data exported from GemsTracker contains two files per survey:

A file with the respondent data with all the data that each subject filled in. Each time someone fills in the form, a row is added. So one subject can have multiple rows.
A codebook of the survey, defining the variable names and the scoring.

Since measurement tracks in Gemstracker generally contain multiple questionnaires (PROMS or clinician-reported forms), data from all questionnaires have to be combined in a comprehensible format and coupled on a subject level so that you know when scores from different questionnaires are from the same subject.

In an ETL script, all exports of all questionnaires in a measurement tracks are imported into R, preprocessed, and after some basic cleaning stored in separate files with each questionnaire or form in a list.

The ETL of the HandWristStudy group is even more extensive since it combines data from 8 different measurements tracks (e.g., a track for a patient undergoing thumb surgery and a track for a patient with a neuropathy).

Although ETL’s can be defined in different ways, below, we explain how an ETL can be made for a very simple case of a measurement track with only two questionnaires with example data from a few simulated patients.

In the R repository, in the folder scripts/ETL, you can find an r script called example_ETL.R that follows the steps below.

Matching codebooks

In the first step of the ETL, the script checks whether the questions in each survey have remained the same. This check is necessary since sometimes changes are made to the data collection and this needs to be detected. Therefore, it is a first before you can run the rest of the ETL script.

To check the matching, the codebooks of all surveys are loaded and combined into one large codebook. To make sure all questionIDs are unique, the title of the question is combined with the surveyID into the variable rowID (line 36-40), resulting in the following:

Matching codebooks

in the HandWristStudy ETL, in the first step of the ETL, the script checks whether the questions in each survey have remained the same. This check is necessary to run the rest of the ETL script. If there is a difference in one of the surveys compared to the previous version of the export, a warning comes up stating there is a mismatch with the Codebook, plus information about the missing question or added question. In case no surveys have changed, the ETL script will continue to load the respondent data.

Re-coding raw data

The raw data exported from GemsTracker uses the answer-codes in LimeSurvey to store all the given answers. However, these answer-codes are not always easy to interpret. Before recoding the raw data, it often has a structure like shown below. The left part of the data contains all the information of the measurement tracks (red). The middle part shows the token information (green) and on the right you will find the answers to the questions of the given survey (blue).

Selection

In selection, the researcher can take the output from the ETL and define exactly which subjects he wants to include based on specific characteristics (e.g, specific treatments) and the availability of specific data (e.g., only subjects who filled in the baseline and 12-month follow-up). In addition, the researcher can select which questionnaires to include in further analysis and which format the data file should have (e.g., a 'long' or a 'wide' format).

We have separate versions of the selection script. In the two videos below (3:11 min + 16:19 min), Lisa Hoogendam will present the selection script she has developed and she will also tell you more about the long and wide data format that can be made.

Schermafbeelding 2020-11-17 om 13.12.28.

Problems watching this video? Click here to watch the video on youtube

Schermafbeelding 2020-12-22 om 14.10.01.

Problems watching this video? Click here to watch this video on youtube