-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
156 lines (121 loc) · 8.21 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
options(tibble.print_min = 5, tibble.print_max = 5)
```
# inspectEHR <a href='https://cc-hic.github.io/inspectEHR/'><img src='man/figures/logo.png' align="right" height="139" /></a>
<!-- badges: start -->
[![Build Status](https://travis-ci.org/CC-HIC/inspectEHR.svg?branch=master)](https://travis-ci.org/CC-HIC/inspectEHR)
[![codecov](https://codecov.io/gh/CC-HIC/inspectEHR/branch/master/graph/badge.svg)](https://codecov.io/gh/CC-HIC/inspectEHR)
[![Lifecycle Status](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/)
<!-- badges: end -->
## Overview
inspectEHR is a data wrangling, cleaning and reporting tool for CC-HIC. It is designed to run against the CC-HIC EAV table structure (which at present exists in PostgreSQL and SQLite flavours). We are about to undergo a major rewrite to a OHDSI CDM version 6, so this package will be in flux. Once these functions have been ported across and tested, we will aim to submit to CRAN. Please see the `R` vignettes for further details on how to use the package to perform the most common tasks:
* `extract_demographics()` produces a table for time invariant variables
* `extract_timevarying()` produces a table for longitudinal variables
* `clean()` cleans the above tables according to pre-defined standards.
* `report()` produces a data quality report (note, being fixed for IDHS usage)
## Installation
```{r, eval = FALSE}
# install directly from github with
remotes::install_github("cc-hic/inspectEHR")
```
A copy should already be installed into the group library for the CC-HIC team inside the UCL safe haven. If you are having problems with this, please contact me directly.
## Usage
A synthetic database with 1000 patients ships with inspectEHR for you to explore. Actual values are garbage, but everything is logically consitent (e.g. patients discharge after they arrive). The code to produce a more comprehensive synthetic test database is found in `data-raw/write_synthetic_data.R`. We have embedded the first 1000 patients so as to not make the synehtic database combersome.
```{r, message = FALSE}
library(inspectEHR)
# Synthetic database ships with inspectEHR
db_pth <- system.file("testdata/synthetic_db.sqlite3", package = "inspectEHR")
ctn <- connect(sqlite_file = db_pth)
# Extract static variables. Rename on the fly.
dtb <- extract_demographics(
connection = ctn,
episode_ids = 13639:13643,
code_names = c("NIHR_HIC_ICU_0017", "NIHR_HIC_ICU_0019"),
rename = c("height", "weight")
)
head(dtb)
# Extract time varying variables. Rename on the fly.
ltb <- extract_timevarying(
ctn,
episode_ids = 13639:13643,
code_names = "NIHR_HIC_ICU_0108",
rename = "hr")
head(ltb)
# Pull out to any arbitrary temporal resolution and custom define the
# behaviour for information recorded at resolution higher than you are sampling.
# only extract the first 24 hours of data
ltb_2 <- extract_timevarying(
ctn,
episode_ids = 13639:13643,
code_names = "NIHR_HIC_ICU_0108",
rename = "hr",
cadance = 2, # 1 row every 2 hours
coalesce_rows = mean, # use mean to downsample to our 2 hour cadence
time_boundaries = c(0, 24)
)
head(ltb_2)
DBI::dbDisconnect(ctn)
```
## Getting help
If you find a bug, please file a minimal reproducible example on [github](https://github.com/cc-hic/inspectEHR/issues).
## Reporting Data Quality Issues
Please submit an issue and tag it with "data quality". Data quality issues are often related to a specific site. If this is the case, please also tag the site.
## Data Quality Rules
The data quality rules are largely based upon standards set by OHDSI<sup>1</sup>, and the consensus guidelines by Khan et al.<sup>2</sup>. The CC-HIC currently uses an episode centric model. As such, many of the data quality checks are based around this way of thinking. As we move to OMOP (a patient centric model) many of these will change accordingly. General conventions follow that:
* Verification refers to an intenal check of the data, possibly against a data specification or domain knowledge.
* Validation refers to an external check of the data, against a secondary resource that is known to be trusted.
### Episode Verification
- Internal consistency check of episode
- Each episode has a unique identifier
- Each episode has a start date
- Each episode has an end date defined by one of:
- End of episode date time (often administrative in nature and used with caution)
- On unit death date time from any means
- Death
- Brain stem Death (BSD)
- "Body Removal": strongly suggesting death
- The last measured bedside physiological result (HR and SpO2)
- Death (other than BSD) cannot occur before the last caputre of a direct physiological measure (e.g. a heart rate, or a blood pressure), but could occur before the return of a lab measure (e.g. a CRP or a pathology result)
- Records that are flagged as "open" are dropped.
### Event/Value Verification
- Value conformance: data conforms to accepted type and formatting constraints
- Relational conformance: unique keys are not duplicated and there are no orphaned rows in the database
- Computational conformance: calculated and contributed values agree with oneanother
- e.g. BMI (derived from height and weight) in plausible range and agree with a contributed BMI
- e.g. APACHE in plausible range and agree with a contributed APACHE
- Completeness: contributed data items match local capability (i.e. missingness only occurs when the data doesn't exist)
- Plausibility:
- Uniqueness plausibility: descriptions of singular events/objects are not duplicated
- Atemporal plausibility:
- Events occur within their episode (within a reasonable buffer time)
- Events fall within an accepted range, follow the expected distribution and agree with internal/local knowledge
- Distributions for independent measures of the same fact are in agreement:
- e.g. Systolic > MAP > Diastolic
- e.g. blood gas sodium and lab sodium
- e.g. menorrhagia diagnosis only linked to female sex
- Repeated measures of the same event show the expected variability
- Temporal plausibility:
- Observed values, and sequences of values that represent state transitions, correspond to an expected temporal property:
- e.g. Hosp Admission -> ICU Admission -> ICU Discharge -> Hosp Discharge
- Value density over time are consistent with local expectations
- e.g. busy and quiet periods are reflected in the data
### Event Validation
The same principles listed above can be retested under a validation framework. That is to seak an external resource to validate the values in question. An aim is to check the data against the ICNARC reports, which will allows for some external validation against a gold standard resource. At present, the validation that is performed is to compaire all sites against each other. In this sense, each site is used as a validation check against the others. Distrepancies should either be due to systemic errors, or differences in casemix.
## Synthetic Database
There is a copy of the CC-HIC database located in `data-raw/synthetic_db.sqlite3`. This is a structurally sound copy of the real database, but entirely hand crafted (so there is no patient data here). It is quite sparse at present, but I will add more variables as time goes on.
For now, it acts as a test database to make sure inspectEHR works as it should.
---
1. https://www.ohdsi.org/analytic-tools/achilles-for-data-characterization/
2. Kahn, Michael G.; Callahan, Tiffany J.; Barnard, Juliana; Bauck, Alan E.; Brown, Jeff; Davidson, Bruce N.; Estiri, Hossein; Goerg, Carsten; Holve, Erin; Johnson, Steven G.; Liaw, Siaw-Teng; Hamilton-Lopez, Marianne; Meeker, Daniella; Ong, Toan C.; Ryan, Patrick; Shang, Ning; Weiskopf, Nicole G.; Weng, Chunhua; Zozus, Meredith N.; and Schilling, Lisa (2016) "A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data," eGEMs (Generating Evidence & Methods to improve patient outcomes): Vol. 4: Iss. 1, Article 18.