The pmdplyr package is an extension to dplyr designed for cleaning and
managing panel and hierarchical data. It contains variations on the
dplyr mutate
and _join
functions that address common panel data
needs, and contains functions for managing and cleaning panel data. The
goal is to get you a nice tidy pibble
panel data object, which you can
panel_convert()
for use in one of the many packages that help you
analyze panel data.
Unlike other panel data packages, functions in pmdplyr are all designed to work even if there is more than one observation per individual per period. This comes in handy if each individual is observed multiple times per period - for example, multiple classes per student per term; or if you have hierarchical data - for example, multiple companies per country.
Examples of pmdplyr use are below. These examples only cover some of the
functionality of the package. See the
Reference
pkgdown
page for a full list of functions, or help(pmdplyr)
.
You can install pmdplyr from CRAN:
install.packages("pmdplyr")
The development version can be installed from GitHub:
# install.packages("devtools")
devtools::install_github("NickCH-K/pmdplyr")
Let’s start with the fairly straightforward Scorecard
data, which
describes how well students who attended that college are doing years
after attendance. Scorecard
observations are uniquely identified by
college ID unitid
and year year
.
# Note that pmdplyr automatically loads dplyr as well
library(pmdplyr)
data(Scorecard)
# Let's declare it as a pibble panel-data tibble
Scorecard <- as_pibble(Scorecard, .i = unitid, .t = year)
# We also have this data on the December unemployment rate for US college grads nationally
# but only every other year
unemp_data <- data.frame(
unemp_year = c(2006, 2008, 2010, 2012, 2014, 2016, 2018),
unemp = c(.017, .036, .048, .040, .028, .025, .020)
)
I am interested in measuring the differences in ex-student earnings
earnings_med
between two-year and four-year colleges
(pred_degree_awarded_ipeds == 2
or 3
, respectively). But before we
can do that we need to clean the data.
Scorecard %>%
# We want pred_degree_awarded_ipeds to be consistent within college. No changers!
# So let's drop them by using fixed_check with .resolve = "drop" to lose inconsistencies
fixed_force(
.var = pred_degree_awarded_ipeds,
.within = unitid,
.resolve = "drop"
) %>%
# Then, get rid of pred_degree_awarded_ipeds == 1
# And simplify our terms
filter(pred_degree_awarded_ipeds %in% c(2, 3)) %>%
mutate(FourYear = pred_degree_awarded_ipeds == 3) %>%
# earnings_med has some missing values - let's fill them in with
# the most recent nonmissing observations we have
# - panel_locf respects the panel structure declared above with as_pibble()
mutate(earnings_med = panel_locf(earnings_med)) %>%
# Now let's bring in that unemployment data!
# Since it's every other year, it won't line up properly
# in the join. So let's use inexact_join to get the MOST RECENT
# year to join with
inexact_left_join(unemp_data, var = year, jvar = unemp_year, method = "last") %>%
# To adjust for state-level trends, let's also control for a tlag of
# average earnings within state.
# The lag is at the state level, and state-year doesn't uniquely identify,
# But that's okay! We just pick a .resolve function to handle disagreements.
# (We could also do this straight in the regression model itself)
mutate(lag_state_earnings = tlag(earnings_med,
.i = state_abbr,
.t = year,
.resolve = mean
)) -> scorecard_clean
# Now we can run a basic regression.
lm(
earnings_med ~
FourYear +
unemp +
lag_state_earnings,
data = scorecard_clean
) %>%
summary()
#>
#> Call:
#> lm(formula = earnings_med ~ FourYear + unemp + lag_state_earnings,
#> data = scorecard_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -25341 -4917 -528 4091 54587
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.933e+03 1.772e+03 3.348 0.000826 ***
#> FourYearTRUE 9.088e+03 3.511e+02 25.886 < 2e-16 ***
#> unemp -4.564e+04 2.529e+04 -1.805 0.071215 .
#> lag_state_earnings 7.348e-01 4.027e-02 18.244 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 8209 on 2474 degrees of freedom
#> (25161 observations deleted due to missingness)
#> Multiple R-squared: 0.3431, Adjusted R-squared: 0.3423
#> F-statistic: 430.6 on 3 and 2474 DF, p-value: < 2.2e-16
We could even improve that code - why not run the anti_join()
and
inexact_left_join()
using safe_join()
? When we do the
inexact_left_join()
, for example, we’re assuming that unemp_data
is
uniquely identified by unemp_year
—is it really? safe_join()
would
check for us and minimize error.
Let’s shift focus to the SPrail
data set, which is not nearly as well
behaved as Scorecard
! This data contains the price
of 2,000 train
trips taken on the Spanish rail system. At any given time period, there
are multiple tickets sold per route, which is identified by origin
and
destination
. And the insert_date
of sale is down to the minute - far
more precise than we might actually want to use.
Let’s say we are interested in how much of a premium you’ll have to pay
for different ticket types relative to the cheapo option - a tourist
ticket with a transfer (train_class == "Turista con enlace"
),
accounting for differences in prices between routes.
We have some difficulties to cover: making the ID and time variables behave, accounting for the between-route differences, and figuring out how to compare each price to the cheapo price.
data(SPrail)
SPrail %>%
# We have two ID variables - origin and destination.
# pmdplyr has no problem with this, but maybe we want to export
# to something like plm later, which can't handle it.
# So let's use id_variable to combine them into one
mutate(route_ID = id_variable(origin, destination)) %>%
# We have a time variable down to the minute. Too fine-grained!
# Let's back things up to the daily level, and
# create a nice integer time variable that's easy to use
mutate(day = time_variable(insert_date, .method = "day")) %>%
# Now we can declare a pibble
as_pibble(.i = route_ID, .t = day) %>%
# We want to account for between-route differences in price,
# so let's isolate the within variation
mutate(price_w = within_i(price)) %>%
# We want to compare to the cheapo option, so let's use
# mutate_subset to get the average price of the cheapo option
# and propogate that to the other options for comparison
mutate_subset(
cheapo_price = mean(price, na.rm = TRUE),
.filter = train_class == "Turista con enlace"
) %>%
mutate(premium = price - cheapo_price) %>%
filter(train_class %in% c("Preferente", "Turista", "Turista Plus")) %>%
# Now let's compare premia
group_by(train_class) %>%
summarize(premium = mean(premium, na.rm = TRUE)) -> sprail_compare_premia
sprail_compare_premia
#> # A tibble: 3 x 2
#> train_class premium
#> <fct> <dbl>
#> 1 Preferente 29.9
#> 2 Turista 8.36
#> 3 Turista Plus 14.4
And so there we have it—Preferente
will really set you back relative
to the cheapo ticket on the same route.
Please note that the ‘pmdplyr’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.