This is an R package designed to aid in the analysis of panel data,
designs in which the same group of respondents/entities are
contacted/measured multiple times. panelr provides some
useful infrastructure, like a panel_data object class, as
well as automating some emerging methods for analyses of these data.
wbm() automates the “within-between” (also known as
“between-within” and “hybrid”) specification that combines the desirable
aspects of both fixed effects and random effects econometric models and
fits them using the lme4 package in the backend. Bayesian
estimation of these models is supported by interfacing with the
brms package (wbm_stan()) and GEE estimation
via geepack (wbgee()).
It also automates the fairly new “asymmetric effects” specification
described by Allison
(2019) and supports estimation via GLS for linear asymmetric effects
models (asym()) and via GEE for non-Gaussian models
(asym_gee()).
panelr is now available via CRAN.
install.packages("panelr")panel_data framesWhile not strictly required, the best way to start is to declare your
data as panel data. I’ll load the example data WageData to
demonstrate.
library(panelr)
data("WageData")
colnames(WageData) [1] "exp" "wks" "occ" "ind" "south" "smsa" "ms" "fem"
[9] "union" "ed" "blk" "lwage" "t" "id"
The two key variables here are t and id.
t is the wave of the survey the row of the data refers to
while id is the survey respondent. This is a perfectly
balanced data set, so there are 7 observations for each of the 595
respondents. We will use those two pieces of information to create a
panel_data object.
wages <- panel_data(WageData, id = id, wave = t)
wages# Panel data: 4,165 x 14
# entities: id [595]
# wave variable: t [1, 2, 3, ... (7 waves)]
id t exp wks occ ind south smsa ms fem union ed
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 3 32 0 0 1 0 1 0 0 9
2 1 2 4 43 0 0 1 0 1 0 0 9
3 1 3 5 40 0 0 1 0 1 0 0 9
4 1 4 6 39 0 0 1 0 1 0 0 9
5 1 5 7 42 0 1 1 0 1 0 0 9
6 1 6 8 35 0 1 1 0 1 0 0 9
7 1 7 9 32 0 1 1 0 1 0 0 9
8 2 1 30 34 1 0 0 0 1 0 0 11
9 2 2 31 27 1 0 0 0 1 0 0 11
10 2 3 32 33 1 1 0 0 1 0 1 11
# ... with 4,155 more rows, and 2 more variables: blk <dbl>, lwage <dbl>
We have to tell panel_data() which column refers to the
unique identifiers for respondents/entities (the latter when you have
something like countries or companies instead of people) and which
column refers to the period/wave of data collection.
Note that the resulting panel_data object will remember
which of the columns is the ID column and which is the wave column. It
will also fight you a bit when you do things that might have the side
effect of dropping those columns or putting them out of time order.
panel_data frames are modified tibbles (tibble package)
that are grouped by entity (i.e., the ID column).
panel_data frames are meant to play nice with the tidyverse. Here’s a
quick sample of how a tidy workflow with panelr can
work:
library(dplyr)
data("WageData")
# Create `panel_data` object
wages <- panel_data(WageData, id = id, wave = t) %>%
# Pass to mutate, which will calculate statistics groupwise when appropriate
mutate(
wage = exp(lwage), # reverse transform the log wage variable
mean_wage_individual = mean(wage), # means calculated separately by entity
lag_wage = lag(wage) # mutate() will calculate lagged values correctly
) %>%
# Use `panelr`'s complete_data() to filter for entities that have
# enough observations
complete_data(wage, union, min.waves = 5) %>% # drop if there aren't 5 completions
# You can use unpanel() if you need to do rowwise or columnwise operations
unpanel() %>%
mutate(
mean_wage_grand = mean(wage)
) %>%
# You'll need to convert back to panel_data if you want to keep using panelr functions
panel_data(id = id, wave = t)wbm() — the
within-between modelAnyone can fit a within-between model without the use of this package as it is just a particular specification of a multilevel model. With that said, it’s something that will require some programming and could be rather prone to error. In the best case, it is cumbersome and inefficient to create the necessary variables.
wbm() is the primary model-fitting function that you’ll
use from this package and it fits within-between models for you,
utilizing lme4 as
a backend for estimation.
A three-part model syntax is used that goes like this:
dv ~ varying_variables | invariant_variables | cross_level_interactions/random effects
It works like a typical formula otherwise. The bars just tell
panelr how to treat the variables. Note also that you can
specify random slopes using lme4-style syntax in the third
part of the formula as well. A random intercept for the ID variable is
included by default and doesn’t need to be specified in the formula.
Lagged variables are supported as well through the lag()
function. Unlike base R, panelr lags the variables
correctly — wave 1 observations will have NA values for the lagged
variable rather than taking the final wave value of the previous
entity.
Here we will specify a model using the wages data. We
will predict logged wages (lwage) using two time-varying
variables — lagged union membership (union) and
contemporaneous weeks worked (wks) — along with a
time-invariant predictor, a binary indicator for black race
(blk). For demonstrative purposes, we’ll fit a random slope
for lag(union) and a cross-level interaction between
blk and wks.
model <- wbm(lwage ~ lag(union) + wks | blk | blk * wks + (lag(union) | id), data = wages)
summary(model)MODEL INFO:
Entities: 595
Time periods: 2-7
Dependent variable: lwage
Model type: Linear mixed effects
Specification: within-between
MODEL FIT:
AIC = 1427.04, BIC = 1495.03
Pseudo-R² (fixed effects) = 0.05
Pseudo-R² (total) = 0.75
Entity ICC = 0.73
WITHIN EFFECTS:
---------------------------------------------------------
Est. S.E. t val. d.f. p
---------------- ------- ------ -------- --------- ------
lag(union) 0.04 0.04 1.24 88.17 0.22
wks -0.00 0.00 -1.51 2948.04 0.13
---------------------------------------------------------
BETWEEN EFFECTS:
---------------------------------------------------------------
Est. S.E. t val. d.f. p
----------------------- ------- ------ -------- -------- ------
(Intercept) 6.20 0.24 25.89 571.97 0.00
imean(lag(union)) 0.03 0.04 0.72 593.27 0.47
imean(wks) 0.01 0.01 2.30 571.29 0.02
blk -0.35 0.06 -5.65 591.87 0.00
---------------------------------------------------------------
CROSS-LEVEL INTERACTIONS:
------------------------------------------------------
Est. S.E. t val. d.f. p
------------- ------- ------ -------- --------- ------
wks:blk -0.00 0.00 -1.06 2956.56 0.29
------------------------------------------------------
p values calculated using Satterthwaite d.f.
RANDOM EFFECTS:
-------------------------------------
Group Parameter Std. Dev.
---------- -------------- -----------
id (Intercept) 0.3785
id lag(union) 0.24
Residual 0.2291
-------------------------------------
Note that imean() is an internal function that
calculates the individual-level mean, which represents the
between-subjects effects of the time-varying predictors. The within
effects are the time-varying predictors at the occasion level with the
individual-level mean subtracted. If you want the model specified such
that the occasion level predictors do not have the mean subtracted, use
the model = "contextual" argument. The “contextual” label
refers to the way these terms are normally interpreted when it is
specified that way.
You may also use model = "between" to fit what
econometricians call the random effects model, which does not
disaggregate the within- and between-entity variation.
widen_panel() and
long_panel()Two functions that should cover your bases for the tricky business of
reshaping panel data are included. Sometimes, like for
doing SEM-based analyses, you need your data in wide format — i.e., one
row per entity. widen_panel() makes that easy and should
require minimal trial and error or thinking.
Perhaps more often, your raw data are already in wide format and you
need to get it into long format to do cool stuff like use
wbm(). That can be very tricky, but
long_panel() (I didn’t think lengthen_panel()
or longen_panel() quite worked as names) should cover most
situations. You tell it what the labels for periods are (e.g., does it
range from 1 to 5, "A" to
"E", or something else?), where they are located (before or
after the variable’s name?), and what kinds of formatting go
before/after it. Check out the vignette for more details and some worked
examples.
I’m happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. I prefer you use the Github issues system over trying to reach out to me in other ways. Pull requests for contributions are encouraged.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
The source code of this package is licensed under the MIT License.