Motivation

The purpose of this document is to show how to use the rstap package including the modeling assumptions and the kind of data it uses. We’ll begin by loading in the appropriate libraries which include dplyr and tidyr for data manipulation, ggplot2 for plotting and rstap for model fitting.

A typical data structure will involve subjects and features at different locations and/or times. For purposes of this example, we simulate 550 subjects and 55 features uniformly in a 1x2 and 3x3 grid, respectively. We’ll assume we’re modeling the BMI of the subjects as a function of Fast Food Restaurants (FF) in their surrounding area.

rstap models data assuming the relationship between the mean and the covariates is the following:

\[ E[BMI_i] = \alpha + Sex_i \delta + FF_i(\theta)\beta\\ X_i(\theta) = \sum_{d \in \mathcal{D}_i} \mathcal{K}_s(\frac{d}{\theta}) \] Where \(\mathcal{K}_s\) is a real-valued function such that \(\mathcal{K}_s:[0.\infty] \to [0,1]\). Furthermore, we have \(\mathcal{K}_s\) either monotonically decreasing or increasing depending on whether one is modeling a spatial or temporal relationship between the features and subjects, respectively. The default weight function for a spatial decay function in the rstap package is the complementary error function, though others are available. For examples of others in use in the rstap package see the Mis-Specified Weight Function vignette.

For this example we use the following fixed parameters to simulate our dataset:

supposing we use all the features simulated in the space, then this results in the following dataset, with the following marginal distribution of the outcome, \(y\).

The exposure decay function and histogram of the standardized exposure is as follows

We then set-up the two dataframes needed for stap to model these data. The first is a fairly typical data frame with the outcome, covariates and subject ID. This is the same kind of dataset that could be used with a function like lm()

subj_id BMI sex
1 25.13904 M
2 28.14755 M
3 29.03859 M
4 29.58273 F
5 30.30569 F
6 30.60156 F

The distance dataframe contains the corresponding subject id to pair with each built environment feature in the chosen space along with their associated distance. Note that the Built Environment Features are labeled as “Fast_Food” for this example - this will be important syntactically for the stap_glm() function.

subj_id BEF Distance
1 Fast_Food 0.3090334
2 Fast_Food 0.1461124
3 Fast_Food 0.7505497
4 Fast_Food 0.6547669
5 Fast_Food 0.2206448
6 Fast_Food 0.3910711

These data are then modeled with rstap in the following manner - note the placement of “Fast_Food” and sap(), designating the rows labeled as Fast_Food in the distance_data as the appropriate ones to model as Spatial Aggregated Predictor with these data. Note that we sample the posterior with 4 chains for 2000 iterations here. This is very conservative and fewer samples/chains could be used when first fitting a model to make sure the sampler is functioning appropriately.

We’ll first look at the quick summary contained in the model’s print-out

Further model details, including diagnostics concerning convergence properties can be found by using the summary function:

Checking our estimates, the model captures the true values very well.

One typical way to check for model goodness-of-fit is via posterior predictive checks. These are available here via the posterior_predict function and the ppc_dens_overlay function from the bayesplot package.

We’ll take this as sufficient evidence (for this vignette) that the model works. Further model evaluation can be found in a different vignette, to be added later. Let’s now examine the case when we mis-specify the inclusion distance or superflous data is included.

Mis-Specified Inclusion Distance

While there are tools, such as the dlm package, that can provide an investigator with a sense of what appropriate inclusiong distance to set, it is still worth considering what may occur if the inclusion distance is mis-specified. The following section considers this in two cases of data exclusion: non-informative, and informative.

Non-Informative data excluded

Note that the histogram of distances include many over 1 unit - which from the above graph we know only add “negligible” information.

Keeping this in mind - let’s see what happens to our estimates when we set the exclusion distance to be 1.25 miles.

We excluded about half of the total data (not exactly even across individuals)

and yet the estimates are still very good

Extreme Missing data

To show the consequences of using an extreme inclusion distance - extreme with respect to the true terminal inclusion distance - we simulate the data under an alternate scale. Note that the following exposure curve does not even terminate at the maximum distance between a “Fast Food” store and a subject.

Fitting the model on a reduced domain of distances from the terminating distance,results in an highly inflated estimate of the spatial scale, as the posterior will attempt to approximate the second graph shown below - an exposure relationship that is approaching uniform on the domain of distances “observed”.

Simulating a dataset under this model and fitting an rstap model to it with the restricted domain…

We can see a much wider estimate of the scale parameter, and the corresponding effect is biased down from where it should be.

Here is a plot of the estimate on the full domain of the data - we can see that the exposure function approximation becomes less accurate on the set of distances not included in the model.

As can be seen below, the approximation is quite good on the reduced domain, but the reduced number of distances results in a wider uncertainty interval, and the constrained domain results in an inflated estimate. The take home lesson here being, that if the effect of some BEF extends out a certain distance, but you don’t have data at that distance, your estimate will become biased to reflect the spatial exposure curve under the distances observed.

Caveats

It is important to note the limitations of these simulations. To begin with, the exposure covariate \(\tilde{X}\) is an unknown function of the distances between the subjects and the businesses and the spatial scale. Thus these simulations do not represent the probably more realistic scenarios in which:

  1. The decay function is mis-specified or

  2. Multiple built-environment features are of interest.

These will be the subject of future vignettes.