Motivation

The purpose of this document is to identify what impact, if any, spatial correlation within and between built environment features may have on the ability of the rstap package estimating procedure to correctly estimate parameters. We’ll begin by loading in the appropriate libraries, including, the rstap package for modeling these kinds of data.

library(rstap)
library(dplyr)
library(tidyr)
library(ggplot2)

Within Feature Correlation

To induce correlation between business locations, we generate the x and y coordinates from an inhomogenous poisson process, with intensity function \(f(x,y) = 200xy\). So the mean number of businesses increases as \(x\) and \(y\) increase.

intenfun <- function(x,y) 200 * x *y
x <- seq(from = 0, to = 1, by = 0.01)
y <- seq(from = 0, to = 1, by = 0.01)
xy <- expand.grid(x=x,y=y)
z <- intenfun(xy$x,xy$y)
cbind(x=xy$x,y=xy$y,mean = z) %>% as_data_frame() %>% 
    mutate(cut_y = cut(y,4)) %>% 
    ggplot(aes(x=x,y=y,color=z)) + geom_point() + scale_color_continuous(low='white',high='red') +
        theme_bw() + ggtitle("Intensity Function: Mean of Poisson Across 2D Space")

The following plot shows where all subjects and built environment features are located in our artificial space.

set.seed(2524)
num_subj <- 550
bef_rpp <- spatstat::rpoispp(intenfun, lmax = 5E3)
bef_df <- data_frame(x = bef_rpp$x,
                     y = bef_rpp$y, class = 'BEF')

sub_df <- data_frame(x = runif(n = num_subj, min = 0, max = 1.0),
                     y = runif(n = num_subj, min = 0, max = 1.0),
                     class = "Subject")
rbind(bef_df,sub_df) %>% ggplot(aes(x=x,y=y,color=class)) + geom_point() + 
    theme_bw() + ggtitle("Correlated Business Positions")

Then suppose the form of some measurement on the subjects is:

\[ E[Y_i] = \alpha + Z_i \delta + X_i(\theta)\beta_2\\ X_i(\theta) = \sum_{d \in \mathcal{D}_i} \text{w}(\frac{d}{\theta}) \]

Using the following fixed parameter estimates.

alpha <- -.9
Z <- rbinom(n = num_subj, prob = .45, size = 1)
delta <- -.8
theta <- .3
beta <- .4

Assuming the exposure is imposed across all current features, then this results in the following dataset and marginal distribution of the outcome.

dists <- fields::rdist(as.matrix(sub_df[,1:2]),
                       as.matrix(bef_df[,1:2]))
X <- apply(dists,1,function(x) sum(pracma::erfc(x/theta)))
X_tilde <- (X-mean(X))/sd(X) ## center and scale the exposure effect
y <- alpha + Z*delta + X_tilde*beta
mu <- binomial(link='logit')$linkinv(alpha + Z*delta + X_tilde*beta)
N1 <- rbinom(n = num_subj,size = 100,prob = .5)
y <- rbinom(n = num_subj, size = N1, prob = mu)
hist(y)

Additionally, this results in the following decay function and marginal distribution of “Exposure”.

d <- seq(from = 0, to = max(dists), by = 0.01)
X_theta_one <- pracma::erfc(d/theta)
par(mfrow = c(1,2))
plot(d,X_theta_one,type='l',main = "Exposure Decay Function", xlab = "Distance",
     ylab = "Exposure")
hist(X_tilde)

Modeling these data with rstap…

distance_data <- dists %>% as_data_frame() %>%
    mutate(subj_id = 1:nrow(sub_df)) %>% 
    gather(contains("V"), key = 'BEF',value = 'Distance') %>% 
    mutate(BEF = 'Fast_Food')

subject_data <- data_frame(subj_id = 1:num_subj,
                           N = N1,
                           NS = y,
                           NF = N - NS,
                           sex = factor(Z,labels=c("M","F")))

fit <- stap_glm(formula = cbind(NS,NF) ~ sex + sap(Fast_Food),
                subject_data = subject_data,
                distance_data = distance_data,
                family = binomial(link = 'logit'),
                id_key = 'subj_id',
                prior_intercept = normal(location = 0, scale = 3, autoscale = F),
                prior_stap = normal(location = 0, scale = 3, autoscale = F),
                prior_theta = log_normal(location = 0, scale = 1), 
                prior_aux = cauchy(location = 0,scale = 5),
                max_distance = max(dists), chains = 2, cores = 2) ## include all data
results <- rbind(c(alpha,delta,beta,theta),coef(fit))
rownames(results) <- c("True","Estimated")
results
se(fit)

From the above we can see that spatial correlation within built environment features does not eliminate the ability of rstap to correctly estimate the spatial scale. Comparing it to the introductory vignette we can see it does not substantially increase the variance of the estimate either, compared to a completely independent configuration.

Between and Within Feature Correlation

To induce correlation within and between business locations, we generate the x and y coordinates of both features from an inhomogenous poisson process, with the same intensity function \(f(x,y) = 200xy\). So the mean number of both businesses increases as \(x\) and \(y\) increase.

set.seed(2524)
intenfun <- function(x,y) 250 * x * y
bef_rpp <- spatstat::rpoispp(intenfun, lmax = 500)
bef_df <- data_frame(x = bef_rpp$x,
                     y = bef_rpp$y, class = 'BEF_1')
intenfun <- function(x,y) 210 * x * y
bef_rpp <- spatstat::rpoispp(intenfun, lmax = 500)
bef_df <- rbind(bef_df,data_frame(x = bef_rpp$x,
                     y = bef_rpp$y, class = 'BEF_2'))
rbind(sub_df,bef_df) %>% ggplot(aes(x=x,y=y,color=class)) + 
    geom_point() + theme_bw()

Then suppose the form of some measurement on the subjects is:

\[ E[Y_i] = \alpha + Z_i \delta + X_{i,1}(\theta_1)\beta_2 + X_{i,2}(\theta_2)\beta_3 \\ X_{i,j}(\theta) = \sum_{d \in \mathcal{D}_i} \text{w}(\frac{d}{\theta_j}) \quad j = 1,2 \]

Here we use the following fixed parameters. Note that the spatial scales are different for each class of built environment feature.

alpha <- -.9
Z <- rbinom(n = num_subj, prob = .45, size = 1)
delta <- -.8
theta <- .3
theta_2 <- .9
beta <- c(.4,-.5)

supposing we (again) use all the data available (no exclusion distance), then this results in the following dataset

dists_1 <- fields::rdist(as.matrix(sub_df[,1:2]),
                       as.matrix(bef_df[bef_df$class=='BEF_1',1:2]))
X_1 <- apply(dists_1,1,function(x) sum(pracma::erfc(x/theta)))
X_tilde <- (X_1-mean(X_1))/sd(X_1) ## center and scale the exposure effect
dists_2 <- fields::rdist(as.matrix(sub_df[,1:2]),
                       as.matrix(bef_df[bef_df$class=='BEF_2',1:2]))
X_2 <- apply(dists_2,1,function(x) sum(pracma::erfc(x/theta_2)))
X_tilde <- cbind(X_tilde,(X_2-mean(X_2))/sd(X_2))
mu <- binomial(link='logit')$linkinv(alpha + Z*delta + X_tilde %*% beta)
y <- rbinom(n= num_subj, size = N1, p = mu)
par(mfrow=c(1,2))
hist(y)
hist(dists)
d <- seq(from = 0, to = max(dists), by = 0.01)
X_theta_one <- pracma::erfc(d/theta)
X_theta_two <- pracma::erfc(d/theta_2)
par(mfrow = c(2,2),mar=rep(.8,4))
plot(d,X_theta_one,type='l',main = "Exposure Decay Function BEF 1", xlab = "Distance", ylab = "Exposure")
plot(d,X_theta_two,type='l', main = "Exposure Decay Function BEF 2")
hist(X_tilde[,1],main="")
hist(X_tilde[,2],main = "")

Since these businesses, let’s say they’re Fast Food Restaurants and Coffee shops, were generated using the same intensity function from an inhomogenous poisson model, we should expect the exposure to be correlated, since these businesses tend to be located in the same area - as indeed, they may be in real life as well.

cor(X_tilde[,1],X_tilde[,2])

Looking at a plot of the “Exposure” from each.

plot(X_1,X_2, xlab = "FFR Exposure", ylab = "Coffee Exposure",
     main = "Exposure Correlation")

This may cause problems in our modeling, but we should check to make sure.

Then we model this with rstap in the following manner

results <- rbind(c(alpha,delta,beta,theta,theta_2),coef(fit))
rownames(results) <- c("True","Estimated")
results
se(fit)

Looking at the results above we can see that we’re able to capture the true scales and effects, but with much larger variability as compared to before.

Examing the correlation between the exposure covariates, we can see we largely underestimate the true correlation.

cors <- sapply(1:nrow(fit$X_tilde), function(x) cor(fit$X_tilde[x,,1],fit$X_tilde[x,,2], method='spearman'))
summary(cors)

The plots below showing the distribution in samples from the distribution of spatial scales illustrate this further.

b <- rstan::extract(fit$stapfit)
betas <- b$beta %>% as_data_frame() %>% transmute(Fast_Food = V1,
                                                  Coffee_Shops = V2)
betas %>% gather(Fast_Food,Coffee_Shops,key='Store_Class',value='sample') %>% 
    ggplot(aes(x=sample,color = Store_Class)) + geom_density() + theme_bw() + 
    geom_vline(aes(xintercept = beta[1]),linetype = 2, color = 'blue') + 
    geom_vline(aes(xintercept = beta[2]),linetype = 2, color =' red')

To see whether this is a true consequence of correlation or simply the weak coffee shop effect we simulate a new data set with a stronger coffee shop effect.

alpha <- 22.5
Z <- rbinom(n = num_subj, prob = .45, size = 1)
delta <- -.8
theta <- .3
theta_2 <- .9
beta <- c(1.2,2.5)
sigma <- 2.3
results_CF <- rbind(c(alpha,delta,beta,theta,theta_2),coef(fit_CF))
rownames(results_CF) <- c("True","Estimated")
results_CF
se(fit_CF)

The lower variability in spatial scale in these most recent results suggest the former high variability likely was a consequence of the coffee shop effect, rather than the collinearity.

Conclusions

This modeling framework can still estimate parameters in a setting where spatial coordinates are correlated, leading to correlated exposure covariates. This is important since built environment features will often be correlated in space. One qualification that should be made is that the intensity function above is relatively simple, and other intensity functions may induce differing correlation patterns. Additionally, differing processes that allow for greater within feature correlation, e.g. the Hawkes Process, may induce a correlation structure that may not be accounted for by this simulation’s set-up and results.