`Spatial_Correlation_I.Rmd`

The purpose of this document is to identify what impact, if any, spatial correlation within and between built environment features may have on the ability of the rstap package estimating procedure to correctly estimate parameters. We’ll begin by loading in the appropriate libraries, including, the `rstap`

package for modeling these kinds of data.

```
library(rstap)
library(dplyr)
library(tidyr)
library(ggplot2)
```

To induce correlation between business locations, we generate the x and y coordinates from an inhomogenous poisson process, with intensity function \(f(x,y) = 200xy\). So the mean number of businesses increases as \(x\) and \(y\) increase.

```
intenfun <- function(x,y) 200 * x *y
x <- seq(from = 0, to = 1, by = 0.01)
y <- seq(from = 0, to = 1, by = 0.01)
xy <- expand.grid(x=x,y=y)
z <- intenfun(xy$x,xy$y)
cbind(x=xy$x,y=xy$y,mean = z) %>% as_data_frame() %>%
mutate(cut_y = cut(y,4)) %>%
ggplot(aes(x=x,y=y,color=z)) + geom_point() + scale_color_continuous(low='white',high='red') +
theme_bw() + ggtitle("Intensity Function: Mean of Poisson Across 2D Space")
```

The following plot shows where all subjects and built environment features are located in our artificial space.

```
set.seed(2524)
num_subj <- 550
bef_rpp <- spatstat::rpoispp(intenfun, lmax = 5E3)
bef_df <- data_frame(x = bef_rpp$x,
y = bef_rpp$y, class = 'BEF')
sub_df <- data_frame(x = runif(n = num_subj, min = 0, max = 1.0),
y = runif(n = num_subj, min = 0, max = 1.0),
class = "Subject")
rbind(bef_df,sub_df) %>% ggplot(aes(x=x,y=y,color=class)) + geom_point() +
theme_bw() + ggtitle("Correlated Business Positions")
```

Then suppose the form of some measurement on the subjects is:

\[ E[Y_i] = \alpha + Z_i \delta + X_i(\theta)\beta_2\\ X_i(\theta) = \sum_{d \in \mathcal{D}_i} \text{w}(\frac{d}{\theta}) \]

Using the following fixed parameter estimates.

```
alpha <- -.9
Z <- rbinom(n = num_subj, prob = .45, size = 1)
delta <- -.8
theta <- .3
beta <- .4
```

Assuming the exposure is imposed across all current features, then this results in the following dataset and marginal distribution of the outcome.

```
dists <- fields::rdist(as.matrix(sub_df[,1:2]),
as.matrix(bef_df[,1:2]))
X <- apply(dists,1,function(x) sum(pracma::erfc(x/theta)))
X_tilde <- (X-mean(X))/sd(X) ## center and scale the exposure effect
y <- alpha + Z*delta + X_tilde*beta
mu <- binomial(link='logit')$linkinv(alpha + Z*delta + X_tilde*beta)
N1 <- rbinom(n = num_subj,size = 100,prob = .5)
y <- rbinom(n = num_subj, size = N1, prob = mu)
hist(y)
```

Additionally, this results in the following decay function and marginal distribution of “Exposure”.

```
d <- seq(from = 0, to = max(dists), by = 0.01)
X_theta_one <- pracma::erfc(d/theta)
par(mfrow = c(1,2))
plot(d,X_theta_one,type='l',main = "Exposure Decay Function", xlab = "Distance",
ylab = "Exposure")
hist(X_tilde)
```

Modeling these data with rstap…

```
distance_data <- dists %>% as_data_frame() %>%
mutate(subj_id = 1:nrow(sub_df)) %>%
gather(contains("V"), key = 'BEF',value = 'Distance') %>%
mutate(BEF = 'Fast_Food')
subject_data <- data_frame(subj_id = 1:num_subj,
N = N1,
NS = y,
NF = N - NS,
sex = factor(Z,labels=c("M","F")))
fit <- stap_glm(formula = cbind(NS,NF) ~ sex + sap(Fast_Food),
subject_data = subject_data,
distance_data = distance_data,
family = binomial(link = 'logit'),
id_key = 'subj_id',
prior_intercept = normal(location = 0, scale = 3, autoscale = F),
prior_stap = normal(location = 0, scale = 3, autoscale = F),
prior_theta = log_normal(location = 0, scale = 1),
prior_aux = cauchy(location = 0,scale = 5),
max_distance = max(dists), chains = 2, cores = 2) ## include all data
```

```
results <- rbind(c(alpha,delta,beta,theta),coef(fit))
rownames(results) <- c("True","Estimated")
results
```

`se(fit)`

From the above we can see that spatial correlation within built environment features does not eliminate the ability of `rstap`

to correctly estimate the spatial scale. Comparing it to the introductory vignette we can see it does not substantially increase the variance of the estimate either, compared to a completely independent configuration.

To induce correlation within **and** between business locations, we generate the x and y coordinates of both features from an inhomogenous poisson process, with the **same** intensity function \(f(x,y) = 200xy\). So the mean number of *both* businesses increases as \(x\) and \(y\) increase.

```
set.seed(2524)
intenfun <- function(x,y) 250 * x * y
bef_rpp <- spatstat::rpoispp(intenfun, lmax = 500)
bef_df <- data_frame(x = bef_rpp$x,
y = bef_rpp$y, class = 'BEF_1')
intenfun <- function(x,y) 210 * x * y
bef_rpp <- spatstat::rpoispp(intenfun, lmax = 500)
bef_df <- rbind(bef_df,data_frame(x = bef_rpp$x,
y = bef_rpp$y, class = 'BEF_2'))
rbind(sub_df,bef_df) %>% ggplot(aes(x=x,y=y,color=class)) +
geom_point() + theme_bw()
```

Then suppose the form of some measurement on the subjects is:

\[ E[Y_i] = \alpha + Z_i \delta + X_{i,1}(\theta_1)\beta_2 + X_{i,2}(\theta_2)\beta_3 \\ X_{i,j}(\theta) = \sum_{d \in \mathcal{D}_i} \text{w}(\frac{d}{\theta_j}) \quad j = 1,2 \]

Here we use the following fixed parameters. Note that the spatial scales are different for each class of built environment feature.

```
alpha <- -.9
Z <- rbinom(n = num_subj, prob = .45, size = 1)
delta <- -.8
theta <- .3
theta_2 <- .9
beta <- c(.4,-.5)
```

supposing we (again) use all the data available (no exclusion distance), then this results in the following dataset

```
dists_1 <- fields::rdist(as.matrix(sub_df[,1:2]),
as.matrix(bef_df[bef_df$class=='BEF_1',1:2]))
X_1 <- apply(dists_1,1,function(x) sum(pracma::erfc(x/theta)))
X_tilde <- (X_1-mean(X_1))/sd(X_1) ## center and scale the exposure effect
dists_2 <- fields::rdist(as.matrix(sub_df[,1:2]),
as.matrix(bef_df[bef_df$class=='BEF_2',1:2]))
X_2 <- apply(dists_2,1,function(x) sum(pracma::erfc(x/theta_2)))
X_tilde <- cbind(X_tilde,(X_2-mean(X_2))/sd(X_2))
mu <- binomial(link='logit')$linkinv(alpha + Z*delta + X_tilde %*% beta)
y <- rbinom(n= num_subj, size = N1, p = mu)
par(mfrow=c(1,2))
hist(y)
hist(dists)
```

```
d <- seq(from = 0, to = max(dists), by = 0.01)
X_theta_one <- pracma::erfc(d/theta)
X_theta_two <- pracma::erfc(d/theta_2)
par(mfrow = c(2,2),mar=rep(.8,4))
plot(d,X_theta_one,type='l',main = "Exposure Decay Function BEF 1", xlab = "Distance", ylab = "Exposure")
plot(d,X_theta_two,type='l', main = "Exposure Decay Function BEF 2")
hist(X_tilde[,1],main="")
hist(X_tilde[,2],main = "")
```

Since these businesses, let’s say they’re Fast Food Restaurants and Coffee shops, were generated using the same intensity function from an inhomogenous poisson model, we should expect the exposure to be correlated, since these businesses tend to be located in the same area - as indeed, they may be in real life as well.

`cor(X_tilde[,1],X_tilde[,2])`

Looking at a plot of the “Exposure” from each.

```
plot(X_1,X_2, xlab = "FFR Exposure", ylab = "Coffee Exposure",
main = "Exposure Correlation")
```

This may cause problems in our modeling, but we should check to make sure.

Then we model this with rstap in the following manner

```
results <- rbind(c(alpha,delta,beta,theta,theta_2),coef(fit))
rownames(results) <- c("True","Estimated")
results
```

`se(fit)`

Looking at the results above we can see that we’re able to capture the true scales and effects, but with much larger variability as compared to before.

Examing the correlation between the exposure covariates, we can see we largely **underestimate** the true correlation.

```
cors <- sapply(1:nrow(fit$X_tilde), function(x) cor(fit$X_tilde[x,,1],fit$X_tilde[x,,2], method='spearman'))
summary(cors)
```

The plots below showing the distribution in samples from the distribution of spatial scales illustrate this further.

```
b <- rstan::extract(fit$stapfit)
betas <- b$beta %>% as_data_frame() %>% transmute(Fast_Food = V1,
Coffee_Shops = V2)
betas %>% gather(Fast_Food,Coffee_Shops,key='Store_Class',value='sample') %>%
ggplot(aes(x=sample,color = Store_Class)) + geom_density() + theme_bw() +
geom_vline(aes(xintercept = beta[1]),linetype = 2, color = 'blue') +
geom_vline(aes(xintercept = beta[2]),linetype = 2, color =' red')
```

To see whether this is a true consequence of correlation or simply the weak coffee shop effect we simulate a new data set with a stronger coffee shop effect.

```
alpha <- 22.5
Z <- rbinom(n = num_subj, prob = .45, size = 1)
delta <- -.8
theta <- .3
theta_2 <- .9
beta <- c(1.2,2.5)
sigma <- 2.3
```

```
results_CF <- rbind(c(alpha,delta,beta,theta,theta_2),coef(fit_CF))
rownames(results_CF) <- c("True","Estimated")
results_CF
```

`se(fit_CF)`

The lower variability in spatial scale in these most recent results suggest the former high variability likely was a consequence of the coffee shop effect, rather than the collinearity.

This modeling framework can still estimate parameters in a setting where spatial coordinates are correlated, leading to correlated exposure covariates. This is important since built environment features will often be correlated in space. One qualification that should be made is that the intensity function above is relatively simple, and other intensity functions may induce differing correlation patterns. Additionally, differing processes that allow for greater within feature correlation, e.g. the Hawkes Process, may induce a correlation structure that may not be accounted for by this simulation’s set-up and results.