Bayesian-Statistics-Final-Project/Final_Report_Template.Rmd at main · timdebord/Bayesian-Statistics-Final-Project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
---
title: "Bayesian Spatial Models"
author: "Tim DeBord & Andrea Zantek"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning=FALSE)
library(tidyverse)
library(gridExtra)
library(bayesplot)
library(runjags)
library(kableExtra)
```

```{r, include=FALSE, message=FALSE, warning=FALSE}
#install.packages("INLA",repos=c(getOption("repos"),INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
#install.packages('spDataLarge', repos='https://nowosad.github.io/drat/', type='source')
```


```{r, include=FALSE, message=FALSE, warning=FALSE}
library(spocc)
library(sf)
library(mapview)
library(SpatialEpi)
library(sf)
library(dplyr)
library(spdep)
library(INLA)
library(ggplot2)
library(htmltools)
library(leaflet)
library(knitr)
```

## Introduction
Spatial data is simply data which directly or indirectly references geographical locations. Spatial data is essential across various fields such as Environmental Science, Ecology, Urban Planning, Development, Agriculture, Forestry, and many others, providing critical insights and tools for informed decision-making and efficient resource management. The most basic version of spatial data is the generic map. It provides information on how the land is connected and what features are where. An advantage of spatial data is its ease of use; maps are easy to read and understand, whereas a regression table is not easily understood without proper training. Maps usually have boundaries which help group the objects of the map together, this grouping defines areal data.

Areal data is a type of spatial data formatted with boundaries separating areas; each area has a value for each attribute and the response variable. There are two main types of areal data- regular lattice data and irregular lattice data. Lattice data are observations from a random process observed over a countable collection of spatial regions and supplemented by a neighborhood structure. Each unit (cell or point) in the lattice is associated with a specific location in the study area. Irregular lattice data utilizes organic boundaries, like a county or census tract. Regular lattice data involves splitting up an area of land into equally sized grids. Regular lattice data is particularly useful in modeling count data. Furthermore, regular lattice data is sometimes constructed by overlaying a gridded structure on a study area and calculating the number of events in each square. For example, looking at the map of the United States, the state boundaries provide an example of the irregular lattice structure. However, let us consider the United States map again. Say we do not like the idea of states, and we think we would learn more looking at an evenly distributed grid instead, this is an example of regular lattice. Although areal data is useful, it is not without flaws. Areal data assumes homogeneity within units, potentially overlooking internal variations and autocorrelation that complicates statistical analysis.

Spatial autocorrelation occurs when the value of a variable at one location is correlated with the values of the same variable at adjacent locations. This means that spatial data observations are not independent of each other, and the presence of a certain value at one location makes it more likely to observe similar values at nearby locations. This helps in identifying spatial patterns in data. For example, high spatial autocorrelation in environmental data might indicate regions with similar ecological characteristics. The correlated nature of spatial data intuitively makes sense to be implemented within priors used in Bayesian inference.

```{r, fig.cap="Raster Versus Polygon Data", echo=FALSE, out.width = '80%', fig.align='center'}
knitr::include_graphics("rastpol.png")
```

```{r, echo=FALSE}
df <- occ(query = "Giraffa camelopardalis", from = "gbif",
          date = c("2000-01-01", "2019-12-31"),
          gbifopts = list(continent = "Africa"),
          has_coords = TRUE, limit = 1000)
d <- occ2df(df)
```

Using the `spocc` library in R, we can pull information from the Interface to Species Occurrence Data Sources Database by following the code structure. From there, you can plug in any scientific animal name, select the continent, and it will output a map showing the locations where the animal has been spotted. This provides researchers with information about where these animals are being observed in the wild. If we were to further analyze the data beyond just specific points, researchers could identify trends for that specific animal. In the example we provide, we see that giraffes are located in concentrated areas. With further research, we know this makes sense as "[m]ost giraffes live in grasslands and open woodlands in East Africa, especially in reserves such as the Serengeti National Park and the Amboseli National Park. Some are also found in the reserves of Southern Africa."

```{r, echo=FALSE}
giraffe_spotting <- st_as_sf(d, coords = c("longitude", "latitude"))
st_crs(giraffe_spotting) <- 4326
mapview(giraffe_spotting)
```

## Methodology

To model the outcome variable, we’ll construct a model where the outcome is a function of the covariates and random effects, ui and vi. Covariates are the variables included in a model to explain and predict variations in the outcome variable. They help improve the accuracy and reliability of the model by controlling for confounding factors. Random effects in spatial models capture unobserved heterogeneity and account for the correlation between observations due to spatial proximity or hierarchical structure. They are essential for modeling spatial autocorrelation and improving the fit and accuracy of the model. Together, covariates and random effects allow spatial models to more accurately and comprehensively describe and predict spatial phenomena, accounting for both observed and unobserved sources of variability. The spatially unstructured random effect term, vi, captures random noise in the data. Bayesian spatial models incorporate an additional spatially structured random effect term, ui, to capture spatial dependencies. The spatially structured random effects account for the phenomenon of spatial autocorrelation, which is when neighboring areas have similar values.

A common, yet critical, assumption in Bayesian Spatial Modeling is that the spatial random effects jointly follow a normal distribution. This assumption allows us to utilize the INLA package, which offers a computationally efficient alternative to MCMC methods. Spatially structured random effects are ultimately determined by the adjacency matrix, \( W \), where the \( (i,j) \)th element of the matrix, denoted by \( w_{ij} \), spatially connects areas \( i \) and \( j \) in some fashion, \( i, j \in \{1, \ldots, n\} \). Areas are adjacent if they share at least one boundary. In Regular Lattice Models, adjacency can be just a single point (queen adjacency) or at least a whole segment (rook adjacency). In Irregular Lattice Models, the areas are considered neighbors as long as they are touching. The adjacency relationships for n observations can be expressed in an n x n adjacency matrix, with non-zero entries indicating adjacency between regions. The adjacency matrix can have a binary structure, with 0’s and 1’s if the areas are neighbors, resulting in a symmetric matrix. The adjacency matrix can also have a row-standardized structure, where the values in each row sum to 1.

```{r, fig.cap="Binary Adjacency Matrix for Irregular Lattice Data", echo=FALSE, out.width = '80%', fig.align='center'}
knitr::include_graphics("A-simple-Bayesian-network-structure-and-its-adjacency-matrix.png")
```

The most widely used R program to make Bayesian inference is a Integrated Nested Laplace Approximation (INLA). To fit a model using INLA, we need to take two steps. First, we write the linear predictor of the model as a formula object in R. Then, we can run the model by calling the `inla()` function, where we specify the formula, the family, the data, and other options we need to specify in our model. The default prior is loggamma. The execution of `inla()` returns an object that contains information about the fitted model, including several summaries and the posterior results of the parameters, the linear predictors, and the fitted values. These posteriors can then be processed using a set of functions provided by INLA. It should be noted that INLA is still working out a lot of its kinks which is a result of the rapid developments emerging from the field of Bayesian spatial inference.

We’ll specify a model using the Poisson framework to model lip cancer counts. This can be mathematically represented as

\[ Y_i \sim \text{Pois}(E_i \theta_i) \]

Each \(Y_i\), which represents an observed count in a specific area \(i\), follows a Poisson distribution with mean \(E_i \theta_i\). The \(E_i\) term represents the baseline expected count for an area \(i\). Its value is directly derived from the average of the total area, however, it is specific to each area in that it accounts for factors like population size. The \(\theta_i\) represents the relative risk for each area \(i\). For example, a \(\theta_i\) value of 2 corresponds with a risk twice as high in area \(i\) when compared to the standard population.

In the case of disease counts, it’s common to work with \(\log(\theta_i)\) to ensure that the values remain positive. The \(\log(\theta_i)\) expression can be represented as follows:

\[ \log(\theta_i) = \beta_0 + \beta_1 \text{AFF}_i + u_i + v_i \]

\(\beta_0\) serves as the intercept and \(\beta_1\) is the coefficient of the \(\text{AFF}_i\) covariate corresponding to area \(i\). For each one-unit increase in the \(\text{AFF}_i\) value for area \(i\), the relative risk \(\theta_i\) increases by a factor of \(e^{\beta_1}\), holding all other covariates constant.

The spatially unstructured effects are denoted \(v_i\) and capture the random noise, variation unrelated to spatial patterns. In Besag models, the spatially unstructured effects are estimated indepdently from the spatially structured effects. They follow IID Normal distributions with mean 0 and variance \(\frac{1}{\tau_2}\) and can be expressed mathematically as:

\[ v_i \overset{\text{i.i.d.}}{\sim} Norm(0, \frac{1}{\tau_2}) \]

Each \(u_i\) is determined from a Conditional Autoregressive (CAR) distribution in both Besag models. Essentially, this means that each \(u_i\) for an area \(i\) is determined conditionally based on the random effects of neighboring areas. This can be mathematically represented as:

\[ u_i | {u}_{-i} \sim \text{Norm}(\overline{U}_{\delta_i}, \frac{1}{\tau_1 n_{\delta_i}}) \]

The \(u_i\) term represents the spatially structured random effect of area \(i\) and the \( u_{-i} \) term represents the random effects of the neighboring areas. This conditionally follows a normal distribution with mean \(\overline{U}_{\delta_i}\) and variance \(\frac{1}{\tau_1 n_{\delta_i}}\).

The \(\overline{U}_{\delta_i}\) represents the mean value of the neighboring areas, and can be mathematically represented as:

\[ \frac{1}{n_{\delta_i}} \sum\limits_{i \sim j} u_j \]

where \(i \sim j\) indicates the neighboring areas of area \(i\), \(n_{\delta_i}\) represents the number of neighbors, and \(u_j\) represents the spatial effects of the neighboring areas.

The variance is \(\frac{1}{\tau_1 n_{\delta_i}}\). Note that as the number of neighbors increases, the denominator, \(n_{\delta_i}\), increases, reducing the variance overall. This aligns with our statistical intuition; having more neighbors reduces the variance for an area’s CAR distribution because it is analogous to having a larger sample size, as we are borrowing information from more areas.

\(\tau_1\) and \(\tau_2\) are hyperparameters that can be specified based on prior beliefs. \(\tau_1\) is represented as \(\theta_i = \log(\tau_1)\) and the prior is defined on \(\theta_i\). We will rely on the default non-informative prior \(\text{logGamma}(1, 5 \times 10^{-5})\) for \(\theta_i\). Higher values of \(\tau_1\) indicate less variance, which means that neighboring areas are more similar. Lower values of \(\tau_1\) indicate more spatial variability. We also rely on the default prior for \(\tau_2\) which is \(\text{logGamma}(1, 1)\).


The main difference between the Besag and BYM models revolves around the \(v_i\) terms. In Besag models, the \(v_i\) term is captured separately and estimated independently of the \(u_i\) terms. As such, the \(v_i\) term solely represents random noise. In a BYM model, the \(v_i\) and \(u_i\) terms are jointly distributed, providing a more balanced view of the variation. Thus, when using the BYM specification, the \(v_i\) estimates are much smaller since we are allowing the \(u_i\) terms to capture some of the local variability. This can be mathematically represented as:

\[
u_i \mid v_i \sim \text{Norm} \left( \overline{U}_{\delta_i} + \rho (v_i - \overline{V}), \frac{1}{\tau_1 n_{\delta_i}} \right)
\]


From these equations,  \( \rho \) is the spatial correlation parameter, which measures the strength of spatial dependence between neighboring areas. \( \overline{U}_{\delta_i} \) represents the average of the spatially structured random effects of neighboring areas. \( \overline{V} \) is the mean value of all \( v_i \) terms.


The hyperparameters are represented as:
\( \boldsymbol{\theta_i} = (\theta_1, \theta_2) = (\log \tau_1, \log \tau_2) \)

The default priors for the hyperparameters are \( \tau_1 \sim \text{logGamma}(1, 5 \times 10^{-5}) \) and \( \tau_2 \sim \text{logGamma}(1, 1) \).


## Data and Results


Our data analysis will be using data about Scotland from the `SpatialEpi` package. For each of the Scotland counties, the data contains the number of observed and expected lip cancer cases between 1975 and 1980, as well as a variable that indicates the proportion of the population engaged in agriculture, fishing, or forestry (AFF). The AFF variable is related to exposure to sunlight which is a risk factor for lip cancer. The data also contain the map information of Scotland counties, from which we will be using to make Bayesian inferences. In order to analyze the data, we create a data frame called `d` with columns containing the counties ids, the observed and expected number of lip cancer cases, the `AFF` values, and the Standardized Instance Ratio (SIR). The `SIR` is defined as:  \[ {SIR}_i = Y_i / E_i  \]

Specifically, `d` contains the following columns:

* `county`: ID of each county
* `Y`: observed number of lip cancer cases in each county
* `E`: expected number of lip cancer cases in each county
* `AFF`: proportion of population engaged in agriculture, fishing, or forestry
* `SIR`: Standardized Incidence Ratio (SIR) of each county


```{r, echo=FALSE}
data(scotland)
```

```{r, echo=FALSE}
# Convert sp object to sf and transform CRS
map <- scotland$spatial.polygon
map_sf <- st_as_sf(map)
map_sf <- st_set_crs(map_sf, 27700)
map_sf_transformed <- st_transform(map_sf, crs = 27700)

# Transform to WGS84 (longitude and latitude)
map_sf_wgs84 <- st_transform(map_sf_transformed, crs = 4326)

# Prepare the data
d <- scotland$data[, c("county.names", "cases", "expected", "AFF")]
names(d) <- c("county", "Y", "E", "AFF")
d$SIR <- d$Y / d$E

# Add county names to the map_sf_wgs84 object
map_sf_wgs84$county <- scotland$data$county.names

# Combine the data with the spatial polygons using dplyr
map_with_data <- left_join(map_sf_wgs84, d, by = "county")

# Check the combined data
print(head(map_with_data))

# Create the leaflet map
l <- leaflet(map_with_data) %>% addTiles()

# Define color palette
pal <- colorNumeric(palette = "YlOrRd", domain = map_with_data$SIR)

# Define labels
labels <- sprintf(
  "<strong>%s</strong><br/>Observed: %s<br/>Expected: %s<br/>AFF: %s<br/>SIR: %s",
  map_with_data$county, map_with_data$Y, round(map_with_data$E, 2),
  map_with_data$AFF, round(map_with_data$SIR, 2)
) %>% lapply(htmltools::HTML)

# Add polygons and labels to the leaflet map
l %>%
  addPolygons(
    color = "grey", weight = 1,
    fillColor = ~pal(SIR), fillOpacity = 0.5,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(
      style = list(
        "font-weight" = "normal",
        padding = "3px 8px"
      ),
      textsize = "15px", direction = "auto"
    )
  ) %>%

  addLegend(
    pal = pal, values = ~SIR, opacity = 0.5,
    title = "SIR", position = "bottomright"
  )
```

From the map, we see that, on average, the northern parts of Scotland have higher SIRs than the southern parts, with Skye-Lochlash having the highest SIR of 6.43.

```{r, echo=FALSE}
# Create a neighbor list from polygon list
nb <- poly2nb(map_with_data)
#head(nb)

# Convert neighbor list to INLA format
nb2INLA("map.adj", nb)
g <- inla.read.graph(filename = "map.adj")

# Add area identifiers to the data frame
map_with_data$idareau <- 1:nrow(map_with_data)
map_with_data$idareav <- 1:nrow(map_with_data)

# Check structure of map_with_data
#str(map_with_data)
```

We'll first start with a simple IID model without the inclusion of any spatial random effects to serve as the baseline for comparing further models.

```{r, echo=FALSE}
# Define the formula for the first model
formula1 <- Y ~ AFF   +
  f(idareav, model = "iid")

# Fit the first INLA model
res1 <- inla(formula1,
             family = "poisson", data = map_with_data,
             control.predictor = list(compute = FALSE),
             control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor = TRUE))

summary(res1)
```

```{r, echo=FALSE}
dic_res1 <- 536.47
waic_res1 <- 548.07
```

When we only set the equation to account for IID random effects, we observe the intercept \(\beta_0\)= 1.895 with a 95% credible interval equal to (1.518, 2.254). The coefficient of AFF is \(\beta_1\)= 1.173 with a 95% credible interval equal to (-2.098, 4.475). This indicates that there is evidence that an increase in AFF is associated with an increase in lip cancer risk, on average.

In addition, we see that the precision for the \(v_i\) terms has a mean of 1.91.

Next, we'll implement a Besag + IID model. To ensure the inclusion of the \(v_i\) terms, we'll need to add an additional specification for them.

```{r, echo=FALSE}
set.seed(123)

# Define the formula for the second model
formula2 <- Y ~ AFF + f(idareau, model = "besag", graph = g, scale.model = TRUE) +
  f(idareav, model = "iid")

# Fit the second INLA model

res2 <- inla(formula2,
             family = "poisson", data = map_with_data, E = map_with_data$E,
             control.predictor = list(compute = TRUE),
             control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor = TRUE))

summary(res2)
```

```{r, echo=FALSE}
dic_res2 <- 299.53
waic_res2 <- 298.25
```

When we use the Besag + IID model, we observe the intercept \(\beta_0\)=-0.306 with a 95% credible interval equal to (-0.538, -0.069), and the coefficient of AFF is \(\beta_1\)= 4.314 with a 95% credible interval equal to (1.751, 6.762). This indicates that there is evidence that an increase in AFF is associated with an increase in lip cancer risk, on average.

In addition, we see that the precision for the \(u_i\) terms has a mean of 4.14 and the precision for the \(v_i\) terms has a mean of 22,001.62.


We'll now implement the BYM model. Note that because the \(u_i\)  and \(v_i\) terms are jointly distributed, we do not need to add a specification for the \(v_i\) terms.

```{r, echo=FALSE}
set.seed(123)

# Define the formula for the third model
formula3 <- Y ~ AFF + f(idareau, model = "bym", graph = g)

# Fit the third INLA model

res3 <- inla(formula3,
             family = "poisson", data = map_with_data, E = map_with_data$E,
             control.predictor = list(compute = TRUE),
             control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor = TRUE))


# Summarize the results of the third model
summary(res3)
```

When we use the BYM model, we observe the intercept \(\beta_0\)= -0.318 with a 95% credible interval equal to (-0.553, -0.076), and the coefficient of AFF is \(\beta_1\)= 4.276 with a 95% credible interval equal to (1.643, 6.777). This indicates that there is evidence that an increase in AFF is associated with an increase in lip cancer risk, on average.

In addition, we see that the precision for the \(u_i\) terms has a mean of 2.64 and the precision for the \(v_i\) terms has a mean of 2215.43. There is a significant decrease in the precision of the \(v_i\) terms compared to the Besag + IID model. This is because we are jointly considering the random effects, thus allowing the \(u_i\) terms to account for some of the variation.

```{r, echo=FALSE}
dic_res3 <- 298.64
waic_res3 <- 296.43
```

Next, we'll examine the \(\beta_1\) distributions across all three models: IID, Besag + IID, and BYM.

```{r, echo=FALSE}
# Create the marginal data frames
marginal1 <- inla.smarginal(res1$marginals.fixed$AFF)
marginal2 <- inla.smarginal(res2$marginals.fixed$AFF)
marginal3 <- inla.smarginal(res3$marginals.fixed$AFF)

# Convert to data frames and add a model identifier
marginal1 <- data.frame(marginal1)
marginal1$model <- "Model 1"

marginal2 <- data.frame(marginal2)
marginal2$model <- "Model 2"

marginal3 <- data.frame(marginal3)
marginal3$model <- "Model 3"

# Combine the data frames
marginal_combined <- rbind(marginal1, marginal2, marginal3)

# Plot the combined data frame
ggplot(marginal_combined, aes(x = x, y = y, color = model)) +
  geom_line() +
  labs(x = expression(beta[1]), y = "Density") +
  geom_vline(xintercept = 0, col = "black") +
  theme_bw() +
  theme(legend.title = element_blank())
```

When looking at the graph, it is clear that the distribution for the \(\beta_1\) value for the IID model is distinct from the others. This suggests that our `res2` (Besag + IID) and `res3` (BYM) models account for the spatial autocorrelation effectively. Despite their differences, all three models showed a positive correlation between SIR and AFF on average, suggesting those in AFF are at risk for lip cancer.

```{r, echo=FALSE}
# Create a data frame for the model summary
model_summary <- data.frame(
  Model = c("IID", "BESAG + IID", "BYM"),
  DIC = c(dic_res1, dic_res2, dic_res3),
  WAIC = c(waic_res1, waic_res2, waic_res3)
)

# Create kable table
kable(model_summary, caption = "Summary of Model Comparison Metrics", format = "markdown")
```

In order to compare models, we'll rely on the DIC and WAIC metrics. DIC is simply a generalization of the familiar AIC metric. It is used to compare the goodness-of-fit with the complexity of the model, with smaller values indicating a superior model. WAIC aims to measure predictive performance and is based on the idea of LOOCV. Lower WAIC values indicate better predictive performance. We see that the BYM emerges as the superior model. Furthermore, the significant decrease in the DIC and WAIC values with the inclusion of the spatial effects indicates the presence of spatial autocorrelation in the data.


We see that the BYM is our best model. So, let us look at a map of the predicted Relative Risk (RR) of the BYM model to visualize our findings.
```{r, echo=FALSE}
 # Assuming map_with_data contains the columns RR, LL, and UL from the INLA results
map_with_data$RR <- res3$summary.fitted.values[, "mean"]
map_with_data$LL <- res3$summary.fitted.values[, "0.025quant"]
map_with_data$UL <- res3$summary.fitted.values[, "0.975quant"]


# Create a color palette for the Relative Risks (RR)
pal <- colorNumeric(palette = "YlOrRd", domain = map_with_data$RR)


# Create labels for the leaflet map
labels <- sprintf(
  "<strong> %s </strong> <br/> Observed: %s <br/> Expected: %s <br/> AFF: %s <br/> SIR: %s <br/> RR: %s (%s, %s)",
  map_with_data$county, map_with_data$Y, round(map_with_data$E, 2),
  map_with_data$AFF, round(map_with_data$SIR, 2), round(map_with_data$RR, 2),
  round(map_with_data$LL, 2), round(map_with_data$UL, 2)
) %>% lapply(htmltools::HTML)


# Create the leaflet map

lRR <- leaflet(map_with_data) %>%
  addTiles() %>%
  addPolygons(
    color = "grey", weight = 1, fillColor = ~ pal(RR),
    fillOpacity = 0.5,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px", direction = "auto"
    )
  ) %>%
  addLegend(
    pal = pal, values = ~RR, opacity = 0.5, title = "RR",
    position = "bottomright"
  )

# Display the leaflet map
lRR
```

We see that our prior about the northern parts of Scotland being more at risk than southern parts to be well founded, although the ranges for most counties are relatively quite large.

Now, let us use our posterior results from the BYM model to see the probability of a county having a relative risk (`RR`) greater than 2.

```{r, echo=FALSE}

# Calculate and assign the marginalized values
marg <- res3$marginals.fitted.values[[1]]
exc <- sapply(res3$marginals.fitted.values, function(marg) {1 - inla.pmarginal(q = 2, marginal = marg)})

# Assign the calculated values to the `exc` column in `map_with_data`
map_with_data$exc <- exc
```

```{r, echo=FALSE}
# Create a color palette for the probability of RR > 2 (exc)
pal <- colorNumeric(palette = "YlOrRd", domain = map_with_data$exc)

# Create labels for the leaflet map
labels <- sprintf(
  "<strong> %s </strong> <br/>
  Observed: %s <br/> Expected: %s <br/>
  AFF: %s <br/> SIR: %s <br/> RR: %s (%s, %s) <br/> P(RR > 2): %s",
  map_with_data$county, map_with_data$Y, round(map_with_data$E, 2),
  map_with_data$AFF, round(map_with_data$SIR, 2), round(map_with_data$RR, 2),
  round(map_with_data$LL, 2), round(map_with_data$UL, 2), round(map_with_data$exc, 2)
) %>% lapply(htmltools::HTML)

# Create the leaflet map
lexc <- leaflet(map_with_data) %>%
  addTiles() %>%
  addPolygons(
    color = "grey", weight = 1, fillColor = ~pal(exc),
    fillOpacity = 0.5,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(
      style = list(
        "font-weight" = "normal",
        padding = "3px 8px"
      ),
      textsize = "15px", direction = "auto"
    )
  ) %>%

  addLegend(
    pal = pal, values = ~exc, opacity = 0.5, title = "P(RR > 2)",
    position = "bottomright"
  )

# Display the leaflet map
lexc
```

This map provides evidence of excess risk within individual areas. In areas with probabilities close to 1, it is very likely that the relative risk exceeds 2. Conversely, areas with probabilities close to 0 correspond to areas where it is very unlikely that the relative risk exceeds 2. Areas with probabilities around 0.5 have the highest uncertainty and correspond to areas where the relative risk is equally likely to be below or above 2. We observe that the counties in the north of Scotland are the counties where it is most likely that the relative risk exceeds 2. This map also furthers the notion that our data suffers from autocorrelation.

## Discussion/Conclusion

Our study utilized Bayesian spatial models to analyze patterns of lip cancer in Scotland. We found significant evidence of autocorrelation, meaning that neighboring areas had similar SIR values. Model comparison metrics indicated that the BYM framework emerged as the superior model due to its joint modeling of the spatially structured and unstructured random effects.

Findings indicate that increased labor force participation in AFF activities is associated with increased lip cancer rates. This implies the presence of occupational health risks within the sector, such as increased UV exposure. Although, a major limitation in our study is the lack of additional covariates such as tobacco use that may help to improve the model. Other suggestions include expanding the study area and implementing regular lattice methodologies.

## References

* https://inla.r-inla-download.org/r-inla.org/doc/latent/bym.pdf
* https://inla.r-inla-download.org/r-inla.org/doc/latent/besag.pdf
* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3795684/
* https://becarioprecario.bitbucket.io/inla-gitbook/ch-spatial.html
* https://www.paulamoraga.com/book-spatial/bayesian-spatial-models.html
* https://www.paulamoraga.com/book-geospatial/sec-arealdataexamplespatial.html#data-and-map
* https://www.paulamoraga.com/book-spatial/geostatistical-data-1.html
* https://becarioprecario.bitbucket.io/inla-gitbook/ch-INLA.html
* https://www.paulamoraga.com/book-spatial/r-packages-to-download-open-spatial-data.html#population-health-and-other-spatial-data
* https://cran.r-project.org/web/packages/SpatialEpi/SpatialEpi.pdf
* https://jakubnowosad.com/spData/#spdep
* https://www.google.com/search?q=where+do+giraffes+live+in+africa&oq=where+do+gira&gs_lcrp=EgZjaHJvbWUqBwgCEAAYgAQyBwgAEAAYgAQyBggBEEUYOTIHCAIQABiABDIHCAMQABiABDIHCAQQABiABDIHCAUQABiABDIHCAYQABiABDIHCAcQABiABDIHCAgQABiABDIHCAkQABiABNIBCDYzMjRqMGo3qAIAsAIA&sourceid=chrome&ie=UTF-8#ip=1

## Appendix

```{r, fig.cap="Raster Versus Polygon Data",, echo=TRUE, eval=FALSE, out.width = '80%', fig.align='center'}
knitr::include_graphics("rastpol.png")
```

```{r, echo=TRUE, eval=FALSE}
df <- occ(query = "Giraffa camelopardalis", from = "gbif",
          date = c("2000-01-01", "2019-12-31"),
          gbifopts = list(continent = "Africa"),
          has_coords = TRUE, limit = 1000)
d <- occ2df(df)
```

```{r, echo=TRUE, eval=FALSE}
giraffe_spotting <- st_as_sf(d, coords = c("longitude", "latitude"))
st_crs(giraffe_spotting) <- 4326
mapview(giraffe_spotting)
```

```{r, fig.cap="Binary Adjacency Matrix for Irregular Lattice Data", echo=TRUE, eval=FALSE, out.width = '80%', fig.align='center'}
knitr::include_graphics("A-simple-Bayesian-network-structure-and-its-adjacency-matrix.png")
```

```{r, echo=TRUE, eval=FALSE}
data(scotland)
```


```{r, echo=TRUE, eval=FALSE}
library(SpatialEpi)
library(sf)
library(dplyr)
library(leaflet)

# Convert sp object to sf and transform CRS
map <- scotland$spatial.polygon
map_sf <- st_as_sf(map)
map_sf <- st_set_crs(map_sf, 27700)
map_sf_transformed <- st_transform(map_sf, crs = 27700)

# Transform to WGS84 (longitude and latitude)
map_sf_wgs84 <- st_transform(map_sf_transformed, crs = 4326)

# Prepare the data
d <- scotland$data[, c("county.names", "cases", "expected", "AFF")]
names(d) <- c("county", "Y", "E", "AFF")
d$SIR <- d$Y / d$E

# Add county names to the map_sf_wgs84 object
map_sf_wgs84$county <- scotland$data$county.names

# Combine the data with the spatial polygons using dplyr
map_with_data <- left_join(map_sf_wgs84, d, by = "county")

# Check the combined data
print(head(map_with_data))

# Create the leaflet map
l <- leaflet(map_with_data) %>% addTiles()

# Define color palette
pal <- colorNumeric(palette = "YlOrRd", domain = map_with_data$SIR)

# Define labels
labels <- sprintf(
  "<strong>%s</strong><br/>Observed: %s<br/>Expected: %s<br/>AFF: %s<br/>SIR: %s",
  map_with_data$county, map_with_data$Y, round(map_with_data$E, 2),
  map_with_data$AFF, round(map_with_data$SIR, 2)
) %>% lapply(htmltools::HTML)

# Add polygons and labels to the leaflet map
l %>%
  addPolygons(
    color = "grey", weight = 1,
    fillColor = ~pal(SIR), fillOpacity = 0.5,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(
      style = list(
        "font-weight" = "normal",
        padding = "3px 8px"
      ),
      textsize = "15px", direction = "auto"
    )
  ) %>%

  addLegend(
    pal = pal, values = ~SIR, opacity = 0.5,
    title = "SIR", position = "bottomright"
  )
```

```{r, echo=TRUE, eval=FALSE}
# Create a neighbor list from polygon list
nb <- poly2nb(map_with_data)
head(nb)

# Convert neighbor list to INLA format
nb2INLA("map.adj", nb)
g <- inla.read.graph(filename = "map.adj")

# Add area identifiers to the data frame
map_with_data$idareau <- 1:nrow(map_with_data)
map_with_data$idareav <- 1:nrow(map_with_data)

# Check structure of map_with_data
str(map_with_data)
```

```{r, echo=TRUE, eval=FALSE}
# Define the formula for the first model
formula1 <- Y ~ AFF   +
  f(idareav, model = "iid")

# Fit the first INLA model
res1 <- inla(formula1,
             family = "poisson", data = map_with_data,
             control.predictor = list(compute = FALSE),
             control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor = TRUE))

summary(res1)
```

```{r, echo=TRUE, eval=FALSE}
set.seed(123)

# Define the formula for the second model
formula2 <- Y ~ AFF + f(idareau, model = "besag", graph = g, scale.model = TRUE) +
  f(idareav, model = "iid")

# Fit the second INLA model

res2 <- inla(formula2,
             family = "poisson", data = map_with_data, E = map_with_data$E,
             control.predictor = list(compute = TRUE),
             control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor = TRUE))


summary(res2)
```

```{r, echo=TRUE, eval=FALSE}
dic_res2 <- 299.53
waic_res2 <- 298.25
```

```{r, echo=TRUE, eval=FALSE}
dic_res1 <- 536.47
waic_res1 <- 548.07
```

```{r, echo=TRUE, eval=FALSE}
set.seed(123)

# Define the formula for the second model
formula2 <- Y ~ AFF + f(idareau, model = "besag", graph = g, scale.model = TRUE) +
  f(idareav, model = "iid")

# Fit the second INLA model

res2 <- inla(formula2,
             family = "poisson", data = map_with_data, E = map_with_data$E,
             control.predictor = list(compute = TRUE),
             control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor = TRUE))

summary(res2)
```

```{r, echo=TRUE, eval=FALSE}
dic_res2 <- 299.53
waic_res2 <- 298.25
```

```{r, echo=TRUE, eval=FALSE}
set.seed(123)

# Define the formula for the third model
formula3 <- Y ~ AFF + f(idareau, model = "bym", graph = g)

# Fit the third INLA model

res3 <- inla(formula3,
             family = "poisson", data = map_with_data, E = map_with_data$E,
             control.predictor = list(compute = TRUE),
             control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor = TRUE))


# Summarize the results of the third model
summary(res3)
```

```{r, echo=TRUE, eval=FALSE}
dic_res3 <- 298.64
waic_res3 <- 296.43
```

```{r, echo=TRUE, eval=FALSE}
# Create the marginal data frames
marginal1 <- inla.smarginal(res1$marginals.fixed$AFF)
marginal2 <- inla.smarginal(res2$marginals.fixed$AFF)
marginal3 <- inla.smarginal(res3$marginals.fixed$AFF)

# Convert to data frames and add a model identifier
marginal1 <- data.frame(marginal1)
marginal1$model <- "Model 1"

marginal2 <- data.frame(marginal2)
marginal2$model <- "Model 2"

marginal3 <- data.frame(marginal3)
marginal3$model <- "Model 3"

# Combine the data frames
marginal_combined <- rbind(marginal1, marginal2, marginal3)

# Plot the combined data frame
ggplot(marginal_combined, aes(x = x, y = y, color = model)) +
  geom_line() +
  labs(x = expression(beta[1]), y = "Density") +
  geom_vline(xintercept = 0, col = "black") +
  theme_bw() +
  theme(legend.title = element_blank())
```

```{r, echo=TRUE, eval=FALSE}
 # Assuming map_with_data contains the columns RR, LL, and UL from the INLA results
map_with_data$RR <- res3$summary.fitted.values[, "mean"]
map_with_data$LL <- res3$summary.fitted.values[, "0.025quant"]
map_with_data$UL <- res3$summary.fitted.values[, "0.975quant"]


# Create a color palette for the Relative Risks (RR)
pal <- colorNumeric(palette = "YlOrRd", domain = map_with_data$RR)


# Create labels for the leaflet map
labels <- sprintf(
  "<strong> %s </strong> <br/> Observed: %s <br/> Expected: %s <br/> AFF: %s <br/> SIR: %s <br/> RR: %s (%s, %s)",
  map_with_data$county, map_with_data$Y, round(map_with_data$E, 2),
  map_with_data$AFF, round(map_with_data$SIR, 2), round(map_with_data$RR, 2),
  round(map_with_data$LL, 2), round(map_with_data$UL, 2)
) %>% lapply(htmltools::HTML)


# Create the leaflet map

lRR <- leaflet(map_with_data) %>%
  addTiles() %>%
  addPolygons(
    color = "grey", weight = 1, fillColor = ~ pal(RR),
    fillOpacity = 0.5,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px", direction = "auto"
    )
  ) %>%
  addLegend(
    pal = pal, values = ~RR, opacity = 0.5, title = "RR",
    position = "bottomright"
  )

# Display the leaflet map
lRR
```

```{r, echo=TRUE, eval=FALSE}

# Calculate and assign the marginalized values
marg <- res3$marginals.fitted.values[[1]]
exc <- sapply(res3$marginals.fitted.values, function(marg) {1 - inla.pmarginal(q = 2, marginal = marg)})

# Assign the calculated values to the `exc` column in `map_with_data`
map_with_data$exc <- exc
```

```{r, echo=TRUE, eval=FALSE}
# Create a color palette for the probability of RR > 2 (exc)
pal <- colorNumeric(palette = "YlOrRd", domain = map_with_data$exc)

# Create labels for the leaflet map
labels <- sprintf(
  "<strong> %s </strong> <br/>
  Observed: %s <br/> Expected: %s <br/>
  AFF: %s <br/> SIR: %s <br/> RR: %s (%s, %s) <br/> P(RR > 2): %s",
  map_with_data$county, map_with_data$Y, round(map_with_data$E, 2),
  map_with_data$AFF, round(map_with_data$SIR, 2), round(map_with_data$RR, 2),
  round(map_with_data$LL, 2), round(map_with_data$UL, 2), round(map_with_data$exc, 2)
) %>% lapply(htmltools::HTML)

# Create the leaflet map
lexc <- leaflet(map_with_data) %>%
  addTiles() %>%
  addPolygons(
    color = "grey", weight = 1, fillColor = ~pal(exc),
    fillOpacity = 0.5,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(
      style = list(
        "font-weight" = "normal",
        padding = "3px 8px"
      ),
      textsize = "15px", direction = "auto"
    )
  ) %>%

  addLegend(
    pal = pal, values = ~exc, opacity = 0.5, title = "P(RR > 2)",
    position = "bottomright"
  )

# Display the leaflet map
lexc
```

```{r, echo=FALSE}
# running this code chunk will display the results, but not show the code
# include code only in the Appendix (see below)
```

```{r, echo=FALSE, fig.align='center', fig.height=4, fig.width=4, fig.cap="Write figure caption here"}
# adjust figure height, width, and caption as required
```

```{r, echo=TRUE, eval=FALSE}
# running this code chunk will show the code, but the results will not be displayed
```

```{r, echo=TRUE, fig.align='center', fig.height=4, fig.width=4, fig.cap="Write figure caption here"}
# for any additional figures
```

```{}
inla.list.models()
```