--- title: "hhh4addon: extending the functionality of surveillance:hhh4" author: "Johannes Bracher (johannes.bracher@uzh.ch), University of Zurich" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Purpose of the `R` package `hhh4addon` The `R` package `hhh4addon` extends the functionality of the `surveillance` package (Meyer et al 2017), more specifically the implementation of the endemic-epidemic model class in the function `hhh4`. It adds the following features: - Fitting models with higher-order lags. - Computation of predictive and marginal (stationary/periodically stationary) first and second moments. As `hhh4addon` can only be used in combination with `surveillance` we assume familiarity with this package and in particular the `hhh4` function in the following. ## The endemic-epidemic model class and its extension to higher-order lags We only give a brief description of the endemic-epidemic framework and the `hhh4` function, details can be found in Meyer et al (2017) and the vignettes `vignette("hhh4")` and `vignette("hhh4_spacetime")` from the `surveillance` package. Counts $X_{it}$ from units $i = 1, ..., m$ and time $t$ are modelled as $$ X_{it} \mid \mathbf{X}_{t - 1} \sim \text{NegBin}(\mu_{it}, \psi_i); X_{it} \bot X_{jt} \mid \mathbf{X}_{t - 1} $$ $$ \mu_{it} = e_{it}\nu_{it} + \lambda_{it}X_{i, t - 1} + \phi_{it}\sum_{j \neq i} \lfloor w_{ji}\rfloor X_{j, t - 1}. $$ Here, the negative binomial distribution is parametrized by its mean $\mu_{it}$ and an overdispersion parameter $\psi_i$ so that $\text{Var}(X_t \mid \mathbf{X}_{t - 1}) = \mu_{it}\cdot(1 + \psi_i\mu_{it})$. The term $e_{it}\nu_{it}$ is referred to as the endemic component of incidence, where $e_{it}$ is a population offset. The remaining autoregressive terms form the epidemic component, with $\lambda_{it}X_{i, t - 1}$ often called the autoregressive part and $\phi_{it}\sum_{j \neq i} w_{ji}X_{j, t - 1}$ the neighbourhood part. Various specifications for the normalized weights $\lfloor w_{ij}\rfloor$ exist, see `vignette("hhh4_spacetime")` from the surveillance package. The parameters $\nu_{it}, \lambda_{it}$ and $\phi_{it}$ are themselves modelled in a log-linear fashion. While in principle covariates can enter here, it is common to include only an intercept and sine/cosine terms for seasonality, e.g. $$ \log(\nu_{it}) = \alpha^{(\nu)}_i + \beta^{(\nu)}_i \sin(2\pi t/\omega) + \gamma^{(\nu)}_i \cos(2\pi t/\omega), $$ $$ \log(\lambda_{it}) = \alpha^{(\lambda)}_i + \beta^{(\lambda)}_i \sin(2\pi t/\omega) + \gamma^{(\lambda)}_i \cos(2\pi t/\omega), $$ $$ \log(\phi_{it}) = \alpha^{(\phi)}_i + \beta^{(\phi)}_i \sin(2\pi t/\omega) + \gamma^{(\phi)}_i \cos(2\pi t/\omega). $$ Models of this type can be fitted using the function `hhh4` from `surveillance`. Numerous aspects of the model can be specified via the `control` list, with the parametrization of $\nu_{it}$, $\lambda_{it}$ and $\phi_{it}$ steered by the elements `end`, `ar` and `ne`. Whether unit-specific parameters are necessary and identifiable depends on the data at hand, see the vignettes from `surveillance` for more information on model building. The additional functionality offered by `hhh4addon` is the inclusion of higher-order lags, i.e. it provides methods to fit models of the form $$ X_{it} \mid \mathbf{X}_{t - 1}, ..., \mathbf{X}_{t - D}, \sim \text{NegBin}(\mu_{it}, \psi_i) $$ $$ \mu_{it} = \nu_{it} + \lambda_{it}\sum_{d = 1}^D \lfloor u_d\rfloor X_{i, t - d} + \phi_{it}\sum_{j\neq i}\sum_{d = 1}^D \lfloor w_{ji}\rfloor \lfloor u_d\rfloor X_{j, t - d} $$ where the weights $u_d$ are normalized so that $\sum_{d = 1}^D \lfloor u_d\rfloor = 1$. This means that instead of the previous observation $\mathbf{X}_{t - 1}$ a weighted average of the $D$ preceding observations $\mathbf{X}_{t - 1},\dots, \mathbf{X}_{t - D}$ enters. Note that the weights $u_d$ are shared between the autoregressive and neighbourhood component and it is not possible to specify them separately. Currently four parameterizations of the weights $u_d$ are implemented. The default is a geometric, i.e. exponentially decaying lag weighting structure $$ u_d = \alpha(1 - \alpha)^{d - 1}, \ \ \alpha \in (0, 1). $$ A second option is a (shifted) Poisson weighting, $$ u_d = \frac{\alpha^{d - 1}}{(d - 1)!}\exp(-\alpha), \ \ \alpha > 0 $$ which unlike the geometric formulation does not force the first lag weight $u_1$ to be the largest. Thirdly there are linearly decaying lag weights, $$ u_d = \max(1 - \alpha d, 0), \ \ \alpha \in (0, 1). $$ The last pre-implemented lag structure is a simple AR(2) version with $$ u_1 = \alpha,\ \ \ u_2 = 1 - \alpha, \ \ \ \alpha \in (0, 1). $$ When `hhh4addon` is loaded, a modified version `hhh4lag` of `hhh4` is available. The following additional specifications can be made in the `control` list: - `funct_lag`: a function to compute lag weights from a (scalar) parameter `par_lag`. Moreover the function needs to take the smallest and largest lag to receive a positive weight as arguments (`min_lag` and `max_lag`). The four parameterizations mentioned above can be specified (implemented in the functions `geometric lag`, `poisson_lag`, `linear_lag` and `ar2_lag`). Alternatively a user-defined function can be provided. For this case we recommend to consult the source code of e.g. `geometric_lag` and adapt it accordingly. - `par_lag`: the weighting parameter entering into `funct_lag`; for `geometric_lag`, `linear_lag` and `ar2_lag` this is $\text{logit}(\alpha)$, for `poisson_lag` it is $\log(\alpha)$. These choices enable unconstrained optimization of the (profile) likelihood, when $\alpha$ is estimated from the data (see below). - `min_lag`: the lowest lag to receive a positive weight (the weights for lags 1 through `min_lag - 1` are forced to zero). The default value 1 should be kept in most cases. - `max_lag`: the highest lag to be included. Note that the `subset` specified in `control` needs to be compatible with `max_lag` (specifically `subset[1] > max_lag` needs to hold). The return object of `hhh4lag` is an object of class `hhh4lag`, which inherits from the regular `hhh4` class. We exemplify this extension with a simple univariate analysis of the `salmonella.agona` data from `surveillance` (see plot below). All syntax also translates to the multivariate case. ```{r, fig.show='hold', message=FALSE} library(surveillance) library(hhh4addon) data("salmonella.agona") # get data # convert old "disProg" to new "sts" data class salmonella <- disProg2sts(salmonella.agona) # plot data: plot(salmonella) ``` First we fit a univariate `hhh4` model with only first lags and seasonality in both the endemic (`en`) and epidemic/autoregressive (`ar`) component, i.e. $$ \mu_t = \nu_t + \phi_t X_{t - 1} $$ $$ \log(\nu_t) = \alpha^{(\nu)} + \beta^{(\nu)} \sin(2\pi t/\omega) + \gamma^{(\nu)} \cos(2\pi t/\omega) $$ $$ \log(\phi_t) = \alpha^{(\phi)} + \beta^{(\phi)} \sin(2\pi t/\omega) + \gamma^{(\phi)} \cos(2\pi t/\omega) $$ ```{r, message=FALSE, warning=FALSE} control_salmonella <- list(end = list(f = addSeason2formula(~ 1)), ar = list(f = addSeason2formula(~ 1)), family = "NegBin1", subset = 6:312) # Note: we fit to subset 6:312 to ensure comparability with the higher-order lag # models fitted next. fit_salmonella_ar1 <- hhh4(salmonella, control_salmonella) # use regular hhh4 function AIC(fit_salmonella_ar1) ``` Next we fit a higher-order lag model with $$ \mu_t = \nu_t + \lambda_{it}\sum_{d = 1}^5 \lfloor u_d\rfloor X_{t - d}, $$ geometric lag weights $u_d$ and a fixed value of $\alpha = 0.8$. We fit the model to data from week 6 onwards. ```{r} par_lag_08 <- log(0.8/(1 - 0.8)) # the par_lag value corresponding to alpha = 0.8 control_salmonella_08 <- list(end = list(f = addSeason2formula(~ 1)), ar = list(f = addSeason2formula(~ 1)), funct_lag = geometric_lag, # default max_lag = 5, # default par_lag = par_lag_08, # new parameter family = "NegBin1", subset = 6:312) # Note that funct_lag = geometric_lag and max_lag = 5 are the defaults in hhh4lag and # would not need to be specified explicitly. fit_salmonella_geom_08 <- hhh4_lag(salmonella, control_salmonella_08) # now use hhh4lag plot(fit_salmonella_geom_08, names = "") AIC(fit_salmonella_geom_08) ``` We can see that in terms of AIC this model is better than the previously fitted model with only first lags. To estimate `par_lag` (i.e. a suitable transformation of $\alpha$) from the data we can use the wrapper `profile_par_lag` which re-fits the model for different values of `par_lag` and uses numerical optimization (`optim`) to find the optimal one ($\alpha$ is thus estimated via a profile likelihood approach). ```{r} control_salmonella_geom <- list(end = list(f = addSeason2formula(~ 1)), ar = list(f = addSeason2formula(~ 1)), funct_lag = geometric_lag, max_lag = 5, family = "NegBin1", subset = 6:312) fit_salmonella_geom <- profile_par_lag(salmonella, control_salmonella_geom) AIC(fit_salmonella_geom) summary(fit_salmonella_geom) ``` The best fit is achieved with $\alpha = 0.56$, i.e. almost half of the contribution of the epidemic contribution comes from lags of order larger than one (the standardized lag weights are stored in `fit_salmonella_geom$distr_lag`). An (older) alternative in order to estimate $\alpha$ is `fit_par_lag` which instead of applying `optim` fits the model for a vector of values for `par_lag` provided by the user (argument `range_par`). Under certain circumstances this is computationally faster than the use of `optim`. ```{r} grid_alpha <- seq(from = 0.01, to = 0.99, by = 0.02) grid_par_lag <- log(grid_alpha/(1 - grid_alpha)) # move to logit scale fit_salmonella_geom_grid <- fit_par_lag(salmonella, control_salmonella_geom, range_par = grid_par_lag) ``` The function `fit_par_lag` returns a list containing the best model (`$best_mod`) and the AIC values of the models corresponding to the different values of `par_lag` (`$AICs`). We can thus plot the AIC as a function of $\alpha$. ```{r} plot(grid_alpha, fit_salmonella_geom_grid$AICs, type = "l", xlab = expression(alpha), ylab = "AIC") ``` A remark on the computation of the AIC values: The AIC of a model which was fitted with `fit_par_lag` or `profile_par_lag` is 2 points higher than that of a model with the same value of `par_lag`, but specified manually instead of being estimated. This is due to the loss of one degree of freedom. ```{r} par_lag_0.56 <- log(0.56/(1 - 0.56)) # par_lag value corresponding to alpha = 0.56 control_salmonella_geom.056 <- list(end = list(f = addSeason2formula(~ 1)), ar = list(f = addSeason2formula(~ 1), use_distr_lag = TRUE), par_lag = par_lag_0.56, family = "NegBin1", subset = 6:312) fit_salmonella_geom.056 <- hhh4_lag(salmonella, control_salmonella_geom.056) AIC(fit_salmonella_geom.056); AIC(fit_salmonella_geom); AIC(fit_salmonella_geom_grid$best_mod) ``` For comparison we also fit models with Poisson and AR(2) lags. ```{r} # Poisson lags: control_salmonella_pois <- control_salmonella control_salmonella_pois$funct_lag <- poisson_lag fit_salmonella_pois <- profile_par_lag(salmonella, control_salmonella_pois) AIC(fit_salmonella_pois) # AR(2) lags: control_salmonella_ar2 <- control_salmonella control_salmonella_ar2$funct_lag <- ar2_lag fit_salmonella_ar2 <- profile_par_lag(salmonella, control_salmonella_ar2) AIC(fit_salmonella_ar2) ``` These parameterizations lead to a slightly worse model fit than the geometric lags weights. Most of the functionality for `hhh4` objects is by now also available for `hhh4lag` objects (as returned by `hhh4lag`, `profile_par_lag` and `fit_par_lag`). For instance we can simulate from a fitted model. Note that in this case the starting values need to be specified as a matrix with `max_lag` rows. ```{r} # simulate 1000 trajectories for weeks 261 through 270, using values from # 256 through 260 as starting values: set.seed(17) sim <- simulate(fit_salmonella_geom, subset = 261:270, y.start = fit_salmonella_geom$stsObj@observed[256:260, , drop = FALSE], nsim = 1000, simplify = TRUE) # plot one trajectory: plot(261:270, sim[, , 1], type = "h", xlab = "week", ylab = "No. infected") ``` Also, we can generate rolling one-week-ahead forecasts (i.e. from models which are iteratively re-fitted each week). This is done using the new function `hhh4addon:oneStepAhead_hhh4lag` (the function `surveillance:oneStepAhead` cannot be applied and will throw an error). By default the lag weighting parameter `par_lag` is not re-fitted (as this can lead to quite long computation times); to re-fit it set the argument `refit_par_lag` to `TRUE`. The result of `hhh4addon:oneStepAhead_hhh4lag` can be handled just like the return from `surveillance:oneStepAhead` would. We illustrate this for weeks 261--312 as the validation period. ```{r} owa_forecasts_geom <- oneStepAhead_hhh4lag(fit_salmonella_geom, tp = c(260, 311), refit_par_lag = FALSE) # forecasts are done for tp[1] + 1, tp[2] + 1, ... colMeans(scores(owa_forecasts_geom)) # for comparison: model with only first lags: owa_forecasts_ar1 <- oneStepAhead(fit_salmonella_ar1, tp = c(260, 311)) colMeans(scores(owa_forecasts_ar1)) # average scores ``` The geometric-lag model thus performs slightly better in terms of the logarithmic and Dawid-Sebastiani scores, but worse in terms of the ranked probability score (all scores are negatively oriented). ## Computing predictive and marginal moments The second functionality of `hhh4addon` concerns predictive and marginal (stationary or periodically stationary) moments. These quantities can be computed without the need for simulation for the endemic-epidemic class, see Held, Meyer and Bracher (2017) and Bracher and Held (2019) for the theoretical background. To illustrate the use of predictive moments we re-fit the above models to weeks 6--260 of the `salmonella.agona` data. The rest is again kept for validation of our predictions. ```{r} control_salmonella.sub <- list(end = list(f = addSeason2formula(~ 1), lag = 1), ar = list(f = addSeason2formula(~ 1), lag = 1), family = "NegBin1", subset = 6:260) fit_salmonella_ar1.sub <- hhh4(salmonella, control_salmonella.sub) fit_salmonella_geom.sub <- profile_par_lag(salmonella, control_salmonella.sub) ``` Predictive moments for weeks 261, 262, ... can now be calculated and plotted. Note that these correspond to a *path forecast* rather than a weekly updated rolling forecast as before, i.e. the forecasts for weeks 261, 262,... are all conditioned on the observation from week 260. ```{r} pred_mom_ar1 <- predictive_moments(fit_salmonella_ar1.sub, t_condition = 260, lgt = 52, return_Sigma = TRUE) # if return_Sigma == TRUE the full predictive covariance function is returned # this may be costly in terms of storage for complex models and long forecasts. pred_mom_geom <- predictive_moments(fit_salmonella_geom.sub, t_condition = 260, lgt = 52, return_Sigma = TRUE) plot(fit_salmonella_ar1.sub, names = "") fanplot_prediction(pred_mom_geom, add = TRUE) ``` The fanplots shown here are based on negative binomial approximations of the predictive distributions via the first two moments. We can also use these predictive moments to evaluate the Dawid-Sebastiani score of the path forecasts (a proper scoring rule for predictive model assessment, Held et al 2017). ```{r} # Note that ds_score requires that the full predictive covariance matrix is # available, i.e. compute_Sigma = TRUE is used in predictive_moments. ds_score_hhh4(pred_mom_ar1) ds_score_hhh4(pred_mom_geom) ``` The David-Sebastiani score is negatively oriented here, i.e. the prediction from the model `fit_salmonella_geom.sub` with geometric lags is slightly better than the one from the simpler `fit_salmonella.sub`. The `predictive_moments` function can also be used to check that the simulation functions are doing the right thing. Here we can re-use the previously simulated data: ```{r} cond_mom_geom_260 <- predictive_moments(fit_salmonella_geom, t_condition = 260, lgt = 10) plot(261:270, apply(sim, 1:2, mean), xlab = "week", ylab = "predictive mean") lines(261:270, cond_mom_geom_260$mu_matrix) legend("topright", pch = c(1, NA), lty = c(NA, 1), legend = c("simulation-based", "analytical"), bty = "n") plot(261:270, apply(sim, 1:2, sd), xlab = "week", ylab = "predictive standard deviation") lines(261:270, sqrt(cond_mom_geom_260$var_matrix)) ``` The agreement between the analytical results and the simulation-based estimates of the predictive moments is relatively poor with 1,000 simulated paths, but gets very good with 10,000 (not applied here as costly in terms of storage and computation time). The function `stationary_moments` can be applied in the same way (without specifying a `t_condition` and `lgt`) to obtain the marginal moments of a fitted model. Note that this is only possible for models with either time-constant parameters (stationary moments) or periodically varying parameters (periodically stationary moments). For horizons exceeding a few weeks the predictive and marginal moments are indistinguishable (as the forecast goes to the periodically stationary behaviour of the model). ```{r} marg_mom_geom <- stationary_moments(fit_salmonella_geom) fanplot_stationary(marg_mom_geom, timepoints = 1:52, add_legend = TRUE) ``` # References Bracher, J. and L. Held (2019). Endemic-epidemic models with discrete-time serial interval distributions for infectious disease prediction. Preprint: https://arxiv.org/abs/1901.03090. Bracher, J. and Held, L. (2017) Periodically stationary multivariate autoregressive models. Preprint: https://arxiv.org/abs/1707.04635 Held, L., S. Meyer, and J. Bracher (2017). Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture. *Statistics in Medicine* 36 (22), 3443–3460. Meyer, S., L. Held, and M. Höhle (2017). Spatio-temporal analysis of epidemic phenomena using the R package surveillance. *Journal of Statistical Software* 77 (11), 1–55.