hhh4addon: extending the functionality of surveillance:hhh4

Purpose of the R package hhh4addon

The R package hhh4addon extends the functionality of the surveillance package (Meyer et al 2017), more specifically the implementation of the endemic-epidemic model class in the function hhh4. It adds the following features:

  • Fitting models with higher-order lags.
  • Computation of predictive and marginal (stationary/periodically stationary) first and second moments.

As hhh4addon can only be used in combination with surveillance we assume familiarity with this package and in particular the hhh4 function in the following.

The endemic-epidemic model class and its extension to higher-order lags

We only give a brief description of the endemic-epidemic framework and the hhh4 function, details can be found in Meyer et al (2017) and the vignettes vignette("hhh4") and vignette("hhh4_spacetime") from the surveillance package. Counts Xit from units i = 1, ..., m and time t are modelled as

Xit ∣ Xt − 1 ∼ NegBin(μit, ψi); XitXjt ∣ Xt − 1 μit = eitνit + λitXi, t − 1 + ϕitj ≠ iwjiXj, t − 1. Here, the negative binomial distribution is parametrized by its mean μit and an overdispersion parameter ψi so that Var(Xt ∣ Xt − 1) = μit ⋅ (1 + ψiμit). The term eitνit is referred to as the endemic component of incidence, where eit is a population offset. The remaining autoregressive terms form the epidemic component, with λitXi, t − 1 often called the autoregressive part and ϕitj ≠ iwjiXj, t − 1 the neighbourhood part. Various specifications for the normalized weights wij exist, see vignette("hhh4_spacetime") from the surveillance package. The parameters νit, λit and ϕit are themselves modelled in a log-linear fashion. While in principle covariates can enter here, it is common to include only an intercept and sine/cosine terms for seasonality, e.g. log (νit) = αi(ν) + βi(ν)sin (2πt/ω) + γi(ν)cos (2πt/ω), log (λit) = αi(λ) + βi(λ)sin (2πt/ω) + γi(λ)cos (2πt/ω), log (ϕit) = αi(ϕ) + βi(ϕ)sin (2πt/ω) + γi(ϕ)cos (2πt/ω). Models of this type can be fitted using the function hhh4 from surveillance. Numerous aspects of the model can be specified via the control list, with the parametrization of νit, λit and ϕit steered by the elements end, ar and ne. Whether unit-specific parameters are necessary and identifiable depends on the data at hand, see the vignettes from surveillance for more information on model building.

The additional functionality offered by hhh4addon is the inclusion of higher-order lags, i.e. it provides methods to fit models of the form Xit ∣ Xt − 1, ..., Xt − D,  ∼ NegBin(μit, ψi) $$ \mu_{it} = \nu_{it} + \lambda_{it}\sum_{d = 1}^D \lfloor u_d\rfloor X_{i, t - d} + \phi_{it}\sum_{j\neq i}\sum_{d = 1}^D \lfloor w_{ji}\rfloor \lfloor u_d\rfloor X_{j, t - d} $$ where the weights ud are normalized so that $\sum_{d = 1}^D \lfloor u_d\rfloor = 1$. This means that instead of the previous observation Xt − 1 a weighted average of the D preceding observations Xt − 1, …, Xt − D enters. Note that the weights ud are shared between the autoregressive and neighbourhood component and it is not possible to specify them separately.

Currently four parameterizations of the weights ud are implemented. The default is a geometric, i.e. exponentially decaying lag weighting structure ud = α(1 − α)d − 1,  α ∈ (0, 1). A second option is a (shifted) Poisson weighting, $$ u_d = \frac{\alpha^{d - 1}}{(d - 1)!}\exp(-\alpha), \ \ \alpha > 0 $$ which unlike the geometric formulation does not force the first lag weight u1 to be the largest. Thirdly there are linearly decaying lag weights, ud = max (1 − αd, 0),  α ∈ (0, 1). The last pre-implemented lag structure is a simple AR(2) version with u1 = α,   u2 = 1 − α,   α ∈ (0, 1).

When hhh4addon is loaded, a modified version hhh4lag of hhh4 is available. The following additional specifications can be made in the control list:

  • funct_lag: a function to compute lag weights from a (scalar) parameter par_lag. Moreover the function needs to take the smallest and largest lag to receive a positive weight as arguments (min_lag and max_lag). The four parameterizations mentioned above can be specified (implemented in the functions geometric lag, poisson_lag, linear_lag and ar2_lag). Alternatively a user-defined function can be provided. For this case we recommend to consult the source code of e.g. geometric_lag and adapt it accordingly.
  • par_lag: the weighting parameter entering into funct_lag; for geometric_lag, linear_lag and ar2_lag this is logit(α), for poisson_lag it is log (α). These choices enable unconstrained optimization of the (profile) likelihood, when α is estimated from the data (see below).
  • min_lag: the lowest lag to receive a positive weight (the weights for lags 1 through min_lag - 1 are forced to zero). The default value 1 should be kept in most cases.
  • max_lag: the highest lag to be included. Note that the subset specified in control needs to be compatible with max_lag (specifically subset[1] > max_lag needs to hold).

The return object of hhh4lag is an object of class hhh4lag, which inherits from the regular hhh4 class.

We exemplify this extension with a simple univariate analysis of the salmonella.agona data from surveillance (see plot below). All syntax also translates to the multivariate case.

library(surveillance)
library(hhh4addon)
data("salmonella.agona") # get data
# convert old "disProg" to new "sts" data class
salmonella <- disProg2sts(salmonella.agona)
# plot data:
plot(salmonella)

First we fit a univariate hhh4 model with only first lags and seasonality in both the endemic (en) and epidemic/autoregressive (ar) component, i.e. μt = νt + ϕtXt − 1 log (νt) = α(ν) + β(ν)sin (2πt/ω) + γ(ν)cos (2πt/ω) log (ϕt) = α(ϕ) + β(ϕ)sin (2πt/ω) + γ(ϕ)cos (2πt/ω)

control_salmonella <- list(end = list(f = addSeason2formula(~ 1)),
                            ar = list(f = addSeason2formula(~ 1)),
                            family = "NegBin1", subset = 6:312)
# Note: we fit to subset 6:312 to ensure comparability with the higher-order lag
# models fitted next.
fit_salmonella_ar1 <- hhh4(salmonella, control_salmonella) # use regular hhh4 function
AIC(fit_salmonella_ar1)
## [1] 1229.134

Next we fit a higher-order lag model with $$ \mu_t = \nu_t + \lambda_{it}\sum_{d = 1}^5 \lfloor u_d\rfloor X_{t - d}, $$ geometric lag weights ud and a fixed value of α = 0.8. We fit the model to data from week 6 onwards.

par_lag_08 <- log(0.8/(1 - 0.8)) # the par_lag value corresponding to alpha = 0.8
control_salmonella_08 <- list(end = list(f = addSeason2formula(~ 1)),
                              ar = list(f = addSeason2formula(~ 1)),
                              funct_lag = geometric_lag, # default
                              max_lag = 5, # default
                              par_lag = par_lag_08, # new parameter
                              family = "NegBin1", subset = 6:312)
# Note that funct_lag = geometric_lag and max_lag = 5 are the defaults in hhh4lag and
# would not need to be specified explicitly.
fit_salmonella_geom_08 <- hhh4_lag(salmonella, control_salmonella_08) # now use hhh4lag
plot(fit_salmonella_geom_08, names = "")

AIC(fit_salmonella_geom_08)
## [1] 1225.157

We can see that in terms of AIC this model is better than the previously fitted model with only first lags. To estimate par_lag (i.e. a suitable transformation of α) from the data we can use the wrapper profile_par_lag which re-fits the model for different values of par_lag and uses numerical optimization (optim) to find the optimal one (α is thus estimated via a profile likelihood approach).

control_salmonella_geom <- list(end = list(f = addSeason2formula(~ 1)),
                                ar = list(f = addSeason2formula(~ 1)),
                                funct_lag = geometric_lag,
                                max_lag = 5,
                                family = "NegBin1", subset = 6:312)
fit_salmonella_geom <- profile_par_lag(salmonella, control_salmonella_geom)
AIC(fit_salmonella_geom)
## [1] 1224.65
summary(fit_salmonella_geom)
## 
## Call: 
## hhh4_lag(stsObj = stsObj, control = control)
## 
## Coefficients:
##                         Estimate  Std. Error
## ar.1                    -0.99658   0.26263  
## ar.sin(2 * pi * t/52)   -0.34039   0.20947  
## ar.cos(2 * pi * t/52)   -0.64112   0.28667  
## end.1                    0.37801   0.15091  
## end.sin(2 * pi * t/52)  -0.25764   0.15369  
## end.cos(2 * pi * t/52)   0.10855   0.17548  
## overdisp                 0.14835   0.04328  
## 
## Distributed lags used (max_lag = 5). Weights: 0.56; 0.25; 0.11; 0.05; 0.02
## Use distr_lag() to check the applied lag distribution and parameters.
## 
## Log-likelihood:   -604.32 
## AIC:              1224.65 
## BIC:              1254.46 
## 
## Number of units:        1 
## Number of time points:  307

The best fit is achieved with α = 0.56, i.e. almost half of the contribution of the epidemic contribution comes from lags of order larger than one (the standardized lag weights are stored in fit_salmonella_geom$distr_lag).

An (older) alternative in order to estimate α is fit_par_lag which instead of applying optim fits the model for a vector of values for par_lag provided by the user (argument range_par). Under certain circumstances this is computationally faster than the use of optim.

grid_alpha <- seq(from = 0.01, to = 0.99, by = 0.02)
grid_par_lag <- log(grid_alpha/(1 - grid_alpha)) # move to logit scale
fit_salmonella_geom_grid <- fit_par_lag(salmonella, control_salmonella_geom,
                                   range_par = grid_par_lag)

The function fit_par_lag returns a list containing the best model ($best_mod) and the AIC values of the models corresponding to the different values of par_lag ($AICs). We can thus plot the AIC as a function of α.

plot(grid_alpha, fit_salmonella_geom_grid$AICs, type = "l",
     xlab = expression(alpha), ylab = "AIC")

A remark on the computation of the AIC values: The AIC of a model which was fitted with fit_par_lag or profile_par_lag is 2 points higher than that of a model with the same value of par_lag, but specified manually instead of being estimated. This is due to the loss of one degree of freedom.

par_lag_0.56 <- log(0.56/(1 - 0.56)) # par_lag value corresponding to alpha = 0.56
control_salmonella_geom.056 <- list(end = list(f = addSeason2formula(~ 1)),
                            ar = list(f = addSeason2formula(~ 1), use_distr_lag = TRUE),
                            par_lag = par_lag_0.56,
                            family = "NegBin1", subset = 6:312)
fit_salmonella_geom.056 <- hhh4_lag(salmonella, control_salmonella_geom.056)
AIC(fit_salmonella_geom.056); AIC(fit_salmonella_geom); AIC(fit_salmonella_geom_grid$best_mod)
## [1] 1222.654
## [1] 1224.65
## [1] 1224.65

For comparison we also fit models with Poisson and AR(2) lags.

# Poisson lags:
control_salmonella_pois <- control_salmonella
control_salmonella_pois$funct_lag <- poisson_lag
fit_salmonella_pois <- profile_par_lag(salmonella, control_salmonella_pois)
AIC(fit_salmonella_pois)
## [1] 1225.025
# AR(2) lags:
control_salmonella_ar2 <- control_salmonella
control_salmonella_ar2$funct_lag <- ar2_lag
fit_salmonella_ar2 <- profile_par_lag(salmonella, control_salmonella_ar2)
AIC(fit_salmonella_ar2)
## [1] 1226.999

These parameterizations lead to a slightly worse model fit than the geometric lags weights.

Most of the functionality for hhh4 objects is by now also available for hhh4lag objects (as returned by hhh4lag, profile_par_lag and fit_par_lag). For instance we can simulate from a fitted model. Note that in this case the starting values need to be specified as a matrix with max_lag rows.

# simulate 1000 trajectories for weeks 261 through 270, using values from 
# 256 through 260 as starting values:
set.seed(17)
sim <- simulate(fit_salmonella_geom, subset = 261:270,
                y.start = fit_salmonella_geom$stsObj@observed[256:260, , drop = FALSE],
                nsim = 1000, simplify = TRUE)
# plot one trajectory:
plot(261:270, sim[, , 1], type = "h", xlab = "week", ylab = "No. infected")

Also, we can generate rolling one-week-ahead forecasts (i.e. from models which are iteratively re-fitted each week). This is done using the new function hhh4addon:oneStepAhead_hhh4lag (the function surveillance:oneStepAhead cannot be applied and will throw an error). By default the lag weighting parameter par_lag is not re-fitted (as this can lead to quite long computation times); to re-fit it set the argument refit_par_lag to TRUE. The result of hhh4addon:oneStepAhead_hhh4lag can be handled just like the return from surveillance:oneStepAhead would. We illustrate this for weeks 261–312 as the validation period.

owa_forecasts_geom <- oneStepAhead_hhh4lag(fit_salmonella_geom,
                                      tp = c(260, 311), refit_par_lag = FALSE)
## Lag weights are not re-estimated for the one-step-ahead forecasts (set refit_par_lag = TRUE to re-fit them).
# forecasts are done for tp[1] + 1, tp[2] + 1, ...
colMeans(scores(owa_forecasts_geom))
##     logs      rps      dss      ses 
## 2.044585 1.125719 2.553112 4.517066
# for comparison: model with only first lags:
owa_forecasts_ar1 <- oneStepAhead(fit_salmonella_ar1,
                                      tp = c(260, 311))
## One-step-ahead prediction @ t = 260 ...
## One-step-ahead prediction @ t = 261 ...
## One-step-ahead prediction @ t = 262 ...
## One-step-ahead prediction @ t = 263 ...
## One-step-ahead prediction @ t = 264 ...
## One-step-ahead prediction @ t = 265 ...
## One-step-ahead prediction @ t = 266 ...
## One-step-ahead prediction @ t = 267 ...
## One-step-ahead prediction @ t = 268 ...
## One-step-ahead prediction @ t = 269 ...
## One-step-ahead prediction @ t = 270 ...
## One-step-ahead prediction @ t = 271 ...
## One-step-ahead prediction @ t = 272 ...
## One-step-ahead prediction @ t = 273 ...
## One-step-ahead prediction @ t = 274 ...
## One-step-ahead prediction @ t = 275 ...
## One-step-ahead prediction @ t = 276 ...
## One-step-ahead prediction @ t = 277 ...
## One-step-ahead prediction @ t = 278 ...
## One-step-ahead prediction @ t = 279 ...
## One-step-ahead prediction @ t = 280 ...
## One-step-ahead prediction @ t = 281 ...
## One-step-ahead prediction @ t = 282 ...
## One-step-ahead prediction @ t = 283 ...
## One-step-ahead prediction @ t = 284 ...
## One-step-ahead prediction @ t = 285 ...
## One-step-ahead prediction @ t = 286 ...
## One-step-ahead prediction @ t = 287 ...
## One-step-ahead prediction @ t = 288 ...
## One-step-ahead prediction @ t = 289 ...
## One-step-ahead prediction @ t = 290 ...
## One-step-ahead prediction @ t = 291 ...
## One-step-ahead prediction @ t = 292 ...
## One-step-ahead prediction @ t = 293 ...
## One-step-ahead prediction @ t = 294 ...
## One-step-ahead prediction @ t = 295 ...
## One-step-ahead prediction @ t = 296 ...
## One-step-ahead prediction @ t = 297 ...
## One-step-ahead prediction @ t = 298 ...
## One-step-ahead prediction @ t = 299 ...
## One-step-ahead prediction @ t = 300 ...
## One-step-ahead prediction @ t = 301 ...
## One-step-ahead prediction @ t = 302 ...
## One-step-ahead prediction @ t = 303 ...
## One-step-ahead prediction @ t = 304 ...
## One-step-ahead prediction @ t = 305 ...
## One-step-ahead prediction @ t = 306 ...
## One-step-ahead prediction @ t = 307 ...
## One-step-ahead prediction @ t = 308 ...
## One-step-ahead prediction @ t = 309 ...
## One-step-ahead prediction @ t = 310 ...
## One-step-ahead prediction @ t = 311 ...
colMeans(scores(owa_forecasts_ar1)) # average scores
##     logs      rps      dss      ses 
## 2.058342 1.116286 2.664664 4.376845

The geometric-lag model thus performs slightly better in terms of the logarithmic and Dawid-Sebastiani scores, but worse in terms of the ranked probability score (all scores are negatively oriented).

Computing predictive and marginal moments

The second functionality of hhh4addon concerns predictive and marginal (stationary or periodically stationary) moments. These quantities can be computed without the need for simulation for the endemic-epidemic class, see Held, Meyer and Bracher (2017) and Bracher and Held (2019) for the theoretical background.

To illustrate the use of predictive moments we re-fit the above models to weeks 6–260 of the salmonella.agona data. The rest is again kept for validation of our predictions.

control_salmonella.sub <- list(end = list(f = addSeason2formula(~ 1), lag = 1),
                            ar = list(f = addSeason2formula(~ 1), lag = 1),
                            family = "NegBin1", subset = 6:260)
fit_salmonella_ar1.sub <- hhh4(salmonella, control_salmonella.sub)
fit_salmonella_geom.sub <- profile_par_lag(salmonella, control_salmonella.sub)

Predictive moments for weeks 261, 262, … can now be calculated and plotted. Note that these correspond to a path forecast rather than a weekly updated rolling forecast as before, i.e. the forecasts for weeks 261, 262,… are all conditioned on the observation from week 260.

pred_mom_ar1 <- predictive_moments(fit_salmonella_ar1.sub, t_condition = 260, 
                               lgt = 52, return_Sigma = TRUE)
# if return_Sigma == TRUE the full predictive covariance function is returned
# this may be costly in terms of storage for complex models and long forecasts.
pred_mom_geom <- predictive_moments(fit_salmonella_geom.sub, 
                                    t_condition = 260, lgt = 52, return_Sigma = TRUE)

plot(fit_salmonella_ar1.sub, names = "")
fanplot_prediction(pred_mom_geom, add = TRUE)

The fanplots shown here are based on negative binomial approximations of the predictive distributions via the first two moments. We can also use these predictive moments to evaluate the Dawid-Sebastiani score of the path forecasts (a proper scoring rule for predictive model assessment, Held et al 2017).

# Note that ds_score requires that the full predictive covariance matrix is
# available, i.e. compute_Sigma = TRUE is used in predictive_moments.
ds_score_hhh4(pred_mom_ar1)
## [1] 1.43501
ds_score_hhh4(pred_mom_geom)
## [1] 1.431232

The David-Sebastiani score is negatively oriented here, i.e. the prediction from the model fit_salmonella_geom.sub with geometric lags is slightly better than the one from the simpler fit_salmonella.sub.

The predictive_moments function can also be used to check that the simulation functions are doing the right thing. Here we can re-use the previously simulated data:

cond_mom_geom_260 <- predictive_moments(fit_salmonella_geom, t_condition = 260, lgt = 10)
plot(261:270, apply(sim, 1:2, mean), xlab = "week", ylab = "predictive mean")
lines(261:270, cond_mom_geom_260$mu_matrix)
legend("topright", pch = c(1, NA), lty = c(NA, 1),
       legend = c("simulation-based", "analytical"), bty = "n")

plot(261:270, apply(sim, 1:2, sd), xlab = "week", ylab = "predictive standard deviation")
lines(261:270, sqrt(cond_mom_geom_260$var_matrix))

The agreement between the analytical results and the simulation-based estimates of the predictive moments is relatively poor with 1,000 simulated paths, but gets very good with 10,000 (not applied here as costly in terms of storage and computation time).

The function stationary_moments can be applied in the same way (without specifying a t_condition and lgt) to obtain the marginal moments of a fitted model. Note that this is only possible for models with either time-constant parameters (stationary moments) or periodically varying parameters (periodically stationary moments). For horizons exceeding a few weeks the predictive and marginal moments are indistinguishable (as the forecast goes to the periodically stationary behaviour of the model).

marg_mom_geom <- stationary_moments(fit_salmonella_geom)
fanplot_stationary(marg_mom_geom, timepoints = 1:52, add_legend = TRUE)

References

Bracher, J. and L. Held (2019). Endemic-epidemic models with discrete-time serial interval distributions for infectious disease prediction. Preprint: https://arxiv.org/abs/1901.03090.

Bracher, J. and Held, L. (2017) Periodically stationary multivariate autoregressive models. Preprint: https://arxiv.org/abs/1707.04635

Held, L., S. Meyer, and J. Bracher (2017). Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture. Statistics in Medicine 36 (22), 3443–3460.

Meyer, S., L. Held, and M. Höhle (2017). Spatio-temporal analysis of epidemic phenomena using the R package surveillance. Journal of Statistical Software 77 (11), 1–55.