R
package hhh4addon
The R
package hhh4addon
extends the
functionality of the surveillance
package (Meyer et al
2017), more specifically the implementation of the endemic-epidemic
model class in the function hhh4
.
It adds the following features:
As hhh4addon
can only be used in combination with
surveillance
we assume familiarity with this package and in
particular the hhh4
function in the following.
We only give a brief description of the endemic-epidemic framework
and the hhh4
function, details can be found in Meyer et al
(2017) and the vignettes vignette("hhh4")
and
vignette("hhh4_spacetime")
from the
surveillance
package. Counts Xit
from units i = 1, ..., m and time
t are modelled as
Xit ∣ Xt − 1 ∼ NegBin(μit, ψi); Xit⊥Xjt ∣ Xt − 1
μit = eitνit + λitXi, t − 1 + ϕit∑j ≠ i⌊wji⌋Xj, t − 1.
Here, the negative binomial distribution is parametrized by its mean
μit and
an overdispersion parameter ψi so that Var(Xt ∣ Xt − 1) = μit ⋅ (1 + ψiμit).
The term eitνit
is referred to as the endemic component of incidence, where eit is
a population offset. The remaining autoregressive terms form the
epidemic component, with λitXi, t − 1
often called the autoregressive part and ϕit∑j ≠ iwjiXj, t − 1
the neighbourhood part. Various specifications for the normalized
weights ⌊wij⌋
exist, see vignette("hhh4_spacetime")
from the surveillance
package. The parameters νit, λit
and ϕit are
themselves modelled in a log-linear fashion. While in principle
covariates can enter here, it is common to include only an intercept and
sine/cosine terms for seasonality, e.g. log (νit) = αi(ν) + βi(ν)sin (2πt/ω) + γi(ν)cos (2πt/ω),
log (λit) = αi(λ) + βi(λ)sin (2πt/ω) + γi(λ)cos (2πt/ω),
log (ϕit) = αi(ϕ) + βi(ϕ)sin (2πt/ω) + γi(ϕ)cos (2πt/ω).
Models of this type can be fitted using the function hhh4
from surveillance
. Numerous aspects of the model can be
specified via the control
list, with the parametrization of
νit,
λit and
ϕit
steered by the elements end
, ar
and
ne
. Whether unit-specific parameters are necessary and
identifiable depends on the data at hand, see the vignettes from
surveillance
for more information on model building.
The additional functionality offered by hhh4addon
is the
inclusion of higher-order lags, i.e. it provides methods to fit models
of the form Xit ∣ Xt − 1, ..., Xt − D, ∼ NegBin(μit, ψi)
$$
\mu_{it} = \nu_{it} + \lambda_{it}\sum_{d = 1}^D \lfloor u_d\rfloor
X_{i, t - d} + \phi_{it}\sum_{j\neq i}\sum_{d = 1}^D \lfloor
w_{ji}\rfloor \lfloor u_d\rfloor X_{j, t - d}
$$ where the weights ud are
normalized so that $\sum_{d = 1}^D \lfloor
u_d\rfloor = 1$. This means that instead of the previous
observation Xt − 1 a
weighted average of the D
preceding observations Xt − 1, …, Xt − D
enters. Note that the weights ud are shared
between the autoregressive and neighbourhood component and it is not
possible to specify them separately.
Currently four parameterizations of the weights ud are implemented. The default is a geometric, i.e. exponentially decaying lag weighting structure ud = α(1 − α)d − 1, α ∈ (0, 1). A second option is a (shifted) Poisson weighting, $$ u_d = \frac{\alpha^{d - 1}}{(d - 1)!}\exp(-\alpha), \ \ \alpha > 0 $$ which unlike the geometric formulation does not force the first lag weight u1 to be the largest. Thirdly there are linearly decaying lag weights, ud = max (1 − αd, 0), α ∈ (0, 1). The last pre-implemented lag structure is a simple AR(2) version with u1 = α, u2 = 1 − α, α ∈ (0, 1).
When hhh4addon
is loaded, a modified version
hhh4lag
of hhh4
is available. The following
additional specifications can be made in the control
list:
funct_lag
: a function to compute lag weights from a
(scalar) parameter par_lag
. Moreover the function needs to
take the smallest and largest lag to receive a positive weight as
arguments (min_lag
and max_lag
). The four
parameterizations mentioned above can be specified (implemented in the
functions geometric lag
, poisson_lag
,
linear_lag
and ar2_lag
). Alternatively a
user-defined function can be provided. For this case we recommend to
consult the source code of e.g. geometric_lag
and adapt it
accordingly.par_lag
: the weighting parameter entering into
funct_lag
; for geometric_lag
,
linear_lag
and ar2_lag
this is logit(α), for
poisson_lag
it is log (α). These choices enable
unconstrained optimization of the (profile) likelihood, when α is estimated from the data (see
below).min_lag
: the lowest lag to receive a positive weight
(the weights for lags 1 through min_lag - 1
are forced to
zero). The default value 1 should be kept in most cases.max_lag
: the highest lag to be included. Note that the
subset
specified in control
needs to be
compatible with max_lag
(specifically
subset[1] > max_lag
needs to hold).The return object of hhh4lag
is an object of class
hhh4lag
, which inherits from the regular hhh4
class.
We exemplify this extension with a simple univariate analysis of the
salmonella.agona
data from surveillance
(see
plot below). All syntax also translates to the multivariate case.
library(surveillance)
library(hhh4addon)
data("salmonella.agona") # get data
# convert old "disProg" to new "sts" data class
salmonella <- disProg2sts(salmonella.agona)
# plot data:
plot(salmonella)
First we fit a univariate hhh4
model with only first
lags and seasonality in both the endemic (en
) and
epidemic/autoregressive (ar
) component, i.e. μt = νt + ϕtXt − 1
log (νt) = α(ν) + β(ν)sin (2πt/ω) + γ(ν)cos (2πt/ω)
log (ϕt) = α(ϕ) + β(ϕ)sin (2πt/ω) + γ(ϕ)cos (2πt/ω)
control_salmonella <- list(end = list(f = addSeason2formula(~ 1)),
ar = list(f = addSeason2formula(~ 1)),
family = "NegBin1", subset = 6:312)
# Note: we fit to subset 6:312 to ensure comparability with the higher-order lag
# models fitted next.
fit_salmonella_ar1 <- hhh4(salmonella, control_salmonella) # use regular hhh4 function
AIC(fit_salmonella_ar1)
## [1] 1229.134
Next we fit a higher-order lag model with $$ \mu_t = \nu_t + \lambda_{it}\sum_{d = 1}^5 \lfloor u_d\rfloor X_{t - d}, $$ geometric lag weights ud and a fixed value of α = 0.8. We fit the model to data from week 6 onwards.
par_lag_08 <- log(0.8/(1 - 0.8)) # the par_lag value corresponding to alpha = 0.8
control_salmonella_08 <- list(end = list(f = addSeason2formula(~ 1)),
ar = list(f = addSeason2formula(~ 1)),
funct_lag = geometric_lag, # default
max_lag = 5, # default
par_lag = par_lag_08, # new parameter
family = "NegBin1", subset = 6:312)
# Note that funct_lag = geometric_lag and max_lag = 5 are the defaults in hhh4lag and
# would not need to be specified explicitly.
fit_salmonella_geom_08 <- hhh4_lag(salmonella, control_salmonella_08) # now use hhh4lag
plot(fit_salmonella_geom_08, names = "")
## [1] 1225.157
We can see that in terms of AIC this model is better than the
previously fitted model with only first lags. To estimate
par_lag
(i.e. a suitable transformation of α) from the data we can use the
wrapper profile_par_lag
which re-fits the model for
different values of par_lag
and uses numerical optimization
(optim
) to find the optimal one (α is thus estimated via a profile
likelihood approach).
control_salmonella_geom <- list(end = list(f = addSeason2formula(~ 1)),
ar = list(f = addSeason2formula(~ 1)),
funct_lag = geometric_lag,
max_lag = 5,
family = "NegBin1", subset = 6:312)
fit_salmonella_geom <- profile_par_lag(salmonella, control_salmonella_geom)
AIC(fit_salmonella_geom)
## [1] 1224.65
##
## Call:
## hhh4_lag(stsObj = stsObj, control = control)
##
## Coefficients:
## Estimate Std. Error
## ar.1 -0.99658 0.26263
## ar.sin(2 * pi * t/52) -0.34039 0.20947
## ar.cos(2 * pi * t/52) -0.64112 0.28667
## end.1 0.37801 0.15091
## end.sin(2 * pi * t/52) -0.25764 0.15369
## end.cos(2 * pi * t/52) 0.10855 0.17548
## overdisp 0.14835 0.04328
##
## Distributed lags used (max_lag = 5). Weights: 0.56; 0.25; 0.11; 0.05; 0.02
## Use distr_lag() to check the applied lag distribution and parameters.
##
## Log-likelihood: -604.32
## AIC: 1224.65
## BIC: 1254.46
##
## Number of units: 1
## Number of time points: 307
The best fit is achieved with α = 0.56, i.e. almost half of the
contribution of the epidemic contribution comes from lags of order
larger than one (the standardized lag weights are stored in
fit_salmonella_geom$distr_lag
).
An (older) alternative in order to estimate α is fit_par_lag
which
instead of applying optim
fits the model for a vector of
values for par_lag
provided by the user (argument
range_par
). Under certain circumstances this is
computationally faster than the use of optim
.
grid_alpha <- seq(from = 0.01, to = 0.99, by = 0.02)
grid_par_lag <- log(grid_alpha/(1 - grid_alpha)) # move to logit scale
fit_salmonella_geom_grid <- fit_par_lag(salmonella, control_salmonella_geom,
range_par = grid_par_lag)
The function fit_par_lag
returns a list containing the
best model ($best_mod
) and the AIC values of the models
corresponding to the different values of par_lag
($AICs
). We can thus plot the AIC as a function of α.
A remark on the computation of the AIC values: The AIC of a model
which was fitted with fit_par_lag
or
profile_par_lag
is 2 points higher than that of a model
with the same value of par_lag
, but specified manually
instead of being estimated. This is due to the loss of one degree of
freedom.
par_lag_0.56 <- log(0.56/(1 - 0.56)) # par_lag value corresponding to alpha = 0.56
control_salmonella_geom.056 <- list(end = list(f = addSeason2formula(~ 1)),
ar = list(f = addSeason2formula(~ 1), use_distr_lag = TRUE),
par_lag = par_lag_0.56,
family = "NegBin1", subset = 6:312)
fit_salmonella_geom.056 <- hhh4_lag(salmonella, control_salmonella_geom.056)
AIC(fit_salmonella_geom.056); AIC(fit_salmonella_geom); AIC(fit_salmonella_geom_grid$best_mod)
## [1] 1222.654
## [1] 1224.65
## [1] 1224.65
For comparison we also fit models with Poisson and AR(2) lags.
# Poisson lags:
control_salmonella_pois <- control_salmonella
control_salmonella_pois$funct_lag <- poisson_lag
fit_salmonella_pois <- profile_par_lag(salmonella, control_salmonella_pois)
AIC(fit_salmonella_pois)
## [1] 1225.025
# AR(2) lags:
control_salmonella_ar2 <- control_salmonella
control_salmonella_ar2$funct_lag <- ar2_lag
fit_salmonella_ar2 <- profile_par_lag(salmonella, control_salmonella_ar2)
AIC(fit_salmonella_ar2)
## [1] 1226.999
These parameterizations lead to a slightly worse model fit than the geometric lags weights.
Most of the functionality for hhh4
objects is by now
also available for hhh4lag
objects (as returned by
hhh4lag
, profile_par_lag
and
fit_par_lag
). For instance we can simulate from a fitted
model. Note that in this case the starting values need to be specified
as a matrix with max_lag
rows.
# simulate 1000 trajectories for weeks 261 through 270, using values from
# 256 through 260 as starting values:
set.seed(17)
sim <- simulate(fit_salmonella_geom, subset = 261:270,
y.start = fit_salmonella_geom$stsObj@observed[256:260, , drop = FALSE],
nsim = 1000, simplify = TRUE)
# plot one trajectory:
plot(261:270, sim[, , 1], type = "h", xlab = "week", ylab = "No. infected")
Also, we can generate rolling one-week-ahead forecasts (i.e. from
models which are iteratively re-fitted each week). This is done using
the new function hhh4addon:oneStepAhead_hhh4lag
(the
function surveillance:oneStepAhead
cannot be applied and
will throw an error). By default the lag weighting parameter
par_lag
is not re-fitted (as this can lead to quite long
computation times); to re-fit it set the argument
refit_par_lag
to TRUE
. The result of
hhh4addon:oneStepAhead_hhh4lag
can be handled just like the
return from surveillance:oneStepAhead
would. We illustrate
this for weeks 261–312 as the validation period.
owa_forecasts_geom <- oneStepAhead_hhh4lag(fit_salmonella_geom,
tp = c(260, 311), refit_par_lag = FALSE)
## Lag weights are not re-estimated for the one-step-ahead forecasts (set refit_par_lag = TRUE to re-fit them).
## logs rps dss ses
## 2.044585 1.125719 2.553112 4.517066
# for comparison: model with only first lags:
owa_forecasts_ar1 <- oneStepAhead(fit_salmonella_ar1,
tp = c(260, 311))
## One-step-ahead prediction @ t = 260 ...
## One-step-ahead prediction @ t = 261 ...
## One-step-ahead prediction @ t = 262 ...
## One-step-ahead prediction @ t = 263 ...
## One-step-ahead prediction @ t = 264 ...
## One-step-ahead prediction @ t = 265 ...
## One-step-ahead prediction @ t = 266 ...
## One-step-ahead prediction @ t = 267 ...
## One-step-ahead prediction @ t = 268 ...
## One-step-ahead prediction @ t = 269 ...
## One-step-ahead prediction @ t = 270 ...
## One-step-ahead prediction @ t = 271 ...
## One-step-ahead prediction @ t = 272 ...
## One-step-ahead prediction @ t = 273 ...
## One-step-ahead prediction @ t = 274 ...
## One-step-ahead prediction @ t = 275 ...
## One-step-ahead prediction @ t = 276 ...
## One-step-ahead prediction @ t = 277 ...
## One-step-ahead prediction @ t = 278 ...
## One-step-ahead prediction @ t = 279 ...
## One-step-ahead prediction @ t = 280 ...
## One-step-ahead prediction @ t = 281 ...
## One-step-ahead prediction @ t = 282 ...
## One-step-ahead prediction @ t = 283 ...
## One-step-ahead prediction @ t = 284 ...
## One-step-ahead prediction @ t = 285 ...
## One-step-ahead prediction @ t = 286 ...
## One-step-ahead prediction @ t = 287 ...
## One-step-ahead prediction @ t = 288 ...
## One-step-ahead prediction @ t = 289 ...
## One-step-ahead prediction @ t = 290 ...
## One-step-ahead prediction @ t = 291 ...
## One-step-ahead prediction @ t = 292 ...
## One-step-ahead prediction @ t = 293 ...
## One-step-ahead prediction @ t = 294 ...
## One-step-ahead prediction @ t = 295 ...
## One-step-ahead prediction @ t = 296 ...
## One-step-ahead prediction @ t = 297 ...
## One-step-ahead prediction @ t = 298 ...
## One-step-ahead prediction @ t = 299 ...
## One-step-ahead prediction @ t = 300 ...
## One-step-ahead prediction @ t = 301 ...
## One-step-ahead prediction @ t = 302 ...
## One-step-ahead prediction @ t = 303 ...
## One-step-ahead prediction @ t = 304 ...
## One-step-ahead prediction @ t = 305 ...
## One-step-ahead prediction @ t = 306 ...
## One-step-ahead prediction @ t = 307 ...
## One-step-ahead prediction @ t = 308 ...
## One-step-ahead prediction @ t = 309 ...
## One-step-ahead prediction @ t = 310 ...
## One-step-ahead prediction @ t = 311 ...
## logs rps dss ses
## 2.058342 1.116286 2.664664 4.376845
The geometric-lag model thus performs slightly better in terms of the logarithmic and Dawid-Sebastiani scores, but worse in terms of the ranked probability score (all scores are negatively oriented).
The second functionality of hhh4addon
concerns
predictive and marginal (stationary or periodically stationary) moments.
These quantities can be computed without the need for simulation for the
endemic-epidemic class, see Held, Meyer and Bracher (2017) and Bracher
and Held (2019) for the theoretical background.
To illustrate the use of predictive moments we re-fit the above
models to weeks 6–260 of the salmonella.agona
data. The
rest is again kept for validation of our predictions.
control_salmonella.sub <- list(end = list(f = addSeason2formula(~ 1), lag = 1),
ar = list(f = addSeason2formula(~ 1), lag = 1),
family = "NegBin1", subset = 6:260)
fit_salmonella_ar1.sub <- hhh4(salmonella, control_salmonella.sub)
fit_salmonella_geom.sub <- profile_par_lag(salmonella, control_salmonella.sub)
Predictive moments for weeks 261, 262, … can now be calculated and plotted. Note that these correspond to a path forecast rather than a weekly updated rolling forecast as before, i.e. the forecasts for weeks 261, 262,… are all conditioned on the observation from week 260.
pred_mom_ar1 <- predictive_moments(fit_salmonella_ar1.sub, t_condition = 260,
lgt = 52, return_Sigma = TRUE)
# if return_Sigma == TRUE the full predictive covariance function is returned
# this may be costly in terms of storage for complex models and long forecasts.
pred_mom_geom <- predictive_moments(fit_salmonella_geom.sub,
t_condition = 260, lgt = 52, return_Sigma = TRUE)
plot(fit_salmonella_ar1.sub, names = "")
fanplot_prediction(pred_mom_geom, add = TRUE)
The fanplots shown here are based on negative binomial approximations of the predictive distributions via the first two moments. We can also use these predictive moments to evaluate the Dawid-Sebastiani score of the path forecasts (a proper scoring rule for predictive model assessment, Held et al 2017).
# Note that ds_score requires that the full predictive covariance matrix is
# available, i.e. compute_Sigma = TRUE is used in predictive_moments.
ds_score_hhh4(pred_mom_ar1)
## [1] 1.43501
## [1] 1.431232
The David-Sebastiani score is negatively oriented here, i.e. the
prediction from the model fit_salmonella_geom.sub
with
geometric lags is slightly better than the one from the simpler
fit_salmonella.sub
.
The predictive_moments
function can also be used to
check that the simulation functions are doing the right thing. Here we
can re-use the previously simulated data:
cond_mom_geom_260 <- predictive_moments(fit_salmonella_geom, t_condition = 260, lgt = 10)
plot(261:270, apply(sim, 1:2, mean), xlab = "week", ylab = "predictive mean")
lines(261:270, cond_mom_geom_260$mu_matrix)
legend("topright", pch = c(1, NA), lty = c(NA, 1),
legend = c("simulation-based", "analytical"), bty = "n")
plot(261:270, apply(sim, 1:2, sd), xlab = "week", ylab = "predictive standard deviation")
lines(261:270, sqrt(cond_mom_geom_260$var_matrix))
The agreement between the analytical results and the simulation-based estimates of the predictive moments is relatively poor with 1,000 simulated paths, but gets very good with 10,000 (not applied here as costly in terms of storage and computation time).
The function stationary_moments
can be applied in the
same way (without specifying a t_condition
and
lgt
) to obtain the marginal moments of a fitted model. Note
that this is only possible for models with either time-constant
parameters (stationary moments) or periodically varying parameters
(periodically stationary moments). For horizons exceeding a few weeks
the predictive and marginal moments are indistinguishable (as the
forecast goes to the periodically stationary behaviour of the
model).
Bracher, J. and L. Held (2019). Endemic-epidemic models with discrete-time serial interval distributions for infectious disease prediction. Preprint: https://arxiv.org/abs/1901.03090.
Bracher, J. and Held, L. (2017) Periodically stationary multivariate autoregressive models. Preprint: https://arxiv.org/abs/1707.04635
Held, L., S. Meyer, and J. Bracher (2017). Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture. Statistics in Medicine 36 (22), 3443–3460.
Meyer, S., L. Held, and M. Höhle (2017). Spatio-temporal analysis of epidemic phenomena using the R package surveillance. Journal of Statistical Software 77 (11), 1–55.