Disease outbreaks and metrics

Overview

When an infectious disease outbreak begins, a time-sensitive question arises: “tomorrow, are things going to be better or worse?”

A great deal of research has gone into how to answer this question, including the development of forecasting tools to attempt to predict what might be coming, as well as data streams and metrics to summarize and understand that data. Here we focus more on the latter approach.

Suppose you work at a public health agency, and you have the following reported case data in blue:

You may want to know, are cases tomorrow going to be a) higher than today or b) lower than today. Just looking visually, either seems plausible: in case a) perhaps today’s cases are a outlier, and the true trend will continue upwards, and in case b) perhaps today’s cases are not an outlier, and tomorrow’s cases will be lower.

However the process that generates these new cases, i.e. infection, has already occurred in most cases. A more helpful question might be: Are people (still) infecting other people in sufficient numbers that we can expect cases to generally keep increasing? Reported cases are a lagging indicator of the current state of disease transmission. If we understand this dynamic, then we will be able to predict how many cases we expect to be coming in the near future and if control measures are effectively slowing transmission.

This is what the effective reproductive number, \(R(t)\), aims to estimate. The reproductive number estimates the average number of people an infectious individual will infect at time \(t\). This is typically done daily. The time-varying reproductive number is estimated from a time-series of case count data like that shown in the plots above. But there is another critical piece of information to indicate when reported cases might have been infected. This is:

How long does it take for an infected person to infect others? This is described by the generation interval. This can be summarized by a mean which would give the average amount of time between an infector and their infectee. But more often it is described by a statistical distribution. For example, the infector of an infected individual would have been infected 1 day prior with 30% probability, 2 days prior with 40% probability, or 3 days prior with 30% probability.

The generation interval is central to estimating the reproductive number. The generation interval can be estimated by a number of methods, including analyzing data of infector-infectee pairs.

Knowing \(R(t)\) can help you begin to make an informed guess as to the current state of a disease outbreak and near term forecasts, as it has the following values and interpretations at a specific point in time:

R(t) Interpretation at time t Outbreak is ...
< 1 Each infected person infects on average fewer than one additional person shrinking
= 1 Each infected person infects on average about one additional person stable
> 1 Each infected person infects on average more than one additional person growing

However, estimating \(R(t)\) is not straightforward, and is the subject of a wealth of academic research and proliferation of software packages. Guidance in choosing a package in the R statistical software is the purpose of this website.

How to choose a tool to estimate \(R(t)\)

There has been a proliferation of software tools that make inference about the current state of an infectious disease outbreak.

Important to keep in mind when choosing a tool to estimate \(R(t)\) is this fact: \(R(t)\) is a latent variable, which means it cannot be measured directly. Instead, it can only be estimated from observable variables (like reported case counts).

The ideal estimator of \(R(t)\) requires a list of the number of newly infected cases by infection date and the generation interval. This is because we want to know about the state of disease based on when people are infected, not when they report having symptoms.

In reality we usually only observe the new number of newly reported cases and can only estimate the serial interval, which is the time between symptom onset of an infector-infectee pair. In this case the estimate of \(R(t)\) will lag reality without some adjustments.

Each software package that estimates \(R(t)\) makes different adjustments and assumptions about how these parameters relate, which leads to variations in estimated \(R(t)\) even if the same input data are used. In addition, different packages require different levels of input data to provide additional robustness in estimated outputs.

The purpose of this document

Therefore, the purpose of this document is to provide guidance about which \(R(t)\) estimation software to choose for different analytical goals. First, see our Example outbreak for the different components of a disease outbreak that can be modeled differently. Next, see our Decision tool for how to choose software for different analytical goals.


We limit the methods discussed here to those for estimating historical to present-day \(R(t)\) values using daily case count data, where a case can be flexibly defined as an individual with a reported positive test (either through healthcare-seeking behavior, routine surveillance, or a hospital admission). Other methods not discussed here include:

We also limit the discussion to packages in the statistical software R (R Core Team 2022), which may exclude some packages in other software programs that combine many of the methodological considerations discussed below (Yang et al. 2022).

Funding, authors, and acknowledgements

This work is supported by CDC grant NU38FT000013.

The lead authors of this document are at Boston University in the School of Public Health:

  • Chad Milando, Laura White

Many additional co-authors contributed to this document including:

  • Harry Hochheiser, University of Pittsburgh
  • Pragati Prasad, Emory University
  • Md. Sakhawat Hossain, Clemson University
  • Sutyajeet Soneja, Johns Hopkins
  • Benjamin Singer, UC Berkeley