There is substantial under-reporting behavior that is consistent for a single geo. For nhsn
, this is mostly systematic, though there are some locations which have sudden corrections. We can probably improve our forecasts by including revision behavior for most locations, and including alerts for what seem like reporting errors. nssp
has smaller, more regular corrections, and is a better predictor of the final value than the initial nhsn
release.
Flu and covid have similar levels of revision, and states with large sudden corrections do so in both states (specifically az
, mn
and nh
all had unusual revisions, though dc
has covid corrections that flu doesn’t), and a list of the states with highest relative correction contains a similar set of states. We haven’t checked the exact correlation between flu and covid, but this is likely high; it is reported through the same channels by the same people. We should look into the extra columns to see if it provides useful information for handling revision behavior.
First we get the archive and remove the data older than the first version so as not to clog up the revision behavior, and display the overall revision summary1.
nhsn_archive_flu <- qs2::qs_read(here::here("flu_hosp_prod", "objects", "nhsn_archive_data"))
nhsn_archive_flu <- nhsn_archive_flu$DT %>% filter(time_value >= "2024-11-19", geo_value %nin% c("vi", "as", "gu")) %>% as_epi_archive()
nhsn_archive_flu$time_type <- "day"
revision_sum <- nhsn_archive_flu %>% epiprocess::revision_analysis(value, min_waiting_period = NULL)
revision_sum %>% print(quick_revision = 7)
##
## ── An epi_archive spanning 2024-11-23 to 2025-05-10. ──
##
## ── Min lag (time to first version):
## min median mean max
## 4 days 4 days 4.2 days 25 days
##
## ── Fraction of epi_key + time_values with
## No revisions:
## • 222 out of 1,336 (16.62%)
##
## Quick revisions (last revision within 7 days of the `time_value`):
## • 379 out of 1,336 (28.37%)
##
## Few revisions (At most 3 revisions for that `time_value`):
## • 957 out of 1,336 (71.63%)
##
##
##
## ── Fraction of revised epi_key + time_values which have:
##
## Less than 0.1 spread in relative value:
## • 609 out of 1,114 (54.67%)
##
## Spread of more than 2719.05 in actual value (when revised):
## • 7 out of 1,114 (0.63%)
##
##
##
## ── Days until within 20% of the latest value:
## min median mean max
## 4 days 4 days 7.9 days 174 days
So around a fifth have no revisions, around a quarter resolve within a week, and around 2/3rds have a small number of revisions. Around half of those with revisions have little relative change (10% of the max value). The “actual value” change isn’t really worth thinking about because this is counts data (so there being only 6 doesn’t tell us much).
Here’s a plot of the version changes for all locations.
text_size <- 6
nhsn_archive_flu %>% autoplot(value) + theme(strip.text.x = element_text(size = text_size, margin = margin(.1, 0, .1, 0, "cm")), axis.text = element_text(size =text_size, angle = 45), legend.title = element_text(size = text_size), legend.text = element_text(size = text_size), legend.key.size = unit(0.5, "cm")) + scale_size_manual(values = c(0.5))
Since this is probably too small to actually be legible, let’s figure out the states with the worst revision behavior and plot those. Worst in this case meaning worst average relative spread
av_re_spread <- revision_sum$revision_behavior %>%
group_by(geo_value) %>%
summarize(rel_spread = mean(rel_spread, na.rm = TRUE)) %>%
arrange(desc(rel_spread)) %>%
filter(geo_value %nin% c("vi", "as", "gu"))
av_re_spread %>%
print(n=20)
## # A tibble: 54 × 2
## geo_value rel_spread
## <chr> <dbl>
## 1 pr 0.357
## 2 wa 0.279
## 3 ok 0.261
## 4 az 0.259
## 5 nm 0.215
## 6 nh 0.214
## 7 ar 0.212
## 8 id 0.206
## 9 dc 0.198
## 10 mo 0.197
## 11 mt 0.178
## 12 or 0.169
## 13 ri 0.139
## 14 ak 0.136
## 15 sd 0.129
## 16 md 0.127
## 17 ga 0.119
## 18 mn 0.113
## 19 ky 0.112
## 20 va 0.111
## # ℹ 34 more rows
The worst 18 excluding the geo’s we don’t actually forecast (so no as
or vi
, gu
).
nhsn_filtered <- nhsn_archive_flu %>%
filter(geo_value %in% av_re_spread$geo_value[1:18])
nhsn_filtered$DT %<>%
mutate(geo_value = factor(geo_value, levels = av_re_spread$geo_value[1:18]))
autoplot(nhsn_filtered, "value") + facet_wrap(~geo_value, ncol = 3, scales = "free") + theme(strip.text.x = element_text(size = 8))
These cover a range of revision behaviors; some are off on a single time_value
in a terrible way, such as mn
or nh
, while most have fairly constant revisioning. az
is unusual in having two bad version
s when they revised the entire history down by a factor of 2, and then reverted to what it had been previously. Most are systematically reporting lower than the actual values, with some such as nm
, nh
, ok
, and ak
particularly bad about this. It is probably worth adding a measure at the key level of how systematic the bias is to revision_analysis
, or perhaps a separate function. These seem likely to be estimable beforehand. Exceptions that include at least one case of over-reporting: mn
, id
, nh
These are not later backfilled; they seem more like bad estimates or entry error.
nssp_archive_flu <- qs2::qs_read(here::here("flu_hosp_prod", "objects", "nssp_archive_data"))
nssp_archive_flu <- nssp_archive_flu$DT %>% filter(time_value >= "2024-11-19", geo_value %nin% c("vi", "as", "gu")) %>% as_epi_archive()
nssp_archive_flu$time_type <- "day"
revision_sum_nssp <- nssp_archive_flu %>% epiprocess::revision_analysis(nssp, min_waiting_period = NULL)
revision_sum_nssp %>% print(quick_revision = 7)
##
## ── An epi_archive spanning 2024-11-20 to 2025-05-07. ──
##
## ── Min lag (time to first version):
## min median mean max
## 4 days 4 days 4.6 days 25 days
##
## ── Fraction of epi_key + time_values with
## No revisions:
## • 169 out of 1,250 (13.52%)
##
## Quick revisions (last revision within 7 days of the `time_value`):
## • 142 out of 1,250 (11.36%)
##
## Few revisions (At most 3 revisions for that `time_value`):
## • 873 out of 1,250 (69.84%)
##
##
##
## ── Fraction of revised epi_key + time_values which have:
##
## Less than 0.1 spread in relative value:
## • 929 out of 1,081 (85.94%)
##
## Spread of more than 0.703 in actual value (when revised):
## • 44 out of 1,081 (4.07%)
##
##
##
## ── Days until within 20% of the latest value:
## min median mean max
## 4 days 4 days 4.7 days 25 days
So very few weeks have no revision, but ~70% have only a small number of revisions. And most (87%) are very close to their final values
av_re_spread <- revision_sum_nssp$revision_behavior %>%
group_by(geo_value) %>%
summarize(rel_spread = mean(rel_spread, na.rm = TRUE)) %>%
arrange(desc(rel_spread)) %>%
filter(geo_value %nin% c("vi", "as", "gu"))
nssp_archive_flu %>% autoplot(nssp, .facet_filter = geo_value %in% av_re_spread$geo_value[1:18]) + facet_wrap(~geo_value, ncol = 3, scales = "free") + theme(strip.text.x = element_text(size = 8))
These corrections are way more regular than nhsn, so we can definitely do nowcasting to adjust the values.
Does NSSP actually correlate better with the latest value than nhsn itself does? This was a property we were relying on through the season to generate our corrections. Comparing the correlation of nssp
as it was on the forecast date with the value as it eventually is.
nssp_archive_flu
## → An `epi_archive` object, with metadata:
## ℹ Min/max time values: 2024-11-20 / 2025-05-07
## ℹ First/last version with update: 2024-12-01 / 2025-05-11
## ℹ Versions end: 2025-05-11
## ℹ A preview of the table (4572 rows x 4 columns):
## Key: <geo_value, time_value, version>
## geo_value time_value version nssp
## <char> <Date> <Date> <num>
## 1: ak 2024-11-20 2024-12-01 1.33
## 2: ak 2024-11-20 2024-12-08 1.35
## 3: ak 2024-11-20 2024-12-22 1.32
## 4: ak 2024-11-20 2025-01-05 1.34
## 5: ak 2024-11-27 2024-12-01 1.74
## ---
## 4568: wy 2025-04-09 2025-04-13 0.00
## 4569: wy 2025-04-16 2025-04-20 0.00
## 4570: wy 2025-04-23 2025-04-27 0.00
## 4571: wy 2025-04-30 2025-05-04 0.00
## 4572: wy 2025-05-07 2025-05-11 0.00
nssp_diagonal <- nssp_archive_flu %>%
epix_slide(\(x, group, ref_time) x %>% filter(time_value == max(time_value))) %>%
select(time_value = version, geo_value, nssp)
nhsn_diagonal <- nhsn_archive_flu %>%
epix_slide(\(x, group, ref_time) x %>% filter(time_value == max(time_value))) %>%
select(time_value = version, geo_value, value)
nhsn_latest <- nhsn_archive_flu %>% epix_as_of_current() %>% filter(time_value %in% nhsn_diagonal$time_value)
nssp_diagonal
## # A tibble: 1,198 × 3
## time_value geo_value nssp
## <date> <chr> <dbl>
## 1 2024-12-01 ak 1.74
## 2 2024-12-01 al 1.17
## 3 2024-12-01 ar 0.38
## 4 2024-12-01 az 3.36
## 5 2024-12-01 ca 1.25
## 6 2024-12-01 co 0.69
## 7 2024-12-01 ct 0.21
## 8 2024-12-01 dc 0.17
## 9 2024-12-01 de 0.34
## 10 2024-12-01 fl 2.02
## # ℹ 1,188 more rows
joined_dataset <- nhsn_latest %>%
left_join(
nhsn_diagonal,
by = c("geo_value", "time_value"),
suffix = c("_latest", "_diag")
) %>%
left_join(
nssp_diagonal %>% mutate(time_value = time_value - 1),
by = c("geo_value", "time_value")
) %>%
filter(geo_value %nin% c("vi", "as", "gu", "mp"))
corrs <- joined_dataset %>%
epi_cor(value_latest, value_diag) %>%
left_join(
joined_dataset %>%
epi_cor(value_latest, nssp),
by = "geo_value", suffix = c("_release", "_nssp")
) %>%
pivot_longer(names_to = "correlation_against", cols = c(cor_release, cor_nssp))
corrs %>% drop_na() %>% group_by(geo_value) %>% filter(n()==2) %>%
ggplot(aes(x = geo_value, y = value, fill = correlation_against)) +
geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "bottom")
Red, which is on the left for each location, gives the correlation of the final value with nssp
as it was at the time of the forecast, while blue on the right gives the correlation of the final value with nhsn
as it was. Everywhere but Louisiana, nssp
has a stronger correlation with the eventual value than nhsn
itself does.
And now for ~ the same idea, but for covid
First we get the archive and remove the data older than the first version so as not to clog up the revision behavior, and display the overall revision summary2.
nhsn_archive_covid <- qs2::qs_read(here::here("covid_hosp_prod", "objects", "nhsn_archive_data"))
nhsn_archive_covid <- nhsn_archive_covid$DT %>% filter(time_value >= "2024-11-19") %>%
filter(geo_value %nin% c("vi", "as", "gu")) %>% as_epi_archive()
nhsn_archive_covid$time_type <- "day"
revision_sum <- nhsn_archive_covid %>%
epiprocess::revision_analysis(value, min_waiting_period = NULL)
revision_sum %>% print(quick_revision = 7)
##
## ── An epi_archive spanning 2024-11-23 to 2025-05-10. ──
##
## ── Min lag (time to first version):
## min median mean max
## 4 days 4 days 4.2 days 25 days
##
## ── Fraction of epi_key + time_values with
## No revisions:
## • 211 out of 1,336 (15.79%)
##
## Quick revisions (last revision within 7 days of the `time_value`):
## • 382 out of 1,336 (28.59%)
##
## Few revisions (At most 3 revisions for that `time_value`):
## • 1,012 out of 1,336 (75.75%)
##
##
##
## ── Fraction of revised epi_key + time_values which have:
##
## Less than 0.1 spread in relative value:
## • 615 out of 1,125 (54.67%)
##
## Spread of more than 929.6 in actual value (when revised):
## • 6 out of 1,125 (0.53%)
##
##
##
## ── Days until within 20% of the latest value:
## min median mean max
## 4 days 4 days 8.1 days 174 days
Around a fifth have no revisions, around a third resolve within a week, and around 3/4ths have a small number of revisions. Around 60% of those with revisions have little relative change (10% of the max value). The “actual value” change isn’t really worth thinking about because this is counts data (so there being only 6 doesn’t tell us much).
Here’s a plot of the version changes for all locations.
nhsn_archive_covid %>% autoplot(value) + theme(strip.text.x = element_text(size = text_size, margin = margin(.1, 0, .1, 0, "cm")), axis.text = element_text(size =text_size, angle = 45), legend.title = element_text(size = text_size), legend.text = element_text(size = text_size), legend.key.size = unit(0.5, "cm")) + scale_size_manual(values = c(0.5))
Since this is probably too small to actually be legible, let’s figure out the states with the worst revision behavior and plot those. Worst in this case meaning worst average relative spread
av_re_spread <- revision_sum$revision_behavior %>%
group_by(geo_value) %>%
summarize(rel_spread = mean(rel_spread, na.rm = TRUE)) %>%
arrange(desc(rel_spread)) %>%
filter(geo_value %nin% c("vi", "as", "gu"))
av_re_spread %>%
print(n=20)
## # A tibble: 54 × 2
## geo_value rel_spread
## <chr> <dbl>
## 1 pr 0.348
## 2 az 0.281
## 3 dc 0.265
## 4 mo 0.253
## 5 ar 0.233
## 6 nm 0.226
## 7 wa 0.224
## 8 ok 0.216
## 9 id 0.197
## 10 mt 0.193
## 11 ks 0.171
## 12 nh 0.169
## 13 ri 0.162
## 14 ky 0.161
## 15 hi 0.144
## 16 sd 0.135
## 17 va 0.131
## 18 or 0.115
## 19 mn 0.113
## 20 nv 0.110
## # ℹ 34 more rows
Which, if you compare with the average relative spread in the Flu section is remarkably similar. The worst 18 excluding the geo’s we don’t actually forecast (so no as
or vi
, gu
).
nhsn_filtered <- nhsn_archive_covid %>%
filter(geo_value %in% av_re_spread$geo_value[1:18])
nhsn_filtered$DT %<>%
mutate(geo_value = factor(geo_value, levels = av_re_spread$geo_value[1:18]))
autoplot(nhsn_filtered, "value") + facet_wrap(~geo_value, ncol = 3, scales = "free") + theme(strip.text.x = element_text(size = 8))
Strictly visually, this seems to revise more than the flu data (this doesn’t actually fit well with the numeric revision analysis so something odd is going on). Perhaps the revisions are more chaotic.
-mn
instead of having a single time point for a single version has it’s entire trajectory wrong in the same way that az
does, possibly on the same version
.
az
has similar revision behavior (a couple of weeks where they were wildly off).
nh
has similar behavior in over-estimating a specific single time value quite badly, but otherwise having typical under-reporting problems.
dc
has some new and wild behavior; this is likely as visually striking as it is because the values are so small.
mo
has new revision behavior, primarily in the magnitude of the difference.
ok
has more extreme under-reporting than in the case of flu, but again this seems to likely be a factor of the number of cases, suggesting their underreporting happens in absolute number of cases rather than relative3.
Like flu, these seem mostly estimable beforehand, with some more drastic exceptions.
nssp_archive_covid <- qs2::qs_read(here::here("covid_hosp_prod", "objects", "nssp_archive_data"))
nssp_archive_covid <- nssp_archive_covid$DT %>% filter(time_value >= "2024-11-19", geo_value %nin% c("vi", "as", "gu")) %>% as_epi_archive()
nssp_archive_covid$time_type <- "day"
revision_sum_nssp <- nssp_archive_covid %>% epiprocess::revision_analysis(nssp, min_waiting_period = NULL)
revision_sum_nssp %>% print(quick_revision = 7)
##
## ── An epi_archive spanning 2024-11-20 to 2025-05-07. ──
##
## ── Min lag (time to first version):
## min median mean max
## 4 days 4 days 4.6 days 25 days
##
## ── Fraction of epi_key + time_values with
## No revisions:
## • 168 out of 1,250 (13.44%)
##
## Quick revisions (last revision within 7 days of the `time_value`):
## • 139 out of 1,250 (11.12%)
##
## Few revisions (At most 3 revisions for that `time_value`):
## • 1,100 out of 1,250 (88%)
##
##
##
## ── Fraction of revised epi_key + time_values which have:
##
## Less than 0.1 spread in relative value:
## • 877 out of 1,082 (81.05%)
##
## Spread of more than 0.162 in actual value (when revised):
## • 70 out of 1,082 (6.47%)
##
##
##
## ── Days until within 20% of the latest value:
## min median mean max
## 4 days 4 days 4.8 days 25 days
So few weeks have no revision (only 14%), but ~87% have only a small number of revisions. And most (81%) are close to their final values
av_re_spread <- revision_sum_nssp$revision_behavior %>%
group_by(geo_value) %>%
summarize(rel_spread = mean(rel_spread, na.rm = TRUE)) %>%
arrange(desc(rel_spread)) %>%
filter(geo_value %nin% c("vi", "as", "gu"))
nssp_archive_flu %>% autoplot(nssp, .facet_filter = geo_value %in% av_re_spread$geo_value[1:18]) + facet_wrap(~geo_value, ncol = 3, scales = "free") + theme(strip.text.x = element_text(size = 8))
These corrections are way more regular than nhsn, so we can definitely do nowcasting to adjust the values.
Does NSSP actually correlate better with the latest value than nhsn itself does? This was a property we were relying on through the season to generate our corrections. Comparing the correlation of nssp
as it was on the forecast date with the value as it eventually is.
nssp_diagonal <- nssp_archive_covid %>%
epix_slide(\(x, group, ref_time) x %>% filter(time_value == max(time_value))) %>%
select(time_value = version, geo_value, nssp)
nhsn_diagonal <- nhsn_archive_covid %>%
epix_slide(\(x, group, ref_time) x %>% filter(time_value == max(time_value))) %>%
select(time_value = version, geo_value, value)
nhsn_latest <- nhsn_archive_covid %>% epix_as_of_current() %>% filter(time_value %in% nhsn_diagonal$time_value)
nssp_diagonal
## # A tibble: 1,198 × 3
## time_value geo_value nssp
## <date> <chr> <dbl>
## 1 2024-12-01 ak 0.39
## 2 2024-12-01 al 0.42
## 3 2024-12-01 ar 0.45
## 4 2024-12-01 az 1.72
## 5 2024-12-01 ca 0.38
## 6 2024-12-01 co 1.11
## 7 2024-12-01 ct 0.77
## 8 2024-12-01 dc 0.22
## 9 2024-12-01 de 0.59
## 10 2024-12-01 fl 0.31
## # ℹ 1,188 more rows
joined_dataset <- nhsn_latest %>%
left_join(
nhsn_diagonal,
by = c("geo_value", "time_value"),
suffix = c("_latest", "_diag")
) %>%
left_join(
nssp_diagonal %>% mutate(time_value = time_value - 1),
by = c("geo_value", "time_value")
) %>%
filter(geo_value %nin% c("vi", "as", "gu", "mp"))
corrs <- joined_dataset %>%
epi_cor(value_latest, value_diag) %>%
left_join(
joined_dataset %>%
epi_cor(value_latest, nssp),
by = "geo_value", suffix = c("_release", "_nssp")
) %>%
pivot_longer(names_to = "correlation_against", cols = c(cor_release, cor_nssp))
corrs %>% drop_na() %>% group_by(geo_value) %>% filter(n()==2) %>%
ggplot(aes(x = geo_value, y = value, fill = correlation_against)) +
geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "bottom")
Red, which is on the left for each location, gives the correlation of the final value with nssp
as it was at the time of the forecast, while blue on the right gives the correlation of the final value with nhsn
as it was. For most locations nssp
correlates somewhat more strongly with the final value than the initial nhsn
release does. nv
, dc
, md
, and mt
are all appreciably better. hi
is notable in that the initial release has a negative correlation, while nssp
is better, though at ~0.25 is still only weakly correlated with the final value. al
, ma
, and sd
are all exceptions in having the initial release correlate better than nssp
, though really they’re around the same value.
And we actually need to compare this with the data revision estimates:
data_substitutions <- readr::read_csv(
here::here(glue::glue("flu_data_substitutions.csv")),
comment = "#",
show_col_types = FALSE
) %>%
mutate(time_value = round_date(time_value, unit = "week", week_start = 6))
nhsn_archive_flu %>%
autoplot(value, .facet_filter = geo_value %in% (data_substitutions$geo_value %>% unique())) + geom_point(data = data_substitutions, aes(x = time_value, y = value)) + facet_wrap(~geo_value, scale = "free")
which doesn’t look all that great. To calculate how much closer (or further) we were from the final value, first we construct the relevant snapshots:
final_values <- nhsn_archive_flu %>% epix_as_of_current() %>% mutate(time_value = round_date(time_value, unit = "week", week_start = 6))
data_as_it_was <- map(
data_substitutions$forecast_date %>% unique(),
\(version) nhsn_archive_flu %>% epix_as_of(version) %>% mutate(forecast_date = version)
) %>%
list_rbind() %>% mutate(time_value = round_date(time_value, unit = "week", week_start = 6))
and then we compute several notions of differences
full_table <- data_substitutions %>%
left_join(
final_values,
by = join_by(geo_value, time_value), suffix = c("_substituted", "_final")
) %>%
left_join(
data_as_it_was,
by = join_by(geo_value, forecast_date, time_value)
) %>%
rename(as_of_value = value) %>%
mutate()
diffs <- full_table %>%
mutate(
abs_diff = value_substituted - value_final,
rel_diff = abs_diff / value_final,
rel_rev_diff = abs_diff / (-value_final + as_of_value),
) %>%
arrange(rel_diff)
diffs %>%
select(-forecast_date, -value_substituted, -value_final, -as_of_value) %>%
print(n = 23)
## # A tibble: 23 × 5
## geo_value time_value abs_diff rel_diff rel_rev_diff
## <chr> <date> <dbl> <dbl> <dbl>
## 1 nm 2025-02-15 -78 -0.315 0.661
## 2 sd 2025-01-04 -14 -0.298 0.412
## 3 nj 2025-03-22 -85 -0.246 0.274
## 4 ok 2025-02-01 -266 -0.238 5.66
## 5 ok 2025-02-15 -186 -0.237 0.660
## 6 ms 2025-02-01 -117 -0.218 117
## 7 ca 2025-02-01 -525 -0.119 3.70
## 8 de 2025-02-01 -10 -0.0714 -Inf
## 9 nm 2025-02-01 -14 -0.0654 1.4
## 10 id 2025-02-01 -9 -0.0566 -3
## 11 ga 2025-02-15 -34 -0.0463 0.105
## 12 hi 2025-03-01 3 0.0448 -0.2
## 13 az 2025-02-01 43 0.0533 0.683
## 14 nh 2025-03-15 21 0.266 Inf
## 15 id 2025-01-04 81 0.335 -1.42
## 16 tn 2025-03-22 122 0.396 -4.88
## 17 nh 2025-03-22 27 0.509 Inf
## 18 dc 2025-02-15 47 0.566 -2.47
## 19 dc 2025-02-22 58 0.806 -2.64
## 20 dc 2025-02-22 58 0.806 -4.83
## 21 dc 2025-02-15 77 0.928 Inf
## 22 dc 2025-02-15 77 0.928 Inf
## 23 nh 2025-03-29 66 1.94 0.733
The table is sorted by rel_diff
.
abs_diff
gives how much our adjustment was over by (so positive means we were over the latest value).
rel_diff
gives the difference relative to the latest value; for nh
on 03/29
(the last entry in the table), our estimate was nearly twice as large as the latest value. This is mostly to compensate for the different magnitudes of values across time and geo.
rel_rev_diff
is the most appropriate to view as a substitution scoring. It divides the abs_diff
by how much the value as of the forecast date was off; for nh
on 03/29
again, it was merely 73%, so we did bring it closer to the actual value. Any of these which are <1 are “successful” in the sense that we were closer to the latest value than the as_of value was. An infinite value tells us that we adjusted a value that hasn’t been corrected. The sign for rel_rev_diff
is a bit confusing, and tells us whether our estimate and the as of value were both larger/smaller than the latest value, or one larger and one smaller.
How many did we substitute a more accurate value?
mean(abs(diffs$rel_rev_diff) < 1)
## [1] 0.3478261
around 35%, so not a great track record overall. How about lower than the target vs higher than the target?
diffs %>%
mutate(is_over = rel_diff > 0) %>%
group_by(is_over) %>%
summarize(fraction_improved = mean(abs(rel_rev_diff) < 1))
## # A tibble: 2 × 2
## is_over fraction_improved
## <lgl> <dbl>
## 1 FALSE 0.455
## 2 TRUE 0.25
So we did marginally better when it was below, but much worse when it was above.
Overall, it turns out our value substitutions did not actually help much for flu.
And we actually need to compare revision behavior with our estimates of the correct values:
data_substitutions <- readr::read_csv(
here::here(glue::glue("covid_data_substitutions.csv")),
comment = "#",
show_col_types = FALSE
) %>%
mutate(time_value = round_date(time_value, unit = "week", week_start = 6))
nhsn_archive_covid %>%
autoplot(value, .facet_filter = geo_value %in% (data_substitutions$geo_value %>% unique())) + geom_point(data = data_substitutions, aes(x = time_value, y = value)) + facet_wrap(~geo_value, scale = "free")
To calculate how much closer (or further) we were from the final value, first we construct the relevant snapshots:
final_values <- nhsn_archive_covid %>% epix_as_of_current() %>% mutate(time_value = round_date(time_value, unit = "week", week_start = 6))
data_as_it_was <- map(
data_substitutions$forecast_date %>% unique(),
\(version) nhsn_archive_covid %>% epix_as_of(version) %>% mutate(forecast_date = version)
) %>%
list_rbind() %>% mutate(time_value = round_date(time_value, unit = "week", week_start = 6))
final_values %>% filter(time_value > "2025-01-03")
## An `epi_df` object, 1,018 x 3 with metadata:
## * geo_type = nation
## * time_type = week
## * as_of = 2025-05-16
##
## # A tibble: 1,018 × 3
## geo_value time_value value
## <chr> <date> <dbl>
## 1 ak 2025-01-04 6
## 2 ak 2025-01-11 14
## 3 ak 2025-01-18 7
## 4 ak 2025-01-25 5
## 5 ak 2025-02-01 9
## 6 ak 2025-02-08 10
## 7 ak 2025-02-15 11
## 8 ak 2025-02-22 10
## 9 ak 2025-03-01 19
## 10 ak 2025-03-08 14
## # ℹ 1,008 more rows
full_table <- data_substitutions %>%
left_join(
final_values,
by = join_by(geo_value, time_value), suffix = c("_substituted", "_final")
) %>%
left_join(
data_as_it_was,
by = join_by(geo_value, forecast_date, time_value)
) %>%
rename(as_of_value = value) %>%
mutate()
diffs <- full_table %>%
mutate(
abs_diff = value_substituted - value_final,
rel_diff = abs_diff / value_final,
rel_rev_diff = abs_diff / (-value_final + as_of_value),
) %>%
arrange(rel_diff)
diffs %>%
select(-forecast_date, -value_substituted, -value_final, -as_of_value) %>%
print(n = 23)
## # A tibble: 8 × 5
## geo_value time_value abs_diff rel_diff rel_rev_diff
## <chr> <date> <dbl> <dbl> <dbl>
## 1 ga 2025-02-15 -52 -0.198 0.473
## 2 nj 2025-03-22 -41 -0.132 0.152
## 3 ky 2025-03-22 -5 -0.0370 0.263
## 4 dc 2025-02-15 5 0.385 -1
## 5 ar 2025-03-22 24 0.393 -0.8
## 6 ak 2025-03-22 5 0.5 -1.67
## 7 nh 2025-01-11 64 0.744 -10.7
## 8 sd 2025-03-01 15 0.75 -3.75
The table is sorted by rel_diff
. - abs_diff
gives how much our adjustment was over by (so positive means we were over the latest value).
rel_diff
gives the difference relative to the latest value.
rel_rev_diff
on the other hand divides that by how much the value as of the forecast date was off. Any of these which are <1 are “successful” in the sense that we were closer to the latest value than the as_of value was. An infinite value tells us that we adjusted a value that hasn’t been corrected. The sign for rel_rev_diff
is a bit confusing, and tells us whether our estimate and the as of value were both larger/smaller than the latest value, or one larger and one smaller.
We did fewer substitutions in Covid, but they were more regularly good, and there were no cases where we shouldn’t have changed the value and did.
How many did we substitute a more accurate value?
mean(abs(diffs$rel_rev_diff) < 1)
## [1] 0.5
50%, so overall basically guessing. How about lower than the target vs higher than the target?
diffs %>%
mutate(is_over = rel_diff > 0) %>%
group_by(is_over) %>%
summarize(fraction_improved = mean(abs(rel_rev_diff) < 1))
## # A tibble: 2 × 2
## is_over fraction_improved
## <lgl> <dbl>
## 1 FALSE 1
## 2 TRUE 0.2
When we were under, we always improved the situation. However when we were above, sometimes we made it worse.
Visually, the flu and covid nhsn datasets seem to have related revision behavior. We should probably study this more carefully, but the question is a somewhat difficult one. One potential way to go about this is for each (time_value, geo_value)
pair, do a correlation of the time series across version
. Alternatively, we could do the same, but for the differences between versions (so correlate the correction amount).
min_waiting_period
is NULL
here since we’re plotting mid-season, while quick_revision = 7
because this is weekly data represented using days (because the versions are days).↩︎
min_waiting_period
is NULL
here since we’re plotting mid-season, while quick_revision = 7
because this is weekly data represented using days (because the versions are days).↩︎
which would definitely be a bit odd of an effect.↩︎