Skip to contents

Introduction

This vignette explains why tidycreel’s effort estimates work the way they do. It is written for creel biologists who know their surveys but have not had to think carefully about sampling theory. By the end you will understand three things:

  1. What a Primary Sampling Unit (PSU) is and why it matters for variance.
  2. How the Horvitz–Thompson estimator turns a sample of daily counts into a population total, and how the Rasmussen (1994) two-stage variance decomposes standard error into between-day and within-day components.
  3. How the progressive count estimator differs mechanically from the instantaneous count estimator.

Each concept is developed from first principles, illustrated with a worked numeric example, and confirmed against estimate_effort() output.


What is a PSU?

In an instantaneous creel survey, the observer selects a random time on a given day and records the number of anglers visible on the survey section. Each day × stratum combination is treated as one sampling unit. In survey-sampling language these units are called Primary Sampling Units (PSUs).

Imagine a three-day survey with two day-type strata:

Six PSUs formed by crossing 3 days × 2 strata.
Date Day type PSU
2024-06-03 weekday PSU-1
2024-06-03 weekend PSU-2
2024-06-04 weekday PSU-3
2024-06-04 weekend PSU-4
2024-06-05 weekday PSU-5
2024-06-05 weekend PSU-6

Six PSUs in total — three per stratum. If the survey selected two weekday PSUs and two weekend PSUs, the sample would contain four PSUs out of six.

The population we want to estimate is the season total over all NhN_h PSUs in stratum hh. The sample gives us observations on nhn_h of those PSUs.


The Basic Effort Estimator

Horvitz–Thompson form

For a stratified random sample, the Horvitz–Thompson (HT) estimator of the population total is:

Ê=isyiπi\hat{E} = \sum_{i \in s} \frac{y_i}{\pi_i}

  • yiy_i = observed count (number of anglers) on PSU ii
  • πi\pi_i = inclusion probability for PSU ii (probability that PSU ii is selected into the sample)
  • ss = the set of sampled PSUs

When sampling is done without replacement and all PSUs in a stratum are equally likely to be chosen, the inclusion probability is πi=nh/Nh\pi_i = n_h / N_h. Substituting into the HT form and recognising that the sum over shs_h covers nhn_h PSUs gives the familiar stratum-total estimator.

Practical creel form

For stratum hh with NhN_h total days and nhn_h sampled days:

Êh=Nhnhishyi\hat{E}_h = \frac{N_h}{n_h} \sum_{i \in s_h} y_i

  • NhN_h = total days available in stratum hh during the season
  • nhn_h = number of days actually sampled in stratum hh
  • yiy_i = count (angler-hours or angler-trips) recorded on sampled day ii
  • shs_h = the set of sampled days in stratum hh

The ratio Nh/nhN_h / n_h is the expansion factor: if only one in three days was sampled, each observed count represents three days worth of effort in the population.

Worked numeric

Stratum: “weekday”. N=n=3N = n = 3 days sampled (all available days observed). Counts: y1=150,y2=230,y3=190y_1 = 150, \; y_2 = 230, \; y_3 = 190 angler-hours.

Êweekday=33×(150+230+190)=1×570=570 angler-hours\hat{E}_{\text{weekday}} = \frac{3}{3} \times (150 + 230 + 190) = 1 \times 570 = 570 \text{ angler-hours}

Confirming with R:

library(tidycreel)

calendar <- data.frame(
  date = as.Date(c("2024-06-03", "2024-06-04", "2024-06-05")),
  day_type = rep("weekday", 3),
  open_hours = rep(10, 3)
)
design <- creel_design(calendar, date = date, strata = day_type)

# n_anglers holds angler-hours (count × open hours already combined)
counts <- data.frame(
  date      = as.Date(c("2024-06-03", "2024-06-04", "2024-06-05")),
  day_type  = rep("weekday", 3),
  n_anglers = c(150L, 230L, 190L)
)
design <- add_counts(design, counts)
#> Warning in svydesign.default(ids = psu_formula, strata = strata_formula, : No
#> weights or probabilities supplied, assuming equal probability
result <- estimate_effort(design)
result$estimates
#> # A tibble: 1 × 7
#>   estimate    se se_between se_within ci_lower ci_upper     n
#>      <dbl> <dbl>      <dbl>     <dbl>    <dbl>    <dbl> <int>
#> 1      570  69.3       69.3         0     272.     868.     3

The estimate column (570) matches the by-hand result. With a single count per day se_within is always 0 — there is no within-day replication to estimate within-day variance.


Multiple Counts per Day: Rasmussen Two-Stage Variance

Why replicate counts within a day?

When two observers (or the same observer on two circuits) count the same section twice on the same day — once in the morning and once in the afternoon — we have replicated measurements within each PSU. This within-day replication does something useful: it separates measurement noise from true day-to-day variation.

Without multiple counts per day, we cannot tell whether the variation across days is caused by real differences in angler activity or by random counting error. With replication we can decompose the total variance into two components:

SE2(Ê)=SEbetween2+SEwithin2\text{SE}^2(\hat{E}) = \text{SE}^2_{\text{between}} + \text{SE}^2_{\text{within}}

Between-day component

The between-day component measures how much estimated effort varies from one sampled PSU to the next. It is computed by treating the within-day mean yi\bar{y}_i (average of all counts on day ii) as the PSU-level observation and applying the standard variance-of-a-total formula:

SEbetween2=Nh2nhsy2\text{SE}^2_{\text{between}} = \frac{N_h^2}{n_h} \cdot s^2_{\bar{y}}

  • NhN_h = total days available in stratum hh
  • nhn_h = number of sampled days
  • sy2s^2_{\bar{y}} = sample variance of the within-day means yi\bar{y}_i

In plain language: this is the variance you would report even with a single count per day, using the daily mean in place of a single count. It reflects how much effort differs from one day to the next.

Within-day component

The within-day component measures how much counts vary within a single day — that is, how much the morning count differs from the afternoon count. The Rasmussen (1994) formula is:

SEwithin2=Nh2nh2i=1nhsd,i2ki\text{SE}^2_{\text{within}} = \frac{N_h^2}{n_h^2} \sum_{i=1}^{n_h} \frac{s^2_{d,i}}{k_i}

  • sd,i2s^2_{d,i} = sample variance of counts on day ii across its kik_i circuits
  • kik_i = number of counts (circuits) on day ii

Equivalently, using the pooled formulation implemented in tidycreel:

SEwithin2=Nhkavgswithin2\text{SE}^2_{\text{within}} = \frac{N_h}{k_{\text{avg}}} \cdot s^2_{\text{within}}

where swithin2=iSSd,inh(kavg1)s^2_{\text{within}} = \frac{\sum_{i} \text{SS}_{d,i}}{n_h (k_{\text{avg}} - 1)} is the pooled within-day variance and SSd,i=j(yijyi)2\text{SS}_{d,i} = \sum_j (y_{ij} - \bar{y}_i)^2 is the sum of squared deviations within day ii.

In plain language: this is how much counting error you are carrying around. A large within-day component means the morning and afternoon counts disagree substantially — the observer is seeing large swings in angler activity within a single day, or measurement error is high.

Worked numeric

Four weekdays are sampled (N=n=4N = n = 4). Each day receives two counts (morning am and afternoon pm):

Day am count pm count Daily mean yi\bar{y}_i SSd,i\text{SS}_{d,i}
1 10 20 15 50
2 20 30 25 50
3 30 40 35 50
4 40 50 45 50

Grand mean of daily means: y=(15+25+35+45)/4=30\bar{\bar{y}} = (15 + 25 + 35 + 45)/4 = 30

Between-day SE (step by step):

Sample variance of daily means:

sy2=(1530)2+(2530)2+(3530)2+(4530)241=225+25+25+2253=5003166.67s^2_{\bar{y}} = \frac{(15-30)^2 + (25-30)^2 + (35-30)^2 + (45-30)^2}{4 - 1} = \frac{225 + 25 + 25 + 225}{3} = \frac{500}{3} \approx 166.67

Between-day variance (without finite-population correction, as in the with-replacement design assumed by survey::svytotal()):

SEbetween2=N2nsy2=164×166.67=4×166.67=666.67\text{SE}^2_{\text{between}} = \frac{N^2}{n} \cdot s^2_{\bar{y}} = \frac{16}{4} \times 166.67 = 4 \times 166.67 = 666.67

SEbetween=666.6725.82\text{SE}_{\text{between}} = \sqrt{666.67} \approx 25.82

Within-day SE (step by step):

kavg=2k_{\text{avg}} = 2; all SSd,i=50\text{SS}_{d,i} = 50:

swithin2=SSd,in(kavg1)=50+50+50+504×(21)=2004=50s^2_{\text{within}} = \frac{\sum \text{SS}_{d,i}}{n \cdot (k_{\text{avg}} - 1)} = \frac{50 + 50 + 50 + 50}{4 \times (2 - 1)} = \frac{200}{4} = 50

SEwithin2=Nkavgswithin2=42×50=100\text{SE}^2_{\text{within}} = \frac{N}{k_{\text{avg}}} \cdot s^2_{\text{within}} = \frac{4}{2} \times 50 = 100

SEwithin=100=10\text{SE}_{\text{within}} = \sqrt{100} = 10

Combined SE:

SE=666.67+100=766.6727.70\text{SE} = \sqrt{666.67 + 100} = \sqrt{766.67} \approx 27.70

Confirming with R:

calendar_m <- data.frame(
  date       = as.Date(c("2024-06-03", "2024-06-04", "2024-06-05", "2024-06-06")),
  day_type   = rep("weekday", 4),
  open_hours = rep(10, 4)
)
design_m <- creel_design(calendar_m, date = date, strata = day_type)

multi_counts <- data.frame(
  date = as.Date(rep(
    c("2024-06-03", "2024-06-04", "2024-06-05", "2024-06-06"),
    each = 2
  )),
  day_type = rep("weekday", 8),
  count_time = rep(c("am", "pm"), 4),
  n_anglers = c(10L, 20L, 20L, 30L, 30L, 40L, 40L, 50L)
)
design_m <- add_counts(design_m, multi_counts, count_time_col = count_time)
#> Warning in svydesign.default(ids = psu_formula, strata = strata_formula, : No
#> weights or probabilities supplied, assuming equal probability
result_m <- estimate_effort(design_m)
result_m$estimates
#> # A tibble: 1 × 7
#>   estimate    se se_between se_within ci_lower ci_upper     n
#>      <dbl> <dbl>      <dbl>     <dbl>    <dbl>    <dbl> <int>
#> 1      120  27.7       25.8        10     31.9     208.     4

The R output confirms: se_between 25.82\approx 25.82, se_within =10.00= 10.00, and se 27.70\approx 27.70 — matching the by-hand calculations exactly.

When does se_within matter?

se_within is small when angler activity is stable throughout the day. It is large when fishing activity follows a strong temporal pattern — for example, an active morning bite followed by a quiet midday and a busy evening. If se_within / se > 0.5, within-day variation is the dominant source of uncertainty, and collecting additional circuits per day would improve precision more than adding more sampled days.


The Progressive Count Estimator

The progressive count is a distinct field method, not a variant of the instantaneous count. Instead of taking a snapshot at a random moment, the observer traverses the entire survey section and records every angler encountered during the traversal. One pass of the section is called a circuit.

Because the observer is moving rather than counting instantaneously, the estimator formula and its interpretation differ.

Formula and derivation

Let CdC_d be the number of anglers encountered during one circuit on day dd, and let TdT_d be the total hours the section is open on day dd.

Êd=Cd×Td\hat{E}_d = C_d \times T_d

  • CdC_d = progressive count (all anglers encountered during the circuit on day dd)
  • TdT_d = total open hours on day dd

Why does τ\tau cancel? During a circuit of duration τ\tau hours, the observer encounters anglers at rate Cd/τC_d / \tau anglers per hour. Over the full TdT_d open hours, the total angler-hours is:

Êd=Cdτ×Td×τ=Cd×Td\hat{E}_d = \frac{C_d}{\tau} \times T_d \times \tau = C_d \times T_d

The circuit time τ\tau appears in both numerator and denominator and cancels exactly. The result is that a progressive count records total angler-hours directly, with τ\tau needed only to derive the expansion factor internally.

Worked numeric

Three weekdays; circuit time τ=2\tau = 2 hours; section open Td=8T_d = 8 hours each day. Counts: C1=15C_1 = 15, C2=20C_2 = 20, C3=25C_3 = 25.

Day CdC_d TdT_d Êd=Cd×Td\hat{E}_d = C_d \times T_d
1 15 8 120
2 20 8 160
3 25 8 200

Season total (N = n = 3): Ê=(3/3)×(120+160+200)=480\hat{E} = (3/3) \times (120 + 160 + 200) = 480 angler-hours.

Confirming with R (note: add_counts() stores the expanded Êd\hat{E}_d values in the n_anglers column automatically):

calendar_p <- data.frame(
  date       = as.Date(c("2024-06-03", "2024-06-04", "2024-06-05")),
  day_type   = rep("weekday", 3),
  open_hours = rep(8, 3)
)
design_p <- creel_design(calendar_p, date = date, strata = day_type)

prog_counts <- data.frame(
  date        = as.Date(c("2024-06-03", "2024-06-04", "2024-06-05")),
  day_type    = rep("weekday", 3),
  n_anglers   = c(15L, 20L, 25L),
  shift_hours = rep(8, 3)
)
design_p <- add_counts(
  design_p, prog_counts,
  count_type = "progressive",
  circuit_time = 2,
  period_length_col = shift_hours
)
#> Warning in svydesign.default(ids = psu_formula, strata = strata_formula, : No
#> weights or probabilities supplied, assuming equal probability

# Per-day Ê_d values stored after C × T expansion
design_p$counts
#>         date day_type n_anglers
#> 1 2024-06-03  weekday       120
#> 2 2024-06-04  weekday       160
#> 3 2024-06-05  weekday       200

result_p <- estimate_effort(design_p)
result_p$estimates
#> # A tibble: 1 × 7
#>   estimate    se se_between se_within ci_lower ci_upper     n
#>      <dbl> <dbl>      <dbl>     <dbl>    <dbl>    <dbl> <int>
#> 1      480  69.3       69.3         0     182.     778.     3

The n_anglers column shows 120, 160, 200 — the pre-expanded daily totals — and the season total is 480 angler-hours, matching the by-hand calculation.


Summary

Three estimator pathways are available for instantaneous creel surveys:

Scenario Data collected Key formula se_within
Single count per day One count per PSU Êh=(Nh/nh)yi\hat{E}_h = (N_h/n_h)\sum y_i Always 0
Multiple counts per day k2k \geq 2 counts per PSU Êh=(Nh/nh)yi\hat{E}_h = (N_h/n_h)\sum \bar{y}_i Nonzero; measures within-day variance
Progressive count One circuit per day Êd=Cd×Td\hat{E}_d = C_d \times T_d Always 0

Single count per day is the classical creel estimator. All variance is between-day variation; there is no within-day component to estimate.

Multiple counts per day (Rasmussen two-stage) adds a within-day variance component. Use this design when angler activity patterns within a day are uncertain or known to vary substantially. Monitor the ratio se_within / se: a value above 0.5 means that adding circuits per day improves precision more than adding sampled days.

Progressive counts are appropriate when the section is long enough that a spot-check count would miss part of the angler population. The C×TC \times T formula recovers total angler-hours without needing τ\tau in the final estimate. Variance structure is the same as for single-count instantaneous surveys.


References

  • Hoenig, J. M., Robson, D. S., Jones, C. M., and Pollock, K. H. (1993). Scheduling counts in the instantaneous and progressive count methods for estimating sportfishing effort. North American Journal of Fisheries Management, 13, 723–736.

  • Pope, S. L., et al. (in press). Progressive count methods for creel surveys.

  • Rasmussen, P. W. (1994). Two-stage variance estimation for creel surveys with multiple counts per sampling unit. Cited in Su, Y.-S., and Liu, P. (2025), equations 17–18.

  • Su, Y.-S., and Liu, P. (2025). Flexible creel survey estimators. Canadian Journal of Fisheries and Aquatic Sciences, 82, 1–27.


For the API walkthrough showing these functions on your own data, see Flexible Count Estimation.