
Bus-Route Survey: Equation Traceability
Source:vignettes/bus-route-equations.Rmd
bus-route-equations.RmdOverview
This document maps every formula used in tidycreel’s bus-route estimation to its published primary source, including the specific equation number and page number. It is intended as an auditable record for reviewers, developers, and users who want to verify that the implementation matches the statistical theory.
The bus-route estimation framework in tidycreel follows two primary sources:
- Jones & Pollock (2012): Chapter 19 of Fisheries Techniques (3rd ed.), pp. 883–919. This chapter provides the general nonuniform probability creel survey estimators.
- Malvestuto (1996): Chapter 20 of Fisheries Techniques (2nd ed.), pp. 591–623. Box 20.6 provides a worked numerical example that tidycreel reproduces exactly.
The four key computations are: inclusion probability (πᵢ), enumeration expansion, expanded effort per interview, and the Horvitz-Thompson total estimators for effort and harvest. Each is documented in a section below.
1. Inclusion Probability (πᵢ)
Published source: Jones & Pollock (2012), p. 912
Formula:
where is the probability of selecting site during a given circuit pass, and is the probability that the sampling period is included in the survey. For uniform period sampling (all periods equally likely), is the same for all sites and circuits.
Statistical meaning: is the inclusion probability — the marginal probability that unit is included in the sample under the two-stage design (stage 1: select a period; stage 2: traverse the circuit). Jones & Pollock (2012) treat this as the product of the two independent selection probabilities.
R implementation:
-
creel_design()inR/creel-design.R: acceptsp_siteandp_periodcolumns from thesampling_frameargument and precomputespi_i = p_site * p_periodfor each site×circuit combination. The result is stored indesign$bus_route$sampling_frame. -
add_interviews()inR/creel-design.R: joins the precomputedpi_ivalue from the sampling frame to each interview row, storing it as the.pi_icolumn.
Malvestuto (1996) Box 20.6 values:
| Site | p_site | p_period | πᵢ |
|---|---|---|---|
| A | 0.30 | 0.50 | 0.150 |
| B | 0.25 | 0.50 | 0.125 |
| C | 0.40 | 0.50 | 0.200 |
| D | 0.05 | 0.50 | 0.025 |
2. Enumeration Expansion
Published source: Malvestuto (1996), Box 20.6, p. 614
Formula:
where is the total number of angler parties observed at site during the visit, and is the number of those parties actually interviewed.
Statistical meaning: When not all parties at a site are interviewed, the expansion factor rescales the interviewed sample to represent the full party count. If all parties are interviewed (), the expansion is 1 and no adjustment is needed. This is the case in Box 20.6 Example 1.
R implementation:
-
add_interviews()inR/creel-design.R: computes.expansion = n_counted / n_interviewedfor each interview row. Whenn_counted = 0andn_interviewed = 0(zero-effort site),.expansionis set toNA; the effort estimator treats this as zero contribution.
3. Effort Estimator (Jones & Pollock Eq. 19.4)
Published source: Jones & Pollock (2012), Eq. 19.4, p. 911
Formula:
where is the enumeration-expanded effort for interview (see Section 4 below), is the inclusion probability, and the sum runs over all interview records.
Statistical meaning: This is a Horvitz-Thompson (HT) total estimator. Dividing by is the HT inverse-probability weight: if a unit is sampled with probability , it represents units in the population. Summing the weighted contributions gives an unbiased estimator of the population total (under the design).
R implementation:
-
estimate_effort_br()inR/creel-estimates-bus-route.R, lines 21–215:- Line 69:
interviews$.e_i <- interviews[[effort_col]] * interviews$.expansion(computes expanded effort; see Section 4) - Line 83:
interviews$.contribution <- interviews$.e_i / interviews$.pi_i(computes for each row) - Line 96:
total_estimate <- sum(interviews$.contribution, na.rm = TRUE)(sums to produce )
- Line 69:
-
estimate_effort()inR/creel-estimates.R, line 310: dispatches toestimate_effort_br()whendesign$design_type == "bus_route".
4. Expanded Effort per Interview (eᵢ)
Published source: Malvestuto (1996), Box 20.6, p. 614 (implicit in the expansion step)
Formula:
Statistical meaning: The effort recorded in an interview represents only the fishing party interviewed. Multiplying by the expansion factor scales that effort to represent all parties at the site, not just the parties.
R implementation:
-
estimate_effort_br()inR/creel-estimates-bus-route.R:- Line 69:
interviews$.e_i <- interviews[[effort_col]] * interviews$.expansion
- Line 69:
When expansion = 1 (all parties interviewed, as in Box 20.6 Example
1),
equals the raw effort. For example, Site C has 6 interviews each with
hours_fished = 57.5 / 6:
Wait — more precisely, each of the 6 Site C rows contributes , and these sum to 57.5 h before the weight is applied, yielding the site contribution of 287.5 angler-hours.
5. Harvest Estimator (Jones & Pollock Eq. 19.5)
Published source: Jones & Pollock (2012), Eq. 19.5, p. 912
Formula:
where is the enumeration-expanded harvest for interview .
Statistical meaning: Structurally identical to the effort estimator (Eq. 19.4) with harvest substituted for effort. Both are HT totals over the same sampling design.
R implementation:
-
estimate_harvest_br()inR/creel-estimates-bus-route.R, lines 218–470: computes.h_i = harvest * .expansionand.contribution = .h_i / .pi_i, then sums viasurvey::svytotal(~.contribution, ...). -
estimate_harvest_rate()inR/creel-estimates.R: dispatches toestimate_harvest_br()whendesign$design_type == "bus_route".
6. Variance Estimation
Published source: Horvitz & Thompson (1952);
implemented via Taylor linearization in the R survey
package (Lumley 2010).
Method: tidycreel uses
survey::svytotal(~.contribution, svy_br) where
svy_br is an svydesign object constructed from
the interview data. The .contribution column holds
(or
).
The survey package applies Taylor linearization to compute
the standard error of this total.
R implementation:
-
estimate_effort_br(), lines 99–112: constructssvy_brviasurvey::svydesign(ids = ~1, ...), appliesget_variance_design()for the selected variance method, and callssurvey::svytotal(~.contribution, svy_br). - Phase 26 cross-validation
(
tests/testthat/test-cross-validation.R) confirms that thesurvey::svytotalvariance matches a manually constructedsvydesigncalculation to tolerance 1e-6.
Bootstrap (variance = "bootstrap") and jackknife
(variance = "jackknife") alternatives are available via
get_variance_design(), consistent with all other tidycreel
estimators.
7. Summary Traceability Table
| Quantity | Formula | Source | Page | R Location |
|---|---|---|---|---|
| Inclusion probability | πᵢ = p_site × p_period | Jones & Pollock (2012) | p. 912 |
creel-design.R: creel_design(),
add_interviews()
|
| Enumeration expansion | expansion = n_counted / n_interviewed | Malvestuto (1996) Box 20.6 | p. 614 |
creel-design.R: add_interviews()
|
| Expanded effort | eᵢ = hours_fished × expansion | Malvestuto (1996) Box 20.6 | p. 614 |
creel-estimates-bus-route.R line 69 |
| HT effort total | Ê = Σ(eᵢ/πᵢ) | Jones & Pollock (2012) Eq. 19.4 | p. 911 |
creel-estimates-bus-route.R lines 83, 96 |
| Expanded harvest | hᵢ = harvest × expansion | Malvestuto (1996) Box 20.6 | p. 614 |
creel-estimates-bus-route.R (harvest branch) |
| HT harvest total | Ĥ = Σ(hᵢ/πᵢ) | Jones & Pollock (2012) Eq. 19.5 | p. 912 |
creel-estimates-bus-route.R lines 218–470 |
| Variance | Taylor linearization on Σ(eᵢ/πᵢ) | Lumley (2010) | — | survey::svytotal(~.contribution, svy_br) |
8. Why πᵢ Matters: A Quantitative Example
Some implementations use a fixed value such as for all sites, or compute from interview timing data (e.g., wait time / circuit time). Both approaches are statistically incorrect: neither is the inclusion probability of the sampling design.
The bias can be large and heterogeneous across sites. Using the Malvestuto (1996) Box 20.6 data, here is the effect of substituting for the correct design-based values:
| Site | Correct πᵢ | eᵢ (h) | Correct eᵢ/πᵢ | Incorrect (π=0.5) eᵢ/πᵢ | Error |
|---|---|---|---|---|---|
| A | 0.150 | 30.0 | 200.0 | 60.0 | −70% |
| B | 0.125 | 20.0 | 160.0 | 40.0 | −75% |
| C | 0.200 | 57.5 | 287.5 | 115.0 | −60% |
| D | 0.025 | 5.0 | 200.0 | 10.0 | −95% |
| Total | 847.5 | 225.0 | −73% |
With , the total effort estimate would be 225.0 angler-hours — a 73% underestimate of the correct 847.5. The bias is not uniform: Site D is underestimated by 95% because it has the lowest correct (0.025) but the incorrect formula assigns it the same weight as Site C (π = 0.5).
The direction and magnitude of bias depend entirely on the distribution of site probabilities in the actual design. The only way to avoid bias is to use the design-specified inclusion probabilities — which is what tidycreel does.
For the specific Site C example cited in the design documents: correct contribution = 57.5 / 0.20 = 287.5; with π = 0.5 it would be 57.5 / 0.5 = 115.0 — a 2.5× underestimate for that site alone.
References
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
Jones, C. M., & Pollock, K. H. (2012). Recreational survey methods: estimation of effort, harvest, and abundance. Chapter 19 in Fisheries Techniques (3rd ed.), pp. 883–919. American Fisheries Society.
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.
Malvestuto, S. P. (1996). Sampling the recreational angler. Chapter 20 in Fisheries Techniques (2nd ed.), pp. 591–623. American Fisheries Society.
Malvestuto, S. P., Davies, W. D., & Shelton, W. L. (1978). An evaluation of the roving creel survey with nonuniform probability sampling. Transactions of the American Fisheries Society, 107(2), 255–262.