Partial Pooling for Imbalanced Sex Ratios

The Problem: Imbalanced Sex Ratios

Elasmobranch datasets frequently have unequal sample sizes between sexes. The imbalanced_data example dataset illustrates this common scenario with 150 females but only 34 males. Fitting separate models for each sex leads to wide credible intervals for the sparse sex, unstable estimates driven by a few influential observations, and inefficient use of information — ignoring the fact that both sexes are the same species and thus have shared biology.

library(vitalBayes)
library(data.table)

# Load the imbalanced dataset (150F, 34M, 13 embryos)
data(imbalanced_data)

# Check the sex ratio
imbalanced_data[embryo == FALSE, .N, by = sex]
#    sex   N
# 1: female  150
# 2: male     34

Three Estimation Strategies

Consider estimating a parameter \(\theta\) (e.g., \(L_{50}\)) for each sex:

1. Complete Pooling (No Sex Effect)

Assume both sexes share identical parameters:

\[\theta_{\text{female}} = \theta_{\text{male}} = \theta\]

This ignores real biological differences between sexes, such as sexual dimorphism in size at maturity.

2. No Pooling (Fully Separate)

Estimate each sex independently:

\[\theta_{\text{female}} \perp \theta_{\text{male}}\]

This ignores that both sexes are the same species. The sparse sex gets unreliable estimates, and point estimates can be driven by a handful of observations near the decision boundary.

3. Partial Pooling (Hierarchical)

Model sex-specific parameters as draws from a common distribution:

\[\theta_s \sim \mathcal{N}(\mu, \tau^2)\]

where \(\mu\) is the species-level mean and \(\tau\) controls between-sex variation.

The model learns \(\tau\) from the data. If sexes are similar, the estimated \(\tau\) will be small and estimates shrink together. If sexes are different, \(\tau\) will be large and estimates stay separated. And when one sex is sparse, it borrows strength from the data-rich sex — reducing variance without requiring the assumption that both sexes are identical.

The vitalBayes Implementation

Non-Centered Parameterization

For numerical stability, vitalBayes uses a non-centered parameterization on the log scale:

\[\log(\theta_s) = \mu + \tau \cdot \eta_s, \quad \eta_s \sim \mathcal{N}(0, 1)\]

This separates the global mean (\(\mu\)) from sex-specific deviations (\(\eta_s\)), which dramatically improves MCMC sampling when \(\tau\) is small. Under centered parameterization, HMC struggles in the “funnel” geometry that arises when \(\tau\) is near zero — the non-centered form eliminates this by sampling on a standard normal and rescaling.

Working on the log scale ensures that parameters remain positive (lengths, ages, growth coefficients are all positive quantities) and that the hierarchical variation is proportional rather than additive — a 10% difference between sexes means the same thing whether the species is 50 cm or 300 cm.

Prior on \(\tau\)

The between-sex standard deviation uses a half-normal prior:

\[\tau \sim \text{Half-Normal}(0, \sigma_\tau)\]

where \(\sigma_\tau\) is set via the prior_tau argument. This prior allows \(\tau = 0\) (complete pooling) when data support it, permits large \(\tau\) when sexes genuinely differ, and avoids the heavy tails of half-Cauchy that can cause divergences in Stan.

Practical Usage

Enabling Partial Pooling

# Prepare maturity data from the imbalanced dataset
mat_data <- imbalanced_data[embryo == FALSE & !is.na(mat)]

# With partial pooling (recommended for imbalanced data)
L50_pooled <- fit_bayesian_maturity(
  maturity    = "mat",
  lt          = "fl",
  sex         = "sex",
  data        = mat_data,
  use_pooling = TRUE,    # Enable hierarchical structure
  prior_tau   = 0.5      # Half-normal scale (on log scale)
)

# Without partial pooling (separate estimation)
L50_unpooled <- fit_bayesian_maturity(
  maturity    = "mat",
  lt          = "fl",
  sex         = "sex",
  data        = mat_data,
  use_pooling = FALSE
)

Comparing Results

# Pooled estimates
L50_pooled$summary("L50")

# Unpooled estimates
L50_unpooled$summary("L50")

# The pooled male estimate will typically have:
# 1. Similar median to unpooled
# 2. Narrower credible interval
# 3. Slight shrinkage toward the female estimate

Formal Comparison

Use compare_pooling() to quantify the difference:

compare_pooling(
  pooled   = L50_pooled,
  unpooled = L50_unpooled,
  params   = "L50"
)

# Output shows:
# - Point estimates for each approach
# - CI widths (pooled typically narrower for sparse sex)
# - Shrinkage magnitude

Understanding Shrinkage

Partial pooling produces shrinkage: estimates for the sparse group are pulled toward the overall mean. The amount of shrinkage depends on the sample size ratio (more imbalance produces more shrinkage for the sparse sex), the observed difference (large apparent differences produce less shrinkage), and the within-group variance (high variance produces more shrinkage, since extreme values are less informative).

Visualizing Shrinkage

library(ggplot2)

# Extract posteriors
draws_pooled <- L50_pooled$draws("L50", format = "df")
draws_unpooled <- L50_unpooled$draws("L50", format = "df")

# Compare male estimates (the sparse sex)
ggplot() +
  geom_density(data = draws_unpooled, aes(x = `L50[2]`, fill = "Unpooled"),
               alpha = 0.5) +
  geom_density(data = draws_pooled, aes(x = `L50[2]`, fill = "Pooled"),
               alpha = 0.5) +
  labs(x = "Male L50 (cm)", y = "Density",
       title = "Effect of Partial Pooling on Male L50 Estimate",
       subtitle = "Note the narrower credible interval with pooling") +
  theme_vital()

Partial Pooling in Growth Models

The same hierarchical principles apply to growth parameters. However, when the maturity-based parameterization is used, if the upstream maturity models were themselves fitted with partial pooling, the maturity parameters \((L_{mat}, t_{mat})\) already carry pooled information. Pooling them again in the growth model creates double-pooling that can over-shrink sex differences.

vitalBayes addresses this with selective pooling: by default, only \(L_\infty\) and \(L_0\) are pooled in the growth model’s hierarchical structure, while \(L_{mat}\) and \(t_{mat}\) receive direct sex-specific priors from the upstream maturity fits. This is controlled by the pool_maturity argument and is auto-detected when vitalBayes maturity fit objects are provided. See vignette("fit_bayesian_growth") for the complete treatment.

# Prepare growth data from imbalanced dataset
gdata <- imbalanced_data[embryo == FALSE & !is.na(age)]

# First, fit maturity models with pooling
L50_fit <- fit_bayesian_maturity(
  maturity = "mat", lt = "fl", sex = "sex",
  data = imbalanced_data[embryo == FALSE & !is.na(mat)],
  use_pooling = TRUE
)

t50_fit <- fit_bayesian_maturity(
  maturity = "mat", age = "age", sex = "sex",
  data = imbalanced_data[embryo == FALSE & !is.na(mat) & !is.na(age)],
  use_pooling = TRUE
)

# Growth model with selective pooling (auto-detected)
growth_pooled <- fit_bayesian_growth(
  lt          = "fl",
  age         = "age",
  sex         = "sex",
  data        = gdata,
  k_based     = FALSE,
  length.mature_stanfit = L50_fit,
  age.mature_stanfit    = t50_fit,
  use_pooling = TRUE,
  prior_tau   = 0.2     # Tighter for growth (less between-sex variation expected)
)

# Sex-specific estimates with uncertainty reduction
growth_pooled$summary(c("Linf", "k"))

# Sex differences (still estimable!)
growth_pooled$summary(c("Linf_diff", "k_diff"))

Choosing `prior_tau`

The prior_tau argument controls the half-normal scale for between-sex SD on the log scale:

Value	Interpretation	Use Case
0.1	Expect very similar sexes	Growth rate, measurement error
0.2	Expect modest differences	`Linf`, `L0`, growth models (default)
0.5	Expect moderate differences	`L50`, `t50`, maturity models (default)
1.0	Expect substantial differences	Rare; consider no pooling

Since parameters are modeled on the log scale, prior_tau = 0.5 corresponds to roughly 50% expected variation between sexes before seeing the data. For parameters like L0, which tend to differ less between sexes than maturity parameters, a tighter value (0.2) is often appropriate.

Diagnosing Pooling Behavior

Check the Estimated \(\tau\)

# Large tau = sexes are different (less shrinkage)
# Small tau = sexes are similar (more shrinkage)
L50_pooled$summary("tau_L50")

A very small \(\tau\) (say \(< 0.02\)) indicates the data strongly support similar parameters across sexes — at this point partial pooling approaches complete pooling. A large \(\tau\) (say \(> 0.5\)) indicates substantial dimorphism, and partial pooling provides minimal shrinkage, approaching independent estimation.

Compare LOO-CV

loo_pooled <- compute_loo(L50_pooled)
loo_unpooled <- compute_loo(L50_unpooled)

compare_loo(
  "Partial Pooling" = loo_pooled,
  "No Pooling" = loo_unpooled
)

# Pooled model often has better (higher) elpd when:
# - Sample sizes are imbalanced
# - True sex differences are modest

Example: Full Workflow with Pooling

# ---- Data Prep ----
# Use the imbalanced dataset to demonstrate pooling benefits
data(imbalanced_data)
mat_data <- imbalanced_data[embryo == FALSE & !is.na(mat)]
gdata <- imbalanced_data[embryo == FALSE & !is.na(age)]

# ---- Maturity with Pooling ----
L50_fit <- fit_bayesian_maturity(
  maturity = "mat", lt = "fl", sex = "sex",
  data = mat_data,
  use_pooling = TRUE,
  prior_tau = 0.5
)

t50_fit <- fit_bayesian_maturity(
  maturity = "mat", age = "age", sex = "sex",
  data = mat_data[!is.na(age)],
  use_pooling = TRUE,
  prior_tau = 0.5
)

# ---- Growth with Selective Pooling ----
# pool_maturity auto-detects to FALSE because L50_fit and t50_fit
# are CmdStanMCMC objects from vitalBayes maturity fits
growth_fit <- fit_bayesian_growth(
  lt          = "fl",
  age         = "age",
  sex         = "sex",
  data        = gdata,
  k_based     = FALSE,
  length.mature_stanfit = L50_fit,
  age.mature_stanfit    = t50_fit,
  use_pooling = TRUE,
  prior_tau   = 0.2
)

# ---- Results ----
# Sex-specific estimates with appropriate uncertainty
growth_fit$summary(c("Linf", "k"))

# Credible sex differences
growth_fit$summary(c("Linf_diff", "k_diff"))

Comparing Datasets: Balanced vs Imbalanced

The package includes datasets that illustrate when pooling matters most:

# Imbalanced data (150F, 34M) - pooling helps substantially
data(imbalanced_data)
imbalanced_data[embryo == FALSE, .N, by = sex]

# Balanced data (189F, 176M) - pooling still works but less critical
data(growth_data)
growth_data[embryo == FALSE, .N, by = sex]

# Limited data (24F, 18M) - pooling essential for both sexes
data(limited_data)
limited_data[embryo == FALSE, .N, by = sex]

Summary

Aspect	No Pooling	Partial Pooling
Sparse sex CI	Wide	Narrower
Point estimates	Unstable	Regularized
Shrinkage	None	Data-adaptive
Sex differences	Direct	Via difference parameters
Recommended for	Balanced data	Imbalanced data

For most elasmobranch life history analyses, partial pooling is the recommended default when fitting two-sex models.