---
title: "Optimal design for growth reference centiles"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Optimal design for growth reference centiles}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Sample size and sample composition

This vignette describes two functions for estimating optimal sample size and 
sample composition in growth reference centile studies, based on the paper
by Cole (2020).

There are two criteria that determine the optimal sample size `N` and sample
composition $\lambda$ in centile studies: 

* the centile of interest, which is expressed as a z-score `z`
* the required level of precision for that centile, expressed as the z-score standard error `SEz`

Take for example the Cuban Growth Study (Healy 1974, Jordan 1975), which was 
designed to estimate the 3rd height centile for boys aged 8 with a standard error (`SE`) of 0.3 cm.

The 3rd centile corresponds to `z` = -1.88, and the required `SE` in z-score units 
is obtained by dividing the `SE` by the height `SD` at age 8 of 5.75 cm. This gives the
precision in z-score units as `SEz` = 0.3/5.75 = 0.053.

Cole (2020) pointed out that when calculating sample size it is essential to think also
about _sample composition_, i.e. the age distribution of the measurements. All centile
studies need to collect extra data in infancy compared to later in childhood, because babies grow
faster than children. But _just how much_ more 
depends on which centile is the focus. The median requires heavy infant oversampling, 
whereas the 0.4th centile where `z` = -2.67 needs very little.

The degree of oversampling is defined by the quantity $\lambda$, see Cole (2020),
where $\lambda$ = 1 means uniform sampling, while for example $\lambda$ = 0.4 indicates heavy infant oversampling.

To run the examples, first load the `tidyverse` and `sitar` libraries.

```{r message = FALSE, echo = FALSE}
  library(ggplot2)
  library(dplyr)
  library(purrr)
  library(tidyr)
  library(forcats)
  library(sitar)
```

## optimal_design

The `optimal_design` function takes as arguments `z`, `lambda`, `N`, `SEz` and `age`. It then calculates the 
optimal sample composition, based either on `z` or `lambda`, and the optimal sample 
size based either on `N` or `SEz`. If both `N` and `SEz` are given, `N` takes precedence. 
If both `z` and `lambda` are given, `SEz` also depends on `age`. 

The return value is `p`, the centile corresponding to `z`, with its 95% confidence interval.

### Example

The following code estimates optimal `lambda` and `SEz` for nine centiles 
spaced two-thirds of a z-score apart, based on a sample of 10,000 children.

```{r}
knitr::kable(optimal_design(z = -4:4*2/3, N = 10000), digits = c(2, 2, 0, 3, 0, 2, 2))
```

What the table shows is unsurprising, that a sample of 10,000 children estimates 
the median (`p` = 50th centile) much more precisely than the 0.4th centile (and by 
symmetry the 99.6th centile), with `SEz` 0.033 compared to 0.065. The 95% 
confidence interval for the 50th centile is (47.3, 52.7) and that for the 0.4th
centile is (0.26, 0.56).

In addition `lambda` increases from 0.38 on the median to 0.90 on the 
0.4th and 99.6th centiles, indicating a large shift in sample composition. 
Let's see what the corresponding age distributions look like.

```{r, fig.width = 7}
minage <- 0
maxage <- 20
N <- 1e4
map_dfr(c(1, 0.9, 0.38), ~{
  tibble(age = (runif(N, minage^.x, maxage^.x))^(1/.x),
         p = .x)
}) %>% 
  mutate(lambda = fct_reorder(factor(p), -p))  %>%
  ggplot(aes(age, group = lambda)) +
  geom_histogram(fill = 'gray', binwidth = 1) +
  facet_wrap(. ~ lambda, labeller = label_both) +
  xlab('age (years)') + ylab('frequency')
```

The figure shows the impact of `lambda` on the age distribution. When `lambda` = 1
sampling is uniform across the age range -- the histogram shows one-year age groups with about 500 per group. 
In practice this design is rarely appropriate for growth studies.

With `lambda` = 0.9, suitable for the 0.4th and 99.6th centiles, the sampling is
close to uniform. 

However with `lambda` = 0.38, which is appropriate for estimating
the median, there is dramatic oversampling in early infancy. This is needed to
accurately estimate the shape of the median curve soon after birth when it is at 
its steepest.

## n_agegp

The function `n_agegp` extends `optimal_design` by giving the numbers of measurements to 
collect per age group.

It takes as arguments `z`, `lambda`, `N` and `SEz`, similar to `optimal_design`. In addition
its arguments `minage`, `maxage` and `n_groups` define the age range and number of groups. 
The groups are of equal width, defaulting to 20 groups from 0 to 20 years, i.e. each one year width.

It returns `n_varying` the numbers per age group, and `age` the mean ages for the groups.

In addition it returns results for cohort studies where the number per age group 
is fixed at `n` = `N` / `n_groups`, and the target ages of measurement are given by `age_varying`.

### Example 1

The following code calculates the age group sizes optimised for centiles from the 0.4th to the 99.6th,
with a sample of 10,000 children from 0 to 20 years in one-year groups.

```{r plot}
  n_table <- map_dfc(-4:4*2/3, ~{
    n_agegp(z = .x, N = 10000) %>% 
      select(!!z2cent(.x) := n_varying)
  }) %>% 
  bind_cols(tibble(age = paste(0:19, 1:20, sep = '-')), .)
knitr::kable(n_table)
```

The table shows how the sample composition varies depending on which centile is optimised. 
Comparing the 50th and 0.4th centiles, the optimal number for the 50th centile is more than four times the size 
in the first year but less than half the size at age 19-20. This matches the histogram above.

### Example 2

`n_agegp` can also be used to design studies over a narrower age range, e.g. 0-5 years.
For example there are 3506 children in this age range in the table optimised for the 2nd centile, 
and the following code returns the same group sizes (bar rounding):

```{r}
knitr::kable(n_agegp(z = 2, N = 3506, minage = 0, maxage = 5, n_groups = 5) %>% 
  select(age, n_varying))
```

This shows that the sample composition optimised for age 0-20 is also optimal for subsets of the age range. 

### Example 3

A third example contrasts the Cuban Growth Study design, as mentioned earlier, 
with the optimal design based on the specified precision of `SEz` = 0.053 on the 3rd centile, 
which Healy (1974) showed required 1000 boys aged 8. 

For an optimal design the required precision should apply 
across the age range. The code shows the age distribution for 0-20 years that was used 
in the Study, including a 10% uplift in numbers to give 1100 at age 8,
and the corresponding design had it been optimised.

```{r, fig.width = 7}
# define CGS age distribution
  tibble(age = c(1/6, 1/2, 5/6, 5/4, 7/4, 19/8, 25/8, 4:10,
                 c(22:29/2 - 1/4), 15:18, 77/4),
         n = c(rep(12, 3), rep(10, 4), 13, rep(11, 6), rep(6, 3),
               rep(11, 5), 15, rep(8, 3), 13) * 100,
         span = c(rep(1/3, 3), rep(1/2, 2), rep(3/4, 2), rep(1, 7),
                  rep(1/2, 8), rep(1, 4), 3/2)) %>% 
    ggplot(aes(x = age, y = n/span, width = span-0.02)) +
    xlab('age (years)') + ylab('frequency') +
    geom_bar(fill = 'gray', stat = 'identity') +
    geom_bar(aes(y = n_varying, width = NULL), 
             data = n_agegp(z = qnorm(0.03), SEz = 0.3 / 5.75) %>% 
               mutate(n_varying = n_varying * 1.1), 
             width = 0.98, fill = 'gray50', stat = 'identity')
```

The pale gray histogram shows the complex sample composition used in the Cuban Growth Study, with
oversampling in infancy and puberty reflecting the higher growth velocity at those ages.

However the pubertal oversampling is unnecessary, and the design ignores the 
saving in numbers that is achieved by smoothing the centile curves. 

The darker histogram shows the optimal design. Overall the sample size could have been reduced from 28,000 to around 11,000. 
This highlights the saving in resources that optimisation can achieve. 

## References

Cole TJ. 2020. Sample size and sample composition for constructing growth reference centiles. Stat Methods Med Res (in press).

Healy MJR. 1974. Notes on the statistics of growth standards. Ann Hum Biol 1:41-46.

Jordan J, Ruben M, Jernandez J, Bebelagua A, Tanner JM, Goldstein H. 1975. The 1972 Cuban national child growth study as an example of population health monitoring: design and methods. Ann Hum Biol 2:153-171.