diff --git a/src/bibtex/all.bib b/src/bibtex/all.bib index 1aae02559..f4af1a3c7 100644 --- a/src/bibtex/all.bib +++ b/src/bibtex/all.bib @@ -1867,4 +1867,30 @@ @article{Magnusson+etal:2024:posteriordb author={Magnusson, M{\aa}ns and Torgander, Jakob and B{\"u}rkner, Paul-Christian and Zhang, Lu and Carpenter, Bob and Vehtari, Aki}, journal={arXiv preprint arXiv:2407.04967}, year={2024} + +@article{egozcue+etal:2003, + title={Isometric logratio transformations for compositional data analysis}, + author={Egozcue, Juan Jos{\'e} and Pawlowsky-Glahn, Vera and Mateu-Figueras, Gl{\`o}ria and Barcelo-Vidal, Carles}, + journal={Mathematical Geology}, + volume={35}, + number={3}, + pages={279--300}, + year={2003} +} + +@book{filzmoser+etal:2018, + title={Geometrical properties of compositional data}, + author={Filzmoser, Peter and Hron, Karel and Templ, Matthias}, + booktitle={Applied Compositional Data Analysis: With Worked Examples in R}, + pages={35--68}, + year={2018}, + publisher={Springer} +} + +@misc{seyboldt:2024, + author="Seyboldt, Adrian", + title="Add ZeroSumNormal distribution", + note="pyro-ppl GitHub repository issue \#1751", + year = "2024", + url ="https://github.com/pyro-ppl/numpyro/pull/1751#issuecomment-1980569811" } \ No newline at end of file diff --git a/src/functions-reference/real-valued_basic_functions.qmd b/src/functions-reference/real-valued_basic_functions.qmd index e3848c507..e0e2c6648 100644 --- a/src/functions-reference/real-valued_basic_functions.qmd +++ b/src/functions-reference/real-valued_basic_functions.qmd @@ -783,8 +783,10 @@ calculations, but the result is likely to be reduced acceptance probabilities and less efficient sampling. The rounding functions cannot be used as indices to arrays because -they return real values. Stan may introduce integer-valued versions -of these in the future, but as of now, there is no good workaround. +they return real values. For operations over `data` or in the +`generated quantities` block, the +[`to_int()` function](integer-valued_basic_functions.qmd#casting-functions) + can be used. \index{{\tt \bfseries floor }!{\tt (T x): R}|hyperpage} @@ -1636,7 +1638,8 @@ proportion theta, defined by \begin{eqnarray*} Calculates the log mixture density given `thetas`, mixing proportions which should be between 0 and 1 and sum to 1, and `lps`, log densities. -These two containers must have the same length. +The `lps` variable must be either a 1-d container of the same +length as `thetas`, or an array of such. \begin{eqnarray*} \mathrm{log\_mix}(\theta, \lambda) diff --git a/src/reference-manual/expressions.qmd b/src/reference-manual/expressions.qmd index a9e39e28c..d04eff572 100644 --- a/src/reference-manual/expressions.qmd +++ b/src/reference-manual/expressions.qmd @@ -149,9 +149,9 @@ any of the following. ``` int, real, complex, vector, simplex, unit_vector, -ordered, positive_ordered, row_vector, matrix, -cholesky_factor_corr, cholesky_factor_cov, -corr_matrix, cov_matrix, array +sum_to_zero_vector, ordered, positive_ordered, +row_vector, matrix, cholesky_factor_corr, +cholesky_factor_cov, corr_matrix, cov_matrix, array ``` The following built in functions are also reserved and @@ -810,9 +810,9 @@ In addition to single integer indexes, as described in [the language indexing section](#language-indexing.section), Stan supports multiple indexing. Multiple indexes can be integer arrays of indexes, lower bounds, upper bounds, lower and upper bounds, or simply shorthand for -all of the indexes. If the upper bound is smaller than the lower bound, -the range is empty (unlike, e.g., in R). The upper bound and lower bound can be -expressions that evaluate to integer. A complete list of index types is +all of the indexes. If the upper bound is smaller than the lower bound, +the range is empty (unlike, e.g., in R). The upper bound and lower bound can be +expressions that evaluate to integer. A complete list of index types is given in the following table. ##### Indexing Options Table {- #index-types-table} @@ -1078,6 +1078,7 @@ the following table shows the mapping from types to their primitive types. | `vector` | `vector` | | `simplex` | `vector` | | `unit_vector` | `vector` | + | `sum_to_zero_vector` | `vector` | | `ordered` | `vector` | | `positive_ordered` | `vector` | | `row_vector` | `row_vector` | @@ -1378,7 +1379,7 @@ model { } ``` -Algebraically, +Algebraically, [the distribution statement](statements.qmd#distribution-statements.section) in the model could be reduced to diff --git a/src/reference-manual/grammar.txt b/src/reference-manual/grammar.txt index c18fea74c..657eb8b7c 100644 --- a/src/reference-manual/grammar.txt +++ b/src/reference-manual/grammar.txt @@ -24,6 +24,7 @@ ::= IDENTIFIER | TRUNCATE + | JACOBIAN ::= | @@ -57,10 +58,13 @@ | POSITIVEORDERED | SIMPLEX | UNITVECTOR + | SUMTOZERO | CHOLESKYFACTORCORR | CHOLESKYFACTORCOV | CORRMATRIX | COVMATRIX + | STOCHASTICCOLUMNMATRIX + | STOCHASTICROWMATRIX | PRINT | REJECT | FATAL_ERROR @@ -165,11 +169,16 @@ | POSITIVEORDERED LBRACK RBRACK | SIMPLEX LBRACK RBRACK | UNITVECTOR LBRACK RBRACK + | SUMTOZERO LBRACK RBRACK | CHOLESKYFACTORCORR LBRACK RBRACK | CHOLESKYFACTORCOV LBRACK [COMMA ] RBRACK | CORRMATRIX LBRACK RBRACK | COVMATRIX LBRACK RBRACK + | STOCHASTICCOLUMNMATRIX LBRACK COMMA + RBRACK + | STOCHASTICROWMATRIX LBRACK COMMA + RBRACK ::= [LABRACK RABRACK] | LABRACK RABRACK diff --git a/src/reference-manual/syntax.qmd b/src/reference-manual/syntax.qmd index d8bfad955..276a80348 100644 --- a/src/reference-manual/syntax.qmd +++ b/src/reference-manual/syntax.qmd @@ -113,6 +113,7 @@ The raw output is available [here](https://raw.githubusercontent.com/stan-dev/do ``` ::= IDENTIFIER | TRUNCATE + | JACOBIAN ::= @@ -175,11 +176,16 @@ The raw output is available [here](https://raw.githubusercontent.com/stan-dev/do | POSITIVEORDERED LBRACK RBRACK | SIMPLEX LBRACK RBRACK | UNITVECTOR LBRACK RBRACK + | SUMTOZERO LBRACK RBRACK | CHOLESKYFACTORCORR LBRACK RBRACK | CHOLESKYFACTORCOV LBRACK [COMMA ] RBRACK | CORRMATRIX LBRACK RBRACK | COVMATRIX LBRACK RBRACK + | STOCHASTICCOLUMNMATRIX LBRACK COMMA + RBRACK + | STOCHASTICROWMATRIX LBRACK COMMA + RBRACK ::= [LABRACK RABRACK] | LABRACK RABRACK diff --git a/src/reference-manual/transforms.qmd b/src/reference-manual/transforms.qmd index 679e589d9..5f6ab8b27 100644 --- a/src/reference-manual/transforms.qmd +++ b/src/reference-manual/transforms.qmd @@ -15,7 +15,7 @@ constrained to be ordered, positive ordered, or simplexes. Matrices may be constrained to be correlation matrices or covariance matrices. This chapter provides a definition of the transforms used for each type of variable. For examples of how to declare and define these -variables in a Stan program, see section +variables in a Stan program, see section [Variable declaration](types.qmd#variable-declaration.section). Stan converts models to C++ classes which define probability @@ -39,7 +39,7 @@ the exact arithmetic: rounded to the boundary. This may cause unexpected warnings or errors, if in other parts of the code the boundary value is invalid. For example, we may observe floating-point value 0 for - a variance parameter that has been declared to be larger than 0. + a variance parameter that has been declared to be larger than 0. See more about [Floating point Arithmetic in Stan user's guide](../stan-users-guide/floating-point.qmd)). - CmdStan stores the output to CSV files with 6 significant digits accuracy by default, but the constraints are checked with 8 @@ -462,6 +462,89 @@ p_Y(y) $$ + +## Zero sum vector + +Vectors that are constrained to sum to zero are useful for, among +other things, additive varying effects, such as varying slopes or +intercepts in a regression model (e.g., for income deciles). + +A zero sum $K$-vector $x \in \mathbb{R}^K$ satisfies the constraint +$$ +\sum_{k=1}^K x_k = 0. +$$ + +For the transform, Stan uses the first part of an isometric log ratio +transform; see [@egozcue+etal:2003] for the basic definitions and +Chapter 3 of [@filzmoser+etal:2018] for the pivot coordinate version +used here. Stan uses the isometric log ratio transform because it +induces a geometry with zero correlation among the dimensions, making +it easier for HMC to explore than simpler alternatives such as setting +the final element to the negative sum of the first elements; see, e.g., +[@seyboldt:2024]. + + + + +### Zero sum transform {-} + +The (unconstraining) transform is defined iteratively. Given an $x \in +\mathbb{R}^{N + 1}$ that sums to zero (i.e., $\sum_{n=1}^{N+1} x_n = +0$), the transform proceeds as follows to produce an unconstrained $y +\in \mathbb{R}^N$. + +The transform is initialized by setting +$$ +S_N = 0 +$$ +and +$$ +y_N = -x_{N + 1} \cdot \frac{\sqrt{N \cdot (N + 1)}{N}}. +$$ +The for each $n$ from $N - 1$ down to $1$, let +$$ +w_{n + 1} = \frac{y_{n + 1}}{\sqrt{(n + 1) \cdot (n + 2)}}, +$$ +$$ +S_n = S_{n + 1} + w_{n + 1}, +$$ +and +$$ +y_n = (S_n - x_{n + 1}) \cdot \frac{\sqrt{n \cdot (n + 1)}}{n}. +$$ + +### Zero sum inverse transform {-} + +The inverse (constraining) transform follows the isometric logratio tranform. +It maps an unconstrained vector $y \in \mathbb{R}^N$ to a zero-sum vector $x \in +\mathbb{R}^{N + 1}$ such that +$$ +\sum_{n=1}^{N + 1} x_n = 0. +$$ +The values are defined inductively, starting with +$$ +x_1 = \sum_{n=1}^N \frac{y_n}{\sqrt{n \cdot (n + 1)}} +$$ +and then setting +$$ +x_{n + 1} = \sum_{i = n + 1}^N \frac{\sqrt{y_i}}{\sqrt{i \cdot (i + 1)}} +- n \cdot \frac{y_n}{\sqrt{n \cdot (n + 1)}}. +$$ +for $n \in 1{:}N$. + +The definition is such that +$$ +\sum_{n = 1}^{N + 1} x_n = 0 +$$ +by construction, because each of the terms added to $x_{n}$ is then +subtracted from $x_{n + 1}$ the number of times it shows up in earlier terms. + +### Absolute Jacobian determinant of the zero sum inverse transform {-} + +The inverse transform is a linear operation, leading to a constant Jacobian +determinant which is therefore not included. + + ## Unit simplex {#simplex-transform.section} Variables constrained to the unit simplex show up in multivariate diff --git a/src/reference-manual/types.qmd b/src/reference-manual/types.qmd index b113778b8..999751551 100644 --- a/src/reference-manual/types.qmd +++ b/src/reference-manual/types.qmd @@ -117,9 +117,10 @@ vector[3] rho; ``` There are also special data types for structured vectors and -matrices. There are four constrained vector data types, `simplex` +matrices. There are five constrained vector data types, `simplex` for unit simplexes, `unit_vector` for unit-length vectors, -`ordered` for ordered vectors of scalars and +`sum_to_zero_vector` for vectors that sum to zero, +`ordered` for ordered vectors of scalars, and `positive_ordered` for vectors of positive ordered scalars. There are specialized matrix data types `corr_matrix` and `cov_matrix` for correlation matrices (symmetric, positive @@ -692,6 +693,26 @@ unit vectors, this is only done up to a statically specified accuracy threshold $\epsilon$ to account for errors arising from floating-point imprecision. +### Vectors that sum to zero {-} + +A zero-sum vector is constrained such that the +sum of its elements is always $0$. These are sometimes useful +for resolving identifiability issues in regression models. +While the underlying vector has only $N - 1$ degrees of freedom, +zero sum vectors are declared with their full dimensionality. +For instance, `beta` is declared to be a zero-sum $5$-vector (4 DoF) by + +```stan +sum_to_zero_vector[5] beta; +``` + +Zero sum vectors are implemented as vectors and may be assigned to other +vectors and vice-versa. Zero sum vector variables, like other constrained +variables, are validated to ensure that they are indeed unit length; for +zero sum vectors, this is only done up to a statically specified accuracy +threshold $\epsilon$ to account for errors arising from floating-point +imprecision. + ### Ordered vectors {-} An ordered vector type in Stan represents a vector whose entries are @@ -1636,6 +1657,8 @@ dimensions `matrix[K, K]` types. +---------------------+-------------------------+----------------------------------------------+ | | | `unit_vector[N]` | +---------------------+-------------------------+----------------------------------------------+ +| | | `sum_to_zero_vector[N]` | ++---------------------+-------------------------+----------------------------------------------+ | `row_vector` | `row_vector[N]` | `row_vector[N]` | +---------------------+-------------------------+----------------------------------------------+ | | | `row_vector[N]` | @@ -1700,6 +1723,8 @@ dimensions `matrix[K, K]` types. +---------------------+-------------------------+----------------------------------------------+ | | | `array[M] unit_vector[N]` | +---------------------+-------------------------+----------------------------------------------+ +| | | `array[M] sum_to_zero_vector[N]` | ++---------------------+-------------------------+----------------------------------------------+ diff --git a/src/stan-users-guide/regression.qmd b/src/stan-users-guide/regression.qmd index ab82b2b7a..232c89f50 100644 --- a/src/stan-users-guide/regression.qmd +++ b/src/stan-users-guide/regression.qmd @@ -521,46 +521,79 @@ centered around zero, as is typical for regression coefficients. ## Parameterizing centered vectors -It is often convenient to define a parameter vector $\beta$ that is -centered in the sense of satisfying the sum-to-zero constraint, -$$ -\sum_{k=1}^K \beta_k = 0. -$$ - -Such a parameter vector may be used to identify a multi-logit -regression parameter vector (see the [multi-logit -section](#multi-logit.section) for details), or may be used for -ability or difficulty parameters (but not both) in an IRT model (see -the [item-response model section](#item-response-models.section) for -details). - - -### $K-1$ degrees of freedom {-} - -There is more than one way to enforce a sum-to-zero constraint on a -parameter vector, the most efficient of which is to define the $K$-th -element as the negation of the sum of the elements $1$ through $K-1$. +When there are varying effects in a regression, the resulting +likelihood is not identified unless further steps are taken. For +example, we might have a global intercept $\alpha$ and then a varying +effect $\beta_k$ for age group $k$ to make a linear predictor $\alpha + +\beta_k$. With this predictor, we can add a constant to $\alpha$ and +subtract from each $\beta_k$ and get exactly the same likelihood. + +The traditional approach to identifying such a model is to pin the +first varing effect to zero, i.e., $\beta_1 = 0$. With one of the +varying effects fixed, you can no longer add a constant to all of them +and the model's likelihood is identified. In addition to the +difficulty in specifying such a model in Stan, it is awkward to +formulate priors because the other coefficients are all interpreted +relative to $\beta_1$. + +In a Bayesian setting, a proper prior on each of the $\beta$ is enough +to identify the model. Unfortunately, this can lead to inefficiency +during sampling as the model is still only weakly identified through +the prior---there is a very simple example of the difference in +the discussion of collinearity in @collinearity.section. + +An alternative identification strategy that allows a symmetric prior +is to enforce a sum-to-zero constraint on the varying effects, i.e., +$\sum_{k=1}^K \beta_k = 0.$ + +A parameter vector constrained to sum to zero may also be used to +identify a multi-logit regression parameter vector (see the +[multi-logit section](#multi-logit.section) for details), or may be +used for ability or difficulty parameters (but not both) in an IRT +model (see the [item-response model +section](#item-response-models.section) for details). + + +### Built-in sum-to-zero vector {-} + +As of Stan 2.36, there is a built in `sum_to_zero_vector` type, which +can be used as follows. ```stan parameters { - vector[K - 1] beta_raw; - // ... -} -transformed parameters { - vector[K] beta = append_row(beta_raw, -sum(beta_raw)); + sum_to_zero_vector[K] beta; // ... } ``` -Placing a prior on `beta_raw` in this parameterization leads to -a subtly different posterior than that resulting from the same prior -on `beta` in the original parameterization without the -sum-to-zero constraint. Most notably, a simple prior on each -component of `beta_raw` produces different results than putting -the same prior on each component of an unconstrained $K$-vector -`beta`. For example, providing a $\textsf{normal}(0,5)$ prior -on `beta` will produce a different posterior mode than placing -the same prior on `beta_raw`. +This produces a vector of size `K` such that `sum(beta) = 0`. In the +unconstrained representation requires only `K - 1` values because the +last is determined by the first `K - 1`. + +Placing a prior on `beta` in this parameterization, for example, + +```stan + beta ~ normal(0, 1); +``` + +leads to a subtly different posterior than what you would get with the +same prior on an unconstrained size-`K` vector. As explained below, +the variance is reduced. + +The sum-to-zero constraint can be implemented naively by setting the +last element to the negative sum of the first elements, i.e., $\beta_K += -\sum_{k=1}^{K-1} \beta_k.$ But that leads to high correlation among +the $\beta_k$. + +The transform used in Stan eliminates these correlations by +constructing an orthogonal basis and applying it to the +zero-sum-constraint; @seyboldt:2024 provides an explanation. The +*Stan Reference Manual* provides the details in the chapter on +transforms. Although any orthogonal basis can be used, Stan uses the +inverse isometric log transform because it is convenient to describe +and the transform simplifies to efficient scalar operations rather +than more expensive matrix operations. + #### Marginal distribution of sum-to-zero components {-} @@ -575,83 +608,27 @@ model { } ``` -The components are not independent, as they must sum zero. No -Jacobian is required because summation and negation are linear -operations (and thus have constant Jacobians). +The scale component can be multiplied by `sigma` to produce a +`normal(0, sigma)` prior marginally. To generate distributions with marginals other than standard normal, the resulting `beta` may be scaled by some factor `sigma` and translated to some new location `mu`. -### QR decomposition {-} - -Aaron Goodman, on the Stan forums, also provided this approach, which -calculates a QR decomposition in the transformed data block, then uses -it to transform to a sum-to-zero parameter `x`, - -```stan -transformed data{ - matrix[K, K] A = diag_matrix(rep_vector(1, K)); - matrix[K, K - 1] A_qr; - for (i in 1:K - 1) A[K, i] = -1; - A[K, K] = 0; - A_qr = qr_Q(A)[ , 1:(K - 1)]; -} -parameters { - vector[K - 1] beta_raw; -} -transformed parameters{ - vector[K] beta = A_qr * beta_raw; -} -model { - beta_raw ~ normal(0, inv(sqrt(1 - inv(K)))); -} -``` - -This produces a marginal standard normal distribution on the values of -`beta`, which will sum to zero by construction of the QR decomposition. - - -### Translated and scaled simplex {-} - -An alternative approach that's less efficient, but amenable to a -symmetric prior, is to offset and scale a simplex. - -```stan -parameters { - simplex[K] beta_raw; - real beta_scale; - // ... -} -transformed parameters { - vector[K] beta; - beta = beta_scale * (beta_raw - inv(K)); - // ... -} -``` - -Here `inv(K)` is just a short way to write `1.0 / K`. Given -that `beta_raw` sums to 1 because it is a simplex, the -elementwise subtraction of `inv(K)` is guaranteed to sum to zero. -Because the magnitude of the elements of the simplex is bounded, a -scaling factor is required to provide `beta` with $K$ degrees of -freedom necessary to take on every possible value that sums to zero. - -With this parameterization, a Dirichlet prior can be placed on -`beta_raw`, perhaps uniform, and another prior put on -`beta_scale`, typically for "shrinkage." - - ### Soft centering {-} -Adding a prior such as $\beta \sim \textsf{normal}(0,\sigma)$ will provide a kind -of soft centering of a parameter vector $\beta$ by preferring, all -else being equal, that $\sum_{k=1}^K \beta_k = 0$. This approach is only -guaranteed to roughly center if $\beta$ and the elementwise addition $\beta + c$ -for a scalar constant $c$ produce the same likelihood (perhaps by -another vector $\alpha$ being transformed to $\alpha - c$, as in the -IRT models). This is another way of achieving a symmetric prior. +Adding a prior such as $\beta \sim \textsf{normal}(0,\epsilon)$ for a +small $\epsilon$ will provide a kind of soft centering of a parameter +vector $\beta$ by preferring, all else being equal, that $\sum_{k=1}^K +\beta_k = 0$. This approach is only guaranteed to roughly center if +$\beta$ and the elementwise addition $\beta + c$ for a scalar constant +$c$ produce the same likelihood (perhaps by another vector $\alpha$ +being transformed to $\alpha - c$, as in the IRT models). This is +another way of achieving a symmetric prior, though it requires +choosing an $\epsilon$. If $\epsilon$ is too large, there won't be a +strong enough centering effect and if it is too small, it will add +high curvature to the target density and impede sampling. ## Ordered logistic and probit regression {#ordered-logistic.section} diff --git a/src/theming/stan.xml b/src/theming/stan.xml index d41c7cc06..73777de01 100644 --- a/src/theming/stan.xml +++ b/src/theming/stan.xml @@ -41,6 +41,7 @@ positive_ordered simplex unit_vector + sum_to_zero_vector row_vector matrix array