Skip to content

Commit

Permalink
Add _* dirs
Browse files Browse the repository at this point in the history
  • Loading branch information
binho authored and binho committed Dec 7, 2023
1 parent 9bab292 commit e3f23c8
Show file tree
Hide file tree
Showing 6 changed files with 240 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/build/html/_images/pjgui_model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
234 changes: 234 additions & 0 deletions docs/build/html/_sources/pjdoc/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
%%%%%%%%%%%%%
Documentation
%%%%%%%%%%%%%

.. contents::
:local:
:backlinks: none

--------
Overview
--------

Origins and design
==================

The |pj| project was born from our need to simulate arbitrarily complex SSE processes, but not being able to do so with many existing applications.
A series of related diversification models had been implemented in excellent computational methods, but each of those tools made unique modeling assumptions that we wanted to relax.

Instead of reverse-engineering and modifying code written by others, however, it seemed like the path of least resistance was to write our own simulator.
We also realized that our program could potentially come in handy in multiple projects, current and future, though only if written as more than a "one-off" chunk of code.

Ideally, |pj| would have to be "modular" in its architecture, both in terms of its codebase as well as how models should be represented and specified.
This modularity would allow future models to be more easily implemented -- potentially by other developers -- and seamlessly patched into |pj|.

With the above in mind, we reasoned our work would be more likely to bear fruits (and faster) if we used a programming language that:

(i) was easy to prototype and debug model code in, and
(ii) cross-platform,
(iii) supported object-oriented programming, and for which
(iv) biology and data-science libraries were available.

Python was our language of choice.
As for item (4), for example, we could make heavy use of the great `Dendropy <https://dendropy.org/>`_ library, maintained by Jeet Sukumaran and collaborators.

Modularity was achieved by building |pj| around a graphical modeling architecture.
This design (used in the past, see below) would not only make |pj| a more flexible simulator, but also more generally useful in the future, as a tool for method development and inference.

Graphical models
================

|pj| follows the graphical model paradigm popularized in the last decade by programs like `RevBayes <https://revbayes.github.io/>`_, `BEAST <https://beast.community/>`_ and `BEAST 2 <https://www.beast2.org/>`_.
In what follows, we shall offer just a brief refresher on what probabilistic graphical models are, and one simple example of their use in evolutionary modeling.
Those interested in a more detailed exposition will find it in Höhna et al. (2014) and references therein (also, take a look `here <https://revbayes.github.io/tutorials/intro/graph_models.html>`_).

Just as with the RevBayes and BEAST programs, specifying a model in |pj| amounts to writing down any number of conditional dependencies among random variables.
When looked at collectively, random variables comprise a type of probabilistic graphical model called a Bayesian network.
As the name suggest, such models can be described as graphs -- directed acyclic graphs (**DAG**), more specifically -- where each node (or vertex) represents a random variable, and each edge a conditional dependency between the two nodes (variables) it connects.

A Bayesian network representing a very simple model can be seen in figure 1a:

.. figure:: ../images/simple_graphical_model_both_manual.png
:figwidth: 100%
:scale: 40%
:align: center

**Figure 1.** A simple probabilistic model represented by (a) a Bayesian network, and (b) a factor graph.
From the notation in the main text, :math:`\theta=\{sd, m\}`.
The dashed box is called a "plate", and for each node it envelops it denotes multiple (in this case, precisely five) i.i.d. random variables.
White and gray nodes denote random variables whose values we do (i.e., data) and do not know (i.e., parameters), respectively.

If we were to pen down what this Bayesian network represents in probability terms, we would arrive at a joint density (assuming continuous variables) given by expression:

.. math::
:label: jointprob
f_{\Theta,D}(\theta,d) = f_{\Theta|D}(\theta|D=d)f_D(d) = f_{D|\Theta}(d|\Theta=\theta)f_\Theta(\theta),
where :math:`\theta=\{sd, m\}`.

By comparing the Bayesian network graph and the expression above, the attentive reader will notice that functions are not represented in the graph.
We do not know exactly how random variables are related, other than :math:`d_1` through :math:`d_5` depending somehow on :math:`sd` and :math:`m`.
This is why it can be helpful to adopt a more general graphical representation: a factor graph (Fig. 1b).

The model underlying the factor graph in figure 1b is the same as that shown by the Bayesian network, the main difference being the additional factor nodes (the filled squares).
Factor nodes can make the relationships between two or more random variables more explicit, especially if their (factor) functions are annotated.

In the example above we can see a factor whose function :math:`f_{D|\Theta}(d|\Theta=\theta)` gives us the probability of :math:`\boldsymbol{d} = \{d_i: 1 \leq i \leq n\}` given :math:`\theta`.
It is annotated as "Normal", so we in fact know a close-form expression for this factor function; it is the probability density function of a normal distribution.
Random variables :math:`sd` and :math:`m` stand for the standard deviation and the mean of that normal distribution.
(For those curious to see just how complicated factor graphs can be, an example can be found in Zhang et al., 2023; see their Supplementary Fig. X.)

How to read a DAG?
------------------

Most biologists specifying a DAG on a computer do so as the first step of statistical **inference**.
They are interested in estimating unknown quantities about the natural world.
These quantities -- the random variables! -- whose values we do not know (but would like to estimate) are referred to as the **parameters** of the model.
Of course, parameter estimation requires that we know the value(s) of one or more random variables, which we naturally refer to as **data**.

In the example in figure 1, the parameters are :math:`sd` and :math:`m`, jointly referred to as :math:`\theta`; the data is node labeled :math:`d_i`.
(Note how they are colored differently, gray for data, white for parameter.)
Carrying out statistical inference thus implies reading the DAG from its data nodes, normally at the bottom, towards the parameter nodes above.

The most natural approach to parameter estimation requires that we simply take the middle and right-hand side terms in equation :eq:`jointprob`,
and solve for :math:`f_{\Theta|D}(\theta|D=d)`, the posterior density function:

.. math::
:label: bayestheorem
f_{\Theta|D}(\theta|D=d) = \frac{f_{D|\Theta}(d|\Theta=\theta)f_\Theta(\theta)}{f_D(d)}
This expression is known as Bayes theorem.
What programs like RevBayes, BEAST and BEAST 2 do is evaluate the posterior density function at several values of :math:`\theta`, and output the resulting (posterior) distribution.

----

|pj|, however, is among other things a collection of **simulation** (rather than estimation) engines.
Borrowing the jargon used above, we are primarily interested in generating values for random variables, some of which are parameters, some data.

Here, it makes sense to think of factors as the distributions from which values will be sampled (though as we will see below, factors can also represent deterministic functions).
As opposed to estimation, simulation starts at the top (or "outer") layers of the DAG, from parameters whose values we do know, and flows in the direction pointed to by arrows (normally downwards; Fig. 1b).

In figure 1b, for example, we start from known values for the parameters of the distributions at the top.
The exponential distribution from which we sample (i.e., simulate) :math:`sd` has a rate of 1.0; the standard normal distribution (:math:`Z`) from which we sample :math:`m`, by definition, has a mean of 0.0 and a standard deviation of 1.0.
We then define a normal distribution from the sampled values of :math:`sd` and :math:`m`, and in turn sample five times from that normal distribution, obtaining our data values in :math:`\boldsymbol{d}`.

Specifying a model (an example)
===============================

|pj| takes a model specification approach that sits between those adopted by the BEAST and RevBayes communities, and that largely intersects with the `LinguaPhylo <https://linguaphylo.github.io/>`_ project.
DAG-building instructions in |pj| are written in its own programming language, *phylojunction* (written in lowercase), whose syntax evokes RevBayes' Rev language.
But unlike Rev, *phylojunction* is not a fully fledged scripting language; it is lightweight and behaves more like a markup language such as XML, BEAST's format of choice.

Commands in *phylojunction* can be read as mathematical statements, and are naturally interpreted as instructions for building a node in a DAG.
Here is what a series of those commands would look like if written as a *phylojunction* script:

.. code-block::
:caption: **Example script 1.** Script written in *phylojunction* specifying a time-homogenous birth-death model.
# hyperprior
m <- 0.0 # mean of log-normal below
sd <- 0.1 # standard deviation of log-normal below
# rate values
d <- 1.0 # death
b ~ lognormal(mean=m, sd=sd) # birth
# deterministic rate containers
dr := sse_rate(name="death_rate", value=d, event="extinction")
br := sse_rate(name="birth_rate", value=b, event="speciation")
O <- 2.0 # origin age
# deterministic parameter stash
s := sse_stash(flat_rate_mat=[dr, br], n_states=1, n_epochs=1) # parameter stash
# phylogenetic tree
T ~ discrete_sse(stash=s, stop="age", stop_value=O, origin="true")
Each line in the above script is a command instructing the application engine to take some form of user input, produce some value from it, and then store that value permanently in a new variable created on the spot.

Every command string consists of an assignment operator (e.g., ``<-``, ``:=``, ``~``) placed between the variable being created (on its left side) and some user input (on its right side).
We can look at some of these commands individually:

* ``d <- 1.0`` creates a variable ``d`` (the death rate) that is then passed and henceforth stores constant value ``1.0``. We use ``<-`` when we know the value of a random variable and want to create a constant node in the DAG;
* ``b ~ lognormal(mean=m, sd=sd)`` creates a variable ``b`` (the birth rate) that will store a random value drawn from a user-specified distribution. Here, that distribution is a log-normal with mean `m` and standard deviation `sd`.
* ``dr := sse_rate(name="death_rate", value=d, event="extinction")`` calls deterministic function ``sse_rate``, which creates variable ``dr`` from the value stored in ``d`` and some other user input. There is no stochasticity to deterministic assignments; they help make explicit steps where variables are transformed, annotated or combined with others.

And this is the DAG such script instantiates:

.. figure:: ../images/bd_graphical_model_manual.png
:figwidth: 100%
:scale: 40%
:align: center

**Figure 2.** Factor graph representing the time-homogenous birth-death model specified in example script 1.

Note how this DAG has a factor node characterized by a different type of function: deterministic function ``sse_rate``.
Such functions are denoted by filled diamonds instead of filled squares (those represent distributions), and will have their output enclosed in a hollow diamond.

Multiple samples and replicates
-------------------------------

Unlike other probabilistic programming languages, in *phylojunction* the output of every function is immutable and depends exclusively on that function's input.
Initialized variables cannot be altered by mutator methods or side-effects.

An important consequence of variable immutability is that it precludes loop control structures (e.g., *for* and *while* loops).
A reasonable question is then: *How does one have the model be sampled (i.e., simulated) multiple times?*

Let us look at the simple model shown in figure 1.
Ten independent samples of that model can be obtained with following *phylojunction* script:

.. code-block::
:caption: **Example script 2.** Script written in *phylojunction* specifying the simple model in figure 1 and sampling (i.e., simulating) it ten times.
sd ~ exponential(rate=1) # this value will be vectorized
m ~ normal(mean=0.0, sd=1.0) # ... and so will this!
d ~ normal(n=10, nr=5, mean=m, sd=sd)
As can be seen from the last line in script 2, all it took was providing argument ``n=10`` to the function call of a distribution.
But note how the first and second lines in script 2 do not set ``n=10``.
There is just a single value being drawn from exponential and standard normal distributions; implicitly, those commands are setting ``n=1``.

|pj| deals with this discrepancy in the requested number of samples by **vectorizing** the single values stored in :math:`sd` and :math:`m`.
For example, if the sampled value of :math:`m` is 0.1, under the hood |pj| converts :math:`m=1.0` into :math:`\boldsymbol{m}=\{0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1\}`.
In such case, all ten samples requested by command ``d ~ normal(n=10, nr=5, mean=m, sd=sd)`` would thus come from normal distributions with the same mean of 0.1.

.. warning::
|pj| will raise an error if the number of samples ``n`` specified in two commands are both different and greater than 1.
In other words, vectorization can only be applied on variables holding a single value.
(This type of behavior should be familiar to R users.)

----

In addition to simulating an entire model multiple times as explained above, it is also possible to specify a model with multiple i.i.d. random variables, alternatively referred to as **replicates**.
Multiple replicates can be specified by providing an additional argument ``nr`` to distribution calls, e.g., ``nr=5`` as in the last line of script 2.

In a DAG, i.i.d. random variables can be collectively represented by a single plated node (Fig. 1), or each be represented by an individual node.
There are in principle no constraints on the number of replicates a plated node may represent.
Errors will be thrown only if such nodes are used as input for a function, and the replicate count somehow violates that function's signature.

------------------------------
Graphical user interface (GUI)
------------------------------

.. figure:: ../images/pjgui_model.png
:figwidth: 100%
:align: center

**Figure 3.** Test.

----------------------------
Command-line interface (CLI)
----------------------------


-------
Lexicon
-------

.. toctree::
:maxdepth: 2

parametric.rst
Empty file.
6 changes: 6 additions & 0 deletions docs/build/html/_static/css/pj_custom.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.math {
text-align: left;
}
.eqno {
float: right;
}

0 comments on commit e3f23c8

Please sign in to comment.