Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New models in flocker #105

Open
mhollanders opened this issue Jul 9, 2024 · 8 comments
Open

New models in flocker #105

mhollanders opened this issue Jul 9, 2024 · 8 comments

Comments

@mhollanders
Copy link

Hi there,

I'm currently working on occupancy models specifically geared towards passive acoustic monitoring, where sites are allowed to chance occupancy state throughout the acoustic deployment period and the observation process deals with the waiting times. I'm currently working on the manuscript and it would be great if these models were available in existing R packages, for which I think flocker would make the most sense. Are you open to something like this?

Cheers,

Matt

@abfleishman
Copy link
Contributor

@mhollanders I too am very interested in models like you describe. Giant +1!

@jsocolar
Copy link
Owner

jsocolar commented Aug 28, 2024

Let's try to do this, though I cannot make any promises about timescales for getting things done from my end. If somebody is willing to do a big chunk of the lifting, I can comment to doing any necessary reviews and polishing in a timely manner.

To get started, check out the instructions here https://github.com/jsocolar/flocker/blob/main/inst/rmd/adding_a_family.Rmd

The main thing is to get a sketch of what the likelihood should look like, and then a sketch of what the flockerdata format for this model type needs to look like (if different from the format for one of the existing model types). I recommend against starting more substantial work on the code prior to agreeing on the data format. Usually, the biggest challenges involve getting the indexing right during the data packing (in R code) and the data unpacking (in Stan code), and it's possible to waste a lot of time here by racing ahead before we agree to the data format.

@mhollanders
Copy link
Author

Hi @jsocolar, sounds good. For the single season model, the key data structures are (1) the number of detections per site, primary occasion, and secondary occasion and (2) the waiting times between detections. For the multi-season version, there's an additional index of season. For the single season model, the idea is that there is an "overall occupancy" of each site (analogous to the single season occupancy probability) but that occupancy state is allowed to change within a deployment period. So essentially, this is just a zero-inflated (with the extra occupancy probability) dynamic occupancy probability where the observed data are waiting times. I should probably take a look at the current data format looks like for dynamic occupancy, as it should be analogous. For the multiseason version, I doubt that there is already a data type that has all the indices.

@jsocolar
Copy link
Owner

Could you write out the likelihood in math notation?

@mhollanders
Copy link
Author

Yes. The idea is that there's an overall occupancy state of site $i \in 1:I$ in season $k \in 1:K$; I'll just write it out for $K=1$ for now. Then, within each season, the deployment period of the acoustic devices is split into occasions $j \in 1:J$, where the interval length $\delta$ should be informed by the species ecology. So the overall occupancy state $m_i$ is a Bernoulli trial which is analogous to the single season occupancy probability:

$$ m_i \sim \text{Bernoulli} (\psi_i) $$

where $\psi_i$ can be modeled flexibly, i.e. with covariates or (spatial) random effects. Then conditional on $m_i = 1$, the within-occasion occupancy state $z_{ij}$ is allowed to change, and is also modeled with a Bernoulli trial (I opted for this over a dynamic occupancy model with colonisation/emigration probabilities):

$$ z_{ij} | m_i \sim \text{Bernoulli} (\chi_{ij})) $$

where again, $\chi_{ij}$ can vary by site and occasion.

Conditional on $z_{ij} = 1$, the observed waiting times between detections $\bf{y}_{ij}$ are modeled with an exponential distribution:

$$ y_{ij} | z_{ij} \sim \text{Exponential} (\mu_{ij}) $$

Where the last waiting time (which is the total occasion length if no detections were made) is modeled with a complement of the CDF of the exponential distribution. The call rate $\mu_{ij}$ can also be modeled flexibly with a Hawkes process, which addresses the non-independence of calls, i.e. due to animals vocalising repeatedly or to each other.

So basically, instead of assuming the site is occupied throughout the deployment period, the occupancy process is modeled more flexily and all of the detection data is used in the observation process.

@jsocolar
Copy link
Owner

jsocolar commented Oct 19, 2024

I've had a substantial think about this, and I think I have a plan of attack that is relevant to the $K = 1$ case, which is a necessary first step on any potential journey towards other values of $K$. Tell me if the following makes sense, or if I'm missing something important.

There are two changes we need to make.

  1. We need ability to wrap arbitrary groups of sites inside an extra layer of occupancy model, such that when the top layer is "occupied" the associated group of sites gets treated as an ordinary occupancy model, and when the top layer is "unoccupied" the associated group of sites must structurally yield zero detections. It turns out that this is in some important respects just a neat generalization of the existing data augmented model, and I can see a variety of use-cases for it.
  2. We need the ability to pass data on (ordered) waiting times rather than binary detection/non-detection, and express a likelihood conditional on occupancy as a function of the waiting-times data.

It turns out that (1) is going to be relatively easy (but not trivial) to implement. Because there are lots of use-cases, it's a good place to start because the effort won't be wasted even if later steps get bogged down. And happily, most of the machinery already exists for this in the data-augmented multi-species model framework, where collections of units (units = species-sites) are wrapped inside groupings (species) that have a collective binary occupancy state at the group level. I think the data-augmented custom family is already essentially the same thing as this, and all that needs to change is:

  • relax the data formatting to avoid assuming that all groups are identically sized
  • change how flock calls this model so that we can have a covariate-based formula for the top-level likelihood. (Currently when we call the augmented model, under hood we construct and use an intercept-only formula for the parameter Omega
  • make sure the likelihood uses the relevant values of Omega and not just the first value under the assumption that they are all structurally identical
  • change the name so that we can still run post-processing specific to the augmented model even when these assumptions are relaxed.

My preliminary thought about (2), i.e. passing waiting times as data, is that I don't want to implement it for this complicated multi-level occupancy model without first getting it implemented for a garden-variety single-season model. I think I can readily help with getting the data structure right, and it sounds like the likelihood piece will be pretty simple too.

If you're still gung-ho to take this on, my suggestion to you is that you start with change number 1. Below, I'm going to provide a pretty detailed, sort of micro-managing guide to how to do that. Once you implement 1, I think you'll have a strong feel for how everything works, and probably can tackle 2 pretty readily, which will be slightly less trivial because it cannot lean quite so heavily on existing machinery that can be coopted for the same task. With both 1 & 2 under your belt, you'll be able to combine them effectively, I think. And then we can relax the assumption that $K = 1$, which honestly will be pretty brutal to do but in a way that is more tedious than difficult.

Ok, so here we go, with a detailed plan for step 1:
I suggest that we pass the data (which for now is still binary detection/nondetection data) as an I x J x K array where rows I are closure-units (you call these "occasions"), columns J are repeat sampling events within a closure-unit, and slices K are unit-groups (you call these "sites"). Allowable values are 1 (detection), 0 (no detection), and NA (no sampling event). Then the task looks like:

  1. Borrow code from make_flocker_data_augmented to write a new function make_flocker_data_twolevel that is similar to make_flocker_data_augmented with two differences:
    1. It allows more raggedness. make_flocker_data_augmented expects possible raggedness such that different numbers of columns might be filled for different combinations of I and K. The new function needs to additionally expect possible raggedness such that different numbers of slices might be filled for different values of I.
    2. It either doesn't include n_aug or allows it to be set to zero. Which is the better choice depends on how you implement item (2) below.
  2. Refactor make_flocker_data_augmented to use make_flocker_data_twolevel internally, to avoid code duplication and added maintenance burden.
  3. I would suggest beginning with a PR that includes ONLY these first two changes, to keep things streamlined and manageable.
  4. Modify flocker:::make_occupancy_augmented_lpmf in the following ways
    1. change its name to make_occupancy_twolevel_lpmf
    2. Instead of just using the first element of the linear predictor for Omega (under the assumption that it is intercept only and therefore constant), make sure that you're using the correct element of Omega regardless of what the formula is.
    3. give it an extra argument called type with allowable values augmented or twolevel. In the event that the type is augmented, make sure that the stan function produced is named occupancy_augmented_lpmf. If type = twolevel then name the stan function occupancy_twolevel_lpmf.
    4. Update every instance of make_occupancy_augmented_lpmf to make_occupancy_twolevel_lpmf in both the codebase and the doc, and add the type = augmented argument everywhere necessary.
  5. Add an occupancy_twolevel custom family to occupancy_families.R
  6. Modifyflocker::flock
    1. change the argument augmented = FALSE to multilevel = NULL.
      1. Behavior under the default NULL should be exactly as the behavior under the current default.
      2. Other allowable values should be "augmented" yielding identical behavior to the current augmented = TRUE and "twolevel", yielding the new generalized two-level model.
    2. Add an argument analogous to the current f_auto = NULL but call it f_Omega. This will be where we pass our formula on the top-level occupancy probability.
    3. Propagate these changes down through flock_, flocker_fit_code_util and all downstream functions to enable fitting the generalized two-level model!
  7. Update the doc as appropriate.
  8. Submit another self-contained PR at this stage.
  9. Implement the relevant post-processing functionality, and submit that in a stand-alone PR. Lean on the code for the data-augmented model, and refactor as necessary to minimize the maintenance burden.

@mhollanders
Copy link
Author

Hi Jacob,

I'm having a crack at steps 1--3 as outlined above. Pardon me that I'm a bit slow to understand all of this, as I've never worked on an R package before. I have a few questions.

  • make_flocker_data_augmented() additionally expects at least two sampling events and that the possible values are 1, 0, or NA. However, the described model can have one sampling event, which is the waiting time from the start to the end of the occasion when no detections were made. Additionally, the possible values are actually the waiting times, not 1, 0, or NA.
  • I don't know what you mean by "refactoring" the original function. Can you point me to a resource to understand this?

Sorry about this, I'm a bit out of my depth.

Matt

@jsocolar
Copy link
Owner

jsocolar commented Oct 25, 2024

No worries! Thanks for taking a look. An initial word of opportunity and caution: I'm really happy to walk you through the development of these functions (like honestly it would be wonderful!), but this is probably only a good idea if you are substantially interested in learning about writing the software. If your main interest is just in being able to fit the model, then I suspect that this could turn into a painful slog that might peter out rather than end with a usable deliverable.

With that said:

  • We are splitting the work into two parts. The first part ONLY creates the two-level occupancy framework, which enables us to model the occupancy of occasions nested inside of sites, but continues to use binary detection/nondetection data over repeat visits (within occasions) for the likelihood at the occasion level. We will swap the likelihood over to use waiting times in a separate step. There are two reasons for splitting the work like this.
    • Trying to do it all at once would be error-prone and would lead to pull requests that are too large to review effectively.
    • These variations should be useful on their own, and not only in concert. When doing this work, I don't want to end up with a two-level waiting-times model and nothing else. I want to end up with two-level models using both ordinary and waiting-times likelihoods, and with waiting-times models both for the one-level and two-level cases.
  • "Refactor" in this context means to re-organize the code into useful component functions that are concise, efficient, and reusable. Here's an even better definition from our robot overlords: https://chatgpt.com/share/671b9c25-d0f4-8000-aba4-b3549e1fbeed In particular, if we just wrote make_flocker_data_twolevel by borrowing a bunch of code from make_flocker_data_augmented while leaving make_flocker_data_augmented untouched, then we'd end up with a whole bunch of code duplication. In the future, if we spot any bugs or want to modify the code to be more efficient or more general, we would have to do it in two places, which is extra work. The solution is to put the core functionality into a single function, and then let the other function call that core function internally. Because the make_flocker_data_twolevel is more general than make_flocker_data_augmented, my suggestion is to move the main code into make_flocker_data_twolevel, and then re-write make_flocker_data_augmented to call make_flocker_data_twolevel internally, after performing the necessary additional checks for non-raggedness (every species represented on every visit that exists) and appending the augmented zero-data before formatting using make_flocker_data_twolevel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants