Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demes-python and demes-spec both carry example yaml files #203

Open
grahamgower opened this issue Feb 16, 2021 · 20 comments
Open

demes-python and demes-spec both carry example yaml files #203

grahamgower opened this issue Feb 16, 2021 · 20 comments

Comments

@grahamgower
Copy link
Member

It's useful to have examples in each repository for testing, and probably awkward to force devs to clone both repositories just to run the tests. So if we keep both copies, we should have a way to synchronise them, or at least validate the demes-python repository examples against the spec.

@apragsdale
Copy link
Member

Picking up this thread again. @molpopgen and I were just putting together some tests for fwdpy11, and were discussing that it would be really handy to have this suite of examples ship with demes-python as data files, so that testing loading and converting can be done directly from the demes package (I was having the same thoughts with testing moments as well). I think that it will be really useful to have those examples so that loading and converting can be testing without having to clone the whole demes repo.

What do we think about this?

@molpopgen
Copy link
Collaborator

I think having the examples as data files is a good idea. There's the testing angle, and then fwdpy11 could also provide convenience functions for generating the models for simulations using those data files.

@molpopgen
Copy link
Collaborator

While I'm thinking about this: is there a "version" field in our spec somewhere? That'd be quite useful to track changes in models over time.

@grahamgower
Copy link
Member Author

grahamgower commented Apr 5, 2021

Having some set of example models available as data files might be useful, yeah. Here are a few additional considerations.

  • The spec repository needs a test set too, but probably with a different focus (add set of test json files demes-spec#27).
  • For testing within demes-python, we may want to include models that are not reasonable to simulate. E.g. edge cases with respect to various parameters or graph structure.
  • stdpopsim will (probably later this year) include a broad set of demes YAML files, and this should be the preferred source for simulatable models.
  • We probably don't want to make any gaurantees about not changing the set of example models in the demes-python repo.
  • We want to export the hypothesis strategy for generating valid graphs. These can be used for testing converters to/from other formats, including other simulators. However, these randomly generated graphs deliberately contain lots of weird edge cases, and are unlikely to be simulatable in general.

While I'm thinking about this: is there a "version" field in our spec somewhere? That'd be quite useful to track changes in models over time.

We don't have a version field for the models, no. What would a simulator would do with this field though? If a model has a bug, it should be fixed and a new software release (e.g. stdpopsim release) will include the fix. If a model is otherwise changed, it should be given a different filename (unless maybe it's a toy/testing model in demes-python and the test requirements have been changed).

@molpopgen
Copy link
Collaborator

Having some set of example models available as data files might be useful, yeah. Here are a few additional considerations.

* The spec repository needs a test set too, but probably with a different focus ([popsim-consortium/demes-spec#27](https://github.com/popsim-consortium/demes-spec/issues/27)).

Right, these would be different.

* For testing within `demes-python`, we may want to include models that are not reasonable to simulate. E.g. edge cases with respect to various parameters or graph structure.

Wouldn't the edge cases belong in tests/ and not be installed?

* `stdpopsim` will (probably later this year) include a broad set of demes YAML files, and this should be the preferred source for simulatable models.

Why prefer this package? I think this may be problematic for simulation authors whose work is not supported by stdpopsim.

* We probably don't want to make any gaurantees about not changing the set of example models in the `demes-python` repo.

I guess I was thinking of distinguishing testing models from published models like the Gutenkunst, for example

* We want to export the `hypothesis` strategy for generating valid graphs. These can be used for testing converters to/from other formats, including other simulators. However, these randomly generated graphs deliberately contain lots of weird edge cases, and are unlikely to be simulatable in general.

While I'm thinking about this: is there a "version" field in our spec somewhere? That'd be quite useful to track changes in models over time.

We don't have a version field for the models, no. What would a simulator would do with this field though? If a model has a bug, it should be fixed and a new software release (e.g. stdpopsim release) will include the fix. If a model is otherwise changed, it should be given a different filename (unless maybe it's a toy/testing model in demes-python and the test requirements have been changed).

Ultimately, this would end up in the metadata in a tree sequence file. Changing the file name breaks API for down stream packages. Changing the version number does not, and makes the metadata no longer equal, which may be useful.

@grahamgower
Copy link
Member Author

* `stdpopsim` will (probably later this year) include a broad set of demes YAML files, and this should be the preferred source for simulatable models.

Why prefer this package? I think this may be problematic for simulation authors whose work is not supported by stdpopsim.

Collecting demographic models is a core goal of the stdpopsim package. It shouldn't matter whether fwdpy11 is a supported backend simulation engine in stdpopsim or not, because the YAML files will be there for anyone to use. Making these available as data files in a standard location seems like a good idea and should be fairly easy. Then they could be accessed with a simple wrapper function like this:

import pkgutil  # core python lib
import demes

def get_stdpopsim_model(species_id: str, model_id: str) -> demes.Graph:
    # pkgutil.get_data() returns the file contents as bytes
    yaml_bytes = pkgutil.get_data("stdpopsim", "demographic_models/{species_id}/{model_id}.yaml")
    yaml_str = yaml_bytes.decode()
    return demes.loads(yaml_str)

graph = get_stdpopsim_model("HomSap", "OutOfAfricaArchaicAdmixture_5R19")

We could probably even separate out the data files into a stdpopsim-data pypi package, so no additional dependencies would be required. I guess there's no problem keeping this in the stdpopsim git repo along side the stdpopsim package.

While I'm thinking about this: is there a "version" field in our spec somewhere? That'd be quite useful to track changes in models over time.

We don't have a version field for the models, no. What would a simulator would do with this field though? If a model has a bug, it should be fixed and a new software release (e.g. stdpopsim release) will include the fix. If a model is otherwise changed, it should be given a different filename (unless maybe it's a toy/testing model in demes-python and the test requirements have been changed).

Ultimately, this would end up in the metadata in a tree sequence file. Changing the file name breaks API for down stream packages. Changing the version number does not, and makes the metadata no longer equal, which may be useful.

Hmm... ok. But wouldn't that be just as easily dealt with by including the YAML contents in the tree sequence provenance? Or a checksum? One could do something slightly more sophisticated like this:

import hashlib  # core python lib
import demes

def graph_checksum(graph: demes.Graph) -> str:
    # We'll probably make a gaurantee that the asdict() result will be stable
    # across `demes` versions, barring bugs (but not for asdict_simplified()).
    data = graph.asdict()

    # delete stuff we don't want in the checksum
    for key in ("description", "doi"):
        data.pop(key, None)
    for deme in data["demes"]:
        deme.pop("description", None)
        if "ancestors" in deme:
            # sort ancestors/proportions lists
            deme["ancestors"], deme["proportions"] = zip(
                *sorted(zip(deme["ancestors"], deme["proportions"]))
            )

    m = hashlib.sha256()
    m.update(repr(data).encode())
    return m.hexdigest()

@apragsdale
Copy link
Member

I think that we'd want to keep the demes-python package as light as possible and not pull in stdpopsim, which is pretty dependency heavy, just to gain access to a handful of example files. I see stdpopsim depending on demes in the future, but not the other way around. I guess a thing I've run into with getting demes into moments and fwdpy11 is that there are now duplicates of some demographic models, the same ones that are already in the demes examples directory if you clone the repo, so it seems natural to make those examples that are already there available as data files. I'm hesitant to add dependencies just to gain access to a couple example files.

@molpopgen
Copy link
Collaborator

which is pretty dependency heavy, just to gain access to a handful of example files.

This is my concern as well. A stdpopsim-demes-models repo would be preferred over the entire stdpopsim.

@molpopgen
Copy link
Collaborator

A related concern is licensing. stdpopsim is GPL3. While that is compatible with fwdpy11, compatibility w/moments is unclear, and other consuming packages may have similar concerns.

@grahamgower
Copy link
Member Author

To clarify, I'm not suggesting that demes-python would depend on stdpopsim, nor that demes-python would need to access the stdpopsim models at all.

@jeromekelleher
Copy link
Member

I'm with @grahamgower here --- stdpopsim will have a large number of curated demes models which we can easily add in a supported interface for accessing programatically. This is entirely separate from the demes package./

I don't think it's a good idea shipping example models as part of the package. We are taking on a responsibility of ensuring that they are correct, which I don't want to do. (Even if we say they're just examples and not to do it, people will assume they are correct anyway.) It's just maintenance burden for no real benefit, IMO.

@molpopgen
Copy link
Collaborator

I'm with @grahamgower here --- stdpopsim will have a large number of curated demes models which we can easily add in a supported interface for accessing programatically. This is entirely separate from the demes package./

I don't think it's a good idea shipping example models as part of the package. We are taking on a responsibility of ensuring that they are correct, which I don't want to do. (Even if we say they're just examples and not to do it, people will assume they are correct anyway.) It's just maintenance burden for no real benefit, IMO.

I'd like to avoid having the demographic models shipping with that package. It brings in a bunch of dependencies that users of other simulators just don't need. And it would also force other developers to keep up with their dependencies, such as the minimum python version, which has dropped support for 3.6 ahead of its end of life. That kind of stuff just poses complications for other tools.

I propose an entirely separate repository for the curated models. That keeps it as minimal as possible, and it doesn't even need to be Python. It could just be a bunch of directories organized in some logical way, and consumers could deal with it however they see fit.

@molpopgen
Copy link
Collaborator

The more I think about it, the more it makes sense that the models shouldn't be associated with a Python package at all. If someone re-implements the specification another language, then it's weird for them to do something like include a Python package as a sub module.

@jeromekelleher
Copy link
Member

jeromekelleher commented Apr 6, 2021

Sure, I agree with your points here @molpopgen, but it's going to be a lot of work to make all this happen. We need a lot of infrastructure to make stdpopsim happen and it's going to need to be refactored quite a lot to keep it updated over time. It's a lot of work, and I'm not convinced there's really that much benefit in breaking out the demes specifications.

If we just want a handful of examples for documentation purposes, then I think we already have that.

If we want a set of carefully chosen examples that illustrate various aspects of the spec, and that form set of positive and negative test cases for evaluating parsers/implementations, then models from the literature are useless there anyway and we should make up artificial ones.

ps. Are there that many deps with stdpopsim? It's just msprime and a few standard Python things at the moment AFAIK, pretty lightweight in the scheme of things.

@grahamgower
Copy link
Member Author

ps. Are there that many deps with stdpopsim? It's just msprime and a few standard Python things at the moment AFAIK, pretty lightweight in the scheme of things.

This is actually really annoying to discover. I ended up installing pipdeptree. (Results below for stdpopsim HEAD)

$ pipdeptree -p stdpopsim -f | sed 's/ //g' | sort -u
Warning!! Cyclic dependencies found:
* mdit-py-plugins => markdown-it-py => mdit-py-plugins
* markdown-it-py => mdit-py-plugins => markdown-it-py
------------------------------------------------------------------------
appdirs==1.4.4
asciitree==0.3.3
attrs==20.3.0
fasteners==0.15
humanize==3.1.0
jsonschema==3.2.0
kastore==0.3.1
monotonic==1.5
msprime==1.0.0b1
newick==1.0.0
numcodecs==0.7.2
numpy==1.20.2
pandas==1.1.4
pyrsistent==0.17.3
pyslim==0.600
python-dateutil==2.8.1
pytz==2021.1
setuptools==54.0.0
six==1.15.0
stdpopsim@file:///home/grg/src/stdpopsim
svgwrite==1.4
tskit==0.3.5
zarr==2.5.0

@apragsdale
Copy link
Member

Maybe we punt this for now then. Personally, I think there are quite a lot of extra dependencies of stdpopsim that I wouldn't want to need to install just to get demes functionality in moments, for example, which doesn't use slim or msprime or tskit, pandas, newick, etc...

I agree with @molpopgen that breaking out the (future) catalog of demes specified demographic models into its own repository could be quite handy, though I also agree with @jeromekelleher that it would require a lot of refactoring and new infrastructure. I suppose, when we get to the point of converting all demographic models in stdpopsim over to demes format, we could take some time then to think if there are simplifications and refactoring that could be made. For example, it would be awesome to be able to access the demes-specified demographies data files to use in moments, even though moments is not a supported backend of stdpopsim and I wouldn't want to make stdpopsim a dependency of moments...

@jeromekelleher
Copy link
Member

I think there's some crossed wires here: nobody is suggesting that stdpopsim becomes a dependency of demes. Graham and I are just saying that if someone wants a set of curated demes models from the literature, they can get them from stdpopsim.

If you have a python package that needs these, then a dependency on stdpopsim seems a reasonable price. I guess we could break out the stdpopsim-catalog as a separate minimal package at some point, but there's a lot of work involved and not that many developers.

@apragsdale
Copy link
Member

Ok - yeah, I think there were some crossed wires, and I didn't understand the suggestions I think. Sorry about that! That's all quite reasonable, and I agree it's a lot of work and few developers, but if there is good reason to, I think it's still worth considering when we get to that point of converting the stdpopsim models.

More simply, what are we doing with the YAML files in the examples dir? Are they used in the docs, and if not then should they be removed so that they don't need to be maintained? Jerome you make the good point that we don't want to guarantee their correctness here, and maintaining them can be a hassle..

@grahamgower
Copy link
Member Author

More simply, what are we doing with the YAML files in the examples dir? Are they used in the docs, and if not then should they be removed so that they don't need to be maintained? Jerome you make the good point that we don't want to guarantee their correctness here, and maintaining them can be a hassle..

They've served a use during the initial stages of development, where we wanted to exchange our thoughts about how the yaml format should look, but that's probably coming to an end. We do still want to include some yaml files for the tests, so probably we should just move them somewhere under the tests/ folder.

@grahamgower
Copy link
Member Author

A related concern is licensing. stdpopsim is GPL3. While that is compatible with fwdpy11, compatibility w/moments is unclear, and other consuming packages may have similar concerns.

I think there's a fairly clear case for using a separate license (or dual license) for the yaml files that will be included in stdpopsim. While there is some intellectual property associated with implementing and curating the models, these also derive from details in publications. Certainly these models are "free for all to use" in spirit, and distributing stdpopsim under the GPL because the python code does import msprime is irrelevant to the license of any specific yaml file when it is copied out of the repo. Probably we could just place the yaml files into the public domain, or use a creative commons (doc-style) license of some sort. But this is a conversation we should have in the stdpopsim repo, and once the models have been converted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants