Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🗂️Allow version restrictions in dependencies_results #52

Open
funnell opened this issue May 31, 2023 · 5 comments
Open

🗂️Allow version restrictions in dependencies_results #52

funnell opened this issue May 31, 2023 · 5 comments

Comments

@funnell
Copy link
Collaborator

funnell commented May 31, 2023

The new dependencies_result features in PRs #22 #51 looks very nice.
We often need to get the latest results from an app, given some version restrictions.
For example, we want an app to pull in the latest results from a dependency such that the dependency app's major version is 3.
Or we might also want to be able to specify a version range like >=3.1,<4 (it wouldn't necessarily have to be specified in this format though).

@juanesarango
Copy link
Contributor

These are good ideas. Im definitely down to implement something better to define dependencies

@nickp60
Copy link
Collaborator

nickp60 commented Jun 12, 2023

Hi @juanesarango , to elaborate on our use case, we have two major analysis categories that roughly contain steps for preprocessing, one or more analyses, and then some downstream apps relying on the analyses results. All analyses are therefore linked to parent analyses, which trace back to either the proprocessing app or to the raw data itself.

We use the following method to fetch the appropriate parent analyses. The repo this comes from is in the msk enterprise github; happy to add you as a collaborator to see it all in context.

def get_latest_analysis_results(
    app_name,
    app_min_version,
    assembly,
    targets,
    results_keys=['fq1', 'fq2'],
    verbose=False,
    sample_level=False,
):
    """Retrieve analysis results with the highest valid application version.
    Args:
        app_name (str): application name, e.g. "Shotgun Preprocessing"
        app_min_version (str): minimum version of the application, e.g. '2.0.0'
                               Only analyses whose application version matches this argument's major version will be considered.
        assembly (str): the name of the assembly
        targets: a list of target experiment objects
        results_keys (list): the expected results keys. The default is a fastq pair.
        verbose: enable verbose logging
        sample_level: find sample_level parent apps. Note this should be set to
                      FALSE when aggregating experiments to create sample-level
                      analyses
    Returns:
        analysis_keys: list with the primary keys of the retrieved analysis
        results: dictionary of results
    """

    if app_name is None:
        if results_keys == ['fq1', 'fq2']:
            return get_raw_fastqs_as_analysis(targets)
        else:
            raise ValueError(
                "app_name can only be set to None to retrieve a fastq pair (ie results_keys = ['fq1', 'fq2']"
            )
    parent_analyses = get_analyses(
        targets__pk__in=[x.pk for x in targets],
        assembly__name=assembly,
        application__name=app_name,
        status='SUCCEEDED',
        verbose=verbose,
    )
    assert (
        len(parent_analyses) >= 1
    ), f'No parent {app_name} analyses found for {targets}'
    filtered_parent_analyses = filter_candidate_analyses(
        parent_analyses, app_min_version, targets=targets, sample_level=sample_level
    )
    analysis_keys = [x.pk for x in filtered_parent_analyses]
    results = {}
    for r in results_keys:
        for analysis in filtered_parent_analyses:
            try:
                if r not in results.keys():
                    results[r] = [analysis.results[r]]
                else:
                    results[r].append(analysis.results[r])
            except KeyError:
                raise ValueError(
                    f'Requested result {r} not available from analyses {analysis}'
                )
    return analysis_keys, results

Our get_dependencies() methods call this, and each app subclass version is assigned class attributes for which parent app and which parent app versions are valid. Eg

class Trim100bpMouse(_TrimBase):
    NAME = 'Trim 100bp'
    ASSEMBLY = 'shotgun-mouse'
    SPECIES = 'MOUSE'
    TRIM_LENGTH = 100
    PARENT_APP = 'Shotgun Preprocess'
    MIN_PARENT_VERSION = '3.0.0'

where the Trimming app relies on the "Shotgun Preprocess" app with a minimum valid version of 3.0.0.

Let me know if any of this is useful for you; we'd be happy to help get all cleaned up into a PR if so!

@juanesarango
Copy link
Contributor

Hi @nick, great stuff. What I do is something very similar.

  • How do you handle it when you have for example 2 dependencies (or parent apps as you call them)?
  • Can you show me what does filter_candidate_analyses do? I would like to see the logic behind using the app_min_version.

@nickp60
Copy link
Collaborator

nickp60 commented Jun 13, 2023

Sure! Internally, we decided that major versions indicate changes to data, minor versions are bumped for results format changes, and patch versions are bumped for any other inconsequential change. Practically, this means that the major version gets bumped any time we have a consequential change to the preprocessing step, as that needs to cascade down to all child apps. So, in short, we check that the candidate parent analysis

  • matches the major version of the app (eg was run on the same preprocessed data)
  • meets or exceeds the APP_MIN_VERSION, using standard semver parsing
  • (sample-level analyses only) the parent sample identifier matches the target sample identifier
  • is the most recent (if multiple valid versions exist with the correct major version)
def valid_parent_version(parent_version, min_version):
    """Check that the major version matches exactly and that the parent app meets minimum version requirements.
    The first check ensure that apps are not run on incompatible upstream data;
    eg if preprocessing is at version 2 but a downstream app can work on versions 1 and on,
    we want prevent confusion from having both version 1 and version2 downstream results.
    """
    major_version_is_valid = (
        version.parse(parent_version).major == version.parse(min_version).major
    )
    meets_or_exceeds = version.parse(parent_version) >= version.parse(min_version)
    return major_version_is_valid and meets_or_exceeds


def filter_candidate_analyses(
    parent_analyses, app_min_version, targets, sample_level=False
):
    # filter for analyses run meeting the min version requirement
    # ensure that the major version of the downsampling app matches the
    # major version of the preprocessing app and that the preprocessing
    # app meets the minimum version of the preprocessing app
    parent_analyses[:] = [
        x
        for x in parent_analyses
        if valid_parent_version(x['application']['version'], app_min_version)
    ]
    assert len(parent_analyses) >= 1, f'No analyses met version criteria'

    if sample_level:
        parent_analyses[:] = [
            x
            for x in parent_analyses
            if get_sorted_targets_string(x.targets)
            == get_sorted_targets_string(targets)
        ]
        assert (
            len(parent_analyses) >= 1
        ), f'No sample-level analyses found matching all targets'

    # run on the most recent if multiple available, per target
    most_recent_parent_analyses = []
    for target in targets:
        candidates = []
        for pa in parent_analyses:
            if target.identifier in [x.identifier for x in pa.targets]:
                candidates.append(pa)
        assert (
            len(candidates) > 0
        ), f'no analyses from {parent_analyses} matched target identifiers: {target.identifier}. Should this sample be excluded?'

        parent_versions = [
            version.parse(x['application']['version']) for x in candidates
        ]
        mrpa = [
            x
            for x in candidates
            if version.parse(x['application']['version']) == max(parent_versions)
        ]
        assert (
            len(mrpa) == 1
        ), f'multiple valid parent versions for {target.identifier}: {mrpa}'
        # don't add duplicate analyses; this is relevant for sample-level analyses
        # where each target can have multiple parent sets
        if mrpa[0].pk not in [x.pk for x in most_recent_parent_analyses]:
            most_recent_parent_analyses.append(mrpa[0])
    return most_recent_parent_analyses

@funnell
Copy link
Collaborator Author

funnell commented Jun 13, 2023

Maybe Isabl could accept a string to use with SpecifierSet from the packaging library to allow version range specification (etc.)

SpecifierSet('>=1.0.0,!=2.0.0')

https://packaging.pypa.io/en/latest/specifiers.html#packaging.specifiers.SpecifierSet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants