Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/drs 1.5.0 #408

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Release/drs 1.5.0 #408

wants to merge 9 commits into from

Conversation

briandoconnor
Copy link
Contributor

@briandoconnor briandoconnor commented Sep 13, 2024

This pull request compares the primary branch (master) with the release/drs-1.5.0 branch.

DRS 1.5.0 has several small but useful changes that set us up for more significant changes in 1.6 and beyond. These include:

  • a mechanism to declare DRS objects as available or not (for example, in cold storage)
  • recommended best practices for linking Data Connect and DRS, for querying metadata related to DRS objects
  • addition of cloud metadata on DRS objects
  • high-level statistics in /service-info, specifically total object counts and total object size in bytes

Documentation for 1.5 can be found here and the schema can be examined here.

Please feel free to use this PR to comment on DRS 1.5.0 and, for GA4GH Driver Projects, we ask you to use this PR to vote on approval for DRS 1.5. The voting process is simple: add a comment with 1) your driver project name, 2) +1 (approved), 0 (neutral), or -1 (concerns blocking approval), and 3) any comments (required if you vote "-1", the cloud lead and DRS champions will reach out to discuss).

We would like to have voting by driver projects done by Oct 1st, 2024 with a followup discussion and merge (if no "-1" votes are received) by the following Cloud WS meeting.

briandoconnor and others added 9 commits November 26, 2023 22:17
DRS 1.4.0 adds bulk operations, this update includes a syntax fix in the schema
* adding stats endpoint

* stats endpoint

* move stats into status_info

* adding the two new variables to the service-info response (objectCloud and totalObjectSize).

---------

Co-authored-by: Brian O'Connor <[email protected]>
@@ -3,7 +3,6 @@ description: The object that contains the DRS object IDs array
properties:
bulk_object_ids:
type: array
description: DRS object IDs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor question: why was this description deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a duplicate "description" entry here which caused a silent build failure for our documentation (took a long time for me to find that!!)

@dglazer
Copy link
Member

dglazer commented Sep 17, 2024

All of Us +1 on this PR -- nice list of small incremental enhancements, with no downside

@briandoconnor
Copy link
Contributor Author

briandoconnor commented Sep 18, 2024

Thanks @dglazer !! Can you edit your comment to indicate the driver project(s) you're voting for? Assuming All of Us but just wanted to make it clear as we tally the votes.

Copy link

@CRMacPherson CRMacPherson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roche imCORE +1

description: >-
Name of the cloud service provider that the object belongs to.
If the cloud service is Amazon Web Services, Google Cloud Platform or Azure the values should be `aws`, `gcp`, or `azure` respectively.
example: aws, gcp, or azure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how pedantic you want to be but this is more a repeat of the description and not a valid example. Maybe aws to keep it consistent with your region example?


With DRS objects it may be necessary to attach additional metadata to your objects. We believe that a change to the API of DRS to include metadata is not in the spirit of the DRS spec and in general DRS should have no knowledge of the metadata associated with the objects. We have found that a good GA4GH standard to support this is Data Connect (https://github.com/ga4gh-discovery/data-connect). The general approach would be to have a Data Connect service on your platform and to include "tables" with the ID matching your DRS ID for the same object. This means that if you have metadata associated with an object id `abcd` (ex. additional information about Compound Objects) all you need to do is request the information from the Data Connect client at `/tables/abcd/info`. There are optional functionalities of Data Connect, such as querying of tables, but we do not explore them or give any recommendations here.

Here is an example of using Data Connect with DRS in the fasp-scripts repository (https://github.com/ga4gh/fasp-scripts/blob/master/notebooks/drs/DRS%20File%20Data.ipynb). In this notebook we can see that data connect is used to get DRS IDs from a platform. Those DRS IDs are then used to gather aditional information about the file that might be necessary for analysis. This is just one example of how DRS and Data Connect can interact with each other to gather information about data on a platform.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "Data Connect" here "... can see that data connect is used to ...".

Copy link

@davidlougheed davidlougheed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vote: -1

I'm part of the PCGL driver project. In its current state, I am voting against merging this version of the DRS spec, for the reasons outlined below.

@@ -6,6 +6,12 @@ properties:
maxBulkRequestLength:
type: integer
description: The max length the bullk request endpoints can handle (>= 1) before generating a 413 error e.g. how long can the arrays bulk_object_ids and bulk_object_access_ids be for this server.
objectCount:
Copy link

@davidlougheed davidlougheed Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the documentation, these stats show as top-level in the "response sample", vs in a "stats" object in the example

}
...
"stats": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not match what the response sample indicates

additionally, I would prefer some consistency with projects like refget and htsget, where the service-specific additions to the service-info response are nested in an object named after the service artifact

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey David, could you point me to the specifics for refget and htsget. I would love to understand how they are written so we could incorporate those into DRS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @davidlougheed , I think it's a good idea to be consistent here so Michael or I will take a look and try to address your feedback ASAP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For refget, see http://samtools.github.io/hts-specs/refget.html - specifically GET /sequence/service-info and the Example JSON request and response section below it. They put myriad service-specific service info properties in a refget sub-object, which I like as a pattern since it makes it clear it's an extension of the service-info spec.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't feel as strongly about how the stats are nested in the response as much as I feel that they should be nested, in one form or another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants