Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support data object is member of a collection #394

Open
perolavsvendsen opened this issue Nov 15, 2023 · 12 comments
Open

Support data object is member of a collection #394

perolavsvendsen opened this issue Nov 15, 2023 · 12 comments
Assignees
Labels
Data definitions Issues related to data definitions enhancement New feature or request

Comments

@perolavsvendsen
Copy link
Member

perolavsvendsen commented Nov 15, 2023

Discussions 13. november 2023
@jcrivenaes @daniel-sol @HansKallekleiv

In some cases, exported data objects from FMU are related to each other. The specific example discussed revolve around volume calculations, where the same data are represented as a table, a 3D grid parameter and a surface.

A need from Webviz is to programmatically link these together. Conceptually, they all "siblings". This has some parallels to the existing aggregation_id used for aggregated data, where multiple results from the same operation is assigned a common aggregation_id so that they can be linked together.

Elements discussed:

  • Should we simply add collection_id as an argument to ExportData.__init__(), and then forward this to the outgoing metadata? That means each object from this operation is tagged with the same ID. So, when a client is using one of the siblings, it will know that 1) there are other siblings and 2) be able to identify them from the collection_id
  • The name collection_id is perhaps not great.
  • From the data perspective, this may be more like member_of.
  • A data object can be member of more than one collection/family.
  • We want it to be unique per case, but we want elements exported from multiple realizations to belong to the same collection. Proposed solution is to give e.g. collection_tagname (placeholder name) as argument to ExportData(). Generate a uuid4 as a hash of current case.uuid and collection_tag. This ensures that the collection_id (placeholder name) is identical across all realizations, but unique per case.

(Also discussions on gathering all Volume-related exports into one single class, to make essentially a "one-liner". This proposal would be a first step towards this, by doing a very small iteration in the right direction - which includes some tag in the outgoing metadata that can be used. @daniel-sol can you describe in separate issue?)

Proposal

  • Add input argument collection_tagname to ExportData()
  • Create uuid4 as a hash of current case.uuid and collection_tagname
  • Append to data.member_of (placeholder name) in outgoing metadata.

In metadata definitions, add the necessary tag. Placeholder name: member_of.

Example:

data:
  name: MyName
  member_of:
    - b77e5c35-524b-43d7-9356-aa2ef2e382c3
    - d390eb27-f07f-4f9b-bf97-2b61a773753c

Example script:

from fmu.dataio import ExportData

exp = ExportData(collection_tagname="inplace_volumes")

for elem in AllTheElements:
    exp.export(elem)

The collection tagname is in this proposal not directly represented in outgoing metadata. It's only purpose is to be basis for the hash. We could include it, but risk is that it will be directly used by clients.

Please add discussion details I have forgotten.

@perolavsvendsen
Copy link
Member Author

Suggest adding a relations block in the metadata, next to data - not adding relational information inside the data block.

Example:

$schema: <schema_url>
data:
  x: y
  z: 1
relations:
  member_of:
    - <uuid4>
    - <uuid4>

Upside with using list is to capture when object is member of more than one collection. Downside is searchability. But probably possible to work around.

@perolavsvendsen perolavsvendsen self-assigned this Nov 18, 2023
@perolavsvendsen
Copy link
Member Author

Since data object can be member of > 1 collection, input argument must support more than one collection_name.

Will use collection_name, not collection_tagname, to avoid confusion with the other "tagname".

@perolavsvendsen
Copy link
Member Author

Draft PR for discussions: #396

There are some unanswered questions still.

@perolavsvendsen perolavsvendsen added enhancement New feature or request Data definitions Issues related to data definitions labels Nov 18, 2023
@perolavsvendsen perolavsvendsen linked a pull request Nov 18, 2023 that will close this issue
2 tasks
@perolavsvendsen
Copy link
Member Author

Possibly confusing that the collection_name disappears. Should keep this (in addition to uuid).

@perolavsvendsen
Copy link
Member Author

Instead of specifying collection_name we could consider simply using the instance of ExportData() as the identifier of the collection. Upside is invisibility, no API change. Downside is invisibility (user is not able to find "his" collection easily. No name.

Also, ExportData instances have no name/identifier.

@jcrivenaes
Copy link
Collaborator

Instead of specifying collection_name we could consider simply using the instance of ExportData() as the identifier of the collection. Upside is invisibility, no API change. Downside is invisibility (user is not able to find "his" collection easily. No name.

That is an interesting idea, but requires some user awareness: (1) You need to export all needed for you "collection" in the same script using the same instance, and (2) there is a risk of the opposite; people exports stuff not being indented as a collection by using the same instance. So they get a number of "unaware" collections...

For (1), the instance settings (properties) needs also to be updated in the export() jobs, since they in this case will a mix of e.g. surfaces and tables.

@perolavsvendsen
Copy link
Member Author

Yes, there are some snags. If combined with a "best practice" description of how we intent fmu-dataio to be used (#395) it could be part of the contract that everything exported within the same instance of ExportData implicitly is a collection. But I also see that this will quickly break down, and probably some will want to export parts of a "collection" from different scripts (hence different instances).

So I don't think we should pursuit this....

@perolavsvendsen
Copy link
Member Author

Possibly confusing that the collection_name disappears. Should keep this (in addition to uuid).

Updated draft PR #396 to produce relations.collections as a list of objects, where each "collection" is a dict containing uuid and name:

relations:
  collections:
    - name: "mycollection" # as written by user in input arguments
      uuid: <uuid4> # hash of name + case.uuid, so identical only within context of case

@jcrivenaes
Copy link
Collaborator

I wonder if the wanted collection names shall be defined at the "case" level (i.e. they must exists in global config or similar): Reasons:

  • The client (e.g. webviz) can then easily parse all possible collections early and add logics to further work
  • Avoid that typos in collection names in "various scripts around" are root to confusion (e.g. one script usename "volumetrics" while the other uses "Volumentrics"... the user is unaware
  • Avoid too many collections introduced as "nice to have"

@perolavsvendsen
Copy link
Member Author

I wonder if the wanted collection names shall be defined at the "case" level (i.e. they must exists in global config or similar)...

Technically on model level, which makes sense. It has been a stated requirement that a "collection" must be unique to the case (hence it is hashed together with the case.uuid). But it could be that it sometimes must be unique also on iteration (ensemble) level, and possibly on realization level.

So, given these "levels" in the "hierarchy":
fmu (all FMU models)
model (this FMU model) <-- Should be defined here?
case (this specific case of this model)
ensemble (this specific ensemble, in this case)
realization (this specific realization of this iteration)

...we have assumed that it must be unique on case level, but that would still require clients to specify which iteration. So perhaps this should rather be unique on model level. Clients would then have to specify case but that is pretty common situation to be in?

Key question here is perhaps "what is the level of uniqueness needed for collections"? Do we need the unique id's at all, or is it enough with a string identifier/name?

Ref defining globally: So when exporting a data object tied to a collection, it must be verified that this collection is defined (in global config)? Not 100 % sure about the user experience, but that can be worked on.

@perolavsvendsen
Copy link
Member Author

Discussion 23.11.2023

  • Current user story is related to inplace volumes - what other collections are there?
    • HC thickness maps from simulator

Important that tagged collections can be grouped through Elasticsearch
Is tagname essentially a "collection"?

@perolavsvendsen
Copy link
Member Author

From: #396 (comment) @daniel-sol

I have problems understanding how we are actually going to solve the multicollection part. In theory I think the idea is good, but in practice I cannot really see how this will be done. Imagine that you have a specific grid property. This can be part of several collections:

  • As part of the export of a 3D grid and related properties
  • As part of properties used as input to inplace calculations
  • As part of input to seismic forward modelling
    Not thinking properly about this one would set these exports as three different scripts utilizing ExportData, should we then allow for checking that the object is exported already and then just writing to the connected metadata file? Or do we imagine people thinking this through upfront, and then adding all these x number of collection names in one script. I foresee that this will not scale all that well...

The alternative is that one exports this grid property three times with different collection name, or something similar..

Yes, I agree. This is conceptually hard to see a smooth solution to, given current constraints etc.

Multiple pointers to the same data object probably creates a HUGE overhead in bookkeeping (gut-feeling). Multiple uploads of the same object is an option, but doesn't feel good.

A possible option is that collections are pre-defined, and that export scripts point to one or more specific collections when exporting. In the metadata, they are then explicitly listed. But this does not solve how this is to be used by a client. Then the client must also know the context of each collection, etc.

The fundamental need here was to be able to link objects together, e.g. "as a consumer of FMU results, I would like to be able to find connected data, so that I can co-visualize connected data in multiple contexts." or something like that. The collections-concept may not be the best way to handle it. The current solution is using name which is not a good solution either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data definitions Issues related to data definitions enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants