Decide whether to return paths [for PD data] in aggregate mode #208

alyssadai · 2023-10-23T16:02:13Z

Context

Currently, when the API is deployed in "aggregate mode" (NB_RETURN_AGG=true), the only participant-level info returned from the graph in response to a query is the absolute path to the BIDS session directory on the server where the data is stored for all matching sessions. (These paths come from the Neurobagel CLI, which automatically extracts them based on the BIDS directory root that a user provides when running the CLI.) For an example of what this looks like in the query tool, see
https://neurobagel.org/query_tool/#protected-participant-level-results-in-aggregate-mode

E.g., for a clinical dataset stored on the BIC, the aggregate results would include absolute session directory paths on the BIC server for matching sessions. (For OpenNeuro data/DataLad datasets which are stored on GitHub and not on a specific server, the paths have instead been manually modified to be the absolute path from the repository root (since they need to be DataLad-gettable))

The reason we originally included these session paths even in the aggregate mode was that we wanted to provide some way for a user to be able to actually find/access the actual data files matching their query. Our reasoning was that if the query sender is not authenticated to access the data (e.g., doesn't have permission to log onto the BIC), just by seeing the paths they still wouldn't be able to get the raw data anyways.

That said, since the session paths do also contain the BIDS subject identifiers, one concern is that even these paths are exposing too much identifying info for a supposedly more secure aggregate response.

So, we need to decide:

Do we want to return any kind of paths in aggregate mode? (If not, we would essentially only return info about the # matching subjects and the dataset name)
For datasets that are only stored on a server, should the paths be absolute paths?
Should we have subject/session-level paths or only dataset-level paths?

The text was updated successfully, but these errors were encountered:

alyssadai · 2023-10-25T03:11:27Z

@surchs, @rmanaem, anything to add here?

alyssadai · 2023-10-25T17:44:25Z

@nikhil153, @michellewang, any thoughts on this issue / the specific decisions listed?

The first application of what we decide would be the PPMI dataset, since we intend to deploy the API for this dataset in aggregate mode.

michellewang · 2023-10-25T18:06:59Z

Do we want to return any kind of paths in aggregate mode? (If not, we would essentially only return info about the # matching subjects and the dataset name)

For the aggregate mode, if we're not allowed to share any subject ID to people without explicit access to the dataset, then we can't share BIDS paths at all because even at the single file level the subject ID is in the path. So the safest/most general case would be to not share any paths in aggregate mode, and users querying the graph will only know how many records fit their criteria.

However, this gets annoying for the users who do actually have access to the data, because then there's no easy way for them to find the data (they would have to e.g. do their own PyBIDS thing). Maybe this can be solved in the future with an authentication layer in Neurobagel?

For datasets that are only stored on a server, should the paths be absolute paths?

I don't have a very strong personal opinion on this. The only thing to consider would be the "risk" of sharing absolute paths openly on the web. If we instead use paths relative to the BIDS directory root then there would be an added step before users can find the data on the server (will need to ask someone where the BIDS directory is) but that should not be difficult, especially if they already have access to the server (and if not then they will need to talk to people anyway to get access to the server). So maybe go with relative paths for more "security".

Should we have subject/session-level paths or only dataset-level paths?

I would say dataset-level paths? Not a strong preference but I am not sure I can think of a case where a subject/session-level path would be more useful than a dataset-level path.

surchs · 2023-10-25T18:10:27Z

However, this gets annoying for the users who do actually have access to the data, because then there's no easy way for them to find the data (they would have to e.g. do their own PyBIDS thing). Maybe this can be solved in the future with an authentication layer in Neurobagel?

Agreed on this being annoying. I think mid-term we'll address it with authentication/authorization as you say. But one thing I discussed with Nikhil for our specific case is that we could provide some kind of hashed/encrypted result that a user can decrypt on the BIC server to obtain the correct paths - since we control both the graph and the server this could even just be a little decrypter script that lives in the /data/pd path on the BIC, so only people with read permission to this location can use it.

michellewang · 2023-10-25T18:15:10Z

we could provide some kind of hashed/encrypted result that a user can decrypt on the BIC server to obtain the correct paths

I like this idea!

alyssadai · 2023-10-27T17:46:01Z

Thanks a lot for your input @michellewang! It looks like for now the most straightforward solution would be to simply remove the session path values from the aggregate results, so we'll go ahead with implementing that in #210.

In future, we will investigate if we can provide the session paths as encrypted strings and store a decryption script somewhere on the BIC, as suggested.

Will close this discussion issue if there are no other comments 🙂

alyssadai mentioned this issue Oct 23, 2023

MNI PD data part of public federation neurobagel/planning#36

Closed

13 tasks

alyssadai added feat:improve flag:discuss Flag issue that needs to be discussed before it can be implemented. labels Oct 23, 2023

alyssadai changed the title ~~Update the API so that it does not return paths for PD data in aggregate mode~~ Decide whether to update the API so that it does not return paths for PD data in aggregate mode Oct 23, 2023

alyssadai transferred this issue from neurobagel/planning Oct 25, 2023

alyssadai changed the title ~~Decide whether to update the API so that it does not return paths for PD data in aggregate mode~~ Decide whether to return paths [for PD data] in aggregate mode Oct 25, 2023

surchs moved this from Specify - Active to Specify - Done in Neurobagel Oct 27, 2023

alyssadai mentioned this issue Oct 27, 2023

Remove session path values from aggregate results #210

Closed

alyssadai closed this as completed Oct 27, 2023

github-project-automation bot moved this from Specify - Done to Review - Done in Neurobagel Oct 27, 2023

alyssadai moved this from Review - Done to Implement - Done in Neurobagel Oct 27, 2023

surchs moved this from Implement - Done to Review - Done in Neurobagel Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide whether to return paths [for PD data] in aggregate mode #208

Decide whether to return paths [for PD data] in aggregate mode #208

alyssadai commented Oct 23, 2023 •

edited

Loading

alyssadai commented Oct 25, 2023

alyssadai commented Oct 25, 2023

michellewang commented Oct 25, 2023

surchs commented Oct 25, 2023

michellewang commented Oct 25, 2023

alyssadai commented Oct 27, 2023

Decide whether to return paths [for PD data] in aggregate mode #208

Decide whether to return paths [for PD data] in aggregate mode #208

Comments

alyssadai commented Oct 23, 2023 • edited Loading

Context

So, we need to decide:

alyssadai commented Oct 25, 2023

alyssadai commented Oct 25, 2023

michellewang commented Oct 25, 2023

surchs commented Oct 25, 2023

michellewang commented Oct 25, 2023

alyssadai commented Oct 27, 2023

alyssadai commented Oct 23, 2023 •

edited

Loading