Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide whether to return paths [for PD data] in aggregate mode #208

Closed
Tracked by #36
alyssadai opened this issue Oct 23, 2023 · 6 comments
Closed
Tracked by #36

Decide whether to return paths [for PD data] in aggregate mode #208

alyssadai opened this issue Oct 23, 2023 · 6 comments
Labels
flag:discuss Flag issue that needs to be discussed before it can be implemented.

Comments

@alyssadai
Copy link
Contributor

alyssadai commented Oct 23, 2023

Context

Currently, when the API is deployed in "aggregate mode" (NB_RETURN_AGG=true), the only participant-level info returned from the graph in response to a query is the absolute path to the BIDS session directory on the server where the data is stored for all matching sessions. (These paths come from the Neurobagel CLI, which automatically extracts them based on the BIDS directory root that a user provides when running the CLI.) For an example of what this looks like in the query tool, see
https://neurobagel.org/query_tool/#protected-participant-level-results-in-aggregate-mode

E.g., for a clinical dataset stored on the BIC, the aggregate results would include absolute session directory paths on the BIC server for matching sessions. (For OpenNeuro data/DataLad datasets which are stored on GitHub and not on a specific server, the paths have instead been manually modified to be the absolute path from the repository root (since they need to be DataLad-gettable))

The reason we originally included these session paths even in the aggregate mode was that we wanted to provide some way for a user to be able to actually find/access the actual data files matching their query. Our reasoning was that if the query sender is not authenticated to access the data (e.g., doesn't have permission to log onto the BIC), just by seeing the paths they still wouldn't be able to get the raw data anyways.

That said, since the session paths do also contain the BIDS subject identifiers, one concern is that even these paths are exposing too much identifying info for a supposedly more secure aggregate response.

So, we need to decide:

  • Do we want to return any kind of paths in aggregate mode? (If not, we would essentially only return info about the # matching subjects and the dataset name)
  • For datasets that are only stored on a server, should the paths be absolute paths?
  • Should we have subject/session-level paths or only dataset-level paths?
@alyssadai alyssadai added feat:improve flag:discuss Flag issue that needs to be discussed before it can be implemented. labels Oct 23, 2023
@alyssadai alyssadai changed the title Update the API so that it does not return paths for PD data in aggregate mode Decide whether to update the API so that it does not return paths for PD data in aggregate mode Oct 23, 2023
@alyssadai alyssadai transferred this issue from neurobagel/planning Oct 25, 2023
@alyssadai alyssadai changed the title Decide whether to update the API so that it does not return paths for PD data in aggregate mode Decide whether to return paths [for PD data] in aggregate mode Oct 25, 2023
@alyssadai
Copy link
Contributor Author

@surchs, @rmanaem, anything to add here?

@alyssadai
Copy link
Contributor Author

@nikhil153, @michellewang, any thoughts on this issue / the specific decisions listed?

The first application of what we decide would be the PPMI dataset, since we intend to deploy the API for this dataset in aggregate mode.

@michellewang
Copy link

Do we want to return any kind of paths in aggregate mode? (If not, we would essentially only return info about the # matching subjects and the dataset name)

For the aggregate mode, if we're not allowed to share any subject ID to people without explicit access to the dataset, then we can't share BIDS paths at all because even at the single file level the subject ID is in the path. So the safest/most general case would be to not share any paths in aggregate mode, and users querying the graph will only know how many records fit their criteria.

However, this gets annoying for the users who do actually have access to the data, because then there's no easy way for them to find the data (they would have to e.g. do their own PyBIDS thing). Maybe this can be solved in the future with an authentication layer in Neurobagel?

For datasets that are only stored on a server, should the paths be absolute paths?

I don't have a very strong personal opinion on this. The only thing to consider would be the "risk" of sharing absolute paths openly on the web. If we instead use paths relative to the BIDS directory root then there would be an added step before users can find the data on the server (will need to ask someone where the BIDS directory is) but that should not be difficult, especially if they already have access to the server (and if not then they will need to talk to people anyway to get access to the server). So maybe go with relative paths for more "security".

Should we have subject/session-level paths or only dataset-level paths?

I would say dataset-level paths? Not a strong preference but I am not sure I can think of a case where a subject/session-level path would be more useful than a dataset-level path.

@surchs
Copy link
Contributor

surchs commented Oct 25, 2023

However, this gets annoying for the users who do actually have access to the data, because then there's no easy way for them to find the data (they would have to e.g. do their own PyBIDS thing). Maybe this can be solved in the future with an authentication layer in Neurobagel?

Agreed on this being annoying. I think mid-term we'll address it with authentication/authorization as you say. But one thing I discussed with Nikhil for our specific case is that we could provide some kind of hashed/encrypted result that a user can decrypt on the BIC server to obtain the correct paths - since we control both the graph and the server this could even just be a little decrypter script that lives in the /data/pd path on the BIC, so only people with read permission to this location can use it.

@michellewang
Copy link

we could provide some kind of hashed/encrypted result that a user can decrypt on the BIC server to obtain the correct paths

I like this idea!

@surchs surchs moved this from Specify - Active to Specify - Done in Neurobagel Oct 27, 2023
@alyssadai
Copy link
Contributor Author

Thanks a lot for your input @michellewang! It looks like for now the most straightforward solution would be to simply remove the session path values from the aggregate results, so we'll go ahead with implementing that in #210.

In future, we will investigate if we can provide the session paths as encrypted strings and store a decryption script somewhere on the BIC, as suggested.

Will close this discussion issue if there are no other comments 🙂

@github-project-automation github-project-automation bot moved this from Specify - Done to Review - Done in Neurobagel Oct 27, 2023
@alyssadai alyssadai moved this from Review - Done to Implement - Done in Neurobagel Oct 27, 2023
@surchs surchs moved this from Implement - Done to Review - Done in Neurobagel Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flag:discuss Flag issue that needs to be discussed before it can be implemented.
Projects
Archived in project
Development

No branches or pull requests

3 participants