-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide whether to return paths [for PD data] in aggregate mode #208
Comments
@nikhil153, @michellewang, any thoughts on this issue / the specific decisions listed? The first application of what we decide would be the PPMI dataset, since we intend to deploy the API for this dataset in aggregate mode. |
For the aggregate mode, if we're not allowed to share any subject ID to people without explicit access to the dataset, then we can't share BIDS paths at all because even at the single file level the subject ID is in the path. So the safest/most general case would be to not share any paths in aggregate mode, and users querying the graph will only know how many records fit their criteria. However, this gets annoying for the users who do actually have access to the data, because then there's no easy way for them to find the data (they would have to e.g. do their own PyBIDS thing). Maybe this can be solved in the future with an authentication layer in Neurobagel?
I don't have a very strong personal opinion on this. The only thing to consider would be the "risk" of sharing absolute paths openly on the web. If we instead use paths relative to the BIDS directory root then there would be an added step before users can find the data on the server (will need to ask someone where the BIDS directory is) but that should not be difficult, especially if they already have access to the server (and if not then they will need to talk to people anyway to get access to the server). So maybe go with relative paths for more "security".
I would say dataset-level paths? Not a strong preference but I am not sure I can think of a case where a subject/session-level path would be more useful than a dataset-level path. |
Agreed on this being annoying. I think mid-term we'll address it with authentication/authorization as you say. But one thing I discussed with Nikhil for our specific case is that we could provide some kind of hashed/encrypted result that a user can decrypt on the BIC server to obtain the correct paths - since we control both the graph and the server this could even just be a little decrypter script that lives in the /data/pd path on the BIC, so only people with read permission to this location can use it. |
I like this idea! |
Thanks a lot for your input @michellewang! It looks like for now the most straightforward solution would be to simply remove the session path values from the aggregate results, so we'll go ahead with implementing that in #210. In future, we will investigate if we can provide the session paths as encrypted strings and store a decryption script somewhere on the BIC, as suggested. Will close this discussion issue if there are no other comments 🙂 |
Context
Currently, when the API is deployed in "aggregate mode" (
NB_RETURN_AGG=true
), the only participant-level info returned from the graph in response to a query is the absolute path to the BIDS session directory on the server where the data is stored for all matching sessions. (These paths come from the Neurobagel CLI, which automatically extracts them based on the BIDS directory root that a user provides when running the CLI.) For an example of what this looks like in the query tool, seehttps://neurobagel.org/query_tool/#protected-participant-level-results-in-aggregate-mode
E.g., for a clinical dataset stored on the BIC, the aggregate results would include absolute session directory paths on the BIC server for matching sessions. (For OpenNeuro data/DataLad datasets which are stored on GitHub and not on a specific server, the paths have instead been manually modified to be the absolute path from the repository root (since they need to be DataLad-gettable))
The reason we originally included these session paths even in the aggregate mode was that we wanted to provide some way for a user to be able to actually find/access the actual data files matching their query. Our reasoning was that if the query sender is not authenticated to access the data (e.g., doesn't have permission to log onto the BIC), just by seeing the paths they still wouldn't be able to get the raw data anyways.
That said, since the session paths do also contain the BIDS subject identifiers, one concern is that even these paths are exposing too much identifying info for a supposedly more secure aggregate response.
So, we need to decide:
The text was updated successfully, but these errors were encountered: