Unable to access protected datasets on PAVICS #122

tlogan2000 · 2023-03-13T14:25:23Z

Describe the bug

Recent updates to twitcher (or possibly magpie) break previous workflows for accessing group protected datasets on PAVICS.
Code will successfully navigate thredds and create xarray datasets but will fail when trying to actually load data

Devs who wish to debug this can let me know so I can add their pavics username to the appropiriate group

To Reproduce

to run directly on PAVICS see example notebook:
https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Access-ESPO-G6_PAVICS_BROKEN.ipynb

from siphon.http_util import session_manager
from siphon.catalog import TDSCatalog
import requests
from requests_magpie import MagpieAuth, MagpieAuthenticationError
from pathlib import Path
import xarray as xr
from dask.distributed import Client
from dask.diagnostics import ProgressBar
import shutil

# Preliminary ESPO-G6 v1
cat_url =  "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/datasets/simulations/bias_adjusted/cmip6/ouranos/ESPO-G/ESPO-G6v1.0.0/catalog.xml"  

# Preliminary ESPO-R5 v1
#cat_url = "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/datasets/simulations/bias_adjusted/cmip5/ouranos/ESPO-R/ESPO-R5v1.0.0/catalog.xml"  

# the PAVICS_credentials.txt file is a simple text file with the user on the first line and the password on the second
# These are the same as for jupyterhub
with Path('/notebook_dir/writable-workspace/PAVICS_credentials.txt').open() as f:
    AUTH_USR = f.readline().strip()
    AUTH_PWD = f.readline().strip()

# Authentification object that will request and cache the authentification cookies
auth = MagpieAuth(f"https://pavics.ouranos.ca/magpie/", AUTH_USR, AUTH_PWD)
# A full session, within which to make opendap requests.
session = requests.Session()
session.auth = auth


# Pass the authentification manager
session_manager.set_session_options(auth=auth)

cat = TDSCatalog(cat_url)

ds = xr.open_dataset(cat.datasets[0].access_urls["OPENDAP"], chunks=dict(time=1460, lat=50, lon=50), decode_timedelta=False, session=session, engine='pydap', user_charset='utf8')
display(ds)

# loading data fails
with ProgressBar():
    ds.isel(time=0).tasmin.plot()

Expected behavior

access data

The text was updated successfully, but these errors were encountered:

fmigneault · 2023-03-13T18:29:30Z

Is there an estimate date of when this script was last executed successfully to see if any relevant version changes of Magpie/Twitcher since then could have caused this?

My guess would be that it is not related to Magpie/Twitcher code changes themselves (as there was not really any change related to permissions resolution for a while), but maybe the default permissions and/or components loaded in birdhouse-deploy.

tlogan2000 · 2023-03-13T18:58:22Z

@fmigneault script was working early to mid february. @tlvu Is there a history of deployment changes we can track?

tlvu · 2023-03-13T20:21:22Z

@fmigneault script was working early to mid february. @tlvu Is there a history of deployment changes we can track?

@tlogan2000
Yes, the Twitcher/Magpie upgrade was done in this PR bird-house/birdhouse-deploy#283 and which was merged to our fork in commit Ouranosinc/birdhouse-deploy@f72d36c on Feb 13 so it should went live on Feb 14 via nightly autodeploy.

That said, @fmigneault concern that maybe there could be another cause than Twitcher/Magpie is valid. So to test that theory, I'll try to rollback Twitcher/Magpie to the previous version to see if the problem is fixed or not. If it is fixed, then it's clearly caused by the new Twitcher/Magpie on Feb 14.

Hopefully the downgrade is easy since there was a DB upgrade for Magpie (see comment bird-house/birdhouse-deploy#283 (review)). In fact, is there a procedure to downgrade Magpie documented somewhere?

tlvu · 2023-03-13T20:48:47Z

@fmigneault

So as I suspected, the previous Magpie do not start with the DB already upgraded to the new Magpie:

[2023-03-13 20:46:17,015] INFO       [MainThread][alembic.runtime.migration] Will assume transactional DDL.
[2023-03-13 20:46:17,198] ERROR      [MainThread][alembic.util.messaging] Can't locate revision identified by '5e5acc33adce'
FAILED: Can't locate revision identified by '5e5acc33adce'

Identifier '5e5acc33adce' matched the upgrade in comment bird-house/birdhouse-deploy#283 (review).

Do you happen to have the procedure to downgrade Magpie?

fmigneault · 2023-03-14T03:26:42Z

@tlvu
You can run the downgrade manually with the Makefile from Magpie.
https://github.com/Ouranosinc/Magpie/blob/master/Makefile#L282-L288

You can probably even set 0c6269f410cd manually (the version just before) in the DB since there is only an index added to the db schema, no modification to data itself.

tlvu · 2023-03-14T17:47:12Z

Documenting commands to pseudo downgrade Magpie DB:

+ docker exec postgres-magpie psql -U pavics magpiedb -c 'select * from alembic_version where version_num='\''5e5acc33adce'\'''                      
 version_num                                                                                                                                                  
--------------                                                                                                                      
 5e5acc33adce                                                                                                                                                 
(1 row)           

+ docker exec postgres-magpie psql -U pavics magpiedb -c 'update alembic_version set version_num='\''0c6269f410cd'\'' where version_num='\''5e5acc33adce'\''' 
UPDATE 1

tlvu · 2023-03-14T17:55:04Z

Subsequent upgrade back to latest Magpie version do not work

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.DuplicateTable) relation "ix_users_email_unique_case_insensitive" already exists
                                                                                                           
[SQL: CREATE UNIQUE INDEX ix_users_email_unique_case_insensitive ON users (lower(email))]

tlvu · 2023-03-14T18:28:44Z

@fmigneault I tried to follow this as you suggest https://github.com/Ouranosinc/Magpie/blob/e6a7c34b97877d28dc90538d157e3a948b91b6a4/Makefile#L272-L288

With the new Magpie version running, I run the following commands:

APP_INI=/opt/local/src/magpie/config/magpie.ini
DB_REVISION=5e5acc33adce
CONTAINER_NAME=magpie
docker exec -e MAGPIE_INI_FILE_PATH="$APP_INI" $CONTAINER_NAME alembic -c "${APP_INI}" downgrade $DB_REVISION

And it does not do anything. Is there something I missed?

If possible, could you provide exact command to downgrade from the context of PAVICS environment and not from a dev environment.

tlvu · 2023-03-15T02:16:55Z

@fmigneault

I tried to do this manually https://github.com/Ouranosinc/Magpie/blob/e6a7c34b97877d28dc90538d157e3a948b91b6a4/magpie/alembic/versions/2022-09-01_5e5acc33adce_case_insensitive_email_constraint.py#L29-L30

docker exec postgres-magpie psql -U pavics magpiedb -c 'DROP INDEX ix_users_email_unique_case_insensitive'

And it also hangs forever (waited 2 hours).

I think the reason why my previous alembic downgrade did nothing is because it also hangs.

Can you advise how to drop that index or investigate the hanging problem when trying to drop that index? Because without dropping the index, I can not go back to the latest Magpie.

fmigneault · 2023-03-15T22:38:42Z

@tlvu
I think the alembic downgrade command you used is good, but should point at 0c6269f410cd instead of 5e5acc33adce.
(should be read as "downgrade to <revision>"). Using that command, it should automatically perform the DROP INDEX as you found in its python script.

Then, using a Magpie Docker version before 5e5acc33adce revision was introduced in the code will make 0c6269f410cd the "latest". That means you would have to use https://github.com/Ouranosinc/Magpie/releases/tag/3.26.0 or older, since tags start at 3.27.0 in Ouranosinc/Magpie@9c3d8ad. You cannot re-use any version that includes 5e5acc33adce, as they will be applied automatically on startup of the Magpie webapp.

tlvu · 2023-03-16T13:30:52Z

@fmigneault My problem is the alembic downgrade (for both 5e5acc33adce and 0c6269f410cd) and the manual DROP INDEX ix_users_email_unique_case_insensitive both hangs.

I thought it is a problem with my VM, then I tried on the staging machine, it also hangs.

If you happen to have a PAVICS stack running, can you try to rollback yourself? If not, it's okay, I'll try to figure this out next week. This week I am on Misha PR.

fmigneault · 2023-03-16T19:22:34Z

@tlvu
The issue seems specific to the docker environment. The command also hangs on my side when using birdhouse dockers.
I'm not quite sure why, though. I've tried both with docker exec on the running instance, and a separate docker run call with everything stopped, but they all hang.

The corresponding command on a local environment applies the operations without hanging and changes are reflected in the database:

There must be something specific happening due to docker runtime.

fmigneault · 2023-03-18T02:21:23Z

@tlvu
I've dug deeper into the postgres-magpie docker instance.
There seems to be many idle transactions not being closed properly (still investigating the cause of that...) causing the new transaction created by the downgrade to lock until one becomes available.
To "quick fix", I've tried force closing them with pavics-compose up -d --force-recreate postgres-magpie and then running again the alembic downgrade command successfully.

fmigneault · 2023-03-20T17:46:27Z

Relevant PR to address the issue: Ouranosinc/Magpie#580

tlvu · 2023-03-22T17:46:05Z

To "quick fix", I've tried force closing them with pavics-compose up -d --force-recreate postgres-magpie and then running again the alembic downgrade command successfully.

Sorry for the late reply. I was busy with Misha PR.

Taking clues from you, a simple ./pavics-compose.sh restart postgres-magpie also allowed me to DROP INDEX ix_users_email_unique_case_insensitive.

Did it on our staging machine, rolled back to older Twitcher/Magpie. Will try to reproduce Travis problem with older Twitcher/Magpie.

tlvu · 2023-03-22T21:04:22Z

@tlogan2000 I think I found the root cause: it's the pydap engine of the xr.open_dataset call. The bad news is when not using pydap engine, it's 2 to 3 times slower.

I rolled back Magpie on our staging Boreas and I was able to also reproduce the problem (both behind Twitcher and direct with testthredds): https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/lvu-public/Access-ESPO-G6_PAVICS_BROKEN-with-pydap-boreas-old-magpie.ipynb

Then I remove the pydap engine and it works, both behind Twitcher and direct with testthredds: https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/lvu-public/Access-ESPO-G6_PAVICS_WORKING-boreas-old-magpie.ipynb

The production is still on the latest Magpie. We can re-run my notebooks against production and probably the same result. Magpie is probably not the cause.

tlvu · 2023-03-22T21:36:34Z

Confirmed it works on production with latest Magpie as well, as long as the engine is not pydap.

@fmigneault I noticed however a performance problem with Twitcher/Magpie and maybe more with the newer version.

When hitting production (on fast SSD), behind Nginx and newer Twitcher/Magpie 3.31.0, it took around 210s. Behind Nginx only, it took around 33s. Tried 3 times, taking the average, to remove the effect of caching.

When hitting staging (on spinning HDD), behind Nginx and older Twitcher/Magpie 3.26.0, it took around 113s, Behind Nginx only, it took around 56s. Tried 3 times, taking the average, to remove the effect of caching.

tlogan2000 · 2023-03-23T12:48:17Z

Yes pydap is required to access protected datasets using a session etc. Just removed group permission this week for the ESPO-G6 data so I think we may still have an issue

tlvu · 2023-03-23T14:13:33Z

Yes pydap is required to access protected datasets using a session etc. Just removed group permission this week for the ESPO-G6 data so I think we may still have an issue

OH ! So my test works simply because I did not need to use authentication because the dataset is public now. Had the dataset is still protected, we will need to use pydap engine and the problem is back?

fmigneault · 2023-03-23T17:00:36Z

Is it possible to point me in the direction of the specific pydap function that handles the request, or get logs of requests sent toward Twitcher?
(note: I can't access the notebooks)

If the behaviour differs from using pydap or not, it must inject something in the request that causes a side effect, or something related to timing/caching of multiple requests.

@tlvu
Can you try with Magpie 3.30.0 to reduce the scope to look for?
One notable change in 3.31.0 is using twitcher>=0.8.0 with tweaks to MagpieAdapter interface.

tlogan2000 · 2023-03-23T21:19:45Z

@tlvu I updated the notebook to point to ESPO-R5 data (still protected) and problems are present https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Access-ESPO-G6_PAVICS_BROKEN.ipynb

tlogan2000 · 2023-04-17T14:51:33Z

Is it possible to point me in the direction of the specific pydap function that handles the request, or get logs of requests sent toward Twitcher?
(note: I can't access the notebooks)

@fmigneault I think you will need a PAVICS account to reproduce the issue even if I point towards the specific call. I think you have an account already but if you need to change your password etc let me know

fmigneault · 2023-04-17T15:35:00Z

@tlogan2000
I don't think I have one, or if I do, I do not have the username/password anymore.

tlvu · 2023-05-11T15:07:58Z

Good news, with this command sudo sysctl net.ipv4.tcp_tw_reuse=1, suggested by @ldperron, https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Access-ESPO-G6_PAVICS_BROKEN.ipynb is passing.

It's a bit less safe with this setting turned on, but according @ldperron it should be okay since the conditions for it to be unsafe is pretty rare.

Still I'll run a couple of Jenkins on Production just to be sure.

In the mean time, @tlogan2000 if you can confirm the fix works for your use-cases, or test on a bigger use-case to see the limit of this work-around.

tlvu · 2023-05-12T17:36:48Z

FYI @moulab88, this is the issue I wanted to test net.ipv4.tcp_tw_reuse=1 in production. If this looks good, I will make this change permanent. Do you have any objections?

moulab88 · 2023-05-12T20:47:27Z

You can go with the change. Although this setting was tested a few years ago, we will check the connections states.

tlogan2000 assigned cehbrecht Mar 13, 2023

huard assigned fmigneault and unassigned cehbrecht Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to access protected datasets on PAVICS #122

Unable to access protected datasets on PAVICS #122

tlogan2000 commented Mar 13, 2023

fmigneault commented Mar 13, 2023

tlogan2000 commented Mar 13, 2023

tlvu commented Mar 13, 2023 •

edited

Loading

tlvu commented Mar 13, 2023

fmigneault commented Mar 14, 2023

tlvu commented Mar 14, 2023

tlvu commented Mar 14, 2023

tlvu commented Mar 14, 2023

tlvu commented Mar 15, 2023

fmigneault commented Mar 15, 2023 •

edited

Loading

tlvu commented Mar 16, 2023

fmigneault commented Mar 16, 2023

fmigneault commented Mar 18, 2023

fmigneault commented Mar 20, 2023

tlvu commented Mar 22, 2023

tlvu commented Mar 22, 2023

tlvu commented Mar 22, 2023

tlogan2000 commented Mar 23, 2023

tlvu commented Mar 23, 2023

fmigneault commented Mar 23, 2023

tlogan2000 commented Mar 23, 2023

tlogan2000 commented Apr 17, 2023

fmigneault commented Apr 17, 2023

tlvu commented May 11, 2023

tlvu commented May 12, 2023

moulab88 commented May 12, 2023

Unable to access protected datasets on PAVICS #122

Unable to access protected datasets on PAVICS #122

Comments

tlogan2000 commented Mar 13, 2023

Describe the bug

To Reproduce

Expected behavior

fmigneault commented Mar 13, 2023

tlogan2000 commented Mar 13, 2023

tlvu commented Mar 13, 2023 • edited Loading

tlvu commented Mar 13, 2023

fmigneault commented Mar 14, 2023

tlvu commented Mar 14, 2023

tlvu commented Mar 14, 2023

tlvu commented Mar 14, 2023

tlvu commented Mar 15, 2023

fmigneault commented Mar 15, 2023 • edited Loading

tlvu commented Mar 16, 2023

fmigneault commented Mar 16, 2023

fmigneault commented Mar 18, 2023

fmigneault commented Mar 20, 2023

tlvu commented Mar 22, 2023

tlvu commented Mar 22, 2023

tlvu commented Mar 22, 2023

tlogan2000 commented Mar 23, 2023

tlvu commented Mar 23, 2023

fmigneault commented Mar 23, 2023

tlogan2000 commented Mar 23, 2023

tlogan2000 commented Apr 17, 2023

fmigneault commented Apr 17, 2023

tlvu commented May 11, 2023

tlvu commented May 12, 2023

moulab88 commented May 12, 2023

tlvu commented Mar 13, 2023 •

edited

Loading

fmigneault commented Mar 15, 2023 •

edited

Loading