Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to access protected datasets on PAVICS #122

Open
tlogan2000 opened this issue Mar 13, 2023 · 26 comments
Open

Unable to access protected datasets on PAVICS #122

tlogan2000 opened this issue Mar 13, 2023 · 26 comments
Assignees

Comments

@tlogan2000
Copy link

Describe the bug

Recent updates to twitcher (or possibly magpie) break previous workflows for accessing group protected datasets on PAVICS.
Code will successfully navigate thredds and create xarray datasets but will fail when trying to actually load data

Devs who wish to debug this can let me know so I can add their pavics username to the appropiriate group

To Reproduce

to run directly on PAVICS see example notebook:
https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Access-ESPO-G6_PAVICS_BROKEN.ipynb

from siphon.http_util import session_manager
from siphon.catalog import TDSCatalog
import requests
from requests_magpie import MagpieAuth, MagpieAuthenticationError
from pathlib import Path
import xarray as xr
from dask.distributed import Client
from dask.diagnostics import ProgressBar
import shutil

# Preliminary ESPO-G6 v1
cat_url =  "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/datasets/simulations/bias_adjusted/cmip6/ouranos/ESPO-G/ESPO-G6v1.0.0/catalog.xml"  

# Preliminary ESPO-R5 v1
#cat_url = "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/datasets/simulations/bias_adjusted/cmip5/ouranos/ESPO-R/ESPO-R5v1.0.0/catalog.xml"  

# the PAVICS_credentials.txt file is a simple text file with the user on the first line and the password on the second
# These are the same as for jupyterhub
with Path('/notebook_dir/writable-workspace/PAVICS_credentials.txt').open() as f:
    AUTH_USR = f.readline().strip()
    AUTH_PWD = f.readline().strip()

# Authentification object that will request and cache the authentification cookies
auth = MagpieAuth(f"https://pavics.ouranos.ca/magpie/", AUTH_USR, AUTH_PWD)
# A full session, within which to make opendap requests.
session = requests.Session()
session.auth = auth


# Pass the authentification manager
session_manager.set_session_options(auth=auth)

cat = TDSCatalog(cat_url)

ds = xr.open_dataset(cat.datasets[0].access_urls["OPENDAP"], chunks=dict(time=1460, lat=50, lon=50), decode_timedelta=False, session=session, engine='pydap', user_charset='utf8')
display(ds)

# loading data fails
with ProgressBar():
    ds.isel(time=0).tasmin.plot()

Expected behavior

access data

@fmigneault
Copy link
Contributor

Is there an estimate date of when this script was last executed successfully to see if any relevant version changes of Magpie/Twitcher since then could have caused this?

My guess would be that it is not related to Magpie/Twitcher code changes themselves (as there was not really any change related to permissions resolution for a while), but maybe the default permissions and/or components loaded in birdhouse-deploy.

@tlogan2000
Copy link
Author

@fmigneault script was working early to mid february. @tlvu Is there a history of deployment changes we can track?

@tlvu
Copy link

tlvu commented Mar 13, 2023

@fmigneault script was working early to mid february. @tlvu Is there a history of deployment changes we can track?

@tlogan2000
Yes, the Twitcher/Magpie upgrade was done in this PR bird-house/birdhouse-deploy#283 and which was merged to our fork in commit Ouranosinc/birdhouse-deploy@f72d36c on Feb 13 so it should went live on Feb 14 via nightly autodeploy.

That said, @fmigneault concern that maybe there could be another cause than Twitcher/Magpie is valid. So to test that theory, I'll try to rollback Twitcher/Magpie to the previous version to see if the problem is fixed or not. If it is fixed, then it's clearly caused by the new Twitcher/Magpie on Feb 14.

Hopefully the downgrade is easy since there was a DB upgrade for Magpie (see comment bird-house/birdhouse-deploy#283 (review)). In fact, is there a procedure to downgrade Magpie documented somewhere?

@tlvu
Copy link

tlvu commented Mar 13, 2023

@fmigneault

So as I suspected, the previous Magpie do not start with the DB already upgraded to the new Magpie:

[2023-03-13 20:46:17,015] INFO       [MainThread][alembic.runtime.migration] Will assume transactional DDL.
[2023-03-13 20:46:17,198] ERROR      [MainThread][alembic.util.messaging] Can't locate revision identified by '5e5acc33adce'
FAILED: Can't locate revision identified by '5e5acc33adce'

Identifier '5e5acc33adce' matched the upgrade in comment bird-house/birdhouse-deploy#283 (review).

Do you happen to have the procedure to downgrade Magpie?

@fmigneault
Copy link
Contributor

@tlvu
You can run the downgrade manually with the Makefile from Magpie.
https://github.com/Ouranosinc/Magpie/blob/master/Makefile#L282-L288

You can probably even set 0c6269f410cd manually (the version just before) in the DB since there is only an index added to the db schema, no modification to data itself.

@tlvu
Copy link

tlvu commented Mar 14, 2023

Documenting commands to pseudo downgrade Magpie DB:

+ docker exec postgres-magpie psql -U pavics magpiedb -c 'select * from alembic_version where version_num='\''5e5acc33adce'\'''                      
 version_num                                                                                                                                                  
--------------                                                                                                                      
 5e5acc33adce                                                                                                                                                 
(1 row)           

+ docker exec postgres-magpie psql -U pavics magpiedb -c 'update alembic_version set version_num='\''0c6269f410cd'\'' where version_num='\''5e5acc33adce'\''' 
UPDATE 1

@tlvu
Copy link

tlvu commented Mar 14, 2023

Subsequent upgrade back to latest Magpie version do not work

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.DuplicateTable) relation "ix_users_email_unique_case_insensitive" already exists
                                                                                                           
[SQL: CREATE UNIQUE INDEX ix_users_email_unique_case_insensitive ON users (lower(email))]            

@tlvu
Copy link

tlvu commented Mar 14, 2023

@fmigneault I tried to follow this as you suggest https://github.com/Ouranosinc/Magpie/blob/e6a7c34b97877d28dc90538d157e3a948b91b6a4/Makefile#L272-L288

With the new Magpie version running, I run the following commands:

APP_INI=/opt/local/src/magpie/config/magpie.ini
DB_REVISION=5e5acc33adce
CONTAINER_NAME=magpie
docker exec -e MAGPIE_INI_FILE_PATH="$APP_INI" $CONTAINER_NAME alembic -c "${APP_INI}" downgrade $DB_REVISION

And it does not do anything. Is there something I missed?

If possible, could you provide exact command to downgrade from the context of PAVICS environment and not from a dev environment.

@tlvu
Copy link

tlvu commented Mar 15, 2023

@fmigneault

I tried to do this manually https://github.com/Ouranosinc/Magpie/blob/e6a7c34b97877d28dc90538d157e3a948b91b6a4/magpie/alembic/versions/2022-09-01_5e5acc33adce_case_insensitive_email_constraint.py#L29-L30

docker exec postgres-magpie psql -U pavics magpiedb -c 'DROP INDEX ix_users_email_unique_case_insensitive'

And it also hangs forever (waited 2 hours).

I think the reason why my previous alembic downgrade did nothing is because it also hangs.

Can you advise how to drop that index or investigate the hanging problem when trying to drop that index? Because without dropping the index, I can not go back to the latest Magpie.

@fmigneault
Copy link
Contributor

fmigneault commented Mar 15, 2023

@tlvu
I think the alembic downgrade command you used is good, but should point at 0c6269f410cd instead of 5e5acc33adce.
(should be read as "downgrade to <revision>"). Using that command, it should automatically perform the DROP INDEX as you found in its python script.

Then, using a Magpie Docker version before 5e5acc33adce revision was introduced in the code will make 0c6269f410cd the "latest". That means you would have to use https://github.com/Ouranosinc/Magpie/releases/tag/3.26.0 or older, since tags start at 3.27.0 in Ouranosinc/Magpie@9c3d8ad. You cannot re-use any version that includes 5e5acc33adce, as they will be applied automatically on startup of the Magpie webapp.

@tlvu
Copy link

tlvu commented Mar 16, 2023

@fmigneault My problem is the alembic downgrade (for both 5e5acc33adce and 0c6269f410cd) and the manual DROP INDEX ix_users_email_unique_case_insensitive both hangs.

I thought it is a problem with my VM, then I tried on the staging machine, it also hangs.

If you happen to have a PAVICS stack running, can you try to rollback yourself? If not, it's okay, I'll try to figure this out next week. This week I am on Misha PR.

@fmigneault
Copy link
Contributor

@tlvu
The issue seems specific to the docker environment. The command also hangs on my side when using birdhouse dockers.
I'm not quite sure why, though. I've tried both with docker exec on the running instance, and a separate docker run call with everything stopped, but they all hang.

The corresponding command on a local environment applies the operations without hanging and changes are reflected in the database:
image

There must be something specific happening due to docker runtime.

@fmigneault
Copy link
Contributor

@tlvu
I've dug deeper into the postgres-magpie docker instance.
There seems to be many idle transactions not being closed properly (still investigating the cause of that...) causing the new transaction created by the downgrade to lock until one becomes available.
To "quick fix", I've tried force closing them with pavics-compose up -d --force-recreate postgres-magpie and then running again the alembic downgrade command successfully.

@fmigneault
Copy link
Contributor

Relevant PR to address the issue: Ouranosinc/Magpie#580

@tlvu
Copy link

tlvu commented Mar 22, 2023

To "quick fix", I've tried force closing them with pavics-compose up -d --force-recreate postgres-magpie and then running again the alembic downgrade command successfully.

Sorry for the late reply. I was busy with Misha PR.

Taking clues from you, a simple ./pavics-compose.sh restart postgres-magpie also allowed me to DROP INDEX ix_users_email_unique_case_insensitive.

Did it on our staging machine, rolled back to older Twitcher/Magpie. Will try to reproduce Travis problem with older Twitcher/Magpie.

@tlvu
Copy link

tlvu commented Mar 22, 2023

@tlogan2000 I think I found the root cause: it's the pydap engine of the xr.open_dataset call. The bad news is when not using pydap engine, it's 2 to 3 times slower.

I rolled back Magpie on our staging Boreas and I was able to also reproduce the problem (both behind Twitcher and direct with testthredds): https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/lvu-public/Access-ESPO-G6_PAVICS_BROKEN-with-pydap-boreas-old-magpie.ipynb

Then I remove the pydap engine and it works, both behind Twitcher and direct with testthredds: https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/lvu-public/Access-ESPO-G6_PAVICS_WORKING-boreas-old-magpie.ipynb

The production is still on the latest Magpie. We can re-run my notebooks against production and probably the same result. Magpie is probably not the cause.

@tlvu
Copy link

tlvu commented Mar 22, 2023

Confirmed it works on production with latest Magpie as well, as long as the engine is not pydap.

@fmigneault I noticed however a performance problem with Twitcher/Magpie and maybe more with the newer version.

When hitting production (on fast SSD), behind Nginx and newer Twitcher/Magpie 3.31.0, it took around 210s. Behind Nginx only, it took around 33s. Tried 3 times, taking the average, to remove the effect of caching.

When hitting staging (on spinning HDD), behind Nginx and older Twitcher/Magpie 3.26.0, it took around 113s, Behind Nginx only, it took around 56s. Tried 3 times, taking the average, to remove the effect of caching.

@tlogan2000
Copy link
Author

Yes pydap is required to access protected datasets using a session etc. Just removed group permission this week for the ESPO-G6 data so I think we may still have an issue

@tlvu
Copy link

tlvu commented Mar 23, 2023

Yes pydap is required to access protected datasets using a session etc. Just removed group permission this week for the ESPO-G6 data so I think we may still have an issue

OH ! So my test works simply because I did not need to use authentication because the dataset is public now. Had the dataset is still protected, we will need to use pydap engine and the problem is back?

@fmigneault
Copy link
Contributor

Is it possible to point me in the direction of the specific pydap function that handles the request, or get logs of requests sent toward Twitcher?
(note: I can't access the notebooks)

If the behaviour differs from using pydap or not, it must inject something in the request that causes a side effect, or something related to timing/caching of multiple requests.

@tlvu
Can you try with Magpie 3.30.0 to reduce the scope to look for?
One notable change in 3.31.0 is using twitcher>=0.8.0 with tweaks to MagpieAdapter interface.

@tlogan2000
Copy link
Author

@tlvu I updated the notebook to point to ESPO-R5 data (still protected) and problems are present https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Access-ESPO-G6_PAVICS_BROKEN.ipynb

@tlogan2000
Copy link
Author

Is it possible to point me in the direction of the specific pydap function that handles the request, or get logs of requests sent toward Twitcher?
(note: I can't access the notebooks)

@fmigneault I think you will need a PAVICS account to reproduce the issue even if I point towards the specific call. I think you have an account already but if you need to change your password etc let me know

@fmigneault
Copy link
Contributor

@tlogan2000
I don't think I have one, or if I do, I do not have the username/password anymore.

@tlvu
Copy link

tlvu commented May 11, 2023

Good news, with this command sudo sysctl net.ipv4.tcp_tw_reuse=1, suggested by @ldperron, https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Access-ESPO-G6_PAVICS_BROKEN.ipynb is passing.

It's a bit less safe with this setting turned on, but according @ldperron it should be okay since the conditions for it to be unsafe is pretty rare.

Still I'll run a couple of Jenkins on Production just to be sure.

In the mean time, @tlogan2000 if you can confirm the fix works for your use-cases, or test on a bigger use-case to see the limit of this work-around.

@tlvu
Copy link

tlvu commented May 12, 2023

FYI @moulab88, this is the issue I wanted to test net.ipv4.tcp_tw_reuse=1 in production. If this looks good, I will make this change permanent. Do you have any objections?

@moulab88
Copy link

You can go with the change. Although this setting was tested a few years ago, we will check the connections states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants