Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom notebooks #150

Merged
merged 24 commits into from
May 11, 2021
Merged

Custom notebooks #150

merged 24 commits into from
May 11, 2021

Conversation

ChaamC
Copy link
Collaborator

@ChaamC ChaamC commented Apr 19, 2021

New solution for custom notebooks, related to both pull requests :
bird-house/pavics-jupyter-base#3
crim-ca/pavics-jupyter-images#3

The new solutions uses a custom script that is executed directly on a specific image such as pavics/crim-jupyter-eo or pavics/crim-jupyter-nlp. This script can be executed by configuring a job for the scheduler, using the AUTODEPLOY_EXTRA_SCHEDULER_JOBS variable in the env.local file. The script uses a yaml config file found on the specific images' repo, which describes the source of the notebook files to download and the destination location where to store them.

There was also a need to restructure the tutorial-notebooks folder, in order to better control which notebooks are available depending of the selected image in JupyterHub.

We can discuss of the solution here, but I would propose putting all the current notebooks to a 'common' folder, which would contain all the notebooks that would be available to all images. There would also be other folders for each specific images.
So we would have these paths for example :

  • tutorial-notebooks/common
  • tutorial-notebooks/crim-eo
  • tutorial-notebooks/crim-nlp
  • ...
    We can then mount the common folder for all images, and we can mount the folder related to our specific image.
    For now, I have added the common folder for the volume mounting as an example, but the folder is empty at this current state, since we would need to move the desired notebooks there.

A restriction with my solution is that the specific image notebook folder has to have the same name as the name of the image in the whitelist from JupyterHub (see the code of the CustomDockerSpawner which uses the name of the spawned image to mount the corresponding folder.)

Copy link
Collaborator

@dbyrns dbyrns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good job! We can see the thing takes form and it's elegant. I still see some improvements, but we are real close.

# Add more jobs to ./components/scheduler/config.yml
#
# Potential usages: other deployment, backup jobs on the same machine
#
#export AUTODEPLOY_EXTRA_SCHEDULER_JOBS=""
#
# Example extra job that deploys custom notebooks for a specific image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just create a script that loop over DOCKER_NOTEBOOK_IMAGES env var (https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/env.local.example#L214) and perform this operation on each of them?
So, if I want to offer the eo image I add it the the available images env and voilà! I also get all its notebooks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean automatically creating a job for each image found in DOCKER_NOTEBOOK_IMAGES?
I suppose we could, most of the parameters will stay the same with only the name (such as eo, nlp, etc.) changing.
But, do we always want to create a job for all of those images? For example, I don't want to have notebooks specific to pavics/workflow-tests, I don't want to have a job for this image. But I guess we could organize those images differently, like having another variable that contains the list of the images for which we want to actually run the script.

Or did you mean a single job that will run the script for each images?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean a single job running a script updating all images. Basically what you already have, but inside a loop. Apart the image name everything else is the same.
And yes for all images found in DOCKER_NOTEBOOK_IMAGES. Right now we have workflow-tests, but the name is misleading because it's the primary image that is used by almost everyone! In fact all images in DOCKER_NOTEBOOK_IMAGES are available to user, so yes I think that we want to keep them updated.

#
# This is meant to be run on the same host running PAVICS.
#
# The data details is specified in a yaml config file (TEMPLATE_CONFIG_YML), using this format :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear here that the yaml config file must be in the image. This is the case right?
Also the file is expected to be located to a particular location : https://github.com/bird-house/birdhouse-deploy/pull/150/files#diff-17e9b2e274b97a022f797d1f221f2b50144c0ce3b70537a9faa2b7412b2a2cafR159

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in our case, we run this script on an image, so the config must be on the same image, yes. But, I suppose the script could be used directly on the host, without using any image. Since our use case here is to run it on a docker image, I should just put a clarification that the config file should be on the image too though. :P

And, the script has to be copied, when building the docker image, to the same path that will be written for the job command (like on the link you put here.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum ok, I agree that this script doesn't mind where the config file is located, it is passed as an argument.
In fact, what is needed is a better place to document the location of the config file in images. All images will require this config file at a specific location so that the job (my link) can find it.

#- name: notebookdeploy-eo
# comment: Auto-deploy tutorial notebooks for the eo image
# schedule: '${AUTODEPLOY_NOTEBOOK_FREQUENCY}'
# command: '/deploy-data-specific-image /notebook_config.yml.template'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my previous comment regarding this config, given that this file /notebook_config.yml.template is in the image, why is it a .template?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "cronjob" is again pretty much similar to the existing https://github.com/bird-house/birdhouse-deploy/blob/57a320c3583f804838e8c18aeb064266527a0bc9/birdhouse/components/scheduler/deploy_data_job.env. Same question, if a feature is missing, can we add to it than forking it?

This one you might not be aware since it's under the "scheduler" component because it has to be used with that component.

Using that "generic" cronjob allows a much simplified cronjob definition like this https://github.com/bird-house/birdhouse-deploy/blob/57a320c3583f804838e8c18aeb064266527a0bc9/birdhouse/components/scheduler/deploy_raven_testdata_to_thredds.env

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I used the template word because in this version of the config, we find the variable ${JUPYTERHUB_USER_DATA_DIR}, which gets replaced by its value when running the script.
We do a copy of the config file using envsubst < $TEMPLATE_CONFIG_YML > $CONFIG_YML, where the copy has the real value instead of the variable.
I agree 'template' is not the best word here though. I don't know if you have a better idea.

In regards to your comment about the dest_dir variable found in the yaml config, I could remove it, and we could always output the notebooks to the same directory, in ${JUPYTERHUB_USER_DATA_DIR}/tutorial-notebooks/{$IMAGE_NAME}.
We could remove that option of customization, which could remove my need of having a copy of the config with variables to replace.
If we decide to remove the dest_dir option, I will just remove all mentions of .template, and remove the envsubst command from the script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it solves two issues and make the whole thing simpler, so I would remove the dest_dir option from the config.
But I would not use the directory ${JUPYTERHUB_USER_DATA_DIR}/tutorial-notebooks/{$IMAGE_NAME}. This is where the notebooks need to be on the host, but in the image it could be anywhere like /tmp as long as the calling command mount the host volume at that point. My suggestion still hold : in the docker command include the volume mount and set an environment variable telling the script where to put the stuff (the dest_dir, but as a env var rather than a config)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, now that I have more time to digest this PR ...

I would suggest moving this sample big blob of code to generate the cronjob also into jupyter-pavics-base repo so it is together with the script deploy-data-specific-image it wraps.

  • It makes the env.local.example shorter and less intimidating
  • You now own the "code" so if you need to fix something (like add/remove a volume-mount) you can do it and it's transparent to all the callers, all they need is always run the latest version. You basically provide the user with a stable interface and give you control to change the implementation.
  • Later, in a different PR, once we "migrate" to the generic deploy-data script, the matching generic cronjob components/scheduler/deploy_data_job.env will also have to be adapted for whatever newer options that deploy-data will support. And we still keep the same consistency that the "script" and the "cronjob wrapper" are together.

Try to inspire from the existing generic cronjob wrapper, especially the part about

if [ -z "`echo "$AUTODEPLOY_EXTRA_SCHEDULER_JOBS" | grep $DEPLOY_DATA_JOB_JOB_NAME`" ]; then
# Add job only if not already added (config is read more than once during
# autodeploy process).
to ensure it also works during autodeploy (there is a slight difference between ./pavics-compose.sh invoked manually on the console and invoke inside the scheduler component).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to say, once you move the "cronjob generation code" out to jupyter-pavics-base, you can reference it back in env.local.example to "advertise" it, like

# Load pre-configured cronjob to automatically deploy Raven testdata to Thredds
# for Raven tutorial notebooks.
#
# See the job for additional possible configurations. The "scheduler"
# component needs to be enabled for this pre-configured job to work.
#
#if [ -f "$COMPOSE_DIR/components/scheduler/deploy_raven_testdata_to_thredds.env" \
# -a -f "$COMPOSE_DIR/components/scheduler/deploy_data_job.env" ]; then
# . $COMPOSE_DIR/components/scheduler/deploy_raven_testdata_to_thredds.env
# . $COMPOSE_DIR/components/scheduler/deploy_data_job.env
#fi

birdhouse/default.env Show resolved Hide resolved
birdhouse/deployment/deploy-data-specific-image Outdated Show resolved Hide resolved
Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are getting somewhere in the good direction.

I just want to ensure a seemless transition and also more code re-use.

Consider this my first-pass review as I have not seen the other 2 PR that leverage this feature yet.

birdhouse/config/jupyterhub/jupyterhub_config.py.template Outdated Show resolved Hide resolved
birdhouse/default.env Show resolved Hide resolved
echo "Extracting ${FULL_URL} to ${DEST_DIR}"

# Download the data from github and copy it to the destination directory
svn export --force $FULL_URL $DEST_DIR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Download from github using svn client? I didn't know that it could be done! But why not just use a git client?

The bigger question is this script seems to do a very similar work as the existing deploy-data script. Not clear what feature is missing but why we can not re-use the existing deploy-data script? If a feature is missing, can we add it instead of forking another script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do have a lot of similarities. I could give it a try and try using the deploy-data script directly, but I am scared of finding a blocking point sometime if we want to add different features to one of the use case.
They are almost similar though, if not the same, for now.

Also, I see that the deploy-data script requires the use of Docker, which means we have to install Docker on the pavics-jupyter-base (this would replace the yq/jq installation actually). I don't know what is the best here, if we want to keep those images to the bare minimum.

@dbyrns I would be curious about your input on this :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have enough information on what @tlvu would like you to reuse to make that call. If he could show us how it could easily be done I'm not against it. As long as we don't have to invest another couple of days to do that. The same apply to the other reuse request, maybe you could talk directly to each other so that it could be done fast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main difference between our scripts is that the deploy-data script is meant to run on a generic image that includes Docker and Git, but our new script deploy-data-specific-image is meant to be run directly on one of our own specific image, which includes our config.yaml file.

A benefit of using the new script is that it lets us have the yaml file directly on the specific image's repo, keeping it close to its related context. For example, this means, if a developer wants to include a new folder for the crim-eo environment, he just goes to the crim-eo's repo and changes the config there.

I am not sure if we could do this easily with the deploy-data script? I think the yaml files are stored directly on the birdhouse-deploy repo? Such as : deploy_raven_testdata_to_thredds.env and deploy-data-raven-testdata-to-thredds.yml

They are then added as a job in the env.local file :

#if [ -f "$COMPOSE_DIR/components/scheduler/deploy_raven_testdata_to_thredds.env" \

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I have more time to digest this change. I would suggest moving this deploy-data-specific-image script into the jupyter-pavics-base image so the script and the config yaml (that is in the child image) is part of the same final image.

Why? So eventually Jenkins can also checkout all the notebooks and run tests on them, like currently with the default Jupyter image. The e2e repo should not be responsible for checking out the notebooks like currently and will enable that same e2e repo to test against different notebooks depending on the Jupyter image. That requires implementation of Ouranosinc/PAVICS-e2e-workflow-tests#57 but we can start to lay the ground in that direction.

For the scheduler cronjob, this means you don't even have to volume-mount the script from outside since it's already inside the image !

As for re-using the existing generic deploy-data script, I think it simply boils down to

  1. being able to use yq as docker run (currently) or as installed (your case), will have to add a new switch for that.
  2. download the repo as a tar/zip archive (your way) instead of git pull (current), will have to add a new switch for that as well.
    Note that going the tar/zip archive route nullify the caching provided by git pull. However, git pull way waste more space because basically there is a double checkout, one used for caching, one in the /data/jupyter_user_data/tutorial-notebooks/.... Each side have pros and cons so for generic and re-usabilty sake, we can implement both.

In a different PR, once deploy-data have the additional mode of operations, the jupyter-pavics-base will wget the script part of the build and make it available inside the image as deploy-data-specific-image is, without being directly committed as deploy-data-specific-image.

#- name: notebookdeploy-eo
# comment: Auto-deploy tutorial notebooks for the eo image
# schedule: '${AUTODEPLOY_NOTEBOOK_FREQUENCY}'
# command: '/deploy-data-specific-image /notebook_config.yml.template'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "cronjob" is again pretty much similar to the existing https://github.com/bird-house/birdhouse-deploy/blob/57a320c3583f804838e8c18aeb064266527a0bc9/birdhouse/components/scheduler/deploy_data_job.env. Same question, if a feature is missing, can we add to it than forking it?

This one you might not be aware since it's under the "scheduler" component because it has to be used with that component.

Using that "generic" cronjob allows a much simplified cronjob definition like this https://github.com/bird-house/birdhouse-deploy/blob/57a320c3583f804838e8c18aeb064266527a0bc9/birdhouse/components/scheduler/deploy_raven_testdata_to_thredds.env

birdhouse/env.local.example Outdated Show resolved Hide resolved
birdhouse/env.local.example Outdated Show resolved Hide resolved
Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here my "final-pass" review.

Good job, you seem to grasp the scheduler component and how to add more "jobs" to it.

Things are aligned in the right direction, that's the most important point. There are improvements but we do not have to make them right now.

Here are what I suggest to merge this PR without too much more work:

  • ensure 100% seemless transition (ie when MOUNT_IMAGE_SPECIFIC_NOTEBOOKS == false should behave exactly as before, as if this change never existed).
  • move the custom, not generic deploy script and the matching cronjob generation to jupyter-pavics-base

All the issues about code re-use, we can handle in a "phase 2" sort of PR, iterative/incremental development style.

Final thing, remember to test this change in an autodeploy context (meaning have autodeploy job in the scheduler callling ./pavics-compose.sh which generate the actual ./components/scheduler/config.yml with all the jobs fully instantiated, make sure no double job and all the variables expanded correctly).

# 'nlp-crim': os.environ['DOCKER_NOTEBOOK_IMAGES'].split()[2],
#c.DockerSpawner.image_whitelist = {os.environ['JUPYTERHUB_IMAGE_SELECTION_NAMES'].split()[0]: os.environ['DOCKER_NOTEBOOK_IMAGES'].split()[0],
# os.environ['JUPYTERHUB_IMAGE_SELECTION_NAMES'].split()[1]: os.environ['DOCKER_NOTEBOOK_IMAGES'].split()[1],
# os.environ['JUPYTERHUB_IMAGE_SELECTION_NAMES'].split()[2]: os.environ['DOCKER_NOTEBOOK_IMAGES'].split()[2],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these lines, I am aware it is not super clean. I haven't found an easier way to implement it yet, since the concerned variables have to be working both on shellscript and python.

About DOCKER_NOTEBOOK_IMAGES and JUPYTERHUB_IMAGE_SELECTION_NAMES, we could have use those names in just one variable instead of 2, using a string "formatted similarly to a dictionary" such as "pavics;pavics/workflow-tests:210216 eo-crim;pavics/crim-jupyter-eo:0.1.0 ...." but it would require additional string splitting, so I don't think it would be cleaner.

Another thing is we could probably simplify this by using the same name all the time, instead of having 2 different name format. We would use the image name such as pavics/crim-jupyter-eo:0.1.0 everywhere though, meaning it would be what we see in the jupyterhub list, and it would be the name of the directories for the tutorial-notebooks, instead of the shorter version eo-crim. I don't think it would be necessarily cleaner either. (Potential problem with the '/' found in the image names, clashing with using it as a directory name?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's ugly, but also agree other alternative either complexify the implemenation a lot or look less great. Open to other alternative as well, if someone see something we both missed ...

Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just first pass. Have not seen the other matching PRs. Nothing major so far, looking good.


# Log file location. Default location under /var/log/PAVICS/ has built-in logrotate.
if [ -z "$DEPLOY_DATA_JOB_LOGFILE" ]; then
DEPLOY_DATA_JOB_LOGFILE="/var/log/PAVICS/${DEPLOY_DATA_JOB_JOB_NAME}.log"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using the new PAVICS_LOGDIR you added?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will replace it here, but if I search for the "/var/log/PAVICS" path in the repo, I see that it is referenced many times, in different other jobs or configs (logrotate, autodeploy, trigger-deploy-notebook, ...) Should I also take the liberty of replacing those with the new variable? I am not sure if I want to change anything in parts of the code that I don't know about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlvu Let me know what I should do for this :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlvu ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this question. Yes indeed, will be nice to clean up all existing hardcode but I'd suggest to just finish this PR that's been open for a while already. You can open another PR to fix the remaining of reference.

birdhouse/config/jupyterhub/jupyterhub_config.py.template Outdated Show resolved Hide resolved
birdhouse/env.local.example Outdated Show resolved Hide resolved
birdhouse/env.local.example Outdated Show resolved Hide resolved
birdhouse/env.local.example Outdated Show resolved Hide resolved
Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. I think you got all the pieces. Just simplify them a bit since you are not aiming for something generic (it's all specific to notebooks deployment).

birdhouse/env.local.example Outdated Show resolved Hide resolved
# 'nlp-crim': os.environ['DOCKER_NOTEBOOK_IMAGES'].split()[2],
#c.DockerSpawner.image_whitelist = {os.environ['JUPYTERHUB_IMAGE_SELECTION_NAMES'].split()[0]: os.environ['DOCKER_NOTEBOOK_IMAGES'].split()[0],
# os.environ['JUPYTERHUB_IMAGE_SELECTION_NAMES'].split()[1]: os.environ['DOCKER_NOTEBOOK_IMAGES'].split()[1],
# os.environ['JUPYTERHUB_IMAGE_SELECTION_NAMES'].split()[2]: os.environ['DOCKER_NOTEBOOK_IMAGES'].split()[2],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's ugly, but also agree other alternative either complexify the implemenation a lot or look less great. Open to other alternative as well, if someone see something we both missed ...

birdhouse/env.local.example Outdated Show resolved Hide resolved
birdhouse/env.local.example Outdated Show resolved Hide resolved
birdhouse/env.local.example Outdated Show resolved Hide resolved
birdhouse/env.local.example Show resolved Hide resolved
Copy link
Collaborator

@dbyrns dbyrns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel bad for the long journey, but we are almost there.

@tlvu
Copy link
Collaborator

tlvu commented Apr 27, 2021

I feel bad for the long journey, but we are almost there.

Yes sorry @ChaamC I was the one who lead you into having the loop outside env.local in comment #150 (comment).

At the time, not knowing how things will turn out, I was simply thinking with what I have currently.

@ChaamC
Copy link
Collaborator Author

ChaamC commented May 4, 2021

@dbyrns @tlvu @matprov
Updated PR : moved the env file to the jupyter-pavics-base repo, and fixed other concerns pointed in your feedback

Let me know if it works!

@ChaamC ChaamC requested review from dbyrns and tlvu May 4, 2021 17:11
Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on this side.

Copy link
Collaborator

@dbyrns dbyrns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! I really like how it turns out :)

@ChaamC
Copy link
Collaborator Author

ChaamC commented May 10, 2021

@dbyrns @tlvu @matprov

After testing on the daccs-iac, I found one bug with the current solution.
It seems the existing 'notebookdeploy' scheduler job is in conflict with the new specific image jobs we are trying to add.
I did not see it at first on a local setup, since the 'notebookdeploy' scheduler job was only happening one time per hour, but I started seeing weird behaviors with the volume mounts on the daccs-iac, with the 'notebookdeploy' job happening at each 5 minutes on the daccs-iac setup.

At each 5 minutes, I first see the image-specific folders getting updated as expected, but the 'notebookdeploy' job finishes after, and erases the eo-crim/nlp-crim folders. I was testing the specific image job at each minute, so the following minutes, the eo-crim/nlp-crim come back and get updated as expected. The problem then reoccurs each 5 minutes.

I think the problem is that the 'notebookdeploy' job seems to erase everything in the 'tutorial-notebooks' folder, when running the trigger-deploy-notebook script :

rm -rf $TUTORIAL_NOTEBOOKS_DIR/*

I am not sure what would be to best solution here. Maybe we need to restructure those directories.
I would be curious about your feedback on this.
Should we close this PR first, and attack this problem in a new one?
We can still keep the feature disabled in the configs for now while we fix this part.

@tlvu
Copy link
Collaborator

tlvu commented May 10, 2021

the 'notebookdeploy' job seems to erase everything in the 'tutorial-notebooks' folder

@ChaamC
Yes indeed, I forgot again about this little detail. It deletes everything so that when an existing notebook is renamed or deleted, it does not leave behind the old name.

A quick solution I can propose to you is to deploy your notebooks to a different location on disk, ex: /data/jupyterhub_user_data/tutorial-notebooks-multi-image/. So on disk it's different location but you can still volume-mount to the same /notebook_dir/tutorial-notebooks/ inside the Jupyter container.

So the feature toggle MOUNT_IMAGE_SPECIFIC_NOTEBOOKS now will also look for the new location on disk.

I had to do the exact same trick (deploy to another location on disk /data/jupyterhub_user_data/pavics-homepage) to deploy our extra tutorial notebook from our new landing page in PR bird-house/birdhouse-deploy-ouranos#8.

Good thing you tested in the real IAC env that replicate the prod env !

@tlvu
Copy link
Collaborator

tlvu commented May 10, 2021

Speaking of not leaving old name behind, does svn export --force $FULL_URL $FULL_DEST_DIR in https://github.com/bird-house/pavics-jupyter-base/blob/6a17b1965346f0c8d0893b5accda0f4f52cf230a/scheduler-jobs/deploy_data_specific_image#L86 override the entire folder?

The current deploy-data (

--recursive --links --checksum --delete \
) has the rsync --delete option to handle that case.

This is still a corner case. If renamed or deleted notebooks is not handled yet, just merge this PR now and handle it in a different PR.

@ChaamC
Copy link
Collaborator Author

ChaamC commented May 11, 2021

I checked for the svn command and it doesn't seem to override the folder. If I put some random file in the folder tutorial-notebooks/eo-crim/eo, and run the deploy job with those settings :

source_dir: eo/notebooks
dest_sub_dir: eo

It just takes every file from the source_dir, and sends them to the dest_sub_dir, updating any file that already exists with the same name.
Any other file found in the dest_sub_dir will just stay there for now.

@ChaamC
Copy link
Collaborator Author

ChaamC commented May 11, 2021

Updated with new fix.
tutorial-notebooks files related to specific images are not found on another folder on the host.
I will do a quick check in the iac too, and I think I can merge this after, if it's good for everyone!

Copy link
Collaborator

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me with the new location on disk for notebooks. But I'd rather we play safe and revert the c.DockerSpawner.pull_policy = "always" change unless there is a compelling reason for it.

Make sure to re-test in a real IAC setup.

birdhouse/config/jupyterhub/jupyterhub_config.py.template Outdated Show resolved Hide resolved
@ChaamC
Copy link
Collaborator Author

ChaamC commented May 11, 2021

Did more tests in IAC setup, and new fix seems to be working fine now!

I see that the E2E tests are failing, but they seems to have failed in other recent PRs too, with similar types of errors, so I suppose it is correct for me to ignore this, and to go on with the merge.

Edit : Received confirmation by Mathieu that the E2E errors are not related to this PR, so proceding with merges :)

@ChaamC ChaamC merged commit d90765a into master May 11, 2021
@ChaamC ChaamC deleted the custom-notebooks branch May 11, 2021 16:57
@matprov
Copy link
Collaborator

matprov commented May 11, 2021

Good to know @ChaamC ! That was not an easy one, but it's finally getting merged - great job :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants