DataprocSpawner enables JupyterHub to spawn single-user [jupyter_notebooks][Jupyter notebooks] that run on Dataproc clusters. This provides users with ephemeral clusters for data science without the pain of managing them.
- Product Documentation
- DISCLAIMER: DataprocSpawner only supports zonal DNS names. If your project uses global DNS names, click this for instructions on how to migrate.
Supported Python Versions: Python >= 3.6
In order to use this library, you first need to go through the following steps:
- Select or create a Cloud Platform project
- Enable billing for your project
- Enable the Google Cloud Dataproc API
- Setup Authentication
To try is locally for development purposes. From the root folder:
chmod +x deploy_local_example.sh
./deploy_local_example.sh <YOU_PROJECT_ID> <YOUR_GCS_CONFIG_LOCATIONS> <YOUR_AUTHENTICATED_EMAIL>
The script will start a local container image and authenticate it using your local credentials.
Note: Although you can try the Dataproc Spawner image locally, you might run into networking communication problems.
To try it out in the Cloud, the quickest way is to to use a test Compute Engine instance. The following takes you through the process.
-
Set your working project
PROJECT_ID=<YOUR_PROJECT_ID> VM_NAME=vm-spawner
-
Run the example script which:
a. Creates a Dockerfile b. Creates a jupyter_config.py example file that uses a dummy authenticator. c. Deploy a Docker image of the JupyterHub spawner in Google Container Registry d. Create a container-based Compute Engine e. Returns the IP of the instance that runs JupyterHub.
bash deploy_gce_example.sh ${PROJECT_ID} ${VM_NAME}
-
After the script finishes, you should see an IP displayed. You can use that IP to access your setup at
<IP>:8000
. You might have to wait for a few minutes until the container is deployed on the instance.
To troubleshoot
-
ssh into the VM:
gcloud compute ssh ${VM_NAME}
-
From the VM console, install some useful tools:
apt-get update apt-get install vim
-
From the VM console, you can:
- List the running containers with
docker ps
- Display container logs
docker logs -f <CONTAINER_ID>
- Execute code in the container
docker exec -it <CONTAINER_ID> /bin/bash
- Restart the container for changes to take effect
docker restart <CONTAINER_ID>
- List the running containers with
-
DataprocSpawner defaults to port 12345, the port can be set within
jupyterhub_config.py
. More info in JupyterHub's jupyterhub_documentation.c.Spawner.port = {port number}
-
The region default is
us-central1
for Dataproc clusters. The zone default isus-central1-a
. Usingglobal
is currently unsupported. To change region, pick a region and zone from this list and include the following lines injupyterhub_config.py
:.. code-block:: console
c.DataprocSpawner.region = '{region}' c.DataprocSpawner.zone = '{zone that is within the chosen region}'
- For an example on how to run the Dataproc spawner in production, refer to the ai-notebook-extended Github repository.
- For a Google-supported version of the Dataproc Spawner, refer to the official Dataproc Hub documentation.