This container was created to support various experimentations on Datascience, mainly in the context of Kaggle competitions.
Bundled tools:
- Based on Ubuntu 16.04
- Python 3
- Jupyter
- TensorFlow (CPU and GPU flavors)
- Spark driver (set SPARK_MASTER ENV pointing to your Spark Master)
- For creating a Spark Cluster, you can check
- Scoop, h5py, pandas, scikit, TFLearn, plotly
- pyexcel-ods, pydicom, textblob, wavio, trueskill, cytoolz, ImageHash...
CPU only:
- create docker-compose.yml
version: "3" services: datascience-tools: image: flaviostutz/datascience-tools ports: - 8888:8888 - 6006:6006 volumes: - /notebooks:/notebooks environment: - JUPYTER_TOKEN=flaviostutz
docker-compose up
GPU support for TensorFlow:
- Prepare host machine with NVIDIA Cuda drivers
sudo apt-key adv --fetch-keys
sudo sh -c 'echo "deb /" > /etc/apt/sources.list.d/cuda.list'
sudo apt-get update && sudo apt-get install -y --no-install-recommends cuda-drivers
- Install nvidia-docker and nvidia-docker-plugin
wget -P /tmp
sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
- Install nvidia-docker (
nvidia-docker run -d -v /root:/notebooks -v /root/input:/notebooks/input -v /root/output:/notebooks/output -p 8888:8888 -p 6006:6006 --name jupyter flaviostutz/datascience-tools:latest-gpu
- Prepare host machine with NVIDIA Cuda drivers
If you wish this container to run automatically on host boot, add these lines to /etc/rc.local:
cd /root/datascience-tools/run ./ >> /var/log/boot-script
- Change "/root/datascience-tools" to where you cloned this repo
- http://[ip]:8888 for Jupyter
- http://[ip]:6006 for TensorBoard
- When this container starts, it runs:
- Jupyter Notebook server on port 8888
- TensorBoard server on port 6006
- A custom script located at /notebooks/
- If doesn't exist, it is ignored
- If it exists, everytime you start/restart the container it will be run once
- You can use this script when running large batch processes on servers that could boot/shutdown at random (like what happens when using AWS Spot Instances), so that when the server restarts this script can resume previous work
- Make sure you control partial save/resume for optimal computing usage
- On the host OS, you have to run this docker container with "--restart=always" so that it will be started automatically during boot
- It is possible to edit this file with Jupyter editor
- Example script:
#!/bin/bash python
docker build . -f Dockerfile
docker build . -f Dockerfile-gpu
A good practice is to store your notebook scripts in a git repository
Run datascience-tools container and map the volume "/notebooks", inside the container, to the path you cloned your git repository in your computer
You can edit/save/run the scripts from the web interface (http://localhost:8888) or directly with other tools on your computer. You can commit and push your code to the repository directly (no copy from/to container is needed because the volume is mapped)
version: "3"
image: flaviostutz/datascience-tools
- 8888:8888
- 6006:6006
- /Users/flaviostutz/Documents/development/flaviostutz/puzzler/notebooks:/notebooks
- For running in production, create a new container with "FROM flaviostutz/datascience-tools" and add your script files to "/notebooks" so when you run the container it will have your custom scripts embedded into it. No "volume" mapping is needed for this container. During container startup, script /notebooks/ will run if present.
JUPYTER_TOKEN - token needed for the users to open Jupyter. defaults to '', so that no token or password will asked to the user
SPARK_MASTER - Spark master address. Used if you want to send jobs to an external Spark cluster and still control the whole job from Jupyter Notebook itself.