The objective of this tutorial is to introduce students to methods and tools for large-scale training and inference for OCR and content analysis.
- discussion of deep learning for OCR: text recognition, segmentation, preprocessing, table analysis
- different kinds of self-supervised, semi-supervised, and unsupervised training for OCR
- representation of big OCR datasets using WebDataset, shards, and OCR
- large scale data processing with Docker and Kubernetes
The tutorial will consist of a presentation/introduction and hands-on experimentation.
The entire runtime, packages, Python installations, and utilities needed for the tutorial are packaged up in a single Docker image called "tmbdev/ocr2021". If you have Docker running on your machine, you should be able to run this image.
Please set up Docker on your machine before the start of the tutorial so that you can get started right away.
Ideally, you should also have a GPU available to you, though you can run the scripts (albeit slowly) without a GPU.
You may be able to use Docker from your Ubuntu (or other Linux) distro directly using the built-in docker package, though the docker.io package is preferable. Ideally, you also set up your machine to support NVIDIA GPUs.
- NVIDIA documentation for Docker.
- Gist showing docker.io and nvidia-docker2 installation
- On Ubuntu, you can install Kubernetes with
snap install microk8s
You can install Ubuntu on Windows by using the free VirtualBox software. That software works on Windows, Linux, and OS X. After installing VirtualBox, you can install Ubuntu 20.04 directly in it. There is no GPU suppport with this setup, however.
Instead of VirtualBox, you can use Microsoft WSL to run Unbuntu on Windows. Microsoft also provides a version of Docker that integrates with both Windows and WSL and supports Kubernetes. Be sure to install Ubuntu (20.04) for WSL. GPU computing is supported by WSL and Docker on WSL.
The steps for making this work are: (1) install WSL on Windows, (2) upgrade to WSL2, (3) install Docker for Windows (Docker info, Microsoft Documentation), (4) enable Kubernetes in Docker for Windows. For CUDA support, install NVIDIA drivers for Windows and NVIDIA drivers for WSL (CUDA on WSL).
Here is the guide for installing Docker on OSX. You need an x86 Mac for this to work. There does not seem to be CUDA support on OSX.
Installing VirtualBox is another alternative.
If you have Linux/Ubuntu installed on your desktop but are working on a laptop, consider just installing Docker on your desktop and accessing your desktop remotely via VNC.