Skip to content

ocropus-archive/DUP-ocr2021

Repository files navigation

Large Scale Processing and Training for OCR, Content Analysis, and Information Extraction

The objective of this tutorial is to introduce students to methods and tools for large-scale training and inference for OCR and content analysis.

  • discussion of deep learning for OCR: text recognition, segmentation, preprocessing, table analysis
  • different kinds of self-supervised, semi-supervised, and unsupervised training for OCR
  • representation of big OCR datasets using WebDataset, shards, and OCR
  • large scale data processing with Docker and Kubernetes

The tutorial will consist of a presentation/introduction and hands-on experimentation.

Setting Up your Machine

The entire runtime, packages, Python installations, and utilities needed for the tutorial are packaged up in a single Docker image called "tmbdev/ocr2021". If you have Docker running on your machine, you should be able to run this image.

Please set up Docker on your machine before the start of the tutorial so that you can get started right away.

Ideally, you should also have a GPU available to you, though you can run the scripts (albeit slowly) without a GPU.

Setup on Linux

You may be able to use Docker from your Ubuntu (or other Linux) distro directly using the built-in docker package, though the docker.io package is preferable. Ideally, you also set up your machine to support NVIDIA GPUs.

Setup on Windows with VirtualBox

You can install Ubuntu on Windows by using the free VirtualBox software. That software works on Windows, Linux, and OS X. After installing VirtualBox, you can install Ubuntu 20.04 directly in it. There is no GPU suppport with this setup, however.

Setup on Windows with WSL

Instead of VirtualBox, you can use Microsoft WSL to run Unbuntu on Windows. Microsoft also provides a version of Docker that integrates with both Windows and WSL and supports Kubernetes. Be sure to install Ubuntu (20.04) for WSL. GPU computing is supported by WSL and Docker on WSL.

The steps for making this work are: (1) install WSL on Windows, (2) upgrade to WSL2, (3) install Docker for Windows (Docker info, Microsoft Documentation), (4) enable Kubernetes in Docker for Windows. For CUDA support, install NVIDIA drivers for Windows and NVIDIA drivers for WSL (CUDA on WSL).

Setup on OSX

Here is the guide for installing Docker on OSX. You need an x86 Mac for this to work. There does not seem to be CUDA support on OSX.

Installing VirtualBox is another alternative.

Remote Access to Your Desktop

If you have Linux/Ubuntu installed on your desktop but are working on a laptop, consider just installing Docker on your desktop and accessing your desktop remotely via VNC.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages