Course: OCRopus

Introduction to OCRopus

OCRopus

  • open source document analysis and OCR system
  • written in C++
  • all languages, scripts
  • use from the command line
  • scripting
  • use as library
  • Apache 2 license: open source and commercial use

Running OCRopus

Set Up Your Environment

dgraphics=1
OCROPUS=$HOME/ocropus
PATH="$OCROPUS/ocroscript:$OCROPUS/scripts":$PATH
OCROSCRIPTS=$HOME/ocropus/ocroscript/scripts

Run OCRopus

ocroscript recognize alice_1.png > out.html

View the Result

firefox out.html

OCRopus Functionality

  • image processing toolbox
  • image preprocessing and cleanup
  • layout analysis (standard and new algorithms)
  • text line recognition ("OCR")
  • statistical natural language processing

OCRopus Platforms

  • primary development platform: Ubuntu (current)
  • code has few system dependencies
  • ported to
    • Debian, Fedora, ...
    • Windows
    • OS X
    • Solaris

Installing OCRopus

  • your distribution may be packaging it up (apt-get install ...)
  • you can install a released version
    • download from ocropus.org
    • read the installation instructions
  • you can install from subversion
    • check out from code.google.com/p/ocropus (follow links from ocropus.org)
    • read the installation instructions

OCRopus external dependencies

  • iulib -- image processing library
  • libjpeg, libpng, libtiff -- used for image I/O
  • OpenFST -- weighted finite state transducers
  • Tesseract -- mature OCR system (mostly character recognition)
  • Leptonica -- some additional document processing functions (will be removed later)

IULib Directory Organization

  • colib -- basic data types (narray, etc.)
  • imgbits -- morphology and binary image processing
  • imgio -- image reading and writing in different formats
  • imglib -- image processing
  • utils -- utilities (including graphics display)
  • vidio -- video I/O

OCRopus Directory Organization

  • data
  • ext
    • lua, tolua -- scripting language (eventually want to move to standard Lua 5)
    • voronoi -- Kise's contributed Voronoi layout analysis
  • ocr-* -- OCRopus library files
  • ocroscript -- OCRopus toplevel (Lua scripting)
    • tests -- unit tests
    • scripts -- top level commands
  • python-binding
    • python bindings for OCRopus (via Lua)
  • utilities
    • various scripts for analyzing the distribution

OCRopus Directory Organization

  • ocr-binarize -- binarization methods
  • ocr-bpnet -- MLP-based text line recognizer
  • ocr-deskew-rast -- RAST-based page deskewing
  • ocr-doc-clean -- document cleanup-related functions
  • ocr-langmods -- OpenFST bindings, language models, beam search
  • ocr-layout-rast -- RAST based layout analysis
  • ocr-leptonica -- Leptonica text/image segmentation
  • ocr-pageseg -- other layout analysis methods and text/image segmentation
  • ocr-samples -- simple sample implementations of some of the OCR interfaces
  • ocr-tesseract -- tesseract bindings
  • ocr-utils -- common utility classes

Build Systems & Platforms

  • there are two build systems
    • autoconf/automake -- official, supported build system
    • scons -- used internally
  • supported platforms
    • Ubuntu Linux -- what we develop on; Subversion version should always compile and run
    • other Linux -- probably will compile and run without changes
    • OS X -- requires minor changes; we try to get it working around the time of releases (Subversion version may not always compile/run)
    • Windows -- requires different build system; we try to get this working around the time of releases (Subversion version may not always compile/run)

Documentation

  • we're working on it
    • documentation on ocropus.org
    • this wiki
    • doxygen-extracted API level reference
    • please contribute HOWTOs etc.
  • where you should look
    • ocroscript/tests -- unit tests in Lua
    • ocroscript/scripts -- top level commands
    • ocroscript/*.pkg -- functionality exposed through Lua



Navigation

Recent site activity