Course: OCRopus

Course Outline

Document Analysis with the OCRopus OCR Platform


Intended Audience:

students and researchers interested in the state of the art in document analysis, including OCR and layout analysis; students, researchers, and industrial developers interested in using the OCRopus open source OCR system as a platform for research and development

Prerequisites for participants:

some knowledge of pattern recognition, machine learning, image processing (including C/C++ programming and/or scripting in languages like Matlab, Python, or Lua)

Preliminaries

  • Overview of OCRopus a general overview of the OCRopus architecture
    • goals, information flow, general architecture
  • Introduction to OCRopus basic information about OCRoups
    • installing, running, and extending OCRopus; Lua scripting; writing C++ extensions

Developing With and For OCRopus

  • Scripting OCRopus in Lua
    • introduction to Lua scripting and Lua modules available from within OCRopus
  • Extending OCRopus in C++
    • writing code using OCRopus as a library, and writing new OCRopus components; building, linking; basic data structures; overview of core data structures
  • Debugging Graphics (dgraphics)
    • simple image display functions
  • The C/Lua Embedding
    • how to embed OCRopus in C programs, bind it to other scripting languages, etc.

Image Processing

  • Image Processing general purpose linear and non-linear image processing functions
    • Gaussian filters, FIR/IIR/FFT filters, feature detection, median and rank filters, edge detection,corner detection
  • Binary and Morphological Image Processing efficient document image processing and cleanup using fast morphology
    • morphological decomposition, bit-blit and run-length based implementations, output complexity, marker morphology, applications to document cleanup and layout analysis

The OCR Interfaces

  • OCR Interfaces interface classes to OCR components
    • pluggable algorithms and extensibility
  • Binarization automatic and adaptive document image binarization and segmentation
    • review of standard document binarization methods, fast integral-image based implementations of adaptive binarization methods
  • Skew Detection skew detection and correction, dewarping
    • overview of skew detection methods based on the docstrum, Fourier transforms, mathematical morphology, line fitting and RANSAC; detailed description of robust least square matching for skew correction
  • Text-Image Segmentation text/image segmentation in documents
    • local image features used for pixel and region classification, logistic regression and decision tree based classification
  • Layout Analysis geometric document layout analysis
    • review of morphological layout analysis, docstrum, XY cut, Voronoi methods, whitespace methods, and computational geometry methods
  • Isolated Character Recognition the isolated character recognition interfaces
    • usage and function of the interface
  • Text Line Recognition the text line recognition interfaces
    • usage and function of the interface
  • WFST Operations and Basic Language Modeling
    • weighted finite state transducers, composition, search, relationship to HMMs; "classical" language models, including n-graphs, n-grams, dictionaries, and tries; expression of classical language models in weighted finite state transducers
  • hOCR Output Generation
    • DOM generation, DOM scripting for layout analysis; alternative output formats

More on Algorithms

  • OCR by Oversegmentation text line recognition using oversegmentation
    • recognition by oversegmentation, grouping, and segmentation graphs; different methods for oversegmentation (upper contour, skeletal graph, dynamic programming); relationship to HMMs; statistically sound estimation of segmentation costs and posterior probabilities; relationship to speech recognition and HMM-based methods
  • Adaptation of Character Recognizers improving performance by taking advantage of style consistency within documents
    • parametric adaptation, script and font recognition, shape dictionary augmentation, shape clustering, co-training, stochastic gradient descent, mixture models, hierarchical Bayesian methods
  • Adaptation of Language Models improving performance by taking advantage of language model consistency within documents
    • language recognition, dictionary augmentation, language model adaptation

More on Extending

  • Integrating Your Own OCR and Layout
    • interfaces and strategies for integrating your own OCR and layout analysis code into OCRopus, for evaluation, testing, and production use
  • Future Work and Contributions
    • some hints about current status and areas of work; relationship to published methods; parametric adaptation; requests for contributions

Additional Topics


  • Statistical Layout Analysis statistical document layout analysis
    • an overview of statistical approaches to layout analysis, including MRFs, stochastic 2D grammars, 2D HMMs, turbo decoding, and structural mixture models
  • Shape Based Recognition shape clustering for OCR
    • clustering methods for OCR, both for style modeling and for speeding up OCR engines; contour-based methods; shape matching-based methods; statistically sound modeling of shape variations
  • Advanced Statistical Language Modeling statistical language modeling for OCR using WFSTs and the OpenFST library
    • smoothing; representation of recognition alternatives; integration with statistical natural language post-processing, including information extraction; use of WFST models for training OCR with misaligned ground truth
  • Camera-Based Dewarping Methods
    • brief overview of image dewarping methods and camera-based methods
  • More Performance Evaluation
    • an overview of layout performance evaluation metrics; color-based pixel-accurate ground truth representation; bipartite matching and computation of vectorial performance scores; evaluation of segmentation performance; evaluation of character recognition performance
  • Document Image Degradation artificially generating degraded document image
    • degradation models; applications to training data generation
  • Performance Evaluation benchmarking and performance evaluation
    • edit distance computations; hocr tools
  • Constrained Text Line Finding for Layout Analysis
    • a general-purpose layout analysis method that you can easily customize

Navigation

Recent site activity