Document Analysis with the OCRopus OCR Platform
Intended Audience:
students and researchers
interested in the state of the art in document analysis, including OCR
and layout analysis; students, researchers, and industrial developers
interested in using the OCRopus open source OCR system as a platform
for research and development
Prerequisites for participants:
some knowledge of pattern recognition, machine learning, image processing (including C/C++ programming and/or scripting in languages like Matlab, Python, or Lua)
Preliminaries
- Overview of OCRopus a general overview of the OCRopus architecture
- goals, information flow, general architecture
- Introduction to OCRopus basic information about OCRoups
- installing, running, and extending OCRopus; Lua scripting; writing C++ extensions
Developing With and For OCRopus- Scripting OCRopus in Lua
- introduction to Lua scripting and Lua modules available from within OCRopus
- Extending OCRopus in C++
- writing code using OCRopus as a library, and writing new OCRopus components; building, linking; basic data structures; overview of core data structures
- Debugging Graphics (dgraphics)
- simple image display functions
- The C/Lua Embedding
- how to embed OCRopus in C programs, bind it to other scripting languages, etc.
Image Processing-
Image Processing general purpose linear and non-linear image processing functions
- Gaussian filters, FIR/IIR/FFT filters, feature detection, median and rank filters, edge detection,corner detection
- Binary and Morphological Image Processing efficient document image processing and cleanup using fast morphology
-
morphological decomposition, bit-blit and run-length based
implementations, output complexity, marker morphology, applications to
document cleanup and layout analysis
The OCR Interfaces- OCR Interfaces interface classes to OCR components
- pluggable algorithms and extensibility
-
Binarization automatic and adaptive document image binarization and segmentation
-
review of standard document binarization methods, fast integral-image
based implementations of adaptive binarization methods
- Skew Detection skew detection and correction, dewarping
- overview
of skew detection methods based on the docstrum, Fourier transforms,
mathematical morphology, line fitting and RANSAC; detailed description
of robust least square matching for skew correction
- Text-Image Segmentation text/image segmentation in documents
-
local image features used for pixel and region classification, logistic regression and decision tree based classification
-
Layout Analysis geometric document layout analysis
-
review of morphological layout analysis, docstrum, XY cut, Voronoi
methods, whitespace methods, and computational geometry methods
- Isolated Character Recognition the isolated character recognition interfaces
- usage and function of the interface
- Text Line Recognition the text line recognition interfaces
- usage and function of the interface
- WFST Operations and Basic Language Modeling
- weighted
finite state transducers, composition, search, relationship to HMMs;
"classical" language models, including n-graphs, n-grams, dictionaries,
and tries; expression of classical language models in weighted finite
state transducers
- hOCR Output Generation
- DOM generation, DOM scripting for layout analysis; alternative output formats
More on Algorithms-
OCR by Oversegmentation text line recognition using oversegmentation
-
recognition by oversegmentation, grouping, and segmentation graphs;
different methods for oversegmentation (upper contour, skeletal graph,
dynamic programming); relationship to HMMs; statistically sound
estimation of segmentation costs and posterior probabilities;
relationship to speech recognition and HMM-based methods
- Adaptation of Character Recognizers improving performance by taking advantage of style consistency within documents
- parametric adaptation, script and font recognition, shape dictionary augmentation, shape clustering, co-training, stochastic gradient descent, mixture models, hierarchical Bayesian methods
- Adaptation of Language Models improving performance by taking advantage of language model consistency within documents
- language recognition, dictionary augmentation, language model adaptation
More on Extending- Integrating Your Own OCR and Layout
- interfaces and strategies for integrating your own OCR and layout analysis code into OCRopus, for evaluation, testing, and production use
- Future Work and Contributions
- some hints about current status and areas of work; relationship to published methods; parametric adaptation; requests for contributions
-
Statistical Layout Analysis statistical document layout analysis
-
an overview of statistical approaches to layout analysis, including
MRFs, stochastic 2D grammars, 2D HMMs, turbo decoding, and structural
mixture models
-
Shape Based Recognition shape clustering for OCR
-
clustering methods for OCR, both for style modeling and for speeding up
OCR engines; contour-based methods; shape matching-based methods;
statistically sound modeling of shape variations
- Advanced Statistical Language Modeling statistical language modeling for OCR using WFSTs and the OpenFST library
-
smoothing; representation of recognition
alternatives; integration with statistical natural language
post-processing, including information extraction; use of WFST models
for training OCR with misaligned ground truth
- Camera-Based Dewarping Methods
- brief overview of
image dewarping methods and camera-based methods
- More Performance Evaluation
-
an overview of layout performance evaluation metrics; color-based
pixel-accurate ground truth representation; bipartite matching and
computation of vectorial performance scores; evaluation of segmentation
performance; evaluation of character recognition performance
- Document Image Degradation artificially generating degraded document image
- degradation models; applications to training data generation
-
Performance Evaluation benchmarking and performance evaluation
- edit distance computations; hocr tools
- Constrained Text Line Finding for Layout Analysis
- a general-purpose layout analysis method that you can easily customize
|
|