Course: OCRopus

Overview of OCRopus

Home Page


www.ocropus.org

  • documentation, source code
  • project pages
  • list of contributors
  • list of resources

OCRopus License


OCRopus is distributed under the Apache 2 license
  • very widely used license
  • similar to BSD license
  • far more liberal than GPL/LGPL
  • use it for open source project
  • use it for closed source products
  • if you distribute source, you need to acknowledge its source
  • if you assert patents against other users, you can't use it anymore
  • see the license for details

Target Applications

  • document indexing and search
  • on-line reading
  • re-typesetting
  • editing
  • information extraction

State of OCR

  • OCR is lots of different things...
    • physical layout analysis
    • logical layout analysis
    • text recognition
    • font recognition, style recognition
    • math recognition
    • table recognition
  • all of these problems are unsolved for unrestricted input
  • what works fairly well...
    • physical layout analysis for simple layouts
    • text recognition for alphabetic languages and clean inputs
  • OCR is harder than speech recognition
    • expectation of 100% accuracy
    • human-designed medium, rather than natural medium

State of Commercial OCR



State of Commercial OCR




State of Commercial OCR



State of Commercial OCR



Commercial OCR Use Cases

  • primary use cases
    • desktop scanning, some bill and mail processing
  • layout analysis
    • rule-based, not trainable, prone to catastrophic failures
  • character recognition
    • ad-hoc classifier combination, speed-oriented
  • language modeling
    • dictionary lookup, backtracking
  • adaptation
    • per-page “retraining”, dictionary augmentation

OCR Tradeoffs

  • throughput and memory requirements
  • average error rate
  • frequency of catastrophic failures
  • types of errors
    • layout analysis errors
    • OCR errors correctable by language models
    • OCR errors on names / identifiers
  • cost

OCRopus Architecture



OCRopus Goals


  • performance
    • significantly reduce average character error rate
    • greatly reduce undetected catastrophic failures
    • meet production throughput/memory requirements
  • functionality
    • any script, any language
    • pluggable architecture
    • testable architecture
    • fully statistical foundation



Navigation

Recent site activity