extensively studied in speech recognition, some statistical NLP
techniques are directly applicable to OCR, but results are not
techniques that don't work for speech may work well for OCR
statistics of OCR are very different
Simple Techniques for Adaptation of Language Models
language recognition
recognize the language using fast, heuristic methods
dictionary augmentation
very commonly used in OCR systems
after a first round of recognition with a generic language model , pick any new words found in whose likelihood is above a given threshold and add them to the (dictionary-based) language model
MAP language recognition
given a set of language models and a first-round recognized text , pick the "best" language model according to and repeat recognition
language model mixture
given a set of language models , pick parameters and use as the language model (e.g., by optimizing the likelihood of and then repeating recognition)