Course: OCRopus

hOCR Output Generation

hOCR Output Generation

  • OCRopus generates physical layout analysis as a DOM tree
  • structure is roughly like:
    • ocr_page
      • ocr_column
        • ocr_line
          • actual text
  • this DOM tree can then be post-processed for logical layout analysis
    • right now, this is a purely rule-based process ("if spacing > 1.5 x-height and ... then ...")
    • we've made it comparatively easy by making it scriptable
  • in the future
    • will be applying machine learning to this problem

Toplevel Loop


NB: the actual code is still somewhat different; we're cleaning it up.

dom = HocrDom()
pageno = 1
for i = 1, #arg do
    pages = Pages()
    pages:parseSpec(arg[i])
    while pages:nextPage() do
        dom:startPage(pageno,arg[i])
        regions = RegionExtractor()
        regions:setPageLines(page_seg)
        for i=1,regions:length()-1 do
           regions:extractLine(line_image,page_image,i,1)
           recognizer:recognizeLine(fst,line_image)
           fst:bestpath(text) -- or use a language model
           bbox = regions:bbox(i)
           dom:appendLine(text,bbox)
        end
        dom:endPage()
    end
end
logical_layout_analysis(dom)
dom:writeHocr()

Navigation

Recent site activity