hOCR Output Generation- OCRopus generates physical layout analysis as a DOM tree
- structure is roughly like:
- this DOM tree can then be post-processed for logical layout analysis
- right now, this is a purely rule-based process ("if spacing > 1.5 x-height and ... then ...")
- we've made it comparatively easy by making it scriptable
- in the future
- will be applying machine learning to this problem
Toplevel LoopNB: the actual code is still somewhat different; we're cleaning it up.
dom = HocrDom() pageno = 1 for i = 1, #arg do pages = Pages() pages:parseSpec(arg[i]) while pages:nextPage() do dom:startPage(pageno,arg[i]) regions = RegionExtractor() regions:setPageLines(page_seg) for i=1,regions:length()-1 do regions:extractLine(line_image,page_image,i,1) recognizer:recognizeLine(fst,line_image) fst:bestpath(text) -- or use a language model bbox = regions:bbox(i) dom:appendLine(text,bbox) end dom:endPage() end end logical_layout_analysis(dom) dom:writeHocr()
|