There are two main approaches to document layout analysis. Firstly, there are bottom-up approaches which iteratively parse a document based on the raw pixel data. These approaches typically first parse a document into connected regions of black and white, then these regions are grouped into words, then into text lines, and finally into text blocks.34 Secondly, there are top-down approaches which attempt to iteratively cut up a document into columns and blocks based on white space and geometric information.5
The bottom-up approaches are the traditional ones, and they have the advantage that they require no assumptions on the overall structure of the document. On the other hand, bottom-up approaches require iterative segmentation and clustering, which can be time consuming.6 Top-down approaches have the advantage that they parse the global structure of a document directly, thus eliminating the need to iteratively cluster together the possibly hundreds or even thousands of characters/symbols which appear on a document. They tend to be faster, but in order for them to operate robustly they typically require a number of assumptions to be made about on the layout of the document.7 Examples of top-down approaches include the recursive X-Y cut algorithm, which decomposes the document in rectangular sections.8
There are two issues common to any approach at document layout analysis: noise and skew. Noise refers to image noise, such as salt and pepper noise or Gaussian noise. Skew refers to the fact that a document image may be rotated in a way so that the text lines are not perfectly horizontal. It is a common assumption in both document layout analysis algorithms and optical character recognition algorithms that the characters in the document image are oriented so that text lines are horizontal. Therefore, if there is skew present then it is important to rotate the document image so as to remove it.
It follows that the first steps in any document layout analysis code are to remove image noise and to come up with an estimate for the skew angle of the document.
In this section we will walk through the steps of a bottom-up document layout analysis algorithm developed in 1993 by O`Gorman.9 The steps in this approach are as follows:
Baird, K.S. (July 1992). "Anatomy of a versatile page reader". Proceedings of the IEEE. 80 (7): 1059–1065. CiteSeerX 10.1.1.40.8060. doi:10.1109/5.156469. /wiki/CiteSeerX_(identifier) ↩
Cattoni, R.; Coianiz, T.; Messelodi, S.; Modena, C. M. "Geometric Layout Analysis Techniques for Document Image Understanding: a Review. ITC-irst Technical Report TR#9703-09". {{cite journal}}: Cite journal requires |journal= (help) /wiki/Template:Cite_journal ↩
O'Gorman, L. (1993). "The document spectrum for page layout analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 15 (11): 1162–1173. doi:10.1109/34.244677. /wiki/Doi_(identifier) ↩
Seong-Whan Lee; Dae-Seok Ryu (2001). "Parameter-free geometric document layout analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 23 (11): 1240–1256. CiteSeerX 10.1.1.574.7875. doi:10.1109/34.969115. /wiki/CiteSeerX_(identifier) ↩
Ha, Jaekyu; Haralick, Robert M.; Phillips, Ihsin T. (1995). "Recursive X-Y Cut using Bounding Boxes of Connected Components" (PDF). Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR '95). http://haralick.org/conferences/71280952.pdf ↩