BDID - Contents

Contents

The Business Document Image Dataset offers a lot of variability in terms of layouts and formats. It includes color, grayscale, and black and white document images (all coded in 24 bit RGB image format), one- and two-column pages, various background colors, etc.

The current dataset is comprised of 1232 one-page document images extracted from annual reports of publicly traded companies, and evaluation reports, reviews or plans from government agencies and universities. They are available at several resolutions (100, 150, 200, 250, and 300 DPI). Each document image contains text and at least a chart, an image (picture), or a table, for a total of:

488 charts (281 bar, 60 line, 89 pie/doughnut, 41 mixed, 17 other),
59 images,
1682 tables.

Ground truth

For each document image, ground truth data have been prepared with the software Aletheia 1.5, a layer-based advanced document layout and text ground-truthing system targeting large scale digitization. The software allows to specify objects on a document image and assign them to one of the many predefined types of regions, set their properties, and save the data in the PAGE format framework through an XML schema.

Two different types of ground truth are available: basic and layered.

Basic ground truth

In the basic version, ground truth information includes four types of document objects: text, graphics/charts, images, and tables. Document objects are physically delimited by a close-fitting isothetic polygon or rectangle, and logically defined by the type of region used (text, chart, etc.). Whenever a document object contains embedded text (e.g. text in tables), additional text regions are defined on a separate second layer.

Basic ground truth: sample color document images (top) and basic ground truth overlaid on grayscale versions (bottom)

[sources: http://www.cn.ca, http://www.awincomefund.ca, http://www.bombardier.com]

Layered ground truth

In the layered version, ground truth information includes a first layer of region-based data, and a second layer of pixel-based data. Region-based data identify document objects from a high-level viewpoint. Document objects are often composite in nature, including several elemental patterns (e.g. a chart often include textual and graphical elements). Pixel-based data identify homogeneous groups of pixels from a low-level viewpoint, and represent the decomposition of document objects into elemental patterns. Both document objects and groups of pixels are physically delimited by a close-fitting isothetic polygon or rectangle, and logically defined by their type. Region-based data include six types of document objects: text blocks, charts, images, tables, separators, and background. Pixel-based data include four types of elemental patterns: textual, graphical, pictorial, and interstitial.

Layered ground truth: sample color document image (top) and layered ground truth overlaid on grayscale version (bottom), with region-based data (bottom left), pixel-based data (bottom middle), and zoomed-in section showing the decomposition of a chart into its elemental patterns (bottom right)

[source: http://www.awincomefund.ca]

Business Document Image Dataset