Overview

The Business Document Image Dataset (BDID) is a collection of digitally-born one-page color document images and their associated physical and logical layout information (ground truth), extracted from public business documents. It is made freely available to the document image analysis research community to facilitate the design, development, testing, and evaluation of algorithms for layout segmentation and classification of document images.

 

Why business documents?

 

Contemporary business documents are designed to facilitate rapid interconnections between data presented in textual, graphical, and pictorial formats. As such, they typically contain diverse, multi-layered mixtures of textual, graphical, and pictorial elements through varied layouts, making them a highly interesting data source for developing and testing new document image analysis techniques.

 

Electronic documents are a standard in the business intelligence world. The BDID contains digitally-born document images, as opposed to scanned or camera-captured digitized paper documents, which alleviates the need for pre-processing steps like de-skewing.

 

Another important characteristic setting this dataset apart is that its document images are extracted from publicly available business documents on the web. Example source documents include annual reports of publicly traded companies, evaluation reports, reviews or plans from government agencies and universities.

 

Business document image

[source: http://www.timhortons.com]

Copyright © 2013-2017  |  Last modified on 2017/05/15

BDID

Business Document Image Dataset