AI-SDV is the place to be for everyone involved in advanced search and data applications, text mining and visualization technologies. AI-SDV 2022 took place on Oct 10-11, 2022, in Vienna (Austria).

Karakun expert Dr. Holger Keibel presented a talk about information extraction from tabular documents.

Abstract & Slides

In our customer projects involving automated document processing, we often encounter document types providing crucial data in the form of tables. While established text analytics algorithms are usually optimized to operate on running text, they tend to produce rather poor results on tables as they do not capture the non-sequential relations inside them (e.g. interpret the content of a table cell relative to its column title, interpret line breaks inside a cell differently from line breaks between cells or rows).

While there are elaborate information extraction products in the market for a few highly specific types of tabular documents, there is no general approach out there. The main cause for this is the fact that table structures can be encoded by a heterogenous range of layout means (e.g. column boundaries can be signaled by lines vs. aligned text vs. white space).

In this talk, we will illustrate several solutions that we have developed for a range of challenges occurring in this context, both for scanned and digitally generated documents.

Want to learn more about how language analytics and information extraction can boost you business?

Get in touch with us