2 min read time

Table extraction from PDFs with OpenText Knowledge Discovery (IDOL)

by   in Data Analytics

Valuable data often resides in PDF or Office documents in the form of tables. The challenge, however, lies in extracting this information and being able to query it using natural language. In this article, we will explore how IDOL can be used to extract information from tables and query it using natural language.

KeyView Filter

The OpenText KeyView Filter SDK enables you to incorporate text extraction functionality into your applications. It extracts text and metadata from a wide variety of file formats on numerous platforms and can automatically recognize over 2000 document types. It supports both file-based and stream-based I/O operations, allowing you to perform operations in a separate process for added protection.

In this case, we use KeyView Filter to logically parse PDFs and extract tables in a tab-delimited format. Thanks to its wide-ranging file support, we can perform these operations on various file formats supported by KeyView.

Knowledge Discovery (IDOL) Answer Server

The key to successfully performing RAG (Retrieval Augmented Generation) is to have full control over the context. After KeyView Filter has extracted the content containing the table, which is now in the tab-delimited format, we can pass the content along with the question to an LLM and then get the precise answer.

Table extraction in action
Table extraction in action

In the image above, we are performing table extraction from a PDF document and answering a question while highlighting the answer in the PDF's HTML rendering so that the user can verify the answer visually.

Answer Server is able to provide the paragraph from where the answer has been provided.

The question is, "What is the market share for T-150 Ford in 2023?" IDOL can retrieve the answer from a database of documents in various formats. The KeyView Filter SDK is available in C, C++, Java, .net and Python.

For a deeper understanding of these advancements or to talk about potential collaborations, please feel free to explore OpenText's overview on AI text analytics or drop me an email at vjoseph@opentext.com

If you are based across Europe or North America please drop a line to lobrien@opentext.com

Join OpenText on LinkedIn and follow @OpenText on X.

We’d love to hear your thoughts on this blog. Comment below.

The OpenText Content Services team

Labels:

IDOL
Unstructured Data Analytics