An Introduction to OpenText File Content Extraction
This series of blog posts is a guide to Knowledge Discovery (IDOL), starting with a high level overview , and getting into increasingly focused areas of Knowledge Discovery (IDOL) functionality. It is intended for anyone who wants to learn more about what Knowledge Discovery (IDOL) can do to help get the most out of their data.
As life becomes increasingly digitized, we have become used to working with many different software applications. From email to presentations, images, Web pages, PDF to ZIPs, we deal with hundreds of different types of files, without thinking about how the computer stores and processes the information that we read, manipulate, and use.
In fact, there are thousands of file formats, all storing different types of content, and used for different purposes. To get the most value out of your data, you need to be able to extract value from all these different formats.
File Content Extraction provides the tools needed to identify over 2000 different file types, to filter out the text content, and to convert it to formats that are easier to process automatically, use, and view.
Identify Formats
Before you can do anything with your content, you need to know what you have.
File formats include a vast array of different things, from the comparatively simple plain text, to very specialised formats for computed aided design (CAD). File Content Extraction can identify over 2000 formats, and increases this number with every release.
You might think that you can identify a file format by using its file extension. However, the file extension is an unreliable marker. In some cases, the same file extension might refer to many different versions of an application, or even files from two very different pieces of software. In other cases, a file might be mislabeled, either by accident or by design… for example, a malicious actor might rename a file to get past a simple firewall that attempts to block ZIP or EXE files.
As a result, File Content Extraction ignores the file extension and examines the content of a file to identify it correctly. Many formats use a ‘magic number’ at the start, which are useful for identification. However, magic numbers can be ambiguous and are sometimes insufficient, so File Content Extraction examines the file more deeply to ensure the basic validity of a file before it determines the format and increase the confidence in the result.
In all cases, File Content Extraction does the minimum amount of work required to be confident of the file format, so it can detect formats as quickly as possible.
Depending on your data, and your uses, you can optionally turn on KeyView’s source code detection, which similarly scans a file to work out the main programming language used in a source file.
Container Files
In addition to simple file formats, many formats can contain other files. The most obvious example of this is a ZIP, which is an archive of other files. However, lots of other formats can contain other files, such as emails with attachments, Word documents with embedded images or spreadsheet tables, and many others.
File Content Extraction can open these container files and extract the files inside. This step is an essential part of processing file data, and is made simple in a workflow. You can then process the subfiles in the same way as any other documents, to extract all the value from your data.
Text Filtering
After you identify the format of a file, you might want to be able to get useful content out of it. In KeyView, this process is known as text filtering.
File Content Extraction can detect and retrieve the text in many different file formats. It can filter the obvious text that a user sees when it reads the file, and in many cases it can also filter hidden text. Hidden text is anything that is present in the file but not immediately obvious to a user, such as comments and tracked changes, and explicitly hidden text, such as accessibility text in PDF.
File Content Extraction can also retrieve many different kinds of metadata information from your files, from the author and change date information, to security classification information that might determine what users are allowed to access the information.
It is worth noting that File Content Extraction can also perform character set conversion, so that you can change text from other character sets into more widely accepted and managed ones, such as UTF-8, which can simplify processing.
File Transformation
After you have filtered the text from your files, you can use it for a lot of different things.
In most cases, when you need the text content for something, you use Filter to get the text, and use the output directly. For example, Knowledge Discovery (IDOL) Ingest uses Filter to get the text from your documents, and it uses it to process the content and index it into an Knowledge Discovery (IDOL) Text index. You can also use the filtered text for additional processing, such as using OpenText Named Entity Recognition to search for any personal identifying information.
In some cases you might need to use the whole document for something else, in which case you can use KeyView Export.
File Content Extraction has tools to convert many supported file formats into HTML, XML, or PDF.
HTML export allows you to generate a version of your files that you can read in a Web browser. This might be for ease of access, to make files accessible where a user might not have the required software on their computer. It also allows you to embed document viewing in your web applications. In some cases it might even be an essential method for accessing file formats that no longer have a supported native reader; in this way KeyView can help users access content that would otherwise be lost and unusable.
Similarly, PDF export allows you to convert your documents to PDF, for ease of viewing.
XML export provides a more structured version of the plain text filter output. It contains information about the document layout, and retains the original structure. This output might be useful for document indexing for search, for example so that you can put headings in particular fields, or specialized document parsing in an application that requires the document structure.
Document Security
Secure file sharing in an organization often means that files are encrypted. File Content Extraction has tools for working with these encrypted files, particularly with Microsoft Azure Information Protection (AIP) and Rights Management System (RMS).
In many cases, File Content Extraction can identify the format of the file even without decryption, because some of the metadata information is not encrypted. Where the encryption key is available, File Content Extraction can then decrypt and process the secure documents, making it easier for you to access the content.
File Content Extraction SDKs
File Content Extraction is a set of libraries, intended as tools that you can embed into your own software. To use File Content Extraction, you use one of the SDK packages:
- Filter SDK allows you to identify formats, filter text and metadata, and extract subfiles. Filter is the most commonly needed and used File Content Extraction tool.
The Filter SDK provides C, C++, Java, .NET, and Python APIs.
- Export SDK allows you to convert a document to HTML, XML, or PDF output.
The HTML and XML Export SDKs provides C and Java APIs, while the PDF Export SDK is currently available only in C.
- Panopticon is a File Content Extraction tool for decrypting files that are protected by Microsoft Rights Management System (RMS).
Further Reading
The File Content Extraction technical documentation provides details about how to use and program with the File Content Extraction SDKs.
More Information
Learn more about what Unstructured Data Analytics can do for you.
Join OpenText on LinkedIn and follow @OpenText on X.
We’d love to hear your thoughts on this blog. Comment below.
The OpenText Analytics & AI team