5 min read time

Large PDFs Are No Longer a Burden

by   in Data Analytics

PDF (Portable Document Format) files are ubiquitous.  Invented by Adobe in the early 1990s, PDF files are used to publish documentation, tax forms, contracts, newspapers, capture scanned data and many other uses.  PDF Readers are available on virtually every modern computer system.  In the last 30 years of computing, you would have had to work hard to avoid interacting with a PDF file. 

If (ok, when) you’ve used a PDF file, you have also encountered large ones – containing tens to hundreds, and in rarer cases thousands of pages. These large page-count PDF files are also often large in file size, increasing download times, slowing email, making sharing cumbersome and making it difficult to find what you’re really looking for. While PDF readers have an embedded search function, you still have to download it with times vary based on size and network speed.  Plus, the search within PDF readers is not always fast.  Remember, the goal here is to locate valuable information and do so quickly - time is money.

What if there was a better way?

With IDOL version 12.13, it is now possible to issue a smart search that:

  • ingests PDFs as pages
  • identifies the exact pages in the PDF that match the query terms
  • when the user wants to view the large PDF, the pages of interest are dynamically extracted and rendered as HTML with term highlighting

For large page-count PDF files, this approach can result in a significant reduction in the time to download the content of interest.  And with just the relevant pages downloaded, answers are easier to find.

Let us look at a few examples. In example #1, the KeyViewFilterSDK_12.13_CProgramming.pdf with 401-pages and a ~2.5MB size will be used. Two scenarios will be explored, one with standard ingestion, indexing and viewing and the second with the new selective page rendering feature enabled.  The other 2 examples will explore a mix of other content and more sophisticated searches.

In example #1 and the first scenario, the KeyViewFilterSDK_12.13_CProgramming.pdf file is ingested with IDOL NiFi Ingest and indexed into IDOL Content without page sectioning.  When a query for "Send documentation feedback" is issued, a single search result is returned. Upon viewing, the entire 401-page PDF is downloaded, and the user must locate the two matches on pages 12 and 401.  The download time for ~2.5MB will vary based on network speed.  In most cases, downloading will be fast since the file size is not too large. However, to locate the matches, it took almost 12 seconds for Acrobat Reader to identify the second match on page 401. Using IDOL View Server to render (w/term highlighting) the entire document (as HTML) took ~30 seconds to display all 401 pages (~57MB). Both approaches are too slow.

In scenario #2, the file is ingested with IDOL NiFi Ingest and indexed into IDOL Content as a PDF page sectioned file.  For the same query, a single result is also returned but with the two matching pages (12 and 401) identified in the search result. This means when IDOL View Server is asked to render the document, only two pages (~185KB file size) are produced, taking only ~160 milliseconds to render (w/term highlighting), download and display. Access to the two pages of relevant content felt instantaneous.

Figures 1 and 2 below illustrate the IDOL Content search results and IDOL View with selective page rendering.

 

Figure 1 IDOL Content search results with the additional page information returned.

Figure 2 IDOL View rendition of page 12 and 401 with query terms highlighted.

 

For the second example, the ArcSight Console User Guide document was used. It has more pages (1114) and is a larger file size (~15MB). As a larger file the download time will be longer than in example #1 with the actual time to download varying based on the network speed. To locate any content of interest across 1114 pages will take longer vs the 401-page file in example #1. When IDOL View Server is used to render a single page of results, a small ~160KB file of relevant data is rendered and downloaded. It does not require a network engineer to know that downloading the ~160KB single page rendition will be much faster than downloading the entire ~15MB PDF. And locating the information in a single page rendition will also be faster.

The final example involves what is basic functionality for a search engine: proximity and stemming. An IDOL query for `API NEAR implementation` yields hits on thirteen different pages across the 401-page KeyViewFilterSDK_12.13_CProgramming.pdf file.  Stemming yields matches like `implemented` from the query term `implementation`. And the NEAR proximity operator yields a match like `…defines the API functions implemented by the….`.  IDOL View Server returns the thirteen relevant pages and highlights the matching search terms with the NEAR query syntax and stemming both respected in the term highlighting.

When the same search was attempted with the built-in find feature of web browsers and Acrobat Reader a wall was hit since neither support stemming, nor proximity searches. When a similar query is used: `api implementation` both Acrobat Reader and the web browser find matches on just two pages omitting the other eleven pages of interest.

Figures 3 and 4 below illustrate the IDOL Content search and IDOL View with selective page rendering results.

Figure 3 IDOL Content search results with the additional page information returned for `api NEAR implementation`.

 

Figure 4 IDOL View rendition of page 371 with query terms highlighted.

The overall benefits of selective PDF page rendering will vary based on a few factors:

  • size of the original PDF - including page count and the file size
  • speed of the network used to download the document
  • capabilities of the device viewing the document

What is certain is that downloading and viewing a 1100-page PDF file on a mobile phone will be slower.  And it will be cumbersome to find the content of interest across 1100 pages vs a few relevant pages.

Learn more

View Server has other viewing-related capabilities like:

  • redaction based on IDOL Eduction and Grammar Packs for use cases like PII and other sensitive data masking
  • integration with IDOL Connectors to enable secure, connector-based viewing from cloud storage systems from vendors like Google, AWS, Microsoft, DropBox, Box or applications like Micro Focus Content Manager, OpenText Content Server, IBM FileNet, Salesforce, ServiceNow and many others
  • HTML renditions for file formats besides PDF like for spreadsheets, presentation, word processing and even others
  • Universal Viewing offers automatic selection of the appropriate original document fetch method offering an alternative in case the source repository is unexpectedly unavailable

You can learn more about IDOL and View Server at https://www.microfocus.com/idol.

For the more technically inclined, the IDOL Documentation is available here - select 12.13 or higher for selective page rendering-supported versions.  The following links point to specific CFS, Content and View documentation-related settings for selective page rendering.

The Micro Focus IM&G team

Know your data | empower your people | drive your future Join our community | @microfocusimg | www.microfocus.com

Tags:

Labels:

Artificial Intelligence
Data Governance
File Analysis & Management
Information Governance
Information Management