7 min read time

An Introduction to Knowledge Discovery (IDOL) Text Analytics

by   in Data Analytics

Working with Text Analytics is more than just Text

This series of blog posts is a guide to Knowledge Discovery (IDOL), starting with a high level overview (https://community.microfocus.com/img/b/backupandgovernance/posts/opentext-idol-ai), and getting into increasingly focused areas of Knowledge Discovery functionality. It is intended for anyone who wants to learn more about what Knowledge Discovery can do to help get the most out of their data.

Ever since the invention of writing, text has been an important and extensive form of communication. The earliest written records give us an insight into aspects of history that other artefacts from the time cannot.

Today, while other forms of media become increasingly popular, text is still the backbone of a lot of information technology. From email to social media, companies use text in a large proportion of communications. Even where video and audio formats exist, it is sometimes desirable to transcribe the speech to text to analyse and review its content alongside other documents.

This unstructured text data is generally more difficult for computers to process than structured data, but it contains a wealth of valuable information if it can be extracted.

Knowledge Discovery (IDOL) Text Analytics is the group of products that allow you to automatically process, search, and analyse this unstructured text.

 

Knowledge Discovery (IDOL) Content Component

At the heart of a lot of knowledge Discovery's text processing and analytics functions, is the Content component. It is the main engine that processes and stores (indexes) your text data, and allows you to query it.

Indexing is the process of taking raw text and converting it into a form that is easily searchable. When you index a document, Knowledge Discovery Content processes and stores the text body, along with any metadata about this text. When it processes the text, it uses a probabilistic approach to create a language model that adapts to the contents of your documents. This method allows Knowledge Discovery to adjust flexibly to data that contains specialist language, or highly context-dependent word usage.

Querying allows you to retrieve the data. In addition to simple keyword searches, Knowledge Discovery has a variety of advanced search tools to help you find the content you need. Exact phrase search and proximity operators allow you to specify very exact search criteria. Document tagging, and faceted search (filtering) can make it easier for you to pick a target subject area, while tools such as fuzzy spelling and Soundex (sounds-like) options allow you to broaden your search terms.

In addition, you can supply a sentence, or even a whole document, as query text. Knowledge Discovery pre-processes the query text in the same way as when you index a document, to find the most important terms and phrases. This method provides a powerful tool for finding other documents like your sample, without having to spend time defining what that means.

While you might only need a few of these options for a particular use case, the overall approach means that Knowledge Discovery is adaptable to a wide variety of different data stores.

Data Enrichment for Search

With additional enrichment to the data that you index, you can include many different types of fields in your documents to allow you to perform more complex search operations. Each field contains a piece of information, which might be unstructured text, like a summary, or a structured tag, such as a date or name. Fields can originate from metadata that exists in the source document, or it can be derived from the document, for example by using entity extraction, sentiment analysis or face recognition.

Using data enrichment to add fields in this way means that you can improve performance for some kinds of search, such as:

  • Faceted Search. Use the data in fields as filters that allow you to refine search results, for example to search only for products with a particular colour or price.
  • Geospatial Search. Add coordinates (such as latitude and longitude) or region information to your documents to allow you to search for documents by location. Knowledge Discovery allows you to search for a particular location, locations in a range, or locations within an area that you define (either a simple circle or a polygon).
  • Bias Results. Use field data to ensure that results with a particular value or range of values score more highly, without explicitly excluding other options. For example, you might want results in a particular price range to occur at the top of the search results, while still allowing for other results.
  • User metafields. Perform calculations on field values ‘on the fly’ during queries, to allow you to search for value combinations that aren’t stored in existing fields. For example, you can search for documents where the sum of two price fields is less than a particular value.

Query Analytics

Knowledge Discovery query analytics help you to find out more about your data. It can reveal information about the content you have in your index, find out what topics are most important, and even track how they change over time.

Knowledge Discovery provides many query analytics options out-of-the-box, available either with the Content component, or by using other Knowledge Discovery components that integrate with it and use your data index. For example, you can automatically categorize your documents, and cluster similar topics together.

The Find user interface provides examples of many analytics, such as:

  • Comparisons between two result sets.
  • Timelines to show how the stories in the data change over time.
  • Topic maps to show the important subject areas in a particular set of results.
  • Geographical maps that show where particular stories originate.

Entity Extraction

Entity extraction is a tool that Knowledge Discovery (IDOL) provides in several different products, powered by the Eduction engine.

An entity is any small, definable snippet of information that you might want to find in text, such as a name, an address, or an ID number.

Entity extraction has many uses. Recently, a very common use is to find personally identifiable information (PII) to ensure compliance with regulations such as the EU’s General Data Protection Regulation (GDPR). You can use Eduction to find out whether a particular document contains any PII, and flag these documents for appropriate processing.

You can also use entity extraction for document enrichment, for example to find and tag documents that refer to a particular celebrity or organisation to make it easier for your users to search for. Another common use is sentiment analysis, which finds positive and negative parts of phrases to determine the sentiment surrounding an object within a comment or review.

To find entities, Eduction defines grammars. The grammar might be a list of terms (such as names), or a pattern that defines what the entity looks like. For example, for telephone numbers the grammar defines the number of digits, and use of spaces and other delimiters, rather than explicitly listing every possible number.

Knowledge Discovery provides out-of-the-box grammars for many different types of entity. There is a large set of standard grammars, containing a large variety of common entities. In addition, there are premium grammar packages, which provide grammars tailored to particular uses, such as finding PII from a large number of countries, and in a variety of languages. These grammars tend to be maintained more rigorously than the standard grammars, and are being expanded with every release.

You can read more about using Knowledge Discovery (IDOL) Eduction to find PII

 Natural Language Question Answering

As human use of computers and the internet has evolved, the emphasis has changed from requiring users to carefully select keyword search terms, to allowing us to use more fluid text input to find what we want, and in many cases providing simple direct answers to a question. Computers must now work out the parts of a question that represent the information that the user wants, and how to map this to the data available.

Answer Server provides several methods for this kind of natural language question answering.

  • Answer Bank. An FAQ-style set of model questions and their answers.

  • Fact Bank. A database of facts that map pieces of information together (for example, to map ‘capital city’ and ‘Spain’ to ‘Madrid’), to provide simple answers to questions.

  • Passage Extractor. An index of data that contains information that might be relevant to user questions. Passage Extractor automatically finds phrases, sentences, or short passages from the data that contain a likely answer to the question. This option is unique to IDOL, and uses the same text processing and matching. 
  • Conversation (Virtual Assistant). A scriptable system that allows a back-and-forth conversation with users to answer questions and perform various tasks.

You can use one or more of these methods to provide quick answers to simple user questions, leaving your support staff free to deal with more complicated tasks. 

Further Reading

Knowledge Discovery Text Analytics provides many powerful components and tools for processing and analysing your text content. If you are interested in reading further about specific components, here are some links to our product documentation:

More Information 

Learn more about what unstructured data analytics can do for you.

Join OpenText on LinkedIn and follow @OpenText on X.

We’d love to hear your thoughts on this blog. Comment below.

The OpenText Analytics & AI team

Labels:

IDOL
Unstructured Data Analytics
Knowledge Discovery
AI Content Management