An Introduction to IDOL Text Analytics: making the most of the written word
This series of blog posts is a guide to IDOL, starting with a high level overview blog titled, An Introduction to Micro Focus IDOL - Find and Use Valuable Data, and getting into increasingly focused areas of IDOL functionality. It is intended for anyone who wants to learn more about what IDOL can do to help get the most out of their data.
Ever since the invention of writing, text has been an important and extensive form of communication. The earliest written records give us an insight into aspects of history that other artefacts from the time cannot.
Today, while other forms of media become increasingly popular, text is still the backbone of a lot of information technology. From email to social media, companies use text in a large proportion of communications. Even where video and audio formats exist, it is sometimes desirable to transcribe the speech to text to analyse and review its content alongside other documents.
This unstructured text data is generally more difficult for computers to process than structured data, but it contains a wealth of valuable information if it can be extracted.
IDOL Text Analytics is the group of IDOL products that allows you to automatically process, search, and analyse this unstructured text.
IDOL Content Component
At the heart of a lot of IDOL’s text processing and analytics functions, is the IDOL Content component. It is the main engine behind the index and query of your text data.
Indexing is the process of taking raw text and converting it into a form that is easily searchable. When you index a document, IDOL Content processes and stores the text body, along with any metadata. When it processes the text, it uses a probabilistic approach to create a language model that adapts to the contents of your documents. This method allows IDOL to adjust flexibly to data that contains specialist language, or highly context-dependent word usage.
Querying allows you to retrieve the data. In addition to simple keyword searches, IDOL has a variety of advanced search tools to help you find the content you need. Document tagging, and faceted search can make it easier for you to pick a target subject area, while keyword tools such as fuzzy spelling and Soundex (sounds-like) options allow you to broaden your search terms.
In addition, you can supply a sentence, or even a whole document, as query text. IDOL processes the query text in the same way as when you index a document, to find the most important terms and phrases. This method provides a powerful tool for finding other documents like your sample, without having to manually define what that means.
While you might only need a few of these options for a particular use-case, the overall approach means that IDOL is adaptable to a wide variety of different data stores.
Query Analytics
IDOL query analytics help you to find out more about your data. It can reveal information about the content you have in your index, find out what topics are most important, and even track it over time.
IDOL provides many query analytics options out-of-the-box, available either with the Content component, or by using other IDOL components that integrate with it and use your data index.
The Find user interface provides examples of many analytics, such as:
- Comparisons between two queries
- Timelines to show how the stories in the data change over time
- Topic maps to show the important subject areas in a particular set of results
- Geographical maps that show where particular stories originate
Entity Extraction
Entity extraction is a tool that IDOL provides in several different products, powered by the IDOL Eduction engine.
An entity is any small, definable snippet of information that you might want to find in text, such as a name, an address, or an ID number.
Entity extraction has many uses. Recently, a very common use is to find personally identifiable information (PII) to ensure compliance with regulations such as the EU’s General Data Protection Regulation (GDPR). You can use Eduction to find out whether a particular document contains any PII, and flag these documents for appropriate processing.
You can also use entity extraction for document enrichment, for example to find and tag documents that refer to a particular celebrity or organisation to make it easier for your users to search for. Another common use is sentiment analysis, which finds positive and negative phrases to determine the sentiment of a comment or review.
To find entities, Eduction defines grammars. The grammar might be a list of terms (such as names), or a pattern that defines what the entity looks like. For example, for telephone numbers the grammar defines the number of digits, and use of spaces and other delimiters, rather than explicitly listing every possible number.
IDOL provides out-of-the-box grammars for many different types of entity. In particular, there are packages specifically tailored to certain uses, such as finding PII in a large number of countries and languages.
Read more about using IDOL Eduction to find PII.
Natural Language Question Answering
As human use of computers and the internet has evolved, the emphasis has changed from forcing users to carefully select keyword search terms, to allowing us to use more fluid text input to find what we want, and in many cases providing simple direct answers to a question. Computers must now work out the parts of a question that represent the information that the user wants, and how to map this to the data available.
IDOL Answer Server provides several methods for this kind of natural language question answering.
- Answer Bank. An FAQ-style set of model questions and answers
- Fact Bank. A database of facts that map pieces of information together (for example, to map ‘capital city’ and ‘Spain’ to ‘Madrid’), to provide simple answers to questions
- Passage Extractor. An index of data that contains information that might be relevant to user questions. Passage Extractor automatically finds phrases, sentences, or short passages from the data that contain a likely answer to the question. This option is unique to IDOL, and uses the same text processing and matching
- Conversation (Virtual Assistant). A scriptable system that allows a back-and-forth conversation with users to answer questions and perform various tasks
You can use one or more of these methods to provide quick answers to simple user questions, leaving your support staff free to deal with more complicated tasks.
Further Reading
IDOL Text Analytics provides many powerful components and tools for processing and analysing your text content. If you are interested in reading further about specific components, here are some links to our product documentation: