An Introduction to Knowledge Discovery (IDOL) Ingest
One processing system for all your company’s data
This series of blog posts is a guide to Knowledge Discovery (IDOL), starting with a high level overview , and getting into increasingly focused areas of IDOL functionality. It is intended for anyone who wants to learn more about what Knowledge Discovery (IDOL) can do to help get the most out of their data.
Data is everywhere. It’s not just that there are so many different kinds of data to get to grips with, there are also many places to store the same thing. If you have a photo, you might store it locally on your phone or computer file system, upload it to the cloud, send it to a friend using social media or a sharing app, email it, or post it on a web page.
In each of these cases, it’s the same data—the same photo—but you could store it in any one of hundreds of different repositories.
Each repository has its own way to store and file the data. In most cases, you need to provide credentials to access the system, to make sure that your data is safe, and shared only with those who have permission.
As individual users, we interact with all these systems in a fairly seamless way. But what happens when a company needs to centralise its data somehow? Perhaps you want to make it easier for people in your organisation to find particular pieces of content, or perhaps you need to review your data to check for stored personal information, to comply with standards such as GDPR. Or perhaps you just want to review what you have so that you can make better use of it.
Knowledge Discovery (IDOL)Ingest
Knowledge Discovery (IDOL) Ingest allows you to access your content in its original repository, and retrieve it for processing.
For each repository, there is a connector, which is a component that you configure with the details for a particular repository. Knowledge Discovery (IDOL) has connectors for the file system and general websites, as well as for email, file sharing applications, content management systems, and social media.
The connector accesses your repository, and sends the files into your ingest system. It also monitors the original repository for updates, and keeps the ingest system in sync with any changes.
When your files reach the ingest system, you can process them by performing various media and text analytics, and extracting the text and metadata.
Afterwards, you can convert it into an Knowledge Discovery (IDOL) document to index it into an Knowledge Discovery (IDOL) text index. Alternatively, you can send it to other kinds of index. In some cases you can modify the document in the original repository, for example to redact sensitive content or remove documents that have reached a storage policy expiration date.
Apache NiFi User Interface
You deploy Knowledge Discovery (IDOL) Ingest by using components on the Apache NiFi framework.
All your connectors and data processing steps are available through the NiFi user interface, which provides a single place to create and configure your whole ingest chain.
The user interface allows you to configure all the connectors that retrieve data from each supported repository. NiFi also provides an easy way to monitor your data flow and see how it changes at each step. You can easily spot processing bottlenecks, or pause and reconfigure a step while the rest of the system is running.
File Processing
Knowledge Discovery (IDOL) Ingest allows you to configure additional processing for your documents, such as entity extraction and media analysis. These tools can add value to your source documents. For example you can:
- Extract text and metadata from hundreds of supported file formats, by using Knowledge Discovery (IDOL) KeyView.
- Run entity extraction to find important topics in your documents to enrich them, or to find PII in your documents to process and store them appropriately.
- Perform OCR on images and scanned documents to extract additional text, which you can include in the document that you send to your index.
- Perform other media analysis on images and video, such as face recognition to identity people in images, which you can add to the document metadata.
This data enrichment has many uses, from enabling compliance with data storage regulations, to enhancing user search experience by making it easier to tag and filter content.
NiFi Ingest has processors for Knowledge Discovery (IDOL) media analytics, entity extraction, and text analytics (such as categorization). It also includes all the Knowledge Discovery (IDOL) KeyView functionality required to process your file formats and extract text.
In addition, Knowledge Discovery (IDOL) Ingest can add value by performing field standardisation. When you have data from multiple sources, each original repository might store the same piece of metadata in different ways. For example, they might use ‘author’, ‘author name’ or ‘sender’ for the user who created a document. Knowledge Discovery (IDOL) Ingest can convert all these different forms to a standard name, which makes processing much more straightforward in your index and front-end applications.
You can also use your Knowledge Discovery (IDOL) Ingest network in your front-end applications. When you have set up Knowledge Discovery (IDOL) Ingest with access to your repositories, you can use it to retrieve the original version of a document. When a user runs a query, and you want to show a preview of a result, your connectors can access the original repository to retrieve and view the document, for example by using the Knowledge Discovery (IDOL) View component to convert it to web-friendly HTML.
Security
Security is a key concern for all data. As well as secure access to the repositories, Knowledge Discovery (IDOL) Ingest supports Knowledge Discovery (IDOL) document security.
Document security ensures that your data index honours any access restrictions in your original repositories. The connectors read the user and group restrictions on the original source document and store this information in the document Access Control List (ACL).
When you index the data, the Knowledge Discovery (IDOL) Content Component stores the ACL with the document, and uses it when retrieving the document in queries. Users must provide authentication details that confirm that they have permission to access a document; otherwise it does not return in search results.
Data Migration and Modification
In addition to retrieving data from your repositories, Knowledge Discovery (IDOL) ingest can also write data to some repositories. You can use it to migrate data between different repositories, or to use the Knowledge Discovery (IDOL) Ingest system to process and modify data before returning it to its original source.
For example, you might want to extract data from repository, encrypt it, and then return it. Or you might want to retrieve the data, scan it for sensitive information such as PII, and redact the sensitive content before returning it.
Further Reading
This brief introduction to Knowledge Discovery (IDOL) Ingest gives you some idea of what’s possible with an Knowledge Discovery (IDOL) system to connect all your data from different sources into a single enriched index.
For information about the different connectors available, see the Knowledge Discovery (IDOL) Documentation page
The NiFi Ingest guide provides information about how to set up and configure NiFi Ingest.