Wikis - Page

ControlPoint ingestion process: components, architecture and flow

2 Likes

The purpose of this brief knowledge document is to describe the components involved in the ingestion process, how they work together and where bottlenecks or congestion points can happen. This will hopefully help understand the architecture and contribute to the correct sizing of environments to avoid bad performances. 

Simplified schema of components

Here is a simplified schema of the components involved in the ingestion of files by ControlPoint.

clipboard_image_0.png

Components and flow

File repositories

This is the starting point. This is a file server, a SharePoint…

Nothing is installed on this server, but CP will connect to it and act like a user that would copy all the files to the connector server in a temp work folder for further processing. It is therefore intensive in I/Os.

The source of data can be the weak point in the architecture if it is not sized to deliver fast enough the data to CP.

  • Be aware of the demands of parallel connections, i.e. repositories, configured in CP.
  • Check the disk and network are fast enough to read and send files to the connectors.

What influences the performance:

  • Number and size of files: they are copied over the network

Connector

The connectors read all files in the sources and copy them into a temporary folder locally. This starts creating a temporary structured file (XML like) with metadata collected from the source (e.g. file name, folder…). This file is used during the whole process and more metadata will be appended. Files are processed in batches of 100 by default.

  • Network Intensive: consider the bandwidth.
  • Requires a temp disk.
    • The size of the disk is to be planned. It is of the number of files in the batch multiplied by the size of the biggest files that CP scans.
    • It also needs some space for the temp files.
    • It needs space if “analyze sub items” is checked as zip files will be expanded.
    • It should be on a good performance disk as there is lots of read, write, delete…

What influences the performance:

  • Number and size of files: they are copied over the network on the temp folder

Framework (a.k.a. CFS)

Once the connector has collected the files, it puts the list of files in the CFS queue. CFS picks the files from the temp folder in order to process them. CFS is an orchestrator, it will use sub processes or modules to collect more info from the files, normalize metadata so that it is consistent in CP data warehouse, and enhance it as much as we want. Below is a list of the main sequential processes that are orchestrated by the CFS. Not all of them are required, this depends on the use cases. From a configuration perspective, the use case will determine what option we select, and the options will trigger or not these processes. See latter chapter on what option triggers what function.

KeyView

First in this story, KeyView. It recognizes file types, extracts known metadata from known file types (e.g. author in an Office doc), extracts all text that you can find in the file and puts that all in the temp work file.

  • It uses a bit of CPU as it computes the information
  • It uses disk to write the temp file

What influences the performance:

  • Number of files: it opens them all to check the file format and potentially extract known metadata
  • Size of files: it extracts all the text from them and puts it in the temp work file

MediaServer

If the document has a certain format that may require deeper processing, the file might be sent to MediaServer. This would apply to “media” format files, such as audio, video, or image documents that require conversion to text format. KeyView has recognized the file type. Based on that, CFS can send images to MS in order to process them (for example OCR) and therefore add this recognized text into the temp work file.

  • This process is CPU intensive
  • What influences the performance:

What influences the performance:

  • Number of files and amount of text to recognize in it

Eduction

Eduction is our module for pattern matching. It will work with the temp work file on the text extracted by KeyView or MediaServer. It searches for patterns. Then Eduction feeds back the temp work file with the patterns in metadata fields. This valuable information identified in the content becomes a metadata for CP.

  • This process is CPU intensive
  • There are a lot of parameters to configure what is extracted and these have a strong influence on performance.
  • Scripting can be done during the Eduction stage to adapt to the customer use case. For example, some customers only require a “count” of the number of hits, and not the value of the Eduction grammar hit itself, so we could adapt Eduction so that the value of the data gets deleted, but the number of hits gets kept. (Example: we can detect that a document has 35 credit card numbers in it, but not keep the value of the credit card). Such scripts will affect performance, either negatively or positively.

What influences the performance:

  • Mainly the number of patterns we extracted i.e. how many entities potentially exist in the scanned environment or more precisely how many we need to extract from documents.

Validation and further processing

There are other processes out of the box that happens then. This is about normalizing and formatting information and metadata to have it clean in the data warehouse.

This is where we can have, for example, a validation function that will check the validity of a pattern based on a known algorithm. It can also be any custom function that would count, calculate, cleanup…  for example, if we detect credit cards in the Eduction stage, the validation stage would check that the credit card is a valid number using a special checksum.

Though Validation usually lowers performance speed at some calculation is done. On the other side, it does lower the number of false positives, i.e. improves the accuracy of the results, and therefore reduce the amount of data to ingest in the database.

  • These functions are usually CPU intensive as they will calculate.

What influences the performance:

  • Number of patterns to validate

Temporary working file is ready

Our temp work file is now full of valuable information, in a structured format. It is now time to insert it into our databases. There are two databases that we can send the information to: the MetaStore and the IDOL content index. Not all projects require the IDOL index database, this will depend on the use case. The IDOL index is needed if the use case requires the ability to content search. For example, it is possible to have a project which uses Eduction on the content of documents, but then does not index the content into the IDOL index. In this case only the Eduction values would be kept, these are kept in the MetaStore,

The two databases are:

  • MetaStore in SQL Server: this is where all metadata will go
  • Content in IDOL index database: this is where we will collect index and categorization information if this option was selected.

MetaStore

Once the temp doc file is filled with all the information collected from the previous processes, it is time for the MetaStore to come into the play. MetaStore component is here to put the metadata into the MetaStore database in SQL Server.

  • It consumes CPU power as it must transform the metadata into an SQL request.
  • It takes network bandwidth to transfer this data to the SQL Server but that is condensed information not that big

I would recommend putting the MetaStore service on the same server as the Connector and Framework as they provide consistent actions: one server, one connector, one framework, one MetaStore.

What influences the performance:

  • Number of files: there is a record for each file in the database
  • Number of patterns extracted: there is a record for each pattern

SQL Server

The SQL Server is here to receive the data and store it in the database files. Data is first stored in log files and then in the database file itself. Therefore, it is important to separate the logs and the database files on two disks, to spread the I/Os. The database file can itself be split into multiple files to distribute the load, this is called partitioning and is supported by the CP database installer.

  • SQL Server requires very low latency disk called SSD or Flash in order to write data chunks and more importantly to get the commit fast.
  • SQL Server requires quite a lot of memory in order to keep data in the queue during intensive ingestion.

What influences the performance:

  • Number of files
  • Number of patterns extracted

IDOL Content Index

When the option is selected during ingestion (users can select to have metadata only ingestion for example, which would not allow the option of indexing content), the content of documents can also be indexed. This task is done in parallel, by the IDOL component. The temp work file is therefore sent both to the MetaStore and to IDOL for parallel processing.

IDOL will take the information in the temp work file and add it to its index database.

  • The process is CPU intensive in order to build the index and categorize content intelligently
  • It is Disk I/O intensive as it stores all the index in files on disk and that index can represent 20% to 40% of the original content in size. Remember to consider this when sizing for your Content Index.

What influences the performance:

  • Number and size of files: terms are extracted from the files and put in this dataset in an organized way.

Influence of selected options on components

As you may start understanding now, the selected options during the creation of a repository, have a strong impact on what will happen latter on in the processing of file and therefore on the performances of the system.

Here is a high-level view of what option implicates what component.

clipboard_image_1.png

Depending on the level of indexing chosen, data will go in different places.

clipboard_image_2.png

So, we have three levels of information capture and they are cumulative:

  1. In this first level, only the metadata from the source application is collected. In SharePoint, that would be the metadata in this database, in a File System, that would be the properties of the files, etc. In that level, the document is not opened, and we do not look inside it. Although the option is here, this is not the main use case implemented as CP can deliver much more value out of the content of documents.
  2. The second level does level 1 plus some more clever things. We here open the document and extract known metadata from certain file types with KeyView: e.g. the author of an Office document. Eduction also looks inside the content of the document to find values relevant to the business that we extract as metadata. We could think of it as transforming an unstructured document into a structured record that describes the meaningfulness of it for the business: “content is translated into essential metadata”. Note that in this level, the content is not fully indexed and therefore not searchable. Only the metadata extracted from it can be searched. This is useful for a use case where an organization wants to find specific values in its documents but does not require to search its content later.
  3. In the third level of indexing, we do the same as the previous ones plus content indexing. This provides two key capabilities.
    1. First it will allow search any term in file content e.g. search for all documents containing the word “Paris”.
    2. Secondly, it provides machine learning capability to automatically group documents by concept. E.g. give me all documents that are similar to these ten contracts examples.

Choose only the processes required for your customer use case

To conclude, ControlPoint has many options when choosing how to ingest data. It can scan, process, and then index this data in different ways. The important point is to always choose only the processes required for the use case, so that the organization has its desired results as fast as possible. It is important to have this “consultancy” conversation with business users. The settings come last. The project starts with understanding what the organization needs to see from a functional perspective, what is the use case and what do we want to do with the information we get. Questions that should be asked before starting could be:

  • Do I need to scan all my data sources or specific repositories (e.g. from a department)?
  • Can I do a first high level (simple metadata) scan to get a view one files that could be deleted (redundant, obsolete, trivial)? Experience show this can reduce by 30% the number of files.
  • What are the essential patterns that I need to identify a sensitive data? Do I need that many?
  • Do I need the value of the entities extracted? Do I need only to know how many they are? Or do I simply need to know that there is at least one (or 10) in a document to know it’s sensitive?
  • Do I need to index the content? Will it be useful later?

Doing a cleanup before the deep analysis, minimizing the number of patterns to match, reducing the number of entities extracted and reducing the size of the index are all the essential things that will make the project have faster return to value.

Labels:

How To-Best Practice
Comment List
Related
Recommended