Cybersecurity
DevOps Cloud
IT Operations Cloud
The purpose of this brief knowledge document is to describe the components involved in the ingestion process, how they work together and where bottlenecks or congestion points can happen. This will hopefully help understand the architecture and contribute to the correct sizing of environments to avoid bad performances.
Here is a simplified schema of the components involved in the ingestion of files by ControlPoint.
This is the starting point. This is a file server, a SharePoint…
Nothing is installed on this server, but CP will connect to it and act like a user that would copy all the files to the connector server in a temp work folder for further processing. It is therefore intensive in I/Os.
The source of data can be the weak point in the architecture if it is not sized to deliver fast enough the data to CP.
What influences the performance:
The connectors read all files in the sources and copy them into a temporary folder locally. This starts creating a temporary structured file (XML like) with metadata collected from the source (e.g. file name, folder…). This file is used during the whole process and more metadata will be appended. Files are processed in batches of 100 by default.
What influences the performance:
Once the connector has collected the files, it puts the list of files in the CFS queue. CFS picks the files from the temp folder in order to process them. CFS is an orchestrator, it will use sub processes or modules to collect more info from the files, normalize metadata so that it is consistent in CP data warehouse, and enhance it as much as we want. Below is a list of the main sequential processes that are orchestrated by the CFS. Not all of them are required, this depends on the use cases. From a configuration perspective, the use case will determine what option we select, and the options will trigger or not these processes. See latter chapter on what option triggers what function.
First in this story, KeyView. It recognizes file types, extracts known metadata from known file types (e.g. author in an Office doc), extracts all text that you can find in the file and puts that all in the temp work file.
What influences the performance:
If the document has a certain format that may require deeper processing, the file might be sent to MediaServer. This would apply to “media” format files, such as audio, video, or image documents that require conversion to text format. KeyView has recognized the file type. Based on that, CFS can send images to MS in order to process them (for example OCR) and therefore add this recognized text into the temp work file.
What influences the performance:
Eduction is our module for pattern matching. It will work with the temp work file on the text extracted by KeyView or MediaServer. It searches for patterns. Then Eduction feeds back the temp work file with the patterns in metadata fields. This valuable information identified in the content becomes a metadata for CP.
What influences the performance:
There are other processes out of the box that happens then. This is about normalizing and formatting information and metadata to have it clean in the data warehouse.
This is where we can have, for example, a validation function that will check the validity of a pattern based on a known algorithm. It can also be any custom function that would count, calculate, cleanup… for example, if we detect credit cards in the Eduction stage, the validation stage would check that the credit card is a valid number using a special checksum.
Though Validation usually lowers performance speed at some calculation is done. On the other side, it does lower the number of false positives, i.e. improves the accuracy of the results, and therefore reduce the amount of data to ingest in the database.
What influences the performance:
Our temp work file is now full of valuable information, in a structured format. It is now time to insert it into our databases. There are two databases that we can send the information to: the MetaStore and the IDOL content index. Not all projects require the IDOL index database, this will depend on the use case. The IDOL index is needed if the use case requires the ability to content search. For example, it is possible to have a project which uses Eduction on the content of documents, but then does not index the content into the IDOL index. In this case only the Eduction values would be kept, these are kept in the MetaStore,
The two databases are:
Once the temp doc file is filled with all the information collected from the previous processes, it is time for the MetaStore to come into the play. MetaStore component is here to put the metadata into the MetaStore database in SQL Server.
I would recommend putting the MetaStore service on the same server as the Connector and Framework as they provide consistent actions: one server, one connector, one framework, one MetaStore.
What influences the performance:
The SQL Server is here to receive the data and store it in the database files. Data is first stored in log files and then in the database file itself. Therefore, it is important to separate the logs and the database files on two disks, to spread the I/Os. The database file can itself be split into multiple files to distribute the load, this is called partitioning and is supported by the CP database installer.
What influences the performance:
When the option is selected during ingestion (users can select to have metadata only ingestion for example, which would not allow the option of indexing content), the content of documents can also be indexed. This task is done in parallel, by the IDOL component. The temp work file is therefore sent both to the MetaStore and to IDOL for parallel processing.
IDOL will take the information in the temp work file and add it to its index database.
What influences the performance:
As you may start understanding now, the selected options during the creation of a repository, have a strong impact on what will happen latter on in the processing of file and therefore on the performances of the system.
Here is a high-level view of what option implicates what component.
Depending on the level of indexing chosen, data will go in different places.
So, we have three levels of information capture and they are cumulative:
To conclude, ControlPoint has many options when choosing how to ingest data. It can scan, process, and then index this data in different ways. The important point is to always choose only the processes required for the use case, so that the organization has its desired results as fast as possible. It is important to have this “consultancy” conversation with business users. The settings come last. The project starts with understanding what the organization needs to see from a functional perspective, what is the use case and what do we want to do with the information we get. Questions that should be asked before starting could be:
Doing a cleanup before the deep analysis, minimizing the number of patterns to match, reducing the number of entities extracted and reducing the size of the index are all the essential things that will make the project have faster return to value.