Why entity extraction?
Entity extraction, otherwise known as named entity recognition/resolution, is part of IDOL’s extensive text analytics for natural language processing. Which is itself a part of our complete platform for unstructured data analytics.
So, why is this important? Well, it's all about creating structured information from within unstructured data. This data, sometimes known as human data, is all around us in documents emails, and even social media and unlike structured data nicely stored in regimented databases, computers find it difficult to understand. By extracting entities, it allows us to apply standard structured data processes to the results as found in generic business applications, such as:
- File analysis - Find PII for governance
- eDiscovery - Find associations with a persona and event ie DSAR or FOIR actions
- Digital Loss Prevention - Used in firewalls to classify content and prevent IP leak
This information can also be used as part of further unstructured analytics in search and knowledge discovery as used in Enterprises and Intelligence agencies.
We provide two sets of grammars - basic grammars that can be expanded by the user for a host of standard classes and specific context-driven grammars that are curated by ourselves to maintain performance and accuracy:
- PII - These include 13 categories of entities across 38 different countries
- PHI - Health Information, normally associated with the North American health industry
- PCI - Personnel Credit card information
- PSI - Personnel security information, for account details access keys
- Gov - CUI and Classified documents
Things to Consider
Keywords won't cut the mustard
These semantic grammars use extra information around the entity. Context and landmarks are used to disambiguate and give a confidence score that can be used later to filter out false positives. These can be phrases, single words, or even just characters. Their proximity to the identified entity candidate, as well as the strength of the context-based on natural language processing techniques (NLP), are used to create the score.
You can't have your cake and eat it!
While entity extraction improves upon the efficiency and scalability of an organization's document compliance, each business use case needs to trade off the occurrence of false positives and entities incorrectly extracted, against false negatives and entities missed. This is often because certain entities of interest have a very common format with no checksum or context to distinguish them.
Another problem is with the names of people. Almost any set of letters is a potentially valid name somewhere. Where possible, narrowing down the countries and languages of interest and using prior knowledge on the statistical frequency of common and uncommon names, can be used to reduce false positives. Organizations need to evaluate what would be the cost of false positives and or false negatives in their results and use this information to make trade-offs between time/resource and accuracy.
Tables serve another issue
Tables in documents have the potential to contain large numbers of entities that to the human eye are clearly defined, with strong landmarks and context, derived from the columns and row headers. It is therefore important that the extracted source text maintains information about the underlying structure of the original document. This allows algorithms to adapt the interpretation of proximity when arriving at a confidence score by binding stronger to the headers. IDOL KeyView is used to extract the underlying text and IDOLs optical character recognition software (OCR) both pass on this important structural information to the entity extraction engine.
Speed of delivery
Entities break down into two camps: those that are simple and thus very quick to search for and those that have a much more complex structure, which are a lot harder to find. Names and addresses fall into this latter camp, with detection times potentially orders of magnitudes longer than detecting say telephone or passport numbers. One way to speed this up is to use pre-filtering. In certain circumstances, you can search for simpler grammar as an indicator of a more complex entity's presence. This creates potential candidate windows within your document to apply the slower complex search to i.e. ZIP codes for addresses. The design of the pre-filter becomes a trade-off then between efficacy and accuracy. This method is ideal for documents with low or even no hits, where it will significantly reduce times to process.
Finally, parallelization can be employed to speed things up, especially with multiple grammars.
When to stop looking
Another way of reducing resource requirements is to use entity aggregation to trigger the early exit of the analysis. This involves looking at count and type but not the value of an entity. Once you have found sufficient entities you can flag the document as sensitive and stop further analysis i.e. 1 Name and 1 Address or 2 credit card numbers. This is useful when you come across large database files full of customer details. Potentially another reason to stop early and move on is when an inordinate amount of time is being taken and no results are returned, typical of large "empty" files. This exit strategy becomes even more beneficial when the ingestion chain uses pipelining to open files and start processing immediately rather than waiting for everything to be loaded.
On the face of it entity extraction seems a simple part of natural language processing but as with everything in life doing it and doing it properly are very far apart. IDOL can help you do it properly.