Elections
Geopolitics
Dev. Economics
Cybersecurity
Issue Analysis
The Chicken and the Egg of Named Entity Recognition
Using algorithms to identify and pull out certain elements of text can be easy - what's difficult is getting the data right.
8/13/19 @ 5:25PM CST
A rendering of SpaCy's Named Entity Parser in HTML
SpaCy

In this post, I want to discuss the differences between 1) rule-based systems, 2) supervised automata, and 3) hybrid active semi-supervised automata, and 4) fully unsupervised methods. The main issue with NER as it stands is that to get the data to train a model, you need to have the model already trained.

Rule-based systems are very similar to find & replace algorithms with a foundation in regular expressions. Regular expressions are the use of literal and special alphabetic characters that can be used to find specific patterns in a text-corpora. The patterns are identified using 'rules' (i.e. every instance 'X' to be labeled follows X's specific pattern) Early automated annotation systems often were rule-based, but this poses several problems for domain-specific NER tasks, or anything beyond analyzing large bodies of 'low-intensity text'. Low-intensity in this case meaning news articles, excerpts of novels, and other bodies of literature that use common language. Rule-based systems are brittle and often can't capture any kind of nuance or diversity in the text that they encounter. They also require a lot of manual effort to construct, domain-specific expertise, and a decent amount of knowledge of RegEx, which can arguably become quite complex with certain tasks. Rule-based systems have an advantage in their efficiency and speed - especially for smaller tasks like find and replace, but when they're asked more complex tasks, they aren't up to the job.

The next system of applied NER is supervised automata, which comprises the majority of mainstream, off-the-shelf NER tools online. SpaCy is an excellent example of this. To build a supervised automated system, a corpus of the desired domain is manually annotated with named entities with the types you wish to identify. These annotations - identifying the examples of the named entities that you want to locate and extract - are used to train a model to recognize these patterns. Entities can be locations, expressions of dates or times, or anything that can be classified. Practitioners of supervised automata typically use pre-annotated general datasets that can be obtained from machine learning hubs like Kaggle. This reliance on general datasets like the ones provided by services such as these is a big part of why the widespread use of supervised automata is limited to general knowledge extraction or like I mentioned in the last section - low-intensity writing. Supervised automata typically begin by using the dataset to train a 'memorizer' through ML packages like sklearn and evaluate them using metrics like 5-fold cross-validation. Scores like precision, recall, and the f1-score are used to determine the model's success and are common in NLP tasks. Machine learning is used to improve recall, which is the system's ability to adapt to new, previously unseen information. While formerly a very difficult task, packages like sklearn make the step of making the data usable by a ML approach by converting the data to simple feature vectors (i.e. numbers) easy. Creating a feature map (i.e. a matrix of numbers) is relatively simple by combining sklearn's RandomForestClassifier, Label Encoder, Memory Tagger, Pipeline, and NumPy arrays. Results from using sklearn can be further improved with more sophisticated methods like conditional random fields, neural networks, long-short term memory networks, character embeddings, residual long-short term memory systems, ELMo, and BERT (<- yes, those are names of Sesame Street characters, it's an inside joke among NLP engineers).

Manually annotating large corpora to use and create training data from is expensive, requires new annotations every time a new named entity is defined or a new field-specific body wants to be covered. Supervised automata would actually be the perfect approach if you could get your hands on perfectly annotated data every time you wanted to work with named entities. Unfortunately - and especially in high-paid, highly technical fields like mathematics, clinical studies, law, and less data driven humanities - it's really difficult to not only find someone willing to annotated that much information at low cost, but also to even determine what is worth annotating in the first place. Some scientists have developed workarounds for this - figuring out proxy information to substitute for specific labels is the main role of data engineers. Existing dictionaries or repositories of known named entities to match in text with mechanisms for unsupervised disambiguation or contextual evidence is a significant approach. However, these can largely only apply to the low-intensity corpora that I mentioned above. A significant reason why these methods don't work for high-intensity corpora is because named entities in these fields (clinical studies, law) have linguistic variations. There are several ways to refer to the same named entity. This makes it difficult for semi-supervised automata to work well. Second, high-intensity named entities are often multi-token terms with nested structures that include several other entities inside them. This can make determining boundaries of these structures a challenge in and of itself.

That leads me to #4 - completely unsupervised automata based on computational pattern recognition. Current systems are typically derived from language formatting hypotheses - one such example being Zhang and Elhadad 2013, which observed that named entities in the same class tend to have similar vocabulary and occur in similar contexts [1]. Similar systems typically start with a resource describing all of the possible terms that can a named entity can comprise of - like a clinical UMLS (Unified Medical Language System), the terms are then matched wherever they occur as noun phrase chunks in a corpus. A signature is created for each seed term in the form of vector representation based on the inverse document frequencies (IDF) of the words occurring within the term and the words occurring in the contexts in which the term occurs in the corpus. The context of a term occurrence is defined as the previous and the next two words. A signature of the target entity class is then computed by averaging the signature of all its seed terms. During testing, the method first computes signature for a candidate entity and then computes its cosine similarity with the signatures of all the entity classes. The candidate entity is assigned the entity class with which is has the highest cosine similarity, as long as that similarity is above a predetermined threshold. This is but one method of creating an unsupervised NER - combining a database of potential words, vectorizing them based on their IDF given a certain context, deriving their signature, and assigning a label based on cosine similarity measures. Improvements to this method can be added through machine learning applications applied to an automatically annotated corpus, allowing the system to recognize named entities that can occur in varying contexts. Full parsing of the vocabulary can also improve the potential matches.

The final form of named entity recognition is active learning semi-supervised models - a way to benefit from the extensive literature and proven success of supervised models, but avoiding the 'chicken and egg' situation where in order to get large amounts of the data required to train the models you need to have developed the model in the first place. Examples of this kind of named entity recognition includes training using automatically generated annotations, or focused singularly on relation extraction. The active learning component exists in tools like SpaCy's Prodigy, a way of reducing the manual annotation effort (despite still requiring some kind manual annotation to begin with). The learning process grows the corpus of data interactively and wisely by making the system ask for the most helpful examples that the human should annotate or reduces the annotation process to a simple binary classifier (i.e does this example match or not). Other methods of reducing the annotation burden include pre-annotation, where a system first automatically annotates a corpus that a human corrects. Provided that the pre-annotated corpus is of high-quality, this can save a large amount of effort.

The unsupervised learning approach shows the most promise. This method could greatly reduce the manual effort and cost of building high-intensity NER models, which is really the only worthwhile application of NER for information gathering purposes (low-intensity NER is good for later downstream tasks). Unsupervised models can also be used to strengthen other semi-supervised models by providing more annotations automatically (thus, solving the chicken and the egg).

NER
Big Data
Code
Machine Learning
Read Next:
  • Learning for clinical named entity recognition without manual annotations
  • Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts
  • Highest Rated ML Projects on Github
  • \