The Basics of Natural Language Processing

In the last post, I covered the origins of artificial intelligence (AI) and machine learning (ML). In this post, I’d like to focus on another sub-discipline of AI: natural language processing (NLP). One of the earliest goals of AI was to create computers and programs that could understand human speech: respond to verbal commands and “speak” in a natural-sounding, or human-like way. Early science fiction movies included robots who exhibited human-like understanding. This might have led to a general misconception that NLP was easy when, in fact, it has produced some of the most challenging and intractable problems in all of computer science: speech-to-text translation, language translation and document summarization, to name a few. One way to think about NLP is to observe the relationship between text or speech and semantics, or meaning. When you see the word “cat” you probably think of a feline animal – either a specific one, or perhaps just the notion of a cat, domestic or wild; But when you see the sentence “Charlie Parker was the coolest cat to ever play the sax!” you understand the meaning differently.

As with all of artificial intelligence, there have been two fundamentally different approaches to NLP: knowledge representation and machine learning. In the area of NLP, knowledge representation attacked the problem head-on, first by parsing the text, then using various means of mapping each word or phrase to its meaning. There is a data structure in AI called a semantic network. Like the name implies, it is a network, with nodes and edges. In a semantic network, each node (called a meme) represents a distinct meaning; the edges represent the relationship of one meaning to another. Pictured below, an example of a semantic network.


an example of a semantic network with the memes and edges labeled with English words

A semantic network with the memes and edges labeled.

This approach had many challenges: First, creating a reasonably-complete semantic network by hand is really, really tedious. Second, figuring out how to accurately map words and phrases to the right meanings turned out to be really, really difficult. In my personal experience, I used a tool based on a semantic network to try to identify risk-related language in documents and highlight it. I was working as part of a team made up of three CompSci Ph.Ds and two other people with equivalent experience. It took the five of us approximately six months to create a system that appeared not to miss anything, and not to highlight the wrong things too frequently. This is a good example of why knowledge representation in general has fallen out of favor; the solutions are hand-crafted, usually by experts, and they are brittle -they don’t adapt to changes in (in this case) language. Think about COVID-19: would our system have picked out language with that word in it (assuming that such language tends to be risk-related)? I don’t know.

Again, as with AI in general, the machine learning approach has come to be favored, as it overcomes the challenges of knowledge representation. Within machine learning, there are two types of tasks: supervised and unsupervised. Supervised learning requires training sets – examples of correct input and output – and produces statistical models that, given similar input, will produce an output and the probability of that output being correct (the confidence). Unsupervised learning works by finding patterns – clusters, trends, anomalies – in very large sets of data. At RDC, we use both approaches; supervised ML for text classification (is this text risk-relevant or not?), named entity recognition/entity extraction and unsupervised ML for custom word embedding.

At RDC, our media analysts select the attributes of a risk event – the perpetrator, the event type, the event stage, the date, location, etc. – and add it to the GRID database. Entity extraction for common entity types, such as PERSON, ORGANIZATION, MONEY, GEOPOLITICAL ENTITY (like a nation or state) comes out of the box with most machine learning languages, like Python and R. To add new entity types, like perpetrator and risk-event-type, requires processing a very large number of documents through a neural network, which identifies patterns in the language – some of which identify the entities of interest to us. Pictured below, an example of this bespoke entity extraction.

In the text, places are highlighted in green, organizations in blue, people in light-orange and perpetrators in darker orange. This can help a media analyst to identify the risk-relevant attributes and add them to a GRID profile.

Over time, improving the accuracy of entity extraction, significantly improves our adverse media article processing by semi-automating the information extraction.

John Chung, RDC Data Scientist also contributed to this post.