Can computers help Security Force Monitor’s researchers increase their speed and accuracy when extracting relevant data about security forces from the text of news articles and reports?
Over the last few months Yue “Ulysses” Chang, a masters student at the Data Science Institute at Columbia University, has interned with us to help us explore this question. The quick answer is “yes, 79% of the time.” The longer (and hopefully nice and readable answer) starts in this blog post, the first of a short series.
How do we create data about security forces?
Each week at Security Force Monitor we identify and read 100s of news articles, reports, maps and datasets – these are the “sources” out of which we pull thousands of little details about organizations, names and locations. We stitch these together to create the rich view of security force structures and commanders that you can search through on WhoWasInCommand.com.
This is time-consuming work. In most cases after we’ve found a useful source we just have to read through it, identify the snippets of information we need and then copy, paste or re-type them into our databases. Here’s a few paragraphs from a typical source – “Police IG Redeploys AIGs, CPs For April 11 Polls” – published by Channels TV on 10 April 2015:
We can use the information in this source to support the below statements, and enter the relevant values into a database:
- On 10 April 2015 (the publication date of the source) Inspector General of Police, Assistant Inspector General, and Deputy Inspector-General of Police are ranks in the police force in Nigeria.
- On 10 April 2015 Force Public Relations Officer is a title in the police force in Nigeria.
- On 10 April 2015 Suleiman Abba is a person holding the rank of Inspector General of Police (IGP) in Nigeria.
- On 10 April 2015 Aigusman Gwary is a person holding the rank of Assistant Inspector General of Police (AIG) in Nigeria.
- On 10 April 2015 six Deputy Inspectors of Police “coordinate activities” in six geo-political zones.
Every one of these data points has at least a single source. For example, to make our data on units – distinct parts of security forces such as army battalions or police divisions – we looked at around 3500 unique sources taken from over 200 different publications. From these sources we were able to evidence 25,505 data points with a single source, 2086 data points with more than 10 distinct sources, and 59 data points with over 40 distinct sources.
We presently cover branches of the security forces of Nigeria, Mexico and Egypt. As we expand our coverage to other countries we will need to consider ways of reducing the time spent and risk of error inherent in this part of our research process. If we can reduce the time we spend searching, cutting and pasting bits of text, then we can spend more time cross-referencing and producing interesting analysis from the data. Could we get more help from computers than we currently do?
“NLP”, “NER” … ?
Computers can read too. Sort of.
Natural Language Processing (NLP) is a long-established field of computer science that looks at how machines relate to people’s speech and writing, and ultimately how they can comprehend information passed to it by a person. The fruits of NLP research provide technologies that power everything from the recommendations you get on search engines, those (irritating) automated voice call systems, and the (less irritating) digital voice assistants. Named Entity Recognition (NER) is the sub-field of NLP that gives computers the capability to pick out things that people can recognise in text – like names, persons, organizations, locations, dates. Could they be applied to our work?
We can start exploring this question very quickly by using one of numerous “off the shelf” NLP and NER toolsets. To test our ideas out we have chosen a toolkit called spaCy. This has the benefit of having a wide range of functions, and being free and open source – this enables us to use the toolset without direct cost.
Without any modification spaCy can assess text and identify persons, organizations, locations, dates and lots of other types of entities. It can also be trained to improve its ability to detect the above entities (like adding in new geographical model), and identify new entities such as rank and role, or connections between entities. What’s not to love?
NER and real sources
Let’s give it a try. We can take the text from the sample news article we analysed above, and place it into into spaCy. It will highlight different parts of the text that it considers to be entities:
The performance here is is ok, but it is not without problems. For example, spaCy correctly picked out all but one of people mentioned in the article (“Aigusman Gwary”, who it tagged as an Organization rather than a Person). It has also successfully identified Lagos and Bauchi as geo-political entities (“GPE”) but misses “Akwa Ibom” and “Rivers”, and mis-categorizes “Jigawa” as an organization. There are other misses in there too.
Bringing this into the work of Security Force Monitor
In this post we’ve outlined the challenges we face, and in broad terms the way that we see this set of technologies offering an opportunity to address them. The intriguing question for us is how to take the raw capabilities of NER and have them benefit our research work in specific and effective ways. We have a long list of things on our mind, including:
- The skills and financial costs that will be required to develop, implement and maintain such a system in a way that is reliable and effective.
- Whether we can improve the performance of the algorithm by using the data we have already collected to train spaCy to better pick out what we are looking for in a source.
- How to reconcile the stream of information coming in from NER with the data we already have – for example, what process will we use to figure out if “Jane S Smith” and “Jayne S Smith” are the same person?
- How we evaluate NLP and NER systems so we know whether they are getting better (or worse!).
- The type of workflow and user interface that would be needed to bring these capabilities effectively into our research work so they are actually helpful.
In the next post in this series, Ulysses and I will start digging into these questions and revealing some of the work that we have done so far.
(Article edited on 22 February 2018 to correct typos and clarify wording)