
Image: Alexa Steinbrück / Better Images of AI / Explainable AI / Licenced by CC-BY 4.0
I’m pleased to release a technical working paper we’ve written on Natural Language Processing (NLP) and our research into abuses by police and armies. The paper explores how a NLP system can be deployed to extract information from sources on security force units and their related personnel. This is a practical contribution into the ongoing conversation about the tactical use of AI in human rights work. The experiment detailed in the paper shows some tantalizing possibilities, but also reveals key challenges often overlooked in the conversation around deploying technology for human rights: sustainability and creation of new working methods to harness new technologies.
Our paper – “NLP in Human Rights Research – Extracting Knowledge Graphs About Police and Army Units and Their Commanders” – is the outcome of a collaboration between Security Force Monitor and Dr Daniel Bauer of the Department of Computer Science at Columbia University, and Yueen Ma, a postgraduate student of the same. It would not have been possible to do this work without their expertise and support.
Download our paper from arXiv or by clicking the image below.
Before digging into the paper and what it means for SFM’s work, I’d like to give a bit of context showing how we got to this point.
Round 1: Tentative NLP experiments
At SFM, we rely almost exclusively on text-heavy digital documents (websites, PDFs, spreadsheets) to evidence the data we create. We read these documents and extract scraps of information on the organizational structure, personnel, and geographical footprint of security and defence forces at different points in time. These scraps are then entered into a graph-like data structure, and published on WhoWasInCommand.com, our public data resource on security forces. This is a foundational part of our work, and takes a lot of our time. Reducing the human time and resources required to do this would be a massive win for us.
For years we have wondered whether a cluster of technologies and methods collectively known as Natural Language Processing (NLP) might help speed up this step. We have long been gripped by the idea that the advanced, evolving technologies of artificial intelligence – of which NLP is a part – could and should be more useful to us. But what does this mean in practice for human rights groups? Although there is a small body of NLP work from within the domains of human rights and investigative journalism, its focus has been on entity resolution, topic identification and sentiment analysis rather than detailed data extraction, reducing its relevance to our particular challenge.
Way back in 2018, we conducted some early NLP and machine learning experiments attempting to recreate our research workflow of reading a text, extracting scraps of information, and structuring these scraps of information into data. This first round of experiments showed us the potential of using NLP in our own research, while also throwing up complex problems for which (at the time) we did not have any path forward towards solutions.
Our first experiments used a free, open source NLP framework (called spaCy) to pick out “persons” and “units” from within a text. This capacity is known as Named Entity Recognition (NER) and while it looked like a promising tool for building the foundations of our datasets, we quickly ran into multiple challenges.
We did not at that time have a dataset we could use to train, improve and – most critically – evaluate the performance of the NER algorithm. To do this, we would need a dataset that told us precisely which text in a source from which a value (such as a unit or person name) was extracted. The data we did have at that time told us only what datapoint came from which source – that’s just the way our data is structured. Remedying this would mean going back to the original source texts and annotating them in ways that the NER algorithm could understand. Without such training data, we would also be unable to extend the NLP system to pick out relationships between those units and persons it was able to identify. This task better mirrors the actual work that we have to do “by hand”. We felt it would be a better test of the technology’s potential for us than entity extraction alone. This relationship extraction task is also a magnitude more complex and in 2018 we did not have the methodological support or expertise to address it.
However, by the end of this first round of experimentation with NLP at least we had a far better understanding of the challenges and how to approach them. The development process helped us imagine the sort of data handling pipelines into which NLP capabilities could be included. It helped us consider how to obtain and clean up website text, design user interfaces to review and correct suggestions made by NLP tools, integrate newly-obtained data into our existing datasets, and create virtuous loops that would use the interaction between ourselves and NLP tools improve the system’s performance.
Round 2: Much better NLP experiments
We started the second round of NLP experiments by partnering with Dr Daniel Bauer an NLP expert at the Department of Computer Science in Columbia University. With his advice, we tightly focussed the research on the extraction of information about the relationships between persons and units – whether this person was posted to that unit. This is one of a number of elemental data creation tasks that SFM does, including examining the connections between units, where (and when) units are located at particular places, and the ranks of persons within the same unit through time. We excluded all these from the research task, focusing only on that single question. We separated out the system design into two main processes that could be worked on sequentially and independently evaluated: first, the named entity steps (identifying the name of a person or unit), and then the relationship extraction steps (this person is posted to that unit).
Learning the lesson from our first experiment we created a training dataset and source text corpus, which we have published online (along with exhaustive documentation about how we made it). The training dataset is based on the text of the 130 most information-rich articles about security and defence forces in Nigeria. We made over 3600 annotations to this body of text, highlighting units and persons and the relationships that exist between them. This annotated text provides a “gold standard” against which the data pulled out by the NLP system can be compared and assessed.
The limitations of this training data are obvious: the sources are English language only, Nigeria is the only country of focus, and the number of sources is small compared to the typical NLP project (NLP systems after often trained on massive document collections). As time and resources become available we can extend the training data to include different languages, security forces from different countries, and different types of entities and relationships between them. However, it provided a resource of sufficient quality to move the research task forward. We were able to boost the training dataset a bit by adding some more examples of text describing persons and units that existed in our own dataset, as well as entries from a standard entity set (known as CoNLL-2003). This gave the NLP system more to work with.
With more expertise, a tighter research focus, and stronger foundations we have made a great deal more progress, which I will discuss below.
What the results do and don’t tell us
The working paper outlines the context of SFM’s work and the research challenge that we have set out to address. It details the available training resources and evaluation criteria. After this, it’s a technical document that is aimed at NLP or machine learning practitioners. It gives a detailed description of the methods we used to extract entities from the text corpus and find the relationships between them. The relationship extraction task is the more complicated of the two processes. We look at how three different approaches (“nearest person”, “shortest dependency path”, and “neural network”) perform against this task, showing visually the different types of mistakes they make and the adaptations we made in response to these.
We then evaluate each of the processes with two main metrics: “precision” and “recall”. An algorithm’s precision tells us the proportion of correct predictions the algorithm made in those cases when it pulled out an entity or established a relationship between entities. A value of 1.0 is “perfect” meaning it got it right every time it guessed! The recall value tells us the proportion of correct predictions when compared to the “gold standard data” of sources which we annotated ourselves. Again, a value of 1.0 means it got all of them.
What do these metrics tell us about our specific NLP system? From the paper, here is the evaluation of the Named Entity Recognition (NER) model – this picks out the names of units, persons and terms that could be connections (like rank and role). Reading the data below, the precision in “All Classes” is 0.82, which means that when the model makes a prediction it gets it right 82% of the time (around 4 in every 5 predictions). The recall values are similar, meaning that the algorithm missed 18% (around 1 in 5) of the entities that we had tagged in the training data.

Also clipped from the paper is our evaluation of the Relationship Extraction (RE) algorithm, which guesses when a person is posted to a unit. This compares the performance of a number of different approaches to the task. The precision of all approaches is noticeably lower than for the NER task. The highest precision value is 0.71, which means that in cases it made a guess about how entities were connected, it got it right 71% of the time (just below 3 in 4 cases). Overall, though, the recall values are comparable with the NER model, identifying between 70-83% of the entities we tagged in the training data.

Even with the imperfections of the system’s outputs in terms of accuracy there is promise. It was able to deliver these results in x seconds on a laptop, as compared to the many hours it would take a human researcher to read the 130 text sources and accurately extract data from them . With more work, we may find ways of improving the performance, however, it is likely no system will ever have perfect accuracy. At this point the question we need to ask is “What do we do with algorithms that guesses correctly about ¾ of the time and have a blindspot for between ⅕ to ¼ of possible entities?”
To answer this we need to return to the idea that motivates this research: the need to reduce the time and human resource that it takes to extract data from text sources on an ongoing basis. This task is a combination of human judgement (to identify the correct data to extract) and mechanical work (to extract it). A perfect NLP system would accomplish both fully, like our researchers can: it would identify the right data and extract it accurately into a database. However, the limits of the NLP system, as quantified in the Working Paper, show that this isn’t possible. The system will find a great deal of the data, and will be correct some but not all of the time. Where it is correct, it can extract data instantly.
To bridge this gap, we need to design an interplay between the NLP system that makes some suggestions about what it finds in the text, and a human reviewer who might accept, correct or reject this suggestion (or claim). When we were first looking at this problem, this is the sort of “human in the loop” user interface we came up with.

It’s not great, but I think it captures the correct concept: let the NLP system do the “grunt” work and have the human refine and improve it quickly. Anyhow, we are leaping in front of ourselves with this. So, what next?
What do we do next?
There is a big gap between the system we have designed here and a technology that we might be able to incorporate into our work, and we have to ask some difficult questions about long term value and costs:
- Is our own data management approach and toolset sufficiently malleable to even incorporate an NLP system? If not, what would need to be implemented and at what cost?
- If we push out from the limitations in scope (specific task, English language only) that frame this part of our experiment, would we still be able to obtain results that pass the threshold of usefulness, or would we find the system breaks down?
- In what ways can the experimental system we have built be improved, or surpassed and re-engineered to take advantage of new methods and tools that may have evolved since we started this project, and are we able to support that cost?
- Are the specific capacities of the experimental system likely any time soon to be commodified, made widely available and easily accessible and is the market entry point for this type of technology likely to become sufficiently affordable, transparent, ethically sound and sustainable for groups like ours?
Presently, the answers to all these questions is probably “not yet”, but we’re keen to keep working on pushing those towards “maybe” and “yes”. The experimental results show us that to discount it completely would be to deprive ourselves of useful future potentials. The better view on this, perhaps, is simply that we should persist in expanding what we know about NLP’s value to our work. We’ll keep the experiments running and use them to inform our decisions around technology and methods.
We also recognise that it still remains unusual for human rights groups to delve into technologies like NLP from the perspective of how it can be tactically useful, rather than how it can be harmful. One of our intentions in releasing this paper, data and code is to give anyone else in our sector considering the use of NLP an idea of how to explore the question.
Finally, there is another community that is important to this work: NLP researchers. The experiment we have run may not address novel issues in NLP technology or methods, but it provides a new domain and dataset for their applications. The needs, constraints and concerns of human rights groups – and perhaps civil society organizations more generally – provide valuable new angles on the societal value of NLP. In particular, ethical quandaries around how training datasets are constructed, the human rights implications of the deployment of NLP systems, and dependence on corporate technologies are much more prominent drivers of decision making in human rights groups than in commercial domains. We hope this experiment sparks a conversation with the NLP community about these issues and more.
Resources mentioned in this blog post
All the documents, code and data we have referenced in this blogpost can be found at the links below:
- “Working Paper: NLP in Human Rights Research – Extracting Knowledge Graphs About Police and Army Units and Their Commanders.” (Download: PDF, arXiv)
- NLP system and model, on Github: sfm-graph-extractor
- Python package, on PyPI: extract-sfm
- NLP training dataset, on Github: nlp_starter_dataset