WhoWasInCommand shows you all the sources used to evidence every piece of data it provides.
When you’re browsing your favourite units and commanders on WhoWasInCommand.com – like Operation Lafiya Dole, for example – just hover your mouse over (or tap on, if you’re on a mobile or tablet) the bit of data you’re interested in and this happens:
This interaction gives you a lot of useful information:
Because the little circle is green, it tells you that we have rated this bit of data as “High Confidence” (which means it is drawn from a wide variety of sources of different types)
The pop-over that appears when you click tells you how many sources there are
You can scroll to see them all the sources, along with links to the source’s URL (even if it’s now dead) and a link to a copy of the source we made by submitting it to the Internet Archive
But it’s not my view that counts – it’s your view as a user that matters.
We get a lot of questions about our sources and whilst it’s clear this feature is a practical way to deliver information that answers those questions, I suspect that a lot of users don’t use it either because it’s not immediately apparent it is there or because it is not how users would think about how to find sources.
We could do to sit down with people who are using WhoWasInCommand, watch how they use the site, and ask them for ideas about how we can make these sorts of features clearer.
Most of the time we use OpenStreetMap (OSM) as a gazetteer; that is, a means of representing the geographical aspects of Security Force Monitor’s data.
For example, our research indicates that the Mexican army unit 105 Batallón de Infantería had a base in Frontera, Coahuila, Mexico from 24 February 2014. To geocode this data we will search OSM to find the nearest “object” to the named settlement – in this case a “node” called Frontera (ID number 215400772) – and link it to the unit as a base. Our Research Handbook contains the rules we use for doing this.
When we publish the data on WhoWasInCommand.com it will be displayed in the “Sites” section of the record for 105 Batallón de Infantería along with all the sources that evidence it:
So far we have found OSM to be a good enough gazetteer. And it’s free. And it’s open licensed. And we can fix it if we need to. So you won’t find us moaning and whinging.
However, OSM has a number of issues with accuracy, coverage and change over time so we do not use OSM as a primary source of information. Instead we use it as one of a number of sources of lead information which help us piece together the geographical footprint of a security force. It’s why, for example, we don’t place 105 Batallón de Infantería directly at Venustiano Carranza International Airport, even though this is the case on OpenStreetMap. We don’t (yet) have other sources to evidence this, but OSM gives us a useful prompt to investigate this further.
I’ll cover the pros and cons of using OSM in our research in a future blog post but for now I’d like to talk about how we OSM in the early stages of research into a security force.
OSM is a useful tool for getting an impression of a security force’s physical infrastructure: lead information about where it may have bases and facilities, and the terrain that may be reserved for use by security forces (like firing ranges, training areas, ). How do we do this?
OpenStreetMap is a database
The points, lines and polygons (“objects”) you see on OSM are described with “tags”: for example, a tag can define a line as a “road” or a shape as a “building”, and give it a name. Incredibly, on OSM there are over 70,000 different ways to describe an object, but the tag we’re interested is “landuse=military”.
OSM currently has 70,641 objects to which the tag “landuse=military” has been applied. OSM’s own documentation about this tag is here. The tag can be refined further by applying another tag called “military=[something]” – the [something] in question can be values like the below:
There are currently over 290 additional tags used on OSM to increase the specificity about the type of military land use.
How can we use this information to aid our research? The usual need we have is for a BIG LIST that we can simply go through one by one and use as starting points for searches or to cross reference data we get from other sources. Although we can view these items on OSM we can’t get such a BIG LIST. To do this we need to use a way of accessing OSM’s data called Overpass API. This is mostly by programmers but for us patient non-programmers there is a slightly easier way to use this API – it’s called Overpass Turbo.
Using Overpass Turbo to show military land use on OSM
So, here goes. Let’s ask OSM what objects in Mexico are tagged with “landuse=military”. Head over to Overpass Turbo:
After opening that link copy the below into the input area on the left-hand side and then hit the “Run” button (top left):
// Limit the search to “Mexico”
// Pull together the results that we want
// Ask for the objects we want, and the tags we want
// Print out the results
out skel qt;
What’s this then? Yes, it’s a map of just those objects tagged with “landuse=military”:
Exciting! You can export this into a common geographical format (like KML or geoJSON). But I said we needed a list. Let’s alter the query a bit. Try putting this into the editor:
// Get a CSV output
[out:csv(name, "tags:name:es", "tags:name:en", ::"type", ::"id", ::"lat", ::"lon";true;",")][timeout:25];
// Limit the search to “Mexico”
// Pull together the results that we want
// Ask for the
// Print out the results
out skel qt;
Same data, but in a list that we throw into a spreadsheet to work more on:
Even the snippet above gives us some unit and facility names to research further, as well as the locations of possible facilities that perhaps someone with local knowledge has flagged as being used for military stuff.
The queries above can be altered to search within different countries or other defined areas, examine different tags (like “amenity=police”… give it a try), and export more data (such as an object’s history).
As well as being a map that we can search, OpenStreetMap is a database that can we query in depth.
Historical and contemporary military and police locations may be identified inside OpenStreetMap using the “landuse” tag. More information about the tagging system can be found on OSM’s own TagInfo service.
Using Overpass Turbo we can pull out that information as use it as lead information during our research. Overpass Turbo is free to use, and can output maps and lists. The Overpass query language is documented here and there are some super examples on the OSM wiki here.
I’m sure there are more elegant ways to use Overpass Turbo than my basic code, so should anyone wish to help us out I’m all ears (tom [at] securityforcemonitor.org). We’re also interested in improving the data on military and police facilities that exists in OSM, … but that’s another post.
I hope this has been a helpful read, and do comment, respond and correct as needed.
It exists only in the Internet Archive’s Wayback Machine now. Here are two captures of that URL – made in 2004 and 2005 respectively . The screenshots below show only the first 10 entries (of over 40 in each). Can you spot the difference?
Although the archived URL is the same, the content is not. For example, in the February 2004 snapshot SEDENA lists “Noe Antonio Ordoñez Herran” as the commander of 1/a Z.M. However, by October 2005 SEDENA lists “Germán Redondo Azuara” as the commanding officer. This is a substantive difference that we want to capture; there are also other differences between these two snapshots.
How do we approach it? First, we establish the total number of snapshots. Helpfully, the Wayback Machine tells us this for any URL that it holds snapshots for. For example, the present SEDENA page was captured 57 times:
It is likely that a page like this may have been updated regularly: the little bar chart tells us that there are differences in the sizes of the snapshots, indicating that something changed. The changes could be an update to the text in the list of commanders, a design change of some sort that affects the page size.
Do we have to wade through all of them to find out what the differences are? No. The Wayback Machine can tell us which snapshots differ from the previous ones. Therefore, we can just go to those that differ in some way from the others and extract information from those.
To do this, we have to use another way to ask the Wayback Machine questions: the Wayback CDX server. The CDX server is a more advanced way to query the Wayback Machine, but also using your browser. It doesn’t have graphical user interface to browse the archived pages. Rather it provides metadata about the snapshots.
Here’s Wayback Machine data about our URL, but viewed from the CDX server:
This is the few rows of the same 57 results but shown as metadata rather than as a navigable, graphical version of the web captures themselves. I’m sure you can figure how out how to turn this list into a spreadsheet that you can use to organise your research (hint: copy-paste into your favourite spreadsheet, then text-to-columns using a space as the separator).
By changing the URL a bit we can filter out snapshots that are the same as the preceding one:
We’ve tacked on two new bits to the end of the query URL:
This shows which of the snapshots have duplicates. And then:
This has the effect of removing data about snapshots that are the same as the previous one.
Overall, our results are filtered from 57 down to 31 snapshots. It’s removed 26 that were the same as the preceding one and saved us a good hour of work.
As it happens, of those 31 snapshots only 12 hold content that is useful to us. The remainder are captures of server errors, because SEDENA changed its official website (and URL structure) four times between 2004 and 2017. But that, my friends, is another blogpost.
So, to wrap up:
The Wayback Machine has the equivalent of an advanced query that helps us find out when snapshots of the same page differ from each other.
Using it at the beginning of a bit of research can save you a lot of time.
I hope this helps some of you save time when trawling the Wayback Machine, and encourages you to experiment a bit with obscurer features of well known tools. It certainly helps us create the rich data you see on WhoWasInCommand.com
Can computers help Security Force Monitor’s researchers increase their speed and accuracy when extracting relevant data about security forces from the text of news articles and reports?
Over the last few months Yue “Ulysses” Chang, a masters student at the Data Science Institute at Columbia University, has interned with us to help us explore this question. The quick answer is “yes, 79% of the time.” The longer (and hopefully nice and readable answer) starts in this blog post, the first of a short series.
How do we create data about security forces?
Each week at Security Force Monitor we identify and read 100s of news articles, reports, maps and datasets – these are the “sources” out of which we pull thousands of little details about organizations, names and locations. We stitch these together to create the rich view of security force structures and commanders that you can search through on WhoWasInCommand.com.
This is time-consuming work. In most cases after we’ve found a useful source we just have to read through it, identify the snippets of information we need and then copy, paste or re-type them into our databases. Here’s a few paragraphs from a typical source – “Police IG Redeploys AIGs, CPs For April 11 Polls” – published by Channels TV on 10 April 2015:
We can use the information in this source to support the below statements, and enter the relevant values into a database:
On 10 April 2015 (the publication date of the source) Inspector General of Police, Assistant Inspector General, and Deputy Inspector-General of Police are ranks in the police force in Nigeria.
On 10 April 2015 Force Public Relations Officer is a title in the police force in Nigeria.
On 10 April 2015 Suleiman Abba is a person holding the rank of Inspector General of Police (IGP) in Nigeria.
On 10 April 2015 Aigusman Gwary is a person holding the rank of Assistant Inspector General of Police (AIG) in Nigeria.
On 10 April 2015 six Deputy Inspectors of Police “coordinate activities” in six geo-political zones.
Every one of these data points has at least a single source. For example, to make our data on units – distinct parts of security forces such as army battalions or police divisions – we looked at around 3500 unique sources taken from over 200 different publications. From these sources we were able to evidence 25,505 data points with a single source, 2086 data points with more than 10 distinct sources, and 59 data points with over 40 distinct sources.
We presently cover branches of the security forces of Nigeria, Mexico and Egypt. As we expand our coverage to other countries we will need to consider ways of reducing the time spent and risk of error inherent in this part of our research process. If we can reduce the time we spend searching, cutting and pasting bits of text, then we can spend more time cross-referencing and producing interesting analysis from the data. Could we get more help from computers than we currently do?
“NLP”, “NER” … ?
Computers can read too. Sort of.
Natural Language Processing (NLP) is a long-established field of computer science that looks at how machines relate to people’s speech and writing, and ultimately how they can comprehend information passed to it by a person. The fruits of NLP research provide technologies that power everything from the recommendations you get on search engines, those (irritating) automated voice call systems, and the (less irritating) digital voice assistants. Named Entity Recognition (NER) is the sub-field of NLP that gives computers the capability to pick out things that people can recognise in text – like names, persons, organizations, locations, dates. Could they be applied to our work?
We can start exploring this question very quickly by using one of numerous “off the shelf” NLP and NER toolsets. To test our ideas out we have chosen a toolkit called spaCy. This has the benefit of having a wide range of functions, and being free and open source – this enables us to use the toolset without direct cost.
Without any modification spaCy can assess text and identify persons, organizations, locations, dates and lots of other types of entities. It can also be trained to improve its ability to detect the above entities (like adding in new geographical model), and identify new entities such as rank and role, or connections between entities. What’s not to love?
NER and real sources
Let’s give it a try. We can take the text from the sample news article we analysed above, and place it into into spaCy. It will highlight different parts of the text that it considers to be entities:
The performance here is is ok, but it is not without problems. For example, spaCy correctly picked out all but one of people mentioned in the article (“Aigusman Gwary”, who it tagged as an Organization rather than a Person). It has also successfully identified Lagos and Bauchi as geo-political entities (“GPE”) but misses “Akwa Ibom” and “Rivers”, and mis-categorizes “Jigawa” as an organization. There are other misses in there too.
Bringing this into the work of Security Force Monitor
In this post we’ve outlined the challenges we face, and in broad terms the way that we see this set of technologies offering an opportunity to address them. The intriguing question for us is how to take the raw capabilities of NER and have them benefit our research work in specific and effective ways. We have a long list of things on our mind, including:
The skills and financial costs that will be required to develop, implement and maintain such a system in a way that is reliable and effective.
Whether we can improve the performance of the algorithm by using the data we have already collected to train spaCy to better pick out what we are looking for in a source.
How to reconcile the stream of information coming in from NER with the data we already have – for example, what process will we use to figure out if “Jane S Smith” and “Jayne S Smith” are the same person?
How we evaluate NLP and NER systems so we know whether they are getting better (or worse!).
The type of workflow and user interface that would be needed to bring these capabilities effectively into our research work so they are actually helpful.
In the next post in this series, Ulysses and I will start digging into these questions and revealing some of the work that we have done so far.
(Article edited on 22 February 2018 to correct typos and clarify wording)