WhoWasInCommand shows you all the sources that evidence every piece of data – but you probably missed the way it does this

WhoWasInCommand shows you all the sources used to evidence every piece of data it provides.

When you’re browsing your favourite units and commanders on WhoWasInCommand.com – like Operation Lafiya Dole, for example –  just hover your mouse over (or tap on, if you’re on a mobile or tablet) the bit of data you’re interested in and this happens:

sources_show_em_all

This interaction gives you a lot of useful information:

  • Because the little circle is green, it tells you that we have rated this bit of data as “High Confidence” (which means it is drawn from a wide variety of sources of different types)
  • The pop-over that appears when you click tells you how many sources there are
  • You can scroll to see them all the sources, along with links to the source’s URL (even if it’s now dead) and a link to a copy of the source we made by submitting it to the Internet Archive
  • The little question mark icon links off to the page in our Research Handbook that answers questions about this widget.

Now, I think this feature is pretty cool (well, I designed it so I would say that). We did some  research into how citations, references and footnotes were managed on websites, and our hunch was this would be a good start.

But it’s not my view that counts – it’s your view as a user that matters.

We get a lot of questions about our sources and whilst it’s clear this feature is a practical way to deliver information that answers those questions, I suspect that a lot of users don’t use it either because it’s not immediately apparent it is there or because it is not how users would think about how to find sources.

We could do to sit down with people who are using WhoWasInCommand, watch how they use the site, and ask them for ideas about how we can make these sorts of features clearer.

Any volunteers?

 

 

 

OpenStreetMap is (sometimes) a handy database of military and police locations – here’s how to see them

overpassblog3
OpenStreetMap – 70,641 objects are tagged with “landuse=military”. Source: TagInfo, 6 July 2018

Most of the time we use OpenStreetMap (OSM) as a gazetteer; that is, a means of representing the geographical aspects of Security Force Monitor’s data.

For example, our research indicates that the Mexican army unit 105 Batallón de Infantería had a base in Frontera, Coahuila, Mexico from 24 February 2014. To geocode this data we will search OSM to find the nearest “object” to the named settlement – in this case a “node” called Frontera (ID number 215400772)  – and link it to the unit as a base. Our Research Handbook contains the rules we use for doing this.

When we publish the data on WhoWasInCommand.com it will be displayed in the “Sites” section of the record for 105 Batallón de Infantería along with all the sources that evidence it:

overpassblog1
WhoWasInCommand.com: sites for 105 Batallón de Infantería, Mexico

So far we have found OSM to be a good enough gazetteer. And it’s free. And it’s open licensed. And we can fix it if we need to. So you won’t find us moaning and whinging.

However, OSM has a number of issues with accuracy, coverage and change over time so we do not use OSM as a primary source of information. Instead we use it as one of a number of sources of lead information which help us piece together the geographical footprint of a security force. It’s why, for example, we don’t place 105 Batallón de Infantería directly at Venustiano Carranza International Airport, even though this is the case on OpenStreetMap. We don’t (yet) have other sources to evidence this, but OSM gives us a useful prompt to investigate this further.

I’ll cover the pros and cons of using OSM in our research in a future blog post but for now I’d like to talk about how we OSM in the early stages of research into a security force.

OSM is a useful tool for getting an impression of a security force’s physical infrastructure: lead information about where it may have bases and facilities, and the terrain that may be reserved for use by security forces  (like firing ranges,  training areas, ). How do we do this?

OpenStreetMap is a database

The points, lines and polygons (“objects”) you see on OSM are described with “tags”: for example, a tag can define a line as a “road” or a shape as a “building”, and give it a name. Incredibly, on OSM there are  over 70,000 different ways to describe an object, but the tag we’re interested is “landuse=military”.

OSM currently has 70,641 objects to which the tag “landuse=military” has been applied. OSM’s own documentation about this tag is here. The tag can be refined further by applying another tag called “military=[something]” – the [something] in question can be values like the below:

  • military=airfield
  • military=barracks
  • military=bunker
  • military=checkpoint
  • military=training_area

There are currently over 290 additional tags used on OSM to increase the specificity about the type of military land use.

How can we use this information to aid our research? The usual need we have is for a BIG LIST that we can simply go through one by one and use as starting points for searches or to cross reference data we get from other sources. Although we can view these items on OSM we can’t get such a BIG LIST. To do this we need to use a way of accessing OSM’s data called Overpass API. This is mostly by programmers but for us patient non-programmers there is a slightly easier way to use this API – it’s called  Overpass Turbo.

Using Overpass Turbo to show military land use on OSM

So, here goes. Let’s ask OSM what objects in Mexico are tagged with “landuse=military”.  Head over to Overpass Turbo:

https://overpass-turbo.eu/

After opening that link copy the below into the input area on the left-hand side and then hit the “Run” button (top left):

// Limit the search to “Mexico”
{{geocodeArea:Mexico}}->.searchArea;
// Pull together the results that we want
(
 // Ask for the objects we want, and the tags we want
 way["landuse"="military"](area.searchArea);
 relation["landuse"="military"](area.searchArea);
 node["landuse"="military"](area.searchArea);
);
// Print out the results
out body;
>;
out skel qt;

What’s this then? Yes, it’s a map of just those objects tagged with “landuse=military”:

overpassblog4
Overpass Turbo – map of objects tagged “landuse=military” in Mexico (live)

Exciting! You can export this into a common geographical format (like KML or geoJSON). But I said we needed a list. Let’s alter the query a bit. Try putting this into the editor:

// Get a CSV output
[out:csv(name, "tags:name:es", "tags:name:en", ::"type", ::"id", ::"lat", ::"lon";true;",")][timeout:25];

// Limit the search to “Mexico”
{{geocodeArea:Mexico}}->.searchArea;
// Pull together the results that we want
(
 // Ask for the 
 way["landuse"="military"](area.searchArea);
 relation["landuse"="military"](area.searchArea);
 node["landuse"="military"](area.searchArea);
);
// Print out the results
out body;
>;
out skel qt;

Same data, but in a list that we throw into a spreadsheet to work more on:

overpassblog5
Overpass Turbo – CSV list of objects tagged “landuse=military” in Mexico (live)

Even the snippet above gives us some unit and facility names to research further, as well as the locations of possible facilities that perhaps someone with local knowledge has flagged as being used for military stuff.

The queries above can be altered to search within different countries or other defined areas, examine different tags (like “amenity=police”… give it a try), and export more data (such as an object’s history).

Wrapping up

  • As well as being a map that we can search, OpenStreetMap is a database that can we query in depth.
  • Historical and contemporary military and police locations may be identified inside OpenStreetMap using the “landuse” tag. More information about the tagging system can be found on OSM’s own TagInfo service.
  • Using Overpass Turbo we can pull out that information as use it as lead information during our research. Overpass Turbo is free to use, and can output  maps and lists. The Overpass query language is documented here and there are some super examples on the OSM wiki here.

I’m sure there are more elegant ways to use Overpass Turbo than my basic code, so should anyone wish to help us out  I’m all ears (tom [at] securityforcemonitor.org). We’re also interested in improving the data on military and police facilities that exists in OSM, … but that’s another post.

I hope this has been a helpful read, and do comment, respond and correct as needed.

Not all snapshots are created equal – a time-saving Wayback Machine technique

We’re going to write about our daily work more often.  I’ll go first with a nerdy research tip:

The Internet Archive’s Wayback Machine (the awesomeness of which I won’t bang on about) can show you when captures of the same page differ in some way from each other.

So what?

Here’s a long dead page used by La Secretaría de la Defensa Nacional (SEDENA) in Mexico to list the commanding officers of Zonas Militares (a major tier of the army in Mexico).

http://www.sedena.gob.mx:80/ejercito/comandancias/zon_mil.htm

It exists only in the Internet Archive’s Wayback Machine now. Here are two captures of that URL – made in 2004 and 2005 respectively . The screenshots below show only the first 10 entries (of over 40 in each). Can you spot the difference?

cdxblog1
Clipping from 8 February 2004 Wayback Machine snapshot of SEDENA army commanders page
cdxblog2
Clipping from 3 October 2005 Wayback Machine snapshot of SEDENA army commanders page

Although the archived URL is the same, the content is not. For example, in the February 2004 snapshot SEDENA lists “Noe Antonio Ordoñez Herran” as the commander of 1/a Z.M. However, by October 2005 SEDENA lists “Germán Redondo Azuara” as the commanding officer. This is a substantive difference that we want to capture; there are also other differences between these two snapshots.

How do we approach it? First, we establish the total number of snapshots. Helpfully, the Wayback Machine tells us this for any URL that it holds snapshots for. For example, the present SEDENA page was captured 57 times:

cdxblog3

It is likely that a page like this may have been updated regularly: the little bar chart tells us that there are differences in the sizes of the snapshots, indicating that something changed. The changes could be an update to the text in the list of commanders,  a design change of some sort that affects the page size.

Do we have to wade through all of them to find out what the differences are? No. The Wayback Machine can tell us which snapshots differ from the previous ones. Therefore, we can just go to those that differ in some way from the others and extract information from those.

To do this, we have to use another way to ask the Wayback Machine questions: the Wayback CDX server. The CDX server is a more advanced way to query the Wayback Machine, but also using your browser. It doesn’t have graphical user interface to browse the archived pages. Rather it provides metadata about the snapshots.

Here’s Wayback Machine data about our URL, but viewed from the CDX server:

cdxblog4
Some output from the Wayback Machine CDX server.

Here’s the URL that gives you those results:

https://web.archive.org/cdx/search/cdx?url=http://www.sedena.gob.mx:80/ejercito/comandancias/zon_mil.htm

This is the few rows of the same 57 results but shown as metadata rather than as a navigable, graphical version of the web captures themselves. I’m sure you can figure how out how to turn this list into a spreadsheet that you can use to organise your research (hint: copy-paste into your favourite spreadsheet, then text-to-columns using a space as the separator).

By changing the URL a bit we can filter out snapshots that are the same as the preceding one:

https://web.archive.org/cdx/search/cdx?url=http://www.sedena.gob.mx:80/ejercito/comandancias/zon_mil.htm&showDupeCount=true&collapse=digest

We’ve tacked on two new bits to the end of the query URL:

&showDupeCount=true

This shows which of the snapshots have duplicates. And then:

&collapse=digest

This has the effect of removing data about snapshots that are the same as the previous one.

Overall, our results are filtered from 57 down to 31 snapshots. It’s removed 26 that were the same as the preceding one and saved us a good hour of work.

As it happens, of those 31 snapshots only 12 hold content that is useful to us. The remainder are captures of server errors, because SEDENA changed its official website (and URL structure) four times between 2004 and 2017. But that, my friends, is another blogpost.

So, to wrap up:

  • The Wayback Machine has the equivalent of an advanced query that helps us find out when snapshots of the same page differ from each other.
  • It’s called the Wayback CDX server, and you can read more about what it does on its Github page.
  • Using it at the beginning of a bit of research can save you a lot of time.

I hope this helps some of you save time when trawling the Wayback Machine, and encourages you to experiment a bit with obscurer features of well known tools. It certainly helps us create the rich data you see on WhoWasInCommand.com

Cheers!

 

 

Analysis of sources for data on security forces: can computers help us out?

all_5_illustration
Image: Security Force Monitor uses sources like news articles, NGO reports and official press releases to create data on the structure, leadership and behaviour of security forces

Can computers help Security Force Monitor’s researchers increase their speed and accuracy when extracting relevant data about security forces from the text of news articles and reports?

Over the last few months Yue “Ulysses” Chang, a masters student at the Data Science Institute at Columbia University, has interned with us to help us explore this question. The quick answer is “yes, 79% of the time.” The longer (and hopefully nice and readable answer) starts in this blog post, the first of a short series.

How do we create data about security forces?

Each week at Security Force Monitor we identify and read 100s of news articles, reports, maps and datasets – these are the “sources” out of which we pull thousands of little details about organizations, names and locations. We stitch these together to create the rich view of security force structures and commanders that you can search through on WhoWasInCommand.com.

This is time-consuming work. In most cases after we’ve found a useful source we just have to read through it, identify the snippets of information we need and then copy, paste or re-type them into our databases. Here’s a few paragraphs from a typical source – “Police IG Redeploys AIGs, CPs For April 11 Polls” – published by Channels TV on 10 April 2015:

ig_redeploys_aigs_2015-410
Image: excerpt from “Police IG Redeploys AIGs, CPs For April 11 Polls”, Channels TV (Nigeria), 10 April 2015.

We can use the information in this source to support the below statements, and enter the relevant values into a database:

  • On 10 April 2015 (the publication date of the source) Inspector General of Police, Assistant Inspector General, and Deputy Inspector-General of Police are ranks in the police force in Nigeria.
  • On 10 April 2015 Force Public Relations Officer is a title in the police force in Nigeria.
  • On 10 April 2015  Suleiman Abba is a person holding the rank of Inspector General of Police (IGP) in Nigeria.
  • On 10 April 2015 Aigusman Gwary is a person holding the rank of Assistant Inspector General of Police (AIG) in Nigeria.
  • On 10 April 2015 six Deputy Inspectors of Police “coordinate activities” in six geo-political zones.

Every one of these data points has at least a single source. For example, to make our data on units – distinct parts of security forces such as army battalions or police divisions – we looked at around 3500 unique sources taken from over 200 different publications.  From these sources we were able to evidence 25,505 data points with a single source, 2086 data points with more than 10 distinct sources, and 59 data points with over 40 distinct sources.

table_sources_to_datapoints
Table: How many data points about the units on WhoWasInCommand.com are evidenced by more than one unique source?

We presently cover branches of the security forces of Nigeria, Mexico and Egypt.  As we expand our coverage to other countries we will need to consider ways of reducing the time spent and risk of error inherent in this part of our research process. If we can reduce the time we spend searching, cutting and pasting bits of text, then we can spend more time cross-referencing and producing interesting analysis from the data. Could we get more help from computers than we currently do?

“NLP”, “NER” … ?

Computers can read too. Sort of.

Natural Language Processing (NLP) is a long-established field of computer science that looks at how machines relate to people’s speech and writing, and ultimately how they can comprehend information passed to it by a person. The fruits of NLP research provide technologies that power everything from the recommendations you get on search engines, those (irritating) automated voice call systems, and the (less irritating) digital voice assistants. Named Entity Recognition (NER) is the sub-field of NLP that gives computers the capability to pick out things that people can recognise in text – like names, persons, organizations, locations, dates. Could they be applied to our work?

We can start exploring this question very quickly by using one of numerous “off the shelf” NLP and NER toolsets. To test our ideas out we have chosen a toolkit called spaCy.  This has the benefit of having a wide range of functions, and being free and open source – this enables us to use the toolset without direct cost.

Without any modification spaCy can assess text and identify persons, organizations, locations, dates and lots of other types of entities. It can also be trained to improve its ability to detect the above entities (like adding in new geographical model), and identify new entities such as rank and role, or connections between entities. What’s not to love?

NER and real sources

Let’s give it a try. We can take the  text from the sample news article we analysed above, and place it into into spaCy.  It will highlight different parts of the text that it considers to be entities:

displacy_ner_example
Image: Use of unmodified, untrained spaCy NER algorithm to identify people, places and organizations in text (see an interactive version of this example).

The performance here is is ok, but it is not without problems. For example, spaCy correctly picked out all but one of people mentioned in the article (“Aigusman Gwary”, who it tagged as an Organization rather than a Person). It has also successfully identified Lagos and Bauchi as geo-political entities (“GPE”) but misses “Akwa Ibom” and “Rivers”, and mis-categorizes “Jigawa” as an organization. There are other misses in there too.

Bringing this into the work of Security Force Monitor

In this post we’ve outlined the challenges we face, and in broad terms the way that we see this set of technologies offering an opportunity to address them. The intriguing question for us is how to take the raw capabilities of NER and have them benefit our research work in specific and effective ways. We have a long list of things on our mind, including:

  • The skills and financial costs that will be required to develop, implement and maintain such a system in a way that is reliable and effective.
  • Whether we can improve the performance of the algorithm by using the data we have already collected to train spaCy to better pick out what we are looking for in a source.
  • How to reconcile the stream of information coming in from NER with the data we already have – for example, what process will we use to figure out if “Jane S Smith” and “Jayne S Smith” are the same person?
  • How we evaluate NLP and NER systems so we know whether they are getting better (or worse!).
  • The type of workflow and user interface that would be needed to bring these capabilities effectively into our research work so they are actually helpful.

In the next post in this series, Ulysses and I will start digging into these questions and revealing some of the work that we have done so far.

(Article edited on 22 February 2018 to correct typos and clarify wording)