Unlocking the Department of State’s foreign military training data for good this time

Since 1999 the United States government has published detailed data on the foreign military units that it has trained. Today, we’re releasing a new version of this data, automatically extracted from official sources, along with a simple search tool through which to explore it:

https://trainingdata.securityforcemonitor.org

This database contains over 200,000 training entries and covers the period 2001 to 2018. It includes data extracted from over 5,600 pages taken from 17 of the 19 “Foreign Military Training and DoD Engagement Activities of Interest” reports issued jointly by the U.S. Department of State and U.S. Department of Defense. While the two most recent reports are on the State Department’s website, the source documents for reports issued before 2016 can be found in the State Department archives.

You can use the tool to run simple or complex queries on the data, and craft the way the results are displayed. For example, here’s all the units that have trained at Lackland Air Force Base, all mentions of “Blackwater” in the data, those times the U.S. trained Nigeria’s Guards Brigade and that time a trainee got married and so couldn’t make the training. We warn you now: the tool has a steep learning curve, but it is something highly functional that we developed for our own analysis needs and hope it can be a useful tool for anyone interested in U.S. training activities.

We’ll follow this post up with more guidance on how to use the tool. The project’s Github page contains a detailed guide to the dataset along with the code used to create it.

Why have we done this?

Released annually since 2000 this important, statutorily-mandated report shows in great detail how the U.S. has spent much of its training and assistance budget, and with what aims. For most years this data includes the name of the training course, the trainee unit, the trainee unit’s country, the exact start and end dates, and so on. The value of this very detailed data is high but its accessibility is very low. This is because of the backwards way the data are published: 1000s of pages of tables in PDFs.

This is bad. It just is.

And so each year someone has to take one for the team and spend a month copying, pasting, formatting and cross-checking the new data into some format we can use, like an Excel sheet. Armed with this data we can all crunch the new numbers and begin to answer bread and butter questions about the U.S. military’s training budget: has expenditure gone up or down in comparison to previous years? What sorts of training are increasing; which are stagnating?

To date, our peers at Security Assistance Monitor (SAM) have taken on this noble task and published the data on their site, along with a wide range of other sources. Their site has a powerful report builder and interface for digging into the training programs right down the level of a specific training recipient. And you can even download all their data for free. SAM also does deep analytical dives on trends in U.S. security assistance around the world.

Why, then, have we created our own version of this data from scratch, and released a search tool for it? Simply put, we have very specific needs which meant we had to create a tool to address them. In particular we needed:

  • Transparent, repeatable data processing: our aim was to create a clean, accurate and standardized machine-readable copy of the data that could be easily updated. To do this, it is important to show how the data have been processed. The code and process we developed (published here on Github) creates a clear and visible link between the data and the original source material. It also enables us to see where we’ve made errors.
  • Exact dates for all trainings: our data on security forces is time-bound (as we discussed here), and having data on the exact start dates and end dates of trainings is critical for us. However, that information was not readily available (except in those dreaded PDFs). We started our coding just to get this data for a particular country, but then decided it was simpler to get all of it for every country for every year.
  • Sharing our tools: with over 200,000 rows of data, this project grew far beyond an Excel/Google Sheets file so we moved it into a database for our own analysis needs. The tool, called Datasette, is highly functional and serves the data into your browser. It enables us to build powerful queries, create different views of the data, and export the results in variety of different formats. However, why keep this to ourselves when others could benefit? We thought this might be useful to you as well.

Our hope is that when the next report arrives in a short few months, we will be able to turn it into machine readable data and pass it around the sector in minutes, rather than months. We also hope to integrate this data into our platform WhoWasInCommand.com, which will bring a powerful new dimension of analysis to it users.

Key links:

Updates:

  • We updated this post on 2021-08-19 to fix those direct links to the database that were broken by changes to the data model necessitated by the incorporation of data from the FY 2019-2020 report.

Image: Chris Barbalis

%d bloggers like this: