Data update: Detailed sourcing for our database of 255,007 U.S. foreign military trainings

Photo: The South entrance of the Harry S. Truman Federal Building which is the headquarters of the Department of State, Carol Highsmith, 2016 (source). Somewhere in here are the two missing Foreign Military Training reports covering 2021 to 2023.

For over twenty years, the Foreign Military Training and DoD Engagement Activities of Interest report, published jointly by the U.S. Department of State and the Department of Defense, has provided a measure of transparency about the foreign (non-U.S.) police and military units that have received training and support from U.S. forces. These huge reports are a way that the U.S. Government can realise its legal commitments under the “Leahy Laws” by publicly stating which units, by virtue of their receipt of training, passed its human rights vetting requirements in that fiscal year.

The report’s release is usually a big event here at Security Force Monitor, and in years past we have marked it by turning the contents of the report into a searchable database that journalists, human rights researchers and others interested in accountability can use – our earlier updates cover the basics of why and how we have done this, when we have made updates and how others have been using it

However, for the last two years we have not made updates to this dataset because the report has not materialised online. These reports have been delayed in the past – for example, the 2020-2021 report was published about six months later than we expected. Neither Department has issued statements including a reason for the lengthy delays to two reports. Pending the arrival of new data we can tell you about, we have taken the opportunity to improve the data that is there.

The enhanced dataset is now online here:

https://trainingdata.securityforcemonitor.org

What’s improved?

The main improvement is small but consequential. For nearly all 255,007 training interventions in the dataset you can now see the page number of the report from which it was extracted. The source material for this dataset is 96 distinct PDFs and 16 webpages, which totals 6,908 distinct pages of usable data. We went back over all the source documents and linked the 30-40 distinct training interventions listed on each page to the number of that page. We didn’t include this in the first versions of the dataset on its release in 2019. With this improvement, though, users of the data are now able to verify a row of data against its source far more quickly. They can also generate a precise citation for a row of data to the page level, so others can find and verify it quickly. Finally, it opens up more opportunities to use our database systematically in research; a researcher can query, on various categories and criteria, search it freetext, or work through a report page-by-page depending on their requirements. 

Here are a few sample queries you can run that show the value of these enhancements:

To deliver this we improved the scrapers that extract the data from the PDFs, resolving a number of issues that existed in the first dataset. We also included the data from the 1999-2000 and 2000-2001 reports, which were not part of our earlier data releases. The main downside is that we were unable to retain the unique identifiers (training:id:admin) for each training that existed in the earlier dataset. This is frustrating – and a sin as old as the Internet – but is a technical trade-off in favour of a better overall dataset. For anyone that has used the IDs from the earlier dataset, and needs to match up with the newly-minted identifiers we have published a developer tool that can do it most of the time.

As before, we have made public all the code we developed to extract the data from the official reports and create this database, along with extensive documentation of the fields you can expect to find in the dataset.

Key resources