Earlier this year, the International Consortium of Investigative Journalists and our reporting partners faced a common problem in this line of work: we had received more leaked documents than we could hope to read from start to finish.
A whistleblower had given us more than 200,000 files from an offshore law firm based on the island tax haven of Mauritius. The stories buried in these documents would eventually shed light on a system that diverts tax revenue from poor nations back to the coffers of Western corporations.
But first, we had to figure out how to sort through this mass of data.
Rather than attempt to read every document, our reporting team turned to “machine learning” to automate the sorting process. This subset of artificial intelligence could learn to identify files more likely to contain stories — pulling out, for instance, information-rich tax returns from the massive trove to queue up for review by our reporters.
This was just one example in a presentation on machine learning in journalism that John Keefe, an editor at Quartz and a member of the Mauritius Leaks reporting team, gave last week to ICIJ members.
Keefe, an early adopter of using algorithms in reporting, defines machine learning as the use of complex code to create programs that can detect patterns and sort information faster than any team of humans. Such tools have helped parse massive datasets in many of ICIJ’s recent investigations, including the Paradise Papers, and Implant Files.
As datasets become bigger and more complex, machine learning models that help reporters sort and analyze data are becoming not only more sophisticated but also more accessible to reporters everywhere, Keefe explained. Pre-established learning models, which initially took major work to create, can now be easily fitted to new datasets. Some of these models are even available for free online. “This has only happened in the past few years,” Keefe said.
Keefe’s session was part of ICIJ Labs, a new webinar series for ICIJ’s journalist members to engage in discussion with industry leaders. ICIJ members from more than two dozen countries including India, Japan, Israel, Peru, France, Sweden, Germany, Russia, Egypt, Jordan and Slovenia attended.
Keefe emphasized that the point of machine learning isn’t to create perfectly-crunched data, but instead to automate repetitive tasks that frequently involve sorting through text, numbers or visual information. “What this does is help you find more documents that you wouldn’t have been able to find with a plain text search,” Keefe said. “But you still have to go back and double check, just like with any source.”
Keefe says that machine learning in journalism is rapidly evolving. “We have not even scratched the surface on ways that we can use machine learning systems like this to help solve our problems,” Keefe said.