How we mined more than 715,000 Luanda Leaks records
Luanda Leaks was a trove of more than 175,000 emails – so how do we go about tackling such a massive dataset? And how can we be sure we haven’t missed major stories?
The Luanda Leaks is a collection of more than 715,000 financial and business records – some of which are hundreds of pages each – that provide a window into how Africa’s richest woman, Isabel dos Santos, made her fortune at Angola’s expense.
A leak of 715,000 files is certainly not the largest leak that the International Consortium of Investigative Journalists has had to investigate – the Panama Papers and Paradise Papers were more than 24 million documents combined. But it’s still more documents than a single reporter can possibly read on their own.
Assuming each document is about 500 words long (or the equivalent in a spreadsheet), it would take more than two and a half years of non-stop reading for a single person to pore over every file.
So how do we go about tackling such a massive dataset? And how can we be sure we haven’t missed major stories?
We use a combination of technology, data analysis, research and the power of collaboration: journalists from 20 countries, exploring the documents, sharing their findings in a radical way, and then taking their reporting well beyond the leak.
Here are five elements that were key for mining the Luanda Leaks documents:
1. Dig through documents with Datashare
The first step was to upload all 715,000 leaked files into Datashare, our secure document research platform, which the 120 journalists working on Luanda Leaks could use to search, filter, star and tag the documents. Datashare is able to automatically recognize, highlight and list the names of people, organizations and locations in the files, and is also able to extract text from scanned documents and images to make it searchable.
We estimated the server cost of this operation was $13,300 (please donate to ICIJ), extracting text from images (a process known as optical character recognition, or OCR) being the most expensive step.
With more than half of the documents written in Portuguese, digging into the leak was even more challenging – the majority of journalists working on this project were not fluent in Portuguese. For security and source-protection reasons, we wanted to avoid common online machine-translation tools (such as Google Translate or DeepL) and, instead, have the translation directly available on Datashare. So we decided to use an open-source piece of software called Apertium. Our team wrapped Apertium within a command-line tool that was able to translate any language pair directly into Datashare. We published the code of this tool on Github.
Journalists could also “batch search” documents: if they had lists of people, organizations or anything that they wanted to search en masse, they could upload a spreadsheet of search terms and Datashare would provide results per query, which helped save a lot of time.
2. Upload your CSV file and pick your search settings (exact matches, etc.)
Here’s how ⤵️ 2/3 pic.twitter.com/92NMJqOkad
— ICIJ (@ICIJorg) January 24, 2020
2. Visualizing the data
After Datashare created an index of all the people and companies named in the 715,000 files, reporters needed a way to visualize all the connections and networks of entities and documents in the leak.
Drawing on experience from previous investigations, including the Panama and Paradise Papers, we used Neo4J and Linkurious (along with Talend and an SQL server) to generate a database and create a visual representation of the networks within the leak.
Using Linkurious, reporters could type a name like “Sonangol” and then find which documents were linked to this company. The visualization would show other people and companies that were also connected to the same documents, making it easy to identify links between the hundreds of thousands of records.
3. Training the machines
When you’ve got hundreds of thousands (or even millions) of files to search through, it’s essential to be able to spot patterns in the documents, and quickly identify the more useful records for further scrutiny. For example, in the Luanda Leaks, how could we find all the files that contained contracts? Or could we identify other types of useful documents that were similar to each other, but wouldn’t necessarily show up with simple keyword searches?
Over the last three years, ICIJ has been exploring ways in which artificial intelligence can help journalists with investigations. For Luanda Leaks, we partnered with Quartz AI Studio to see if machine learning could help provide answers to reporters’ requests.
A starting point, for example, was the extraction of titles and subtitles from the documents and the generation of clusters based on words they had in common. Reporters would then get a spreadsheet that gave insights into the type of documents available in the leak that were similar to each other.
Another process involved journalists identifying types of documents of interest like “balance sheets” or “financial agreements” that could be used to train the computer to find other files that contained similar information or looked alike.
The results were shared in spreadsheets and integrated into Datashare, where partners could select clusters of similar documents for further exploration: bank documents, contracts, minutes, bank transfers, utility bills, water bills, among others.
4. The human factor
Using technology helps make the files easier to process, organize and explore. But making sense of the information and connecting it with additional sources wouldn’t have been possible without reporters. More than 120 brains put the pieces together and added months of reporting to leads they found in the documents.
Answering key reporting questions and finding newsworthy stories required substantial data analysis and research. Could we identify all the companies in which Isabel dos Santos and her husband Sindika Dokolo held a stake as shareholders? Would it be possible to know how much money was invoiced by PwC and Boston Consulting to dos Santos’ and Dokolo’s companies for services provided?
To answer these and other questions, we explored thousands of invoices and corporate records, checked external data sources and manually built our own databases to track the research and facilitate the analysis. Each data entry had to be verified and fact-checked. And now, you can download a database of more than 400 connections to Isabel dos Santos and Sindika Dokolo.
5. Radical sharing
“How is the information connected across countries? Any exciting findings? How can we help each other? When do we publish?”
When you’re dealing with massive datasets, the only way to investigate thoroughly is to share the load. ICIJ’s Global I-Hub is where radical collaborative magic happens!
Journalists could log in to securely share their findings and coordinate the reporting. This technology is essential to keep the team together and the communication flowing for months. For Luanda Leaks alone, journalists shared more than 3,500 messages on 837 topics – an effort that couldn’t have happened over email and phone calls alone.
Who worked on the project?
The ICIJ team: Anne L’Hôte, Ashlee Guevara, Bruno Thomas, Delphine Reuter, Jelena Cosic, Madeline O’Leary, Mago Torres, Miguel Fiandor, Pauliina Sinauer, Rigoberto Carvajal, Soline Ledésert, Pierre Romera, Emilia Diaz-Struck
Quartz’s AI Studio: Jeremy Merrill, John Keefe