How ICIJ deals with massive data leaks like the Panama Papers and Paradise Papers

The International Consortium of Investigative Journalists has faced the challenge repeatedly of digging through gigantic amounts of data.

For instance, we recently shared with partners a new trove of 1.2 million leaked documents from the same law firm at the heart of the Panama Papers investigation, Mossack Fonseca. This was on top of the 11.5 million Panama Papers files brought to us in 2015 by the German newspaper Süddeutsche Zeitung and 13.6 million documents that were the basis of the subsequent Paradise Papers probe.

If a single journalist were to spend one minute reading each file in the Paradise Papers, it would take 26 years to go through all of them. Obviously, that’s not realistic. So, we asked ourselves, how can we find a shortcut? How can we make research more efficient and less time consuming?

1. Human resources: Engage with partners

Gaming the global tax system is best investigated by a worldwide network of journalists. And that’s ICIJ’s model. We rally the best reporters on five continents to optimize research efforts and connect the data dots from one country to another.

Tax stories are like a puzzle with missing pieces. A reporter in Estonia might understand one end of the story; a Brazilian reporter might come across the other end. Bring them together, and you get closer to the complete picture. ICIJ’s job is both to connect those reporters and to ensure that they share everything they find in the data.

We call our philosophy radical sharing. ICIJ’s partners communicate their findings as they are working, not only with their immediate co-workers, but also with journalists who may be halfway around the world away.

Emilia Díaz-Struck, ICIJ’S research editor, and Vanessa Wormer, from Süddeutsche Zeitung, explaining the data to reporters at a Paradise Papers meeting in 2016.

In order to promote collaboration, ICIJ provides a communication platform called the Global I-Hub. It has been described by its users as a “private Facebook” and allows the same kind of direct sharing of information that occurs in a physical newsroom. Reporters join groups that follow specific subjects – countries, sports, arts, litigation or any other topic in which they are interested. Within those groups, they can post about even more specific topics like a politician they found in the data or a specific transaction they are looking into. This is where most of the discussion happens, where journalists cross-check information and share notes and interesting documents.

It took ICIJ several projects to get reporters comfortable with the I-Hub. To ease their way onto the platform and deal with technical issues, ICIJ’s regional coordinators offer support. This is key to ensuring reporters meet the required security standard.

2. Secure communications: Encrypt everything

When you conduct an investigation involving 396 journalists, you have to be realistic about security: every individual is a potential target for attackers, and the risk of breach is really high. To mitigate this risk, ICIJ uses multiple defenses.

It is mandatory when joining an ICIJ investigation to setup a PGP key pair to encrypt emails. The principle of PGP is simple. You own two keys: one is public and is communicated to any potential correspondent who can use it to send you encrypted emails. The second key is private and should never leave your computer. The private key serves only one purpose: to decrypt emails encrypted with your public key.

Think of PGP as a safe box where people can store messages for you. Only you have the key to open it and read the messages.

Like every security measure, PGP has vulnerabilities. For instance, it could easily be compromised if spyware is running on your computer, recording words as you type or sniffing every file on your disk. This highlights the importance of accumulating several layers of security. If one of those layers breaks, we hope the other layers will narrow the impact of a breach.

To ensure the identity of its partners, ICIJ implements two-factor authentication on all of its platforms. This technique is very popular with major websites including Google, Twitter and Facebook. It provides the user with a second, temporary code required to login. This code, a series of numbers, is usually generated on a different device, for instance, your phone, and disappears quickly.

On some sensitive platforms, we even add third factor authentication: the client certificate. Basically, it is a small file reporters store and configure on their laptops. Our network system will deny access to any device that doesn’t have this certificate.

One other noteworthy mechanism ICIJ uses to improve its security is Ciphermail. This a software that operates as a proxy in front of every email sent by our platforms to encrypt them using PGP. This is a software that runs between our platforms and our users’ mailboxes. It identifies the PGP key associated with an email address to encrypt emails automatically when they are sent through our platforms. So, in short: any email reporters receive from ICIJ is encrypted.

3. Refine raw data

The Paradise Papers was a cache of 13.6 million documents. One of the main challenges in exploring them came from the fact that the leak came from a variety of sources: Appleby, Asiaciti Trust and 19 national corporate registries. When you have a closer look at the documents, you quickly notice their diverse content and character and the large presence of non-machine readable formats.

The breakdown of files in Appleby and AsiaCiti’s data.
Files from the Paradise Papers.

Emails, PDF, Excel files – those documents reflect the internal activities of the two offshore law firms ICIJ investigated. Of course, this material was not originally structured in a way that would facilitate an investigation by journalists. ICIJ had to find the best way for its partners to dig into the two largest leaks in history.

ICIJ’s engineers put together a complex and powerful framework to allow reporters to search these documents. Using the expandable capacity of cloud computing, the documents were stored on an encrypted disk that was submitted to an extraction pipeline, a series of software systems that takes text from documents and converts it into data that our search engine can use.

Most of the files were PDFs, images, emails, invoices and the like. None were easily searchable. We had to find a way to facilitate access to these files. Using technologies like Apache Tika (to extract metadata and text), Apache Solr (to build search engines) or Tesseract (to turn images into text), the team built an open source software called Extract with the single mission of turning raw documents into searchable, machine-readable content. This tool was particularly helpful in distributing the now-accessible data on up to 30 servers – all administered by ICIJ to deliver the data to its journalists.

With efficiency and accessibility the goal, ICIJ had to build a user interface to allow journalists to explore the refined information extracted from “unstructured data,” the hodge-podge of different types of documents from various sources. Once again the choice was to reuse an open source tool named Blacklight which offers a user-friendly web portal where journalists can look into documents and use advanced search queries (like approximate string matching) to identify leads hidden in the leak.

4. Explore structured data

While ICIJ is committed to publishing information that is of public interest, we are obliged to do so without disclosing information that could jeopardize our sources’ anonymity. For this reason we decided to recreate the corporate registries that were leaked to us instead of using the registries themselves. In order to do that we had to comb through the registries for the names of entities and officers.

The Panama Papers and Paradise Papers data look very similar. However, the second is much more complex since it includes data from 21 different sources. Out of those sources, ICIJ focused its effort to extract data from only seven corporate registries and one Appleby database. Each source had to be treated with specific tools. To do so we created a series of scrapers, an army of small software systems that have only one mission: turning unstructured data into actual machine-readable database formats.

To build this database, ICIJ relied on Neo4j, an amazing technology that helped us convert data into graphs. Most people already use regular databases where data can be understood as tables. In a graph database, your information is stored as points of intersection (nodes) and links (edges), information that explains how companies and individuals may intersect – for instance, an individual might be the shareholder of one or several companies, thereby connecting them.

Because all the data was imported from a large variety of documents, it was essential for ICIJ to guarantee the integrity and the quality of the information. For that purpose developers used a tool called Talend which functions as an intermediary among data sources. Talend helped structure, transform and run tests on data to ensure it was uniform and searchable but not fundamentally altered. For instance, most of the documents ICIJ obtained used different date formats. Talend helped us to turn all of them into a single format.

Finally, when we imported this database into Linkurious, a visualization tool that explores the refined and structured data and creates visualizations, ICIJ was able to publish them on the Offshore Leaks Database website.

Creating shortcuts for research had taken us down a long road.

Want to start exploring ICIJ’s Offshore Leaks Database? Here’s a helpful how to.