DATA TOOLS

How ICIJ’s Datashare project will help journalists breach borders

Behind the scenes, the Panama Papers and Paradise Papers investigations were powered by a central platform that made it possible for the International Consortium of Investigative Journalists’ worldwide network of journalists to collaborate while exploring tens of millions of files. As successful as these probes proved, the platform failed to make searchable and to connect data stored in individual reporters’ computers.

Now comes the advent of ICIJ’s Datashare platform, which, for the first time, will allow journalists across borders to collaboratively and discreetly mine information contained in their respective documents.

To understand how it works, imagine an investigative reporter, let’s call her Gabriela, awash in a flood of potentially damning documents linking a government minister to contract price-fixing. We’ll call our fictional minister João Silva.

By downloading ICIJ’s Datashare, Gabriela can create a search engine for the documents that had threatened previously to swamp her investigation.

Datashare integrates ICIJ’s battle-tested Extract technology which pulls out machine-readable text from files (using Apache Tika), applies optical character recognition (Tesseract OCR) to images, then adds information to a search engine (Elasticsearch.)

Effectively, it permits Gabriela to get individuals, like João Silva, organizations and locations automatically extracted from her files in English, Spanish, French and German using open source software, including Stanford CoreNLP, MIT Information Extraction, Apache OpenNLP, IXA pipes and Gate.

The three steps in the Datashare process.
ICIJ's Datashare

It also helps her discover other names and connections to her investigation. In the case of João Silva, her analysis shows he has a connection with a corrupt engineering magnate Lucas Machado (another fictional character.)

When our ace reporter Gabriela makes the vital link between Silva and Machado, Datashare does not send the data to third-party platforms, such as servers controlled by Google, for analysis.

Crucially, however, it does allow Gabriela to connect not only with hundreds of other journalists but also with all the leaked data from previous ICIJ investigations she has access to, including the Panama Papers and Paradise Papers, hosted on ICIJ’s Knowledge Center.

Datashare allows Gabriela to upload documents and easily share them with journalists across the world.

The new network also enables our journalist Gabriela to put a call out for more information relating to her character João Silva.

When Anastasia, a Russian journalist, gets an alert from Datashare that someone on the network is chasing information on João Silva, a name mentioned in some of her documents, the plot thickens. By sharing documents, Gabriela and Anastasia connect Silva with Alexander Smirnov, a businessman close to Russian organized crime.

And after further investigation, Gabriela and Anastasia are confident enough in their collective findings to publish a series of front-page stories that they would never have uncovered without their confidential connection and the inquisitorial power of Datashare.

Datashare not only avoids duplication of effort, it opens up new avenues of exploration and encourages diverse thinking and is improving security and privacy to keep information from the prying eyes of the corrupt and the powerful while being accessible to selected journalists 24/7.

American poet and political activist Muriel Rukeyser wrote: “The universe is made of stories, not atoms.” ICIJ’s Datashare project aims to make those atoms collide, with international impact.

ICIJ would like to thank the David and Helen Gurley Brown Institute for Media Innovation, Frontline and the Swedish Postcode Lottery for supporting this work.