The team behind ICIJ's Datashare accept the prize for best open source software at the 2019 Paris Open Source Summit.
ICIJ Datashare team

DATA JOURNALISM

How ICIJ will rock its tech in 2020

Approximately one year ago, ICIJ unveiled the first beta version of Datashare, its self-hosted documents search platform. You may have tried it, and who knows, you might even have found interesting leads in your documents thanks to this tool. But at the turn of this new decade, you could well be wondering: what is coming next?

Technology is our never-ending story at ICIJ. Technology and data have been the backbone of all our global investigations over the last seven years.

So as soon as we finish one story, we are already wondering what were we going to do next. Obvious for a media organization, right? But there is one thing that makes ICIJ unique among news outlets: we’ve accumulated millions of leaked documents over the years. And because people are sending us documents every day, this number is still growing.

To face this challenge, we needed to reinvent the way ICIJ deals with leaks.

Our first step was to produce a tool capable of indexing and exploring millions of documents. Enter Datashare.

We built Datashare on top of Extract, another piece of open source technology we created that could extract text from files, all the while distributing the workload across dozens of servers, allowing us to multiply the available power and process vast amounts of data much quicker. Extract was like a conductor who instructed each server, according to their availability, to handle a specific set of documents.

On the reporting side, the way that ICIJ deals with big datasets is to pull together big collaborations – it’s a proven formula that works well. But at an individual level, every journalist who starts to work on a leak rapidly faces the immensity of the leak itself. Yes, Datashare makes it easy to search through documents, but where should a reporter start?

To help our reporters, our team of developers must think like reporters. And one of the first things any journalist looks for in a dataset is familiar names. So we got to work, figuring out how technology could help, and created an automated entity extraction feature. When our system finds a name of an organization, person, or location, it indexes it so it becomes both highly visible and rapidly searchable for our users. Clever, yes; but it requires that the system works correctly and assumes that the entity is relevant to the investigation. These two conditions are far from easy to achieve.

In addition to entity extraction, we’ve also started exploring artificial intelligence (AI) as a potential assistant for reporters. As we recently shared on our blog, AI and, more specifically, Machine Learning, can be extremely efficient to categorize records. When it’s trained correctly, such technology is not only fast, it’s also resistant to common reporting missteps that might lead to a document missed or a story lead overlooked.

All these grand ideas are wonderful, but ICIJ is only a small organization with limited means. So instead of relying only on our own resources, we are taking steps toward more openness and scalability. We may not be able to study all the algorithms available on earth ourselves, but we can certainly facilitate collaborations to expand our knowledge base.

In 2020 we are going to focus effort on making Datashare a more versatile and transparent tool.

There’s an API for that

To do so, we are going to improve our Application Programming Interface (API) and write consistent documentation for how to access and use it. An API is the part of the software that exposes features to other applications. For instance, Open Corporate has an API to interrogate its database of companies.  With Datashare, we already use the API to build a command line tool which can batch tag documents.

Use ElasticSearch under the hood

When building Datashare, one of our strategic decisions was to use ElasticSearch to create an index of documents. You may never have heard of this powerful piece of software – but it’s currently behind a lot of search engines you probably use every day.

It would be fair to describe Datashare as a nice looking (and very pink!) web interface for ElasticSearch. We want our search platform to be user-friendly while keeping all the powerful ElasticSearch features available for advanced users.

This way we ensure that Datashare is usable by non tech-savvy reporters, but still robust enough to satisfy data analysts and developers.

ICIJ’s Datashare tool allows users to search and analyze documents securely.
ICIJ Datashare

Make plugins, not war

But what if you want to integrate text translations to Datashare’s interface? Or make it display a tweet that you scraped with Twint? Or create a custom form to filter by country? Currently, you’d have to fork the Datashare code and submit a “pull request” so our team could add your feature to the code base.

Next year we will explore the possibility to create plugins, to make this process more accessible. Instead of modifying Datashare directly, you could isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the plugins they need or want, and have a fully customized installation of our search platform. Neat idea, isn’t it?

Analyze documents the way you like

Finally, we want to enable arbitrary code execution on Datashare for advanced text-mining and analysis. We envision two different methods a user could choose from, depending on the code and their level of expertise: First, code snippets that could be executed directly from the web interface (using a script language like Python); and second, a dedicated environment for advanced users who need to run heavy analysis with a custom set of libraries. The magic with this sort of functionality is that it would be helpful for both our external users – like our media partners – and our own team of reporters and analysts who constantly need to process documents.

Take the discussion where it belongs

One of the secrets to ICIJ’s collaborative success is the transparency between partners. We ask all our media partners to embrace the concept of “radical sharing,” which encourages reporters to share every lead or potential story they discover during an investigation. This all happens in a secure, virtual newsroom – the I-Hub – which we built to facilitate collaboration across countries, continents and languages.

In 2019, we migrated the I-Hub to a brand new platform. We wanted a secure system that was able to accept more customization and evolutions.

For 2020, we’ve obtained a grant from the Swedish Postcode Foundation to build more bridges between the I-Hub and Datashare. We want to ensure conversations can flow smoothly between our research tool and our collaborative platform.

The first thing we want to do is allow users to start a discussion directly on documents they discover on Datashare. This “commenting system” will be automatically reflected on the I-Hub. This way every journalist who wants to flag a document will be able to see what other users say about it.

Once this commenting system is operational, we want to go further, and give, users the ability to annotate specific sentences or paragraphs within documents.

Driving investigations with a dashboard

During investigations, the I-Hub is a bit like a control tower and ICIJ staff are the operators. Without proper and transparent communication, our partners would be in the dark. The topics we work on are very complex and even more complex when you try to cover so many countries – for new partners joining an investigation, it can be a daunting task to get up to speed, and a challenge to stay on top of everything that’s happening.

ICIJ’s I-Hub is a secure, virtual newsroom that connects journalists around the world.
ICIJ's I-Hub

We are committed to making our reporters’ lives simpler, and will be adding many new features to the I-Hub, including a shared agenda (with project milestones), a list of the leading stories, info about upcoming meetings, and more!

Stronger and more stable infrastructure

With every new tool, we introduce potential new points of vulnerability in our growing tech network. With the number of services, users, the size of our data and the highly confidential nature of our work, we could easily face an industrial nightmare if we didn’t take adequate precautions. With the help of a pentester (a person who puts the security of our platforms at test) and our systems administration team, we’ll continue to roll out new protocols to ensure the continuity and maintainability of all our platforms.