Starting today, you can get a taste of the International Consortium of Investigative Journalist’s powerful research platform without having to install any software – and use it to explore a set of leaked documents.
Datashare, ICIJ’s open-source document analysis tool, is now available online via a web browser, complete with more than 1,000 files from the Luxembourg Leaks investigation.
This Datashare demo website gives users a sense of how investigative journalists work on documents and use ICIJ’s tool to help find stories. This is the cloud version of Datashare that allows multiple users to simultaneously search a central dataset – which is what ICIJ uses for its global investigations. Also available for download and installation is the local version, which is exactly the same but works locally on your machine with your documents. You can get this local version for free here.
We’ve added the documents from ICIJ’s Luxembourg Leaks, so users can test the Datashare features that ICIJ journalists use during their investigations.
Here are three tips to help make the most of your searches on Datashare:
1. Where to start? Explore ‘named entities’
There are 1,086 documents in Luxembourg Leaks… a far cry from the 715,000 records of Luanda Leaks or the 2.6 terabyte trove that was the Panama Papers. Regardless of size, all investigations begin with the same question: where do I start?
To get a broad idea of what story leads could be buried in a dataset, we recommend beginning with the “named entity” filters in Datashare. These filters list the people, organizations and locations mentioned in the documents – often essential starting points for an investigation.
The Natural Language Processing pipelines that Datashare relies on to automatically identify named entities is very powerful, but not perfect yet. So you will find errors. But it helps provide an overview of the central figures in your data.
For example, looking at the list of ‘People’ Datashare has identified in the Luxembourg Leaks documents shows you that the names ‘Kohl’ and ‘Marius Kohl’ appeared the most. This makes sense given that, as head of Luxembourg’s revenue authority, Marius Kohl was a central figure in the reporting.
2. Search lists of people, organizations, or topics you watch
Instead of searching for important potential stories one by one, entire lists of search terms can be uploaded and searched in batches.
Whether it’s a collection of names or places or anything else, you can prepare your list in a spreadsheet, export it as a CSV, and upload it to Datashare. In return, you will get results for each item in a table view or in a downloadable CSV file.
For example, you could create a list of every member of parliament or the biggest companies in your country – and then upload it to see if any appear in our Lux Leaks data. Datashare will then search our documents – and tell you which names it finds.
To demonstrate, we prepared two batch searches that you can find here and here. The first one looks for every country and the second one searches for demonyms. It gives a first quick overview of which nations and citizens are mentioned in these leaks.
3. Want to catch typos and make better searches? Use search operators
In Luxembourg, private limited liability companies are called SARL (“Société à Responsabilité Limitée”).
To search all SARLs in Luxembourg Leaks, you can first type ‘SARL’ in Datashare’s search bar: it finds 661 documents.
But to catch some potential typos, you can use an operator that expands your search (this is called fuzzy searching). A search for ‘SARL~1’ leads to 763 documents. This tilde (~) allows you to catch insertions, deletions, substitutions and transpositions – so you’re sure you don’t miss any mentions, such as the 102 Lux Leaks documents that spelled SARL in a slightly different way, like in the sentence:
“Both Investcorp and Barclays invested in the structure through their own Luxembourg companies respectively named Vending Investments S.arl. (“Luxco 1) and Colonnade Holdeo No. 11 S.a rl. (“Colonnade Holdco”).”
Other operators, such as AND, NOT, and wildcards, are also available and will help you make more efficient searches. Check out the user guide for more information about search operators.
You’re now ready to explore Datashare!
Follow Datashare’s updates on Twitter with the hashtag #ICIJDatashare.