The rise of the machine has freed ICIJ members globally to pore over millions of documents in a custom-built search engine.
But even this next-level research has posed substantial challenges: for example, what to do when certain phrases return an indigestible 150,000 results? Clearly, the next step to speeding up our research was to intelligently filter information relevant to each investigation.
Here’s how we streamlined the previously daunting process, giving us both unprecedented flexibility and the required search success rate.
In leaks like the Paradise Papers, we dealt with millions of documents (including PDFs, photos, and emails) that traditional platforms like Excel can’t process. This is known as Big Data, where huge volumes of often unorganized data need to be tamed into structured sets.
We upload leaked files to a server for indexing, using Apache Solr which is freely-available open source software. For documents stored as image files (e.g., a printed PDF signed and scanned into a computer) we need to use a technique known as optical character recognition.
ICIJ’s team developed our own OCR tool called Extract, that recognizes and indexes the text using the power of up to 30 servers or more. It uses components from existing open source OCR tools like Apache Tika and Tesseract. You can take a look at the tool on our GitHub.
Then we use Talend Studio. It’s a data transformation and analysis software based on visual components that you drag around a work area and connect to create a flow.
First, we automate a search and store the results by creating a Talend job that searches for a topic, organization name or individual that we are investigating, such as “Glencore.” The search will return hits and other information from the text of the documents and their metadata, such as document ID, the name of the file, file root, extension, size in bytes.
The search results are stored in Solr, a database that recognizes the relationship between items. It’s not necessary to store all the text in the document, because analyzing is a slow process. We can add a filter so only the first few pages or words of documents are analyzed.
Clustering is a technique that allows us to group similar things. Instead of scrolling through a mass of PDFs, it’s helpful to create groups of documents by topic or by the type of document, so the reporter can access similar documents all at once, such as all the PDFs concerning fund transfers.
We use RapidMiner to process the text and metadata of documents and create clusters based on common words and phrases. RapidMiner is a powerful tool that makes it easy to implement data mining algorithms, and it also uses a visual workspace rather than a drab text editor.
Now, we need to process the content of the documents by applying transformations and filters to the document text in RapidMiner. Here is a more detailed workflow:
With this process, you can create groups of documents that contain common tokens (words). For example, this is a set of transformed tokens found in documents with similarities:
After processing, we can execute queries in Solr to get a list of documents that match the search parameters. Without opening a single document, we know what type of documents we have, and if we have one we’re interested in. We also know the words and phrases to find them directly.
With this structure, we can now create spreadsheets in Talend. We’ve got the list of search results, their document IDs, the ID numbers of document clusters, a field that contains the cluster tokens and the first 500 characters of a document that serves as a preview before downloading. The spreadsheet could also contain a URL that leads to the documents so they can be downloaded directly from the platform.
This kind of spreadsheet allows us to explore documents more efficiently because journalists can discard the documents they’re not interested in and easily determine which are relevant.
Once we’ve classified some leaked documents using the above processes, we can use machine learning to automatically classify other documents that we haven’t even opened.
First, we need a structured dataset to train the machine learning model in RapidMiner, which will then learn to classify new files using the existing data and metadata.
After applying the transformations and filters explained in Step 1, the model will take into account all tokens (words) and/or phrases that have to be present in a file to classify it accordingly. So when we add a new file to the system, the model can determine which category to assign the file to.
Now the machine will automatically place new files into categories, such as:
But, there’s still work to be done. At ICIJ, we have plans to repurpose existing tools and to refine the machine learning model.
For a long time, email services have used classification models to recognize spam. We want to apply this technology to all email files in our document set, as well as the emails of irrelevant conversations. That way, we can discard those emails and other files aren’t of public interest and so reduce the number of results that we obtain through our search or the number of the files we process with other components.
Right now, we’ve applied this strategy to a subset of files containing special keywords, but we plan for the capability to apply automatic classification on all the files in the leak. We will train the machine model using a complete set of diverse files, and eventually, the model should be able to teach itself how to classify all new files.
Razzan Nakhlawi contributed to this story.