The rise of the machine has freed ICIJ members globally to pore over millions of documents in a custom-built search engine.
But even this next-level research has posed substantial challenges: for example, what to do when certain phrases return an indigestible 150,000 results? Clearly, the next step to speeding up our research was to intelligently filter information relevant to each investigation.
Here’s how we streamlined the previously daunting process, giving us both unprecedented flexibility and the required search success rate.
Step 1: Wrangle the big data
In leaks like the Paradise Papers, we dealt with millions of documents (including PDFs, photos, and emails) that traditional platforms like Excel can’t process. This is known as Big Data, where huge volumes of often unorganized data need to be tamed into structured sets.
We upload leaked files to a server for indexing, using Apache Solr which is freely-available open source software. For documents stored as image files (e.g., a printed PDF signed and scanned into a computer) we need to use a technique known as optical character recognition.
ICIJ’s team developed our own OCR tool called Extract, that recognizes and indexes the text using the power of up to 30 servers or more. It uses components from existing open source OCR tools like Apache Tika and Tesseract. You can take a look at the tool on our GitHub.
Then we use Talend Studio. It’s a data transformation and analysis software based on visual components that you drag around a work area and connect to create a flow.
First, we automate a search and store the results by creating a Talend job that searches for a topic, organization name or individual that we are investigating, such as “Glencore.” The search will return hits and other information from the text of the documents and their metadata, such as document ID, the name of the file, file root, extension, size in bytes.
The search results are stored in Solr, a database that recognizes the relationship between items. It’s not necessary to store all the text in the document, because analyzing is a slow process. We can add a filter so only the first few pages or words of documents are analyzed.
Step 2: Group documents by cluster
Clustering is a technique that allows us to group similar things. Instead of scrolling through a mass of PDFs, it’s helpful to create groups of documents by topic or by the type of document, so the reporter can access similar documents all at once, such as all the PDFs concerning fund transfers.
We use RapidMiner to process the text and metadata of documents and create clusters based on common words and phrases. RapidMiner is a powerful tool that makes it easy to implement data mining algorithms, and it also uses a visual workspace rather than a drab text editor.
Now, we need to process the content of the documents by applying transformations and filters to the document text in RapidMiner. Here is a more detailed workflow:
- Tokenize: This process separates the text in a sequence of individual words or “tokens” preparing it for further manipulation.
- Filter tokens (optional): In one case, we noticed that the most relevant keywords and phrases to identify a document were written in capital letters, so we added a token filter to only analyze uppercase words longer than three characters.
- Eliminate stop-words: Remove words that are commonplace in daily language and aren’t key to deciphering the document, for example, “a,” “and,” “the,” “be.”
- Stemming: For all the tokens (or words) in the document, we find the root word. For example, fishing, fished and fisher derive from fish. Certain types of stemming will not result in real words, but machine calculated stems. For instance, the words expression and expressive may be shortened to expres. This technique identifies themes in a document by turning all root word derivatives into a single token.
- Clustering: Common words and phrases in a document are grouped together, or clustered, so that the similarities among documents can be found.
With this process, you can create groups of documents that contain common tokens (words). For example, this is a set of transformed tokens found in documents with similarities:
- loan agreement
- share purchas agreement (this will then pick up “share purchasing agreement” or share purchased agreement” as it is using the root word.)
- altern director appoint confirm
- written resolut sole sharehold adopt
- fund agreement plc
- transfer agreement
- board director unanim written resolut herebi adopt
- altern director resign
- privat confidenti
- power attornei
- director appoint confirm
After processing, we can execute queries in Solr to get a list of documents that match the search parameters. Without opening a single document, we know what type of documents we have, and if we have one we’re interested in. We also know the words and phrases to find them directly.
With this structure, we can now create spreadsheets in Talend. We’ve got the list of search results, their document IDs, the ID numbers of document clusters, a field that contains the cluster tokens and the first 500 characters of a document that serves as a preview before downloading. The spreadsheet could also contain a URL that leads to the documents so they can be downloaded directly from the platform.
This kind of spreadsheet allows us to explore documents more efficiently because journalists can discard the documents they’re not interested in and easily determine which are relevant.
Step 3: Automatic file classification
Once we’ve classified some leaked documents using the above processes, we can use machine learning to automatically classify other documents that we haven’t even opened.
First, we need a structured dataset to train the machine learning model in RapidMiner, which will then learn to classify new files using the existing data and metadata.
After applying the transformations and filters explained in Step 1, the model will take into account all tokens (words) and/or phrases that have to be present in a file to classify it accordingly. So when we add a new file to the system, the model can determine which category to assign the file to.
Now the machine will automatically place new files into categories, such as:
- NOMINEE AGREEMENT
- LOAN AGREEMENT
- UNANIMOUS WRITTEN RESOLUTIONS
- SHARE TRANSFER AGREEMENT
- SALE AND PURCHASE AGREEMENT
- REGISTER OF SHAREHOLDER
- REGISTER OF MEMBERS
- POWER OF ATTORNEY
- PLAN OF DISSOLUTION
- OFFICER’S APPOINTMENT CONFIRMATION
- OFFICER’S RESIGNATION NOTARIAL
- MANAGEMENT AGREEMENT
- FUNDING AGREEMENT
- FINANCE AGREEMENT
- DIRECTORS ‘APPOINTMENT
But, there’s still work to be done. At ICIJ, we have plans to repurpose existing tools and to refine the machine learning model.
Discarding spam and useless files
For a long time, email services have used classification models to recognize spam. We want to apply this technology to all email files in our document set, as well as the emails of irrelevant conversations. That way, we can discard those emails and other files aren’t of public interest and so reduce the number of results that we obtain through our search or the number of the files we process with other components.
Classifying all the files
Right now, we’ve applied this strategy to a subset of files containing special keywords, but we plan for the capability to apply automatic classification on all the files in the leak. We will train the machine model using a complete set of diverse files, and eventually, the model should be able to teach itself how to classify all new files.
Razzan Nakhlawi contributed to this story.