How do you tackle some of history’s largest data leaks? With ETL, SQL, Neo4j and a bit of Cypher.
Unless you’re a computer programmer, those acronyms and program names likely appear more intimidating than useful. They look that way to many investigative journalists as well. But in a world where the story is hidden in the details, and the details come in the form of data, investigative journalists are trading pounding the pavement for a “digital toolbelt” to make sense of leaks before sharing their findings in an online newsroom.
This is a look at the International Consortium of Investigative Journalists’ favorite data journalism tools: what they are, how they work and where we’ve used them before. We believe in open source, or freely available, technology and use it wherever we can. Our tools range from simple spreadsheets to software designed especially for the kind of “big data” that produced some of our major investigations.
Spreadsheets + Pivot Tables (Google Sheets/ Microsoft Excel)
Improving spreadsheet literacy is a big step towards being more comfortable with data for both journalists and other citizens. ICIJ journalists use both Microsoft Excel and Google Sheets on a daily basis. However, Pierre Romera, ICIJ’s Chief Technology Officer, says that data sets on Google Sheets should only be data sets slated for publication because of Google’s lower security bar. Sensitive data needs to be kept in-house. “When we need to dig into a data set that we don’t want to release, that’s when we build on our servers,” he says.
Increased comfort with spreadsheets creates big returns in the form of pivot tables, which allow journalists to pull information from a huge data set in very little time. Pivot tables aggregate a spreadsheet by a particular value to help summarize data. They can also be used to identify inconsistencies (for example, by counting how many instances of a value a set contains), and to find patterns in data sets.
Not a developer? This is for you. Miguel Fiandor, one of ICIJ’s web and data application developers, highlighted Datawrapper as a data visualization tool that doesn’t require code.
“Data is not everything, but it can be the foundation to understanding the highly complex issues of our time,” Datawrapper co-founder Mirko Lorenz said.
Data is not everything, but it can be the foundation to understanding the highly complex issues of our time.
Datawrapper, an open source tool for anyone who wants to create a chart or map from their data, came out of the knowledge that expecting every journalist to know code is unrealistic.
“The idea is that you have a two-minute interaction with the tool,” Lorenz said.
This interaction consists of importing data from a spreadsheet or an external link, selecting a headline and hitting “publish.” Users emerge with a graph ready to be embedded in a story, from bar graphs to density charts.
With individual and newsroom subscriptions, Datawrapper is geared towards people who are not equipped with the data capabilities of a large news outlet and only charges its users if their graphs have more than 10,000 views each month.
A more recent addition to Fiandor’s “digital toolbelt” is Jupyter Notebook, a web application that works a little like a souped-up Google Drive. Notebook users can share and publish documents containing live code (especially Python – more on Python below), visualizations, images, videos and other content.
“You can follow step by step in the notebook what you’ve been doing with the data,” Fiandor said, making the tool a good one for presenting data for analysis, especially while collaborating.
Jupyter Notebook is open source: it is freely available, and users may study or change the code as needed.
“I use [OpenRefine] like most other people use Excel,” Romera says. OpenRefine (previously Google Refine), another open source tool, is very similar to Microsoft Excel and does not require coding skills to import and manipulate datasets big and small.
Raw data often needs “cleaning” before journalists can get information from it. Clean data is written in a single format and has no duplicate values, or instances where a set of queries and responses appears more than once. Clean data is also properly filtered and partitioned. So a data set listing domestic addresses and their occupants might be divided by state, P.O. boxes separate from on-site mailboxes, with no lurking corporate addresses.
OpenRefine helps the data team merge copied sets of data points, check for inconsistencies or mistakes and otherwise make sure that a “raw” dataset is ready for analysis. And with just a few lines of code (see more on coding below), OpenRefine can compile data in a matter of moments – for example, adding population growth statistics, GDP, and demographic breakdowns to a data set that contains a list of countries.
Programming languages: Python and R
And what if journalists are ready to turbo-charge their data capability?
“Every tool you have gets more powerful when you can improve [on] it,” Romera said, and the way to do that is by learning to code.
Coding languages Python and R get the most mileage in data analytics.
Python is the best choice for machine learning, data extraction and database building – all essential parts of ICIJ’s data operations – and many of the more complex tools on this list either are built using Python or need the language for use. ICIJ data expert Rigoberto Carvajal emphasised R as a powerful tool for everything from “scraping,” or collecting, data, to cleaning and processing to data analysis and visualization.
Aspiring coders can consult Codecademy to begin getting familiar with R, Python, or any number of coding languages.
Data expert Carvajal also singled out the open-source Talend Studio for his work with “big data.”
“[Talend] saves me hours or days of programming for several data tasks,” Carvajal says. The software lets him build ETL, or extract, transform, and load, processes. ETL consists of “reading” data from its original location and converting it into one desired format before rewriting it into a new database, all according to predetermined rules. Talend speeds up the ETL process by letting Carvajal build programs for these rules, when he would otherwise be writing those rules “by hand.”
SQL, pronounced “sequel,” stands for Structured Query Language: Carvajal’s preferred tool for interacting with databases. Among the tool’s capabilities are retrieving information from, updating and creating “procedures” in a database. Using SQL, a user can filter a data set for a particular piece of information, find patterns in data or combine these functions to explore “relational” databases, or a series of data sets all connected by one common trait.
Pandas is an open-source tool that uses Python to work with large data sets by indexing, tagging or otherwise manipulating data to make it easier to analyze. Some of its functions include removing empty cells and duplicated data points and “labeling,” or describing, data with a desired descriptor: for example, determining whether a document is written in Russian or French. Pandas also combines smaller data sets into larger ones to make extracting information from data that much simpler.
Neo4j + Linkurious
ICIJ’s investigations often expose previously unknown links between entities or characters in a data set that seem unrelated at first glance. And had we been investigating a leak like the Panama Papers “by hand,” many relationships between different individuals and shell companies would have gone unnoticed.
Tools like Neo4j ensure journalists can have the full picture of a given piece of data — not just the pieces of data to which it’s immediately connected but also second- and third-degree relationships, as well as the different ways in which two pieces of data could be linked.
Take U.S. Secretary of Commerce Wilbur Ross, a stakeholder in 11 shell companies that all listed their headquarters at the same address in the Cayman Islands. Neo4j allows journalists to connect that data point to Ross himself via his companies and, ultimately, to a third data point, Ross’ office address. That relationship between the two addresses certainly isn’t an easily visible one, nor is the connection, through Ross, between the different shell companies to their single Cayman Islands location.
To create what visitors to our website see when they search the Offshore Leaks Database, ICIJ used Linkurious to visualize the Neo4j graph database and make it searchable by labeling and categorizing nodes (data points) and edges (relationships between points) and indexing those labels for the search engine. When harnessing the database for research, journalists literate in the Neo4j language Cypher can search for certain relationships by using “patterns” to query varying sets of results.
Hopefully, this piece has demystified some of the tools in ICIJ’s data operation, but there’s no denying that talking about these tools and their operations involves a lot of acronyms and jargon. Below is a glossary defining some common processes and terms in our digital toolkit, as well as a few other useful words to know for the curious reader.
Glossary of Terms
OCR: Optical Character Recognition. OCR converts hard text into machine-encoded text, making it useful for digitizing print copies of anything.
Unicode: a binary language that encodes text or script into one “alphabet,” thereby making it universally accessible.
Node: (in graph databases) a vertex, or the point at which lines meet, in a graph of points. Nodes can be categorized or linked with “edges,” or relationships. Most graphs represent nodes with a circle.
Edge: (in graph databases) another name for a relationship between two nodes in a graph database. Graphs represent these relationships as lines between the nodes. If, for example, you had a business person as one node, and a company as a second node, the edge would describe and visualize the relationship between the two – such as “shareholder” or “director” or “owner.”
ETL: Extract, Transform and Load is the process of reading data in its original format, sorting and cleaning it by removing redundant information, and then converting it into a particular format or consolidating it before re-writing it into a target database.
Fuzzy searching: also known as approximate string matching, fuzzy searching is the process of searching terms in a database allowing for misspellings, cross-language similarities or hints. It ensures that a search turns up the most complete set of results possible.
Cypher: the query language of Neo4j, which allows users to describe nodes, edges and their properties as well as the actions they want done on them — for example, searching or finding a pattern in a series of nodes.
Faceting: allows search results to be arranged into subsets, categorized by date ranges, number ranges or keywords and then count the number of results that appear for a particular query.
Scraping: the process of extracting data information from websites, documents, images and more. It focuses on the transformation of unstructured data (for example, from a website’s HTML) to structured data (databases, spreadsheets, etc.)
BSD: Berkeley Software Distribution, a “family” of permissive free software licenses.
API: Application Programming Interface. An API is a set of tools and documentation that a developer offers to other developers in order to use its system.