Investigative reporters have traditionally been very protective of their work. Yet, the old guard has found a strong and increasingly necessary ally in a group outside the bounds of traditional journalism – the open data community.
As the number of documents available to reporters has grown both in size and in complexity, newsrooms have been more and more open to asking ordinary citizens for help with rescuing, transcribing, and digitizing everything from scanned images and handwritten PDFs to documents dumped in a lake. And the trend is spreading globally.
In 2009, the Guardian asked its readers to go through over 200,000 expense reports from British representatives. The crowdsourcing initiative revealed one of the country’s biggest political scandals, unveiling series of questionable spending by elected officials. Three years later, ProPublica launched a similar initiative to collect political ad spending data from filings produced by Federal Communications Commission. The “MP Expense” and “Free the Files” projects remain key references for newsrooms experimenting with crowdsourcing.
Technology has opened very exciting opportunities for journalists willing to dig into data, but it also brings a new challenge: scale. When documents are produced in the form of PDFs, when they are handwritten and are not structured in a way that allows a computer to analyze them, reporters have no choice but used the old method of entering the data manually.
Last May, the Argentinean newspaper La Nación completed the digitization of two years of expenses of the National Senate in two months with the help of the broader community.
“The key for these projects is the relationship the newspaper has with the community,” said Gabriela Rodriguez, a programmer and former Knight-Mozilla fellow at La Nación, who worked on the project.
La Nación invited transparency activists and university professors to work on the files before opening it to the public. Around 500 people participated overall. To rally the community, the newspaper tapped into the patriotism of its readership with a hashtag: #semanademayo.
“[Semana de Mayo] was a historical day in Argentina in 1810, it is famous for the phrase ‘El Pueblo Quiere Saber,’ [which means] ‘the people want to know,’” explained Florencia Coelho, Digital Research and Training Manager at La Nación. “We asked them: What can you do today for your country?” In a week, the newspaper was able to get enough people to participate to finish on National Day.
Getting citizens to participate is just the beginning. Outsourcing data transcription also comes with legal concerns.
In France, the nonprofit organization Regards Citoyens recently completed the transcription of elected officials’ declarations of interest, which were disclosed for the first time thanks to a new law passed last year. The handwritten documents, in addition to sometimes being hard to read, had been filled out differently by each Member of Parliament.
“It all went quite fast, we knew they wouldn’t have time to come up with online forms,” said Benjamin Ooghe who volunteers for the organization. “There weren’t a lot of guidelines on the form itself.”
One of the main concerns of the organization was to make sure that the transcribed data matched exactly what was written on paper. “We asked several people to transcribe the same documents and we didn’t stop until three people found exactly the same thing.”
In Germany, the publication of the documents itself has been contested. Late 2012 the WAZ media group launched a platform to transcribe briefings of the German Parliament on the country’s involvement in Afghanistan. The documents, gathered by investigative reporter David Schraven from several sources, were scanned PDFs of poor quality.
“We quickly decided to publish everything,” explained Schraven. “We knew there would be no problem at all for our sources and the soldiers in Afghanistan, there were no military secrets in these papers.”
Yet, the German Department of Defense sued the organization on the basis or property rights and asked for the documents to be removed from the Internet. The trial is still ongoing.
More recently, Ukrainian reporters rescued tens of thousands of documents sinking in the reservoir of former President Viktor Yanukovych’s estate. Not knowing how much longer they would be able to access the precious receipts showing the extent of Yanukovych’s corruption, reporters from several media organizations decided to collaborate to dry, sort and digitize the documents. Ukrainian citizens are now being solicited to help flagging companies, names and places in the documents uploaded on the YanukovychLeaks platform.
Are those crowdsourcing experiments likely to be repeated? Journalists at La Nación seem to think so. The VozData platform was developed so that it could be used for any project requiring liberation of PDFs. “For us this is just the starting point,” Coelho said. “We want to create a group where we invite users to participate. It’s like reaching a new level of candy crush transparency.”
A group of students at Columbia Journalism School is also working on a tool, InfoScribe, which would enable media organizations to find citizens willing to help them unlock PDFs.
“Not all media organizations have the resources to build this or a readership big enough to participate,” said cofounder Madeline Ross. “And the number of documents used as sources is not going to be shrinking.”
Find out first! Receive ICIJ's investigations by email