DIGITAL TOOLS

Web scraping: How to harvest data for untold stories

The thing about investigative reporting is, it’s hard work.

Sure, we’ve got more data available now. But data presents its own challenges: You’re tackling a massive pile of information, looking for the few best bits.

A technique called web scraping can help you extract information from a website that otherwise is not easily downloadable, using a piece of code or a program.

Web scraping gives you access to information living on the internet. If you can view it on a website, you can harvest it. And since you can collect it, you might as well automate that process for large datasets — at least if the website’s terms and conditions don’t say otherwise.

And it really helps. “You might go to an agency’s website to get some data you’re interested in, but the way they’ve got their web app set up you’ve got to click through 3,000 pages to get all of the information,” said Investigative Reporters and Editors training director Cody Winchester

What’s the solution? Web scraping. You can write a script in a coding language (Python is one) that funnels the desired information into a spreadsheet and automatically flicks through all of the pages. Or you could bypass coding completely and use an application to deal swiftly with the web scraping, for example, Outwit Hub, a point and click tool that recognizes online elements, and downloads and organizes them into datasets.

Why do it?

Web scraping gives reporters the ability to create their own datasets with scraped information, opening the possibility of discovering new stories — a priority for investigative journalists.

Jodi Upton, the Knight Chair of Data and Explanatory Journalism at Syracuse University, began her career doing old-school “scraping.” Before online databases were widely used, when she only had access to paper records, she created her own databases manually. For Upton’s work, it was a necessity.

We do have some data from the government, but we know that it is so inaccurately kept that there are some really good stories in finding out just how wrong they are
Jodi Upton

“When you’re trying to do news stories or investigative projects that require really interesting data, often it means you are creating a database yourself,” Upton said. Now it’s a lot easier, though the raw product, data itself, isn’t always easy to get your hands on.

There isn’t much incentive for organizations to disclose important data unless required to by law. Even then, the government does a poor job of data maintenance.

“We do have some data from the government, but we know that it is so inaccurately kept that there are some really good stories in finding out just how wrong they are,” Upton said.

Working on USA Today’s Mass Killings project, an investigation into Federal Bureau of Investigation mass homicide data, Upton and the rest of the data team scoured FBI data for mass homicides. The data was so poorly kept that the team had to hand-check and verify every incident itself. They found many more incidents the FBI had failed to log.

Upton said she was concerned. “This is our premiere crime fighting agency in the U.S. and when it comes to mass killings, they’re right around only 57 percent of the time.”

Sometimes the government will simply refuse to hand over data sets.

IRE’s Winchester described his attempt to get a database from a South Dakota government lobbyist, who argued that putting data up on a webpage was transparent enough:

“I put in a records request to get the data in the database that was powering their web app, and they successfully argued, ‘We’re already making the information available, we don’t have to do anything special to give it to you as data’.”

Aside from structured data, which is organized to make it more accessible, some stories are born from journalists giving structure to unstructured information. In 2013, Reuters investigated a marketplace for adopted children, who were being offered by the parents or guardians who had taken them in on Yahoo message boards to strangers.

The investigative team scraped the message boards and found 261 children on offer. The team was then able to organize the children by gender, age, nationality and by their —situations, such as having special needs or a history of abuse.

“That is not a dataset that a government agency produces. That is not a dataset that is easy to obtain in any way. It was just scraping effectively; a social media scraping,” Upton said.

How could you use web scraping?

Samantha Sunne, a freelance data and investigative reporter, created a whole tutorial for those without coding experience. “When I’m investigating stories as a reporter, I don’t actually write code that often,” Sunne said.

Instead, she uses Google Sheets to scrape tables and lists off a single page, using a simple formula within the program. The formula imports a few HTML elements into Google Sheets and is easy enough for anyone with basic HTML knowledge to follow.

You can read her entire tutorial here.

“I’ve used it for court documents at a local courthouse, I use it for job postings for a newsletter I write about journalism happenings,” Sunne said.

“It’s a spreadsheet that automatically updates from like 30 different job boards. It makes the most sense for things that continually update like that.”

How does ICIJ use web scraping? (This is for our more technically savvy readers!)

ICIJ developer Miguel Fiandor handles data harvesting on a much grander scale, trawling hundreds of thousands of financial documents.

Fiandor’s process begins by opening Google DevTools in the Chrome browser. It’s a mode that allows the user to see the inner workings of a website and play around with its code.

Then he uses the ‘Network’ tab in the Developer Tools window to find the exact request he needs. (A request is how a browser retrieves a webpage’s files from the website’s servers.)

He studies the communication between the website and his browser and isolates the requests he wants to target. Fiandor tests those requests with cURL, a Linux command that he can use from his computer terminal. This bypasses the need for a browser.

Next, Fiandor uses the BeautifulSoup library that needs to be downloaded through Python.

Code for scraping a corporate registry used in the Paradise Papers.
Fiandor's webscraping code

Beautifulsoup allows the user to parse HTML, or separate it into useful elements. After the request, he’ll save the data onto his computer, then route those elements into a spreadsheet and run his script.

Simple enough, right?