DATA JOURNALISM

Why journalists need an archiving system

Taking good care of data requires time and money, but the loss of irreplaceable work can come at a higher cost. Here are a few basic tips for journalists to improve personal archiving practices.

This post, by Talya Cooper, has been republished with kind permission from the Global Investigative Journalism Network.

As an archivist, I love the clichéd scenes in mystery movies when a grizzled journalist pulls a dusty banker’s box out of a teetering stack of files in their garage, filled with the exact right pieces of evidence — miraculously free of bugs and mildew — to bring a wrongdoer to justice. To me, this is as rich a fantasy as the films’ happy endings.

I spent several years working alongside journalists in a newsroom, where I witnessed some unholy messes of files on both physical and virtual desktops. The reporting process enveloped my colleagues. Usually even before one story was published, they had begun generating new files for the next piece, leading to an occasional scramble when they had to recover older notes if they wanted to revisit a subject or if they faced questions about the conclusions they made in their reporting.

Taking good care of data requires some time and money, but the loss of irreplaceable reporting work can come at a higher cost. It may be difficult or impossible to return to interviewees, recover government documents that appear on and disappear from official websites, or just find a detail that might have been in an audio recording, or maybe it was in an email conversation, or was it a text message?

Journalists have three key considerations to balance as they plan to preserve investigative materials. In addition to the general preservation issues that are my area of expertise, they also have to consider legal as well as digital security concerns.

Lynn Oberlander, a media attorney who has served as in-house counsel for publications including The New Yorker and my former employer The Intercept, and who currently works at the law firm Ballard Spahr, advises that “as a basic tenet, legal liability should not govern how journalists do their work.” Consequently, if keeping notes or recordings might benefit a reporter in the future, she recommends that they preserve those materials.

But at the same time, she notes the importance of weighing the risks of archiving information for highly sensitive stories. “If you are working with confidential sources and you don’t anticipate a need to go back to the records and you haven’t explicitly been told that you’re going to be sued, it might be worthwhile to have a practice of getting rid of your source material,” Oberlander explains. She further notes that to minimize the risk to sources in such cases, it’s vital to follow proper security protocols, which may incorporate using non-internet connected computers, encrypted messaging tools and hard drives, and other similar measures.

She has another general principle for reporters to bear in mind: “Whatever they do, they should generally be consistent. If they have all of their notes for the last 20 years but the one story that may be problematic, those are the ones they’ve deleted — that could be a problem.”

Consistency is a key principle for digital preservation as well. Archivists don’t have a precise formula for how to label folders or how often to back up your files. Rather, we advise creating a system that makes sense to you, applying a few basic principles I’ll describe, and then sticking to them.

And although digital security is somewhat outside the scope of a brief article — it’s an entire ecology of practices, beginning from the moment you create materials or communicate with sources, and depending on numerous factors ranging from the topics you report on to specific laws in the country where you work — consistency is vital to keeping those materials safe as well. Omitting any step of a recommended security protocol can open up a valuable document or a confidential source to risk.

Emilia Díaz-Struck, research editor and Latin America coordinator at the International Consortium of Investigative Journalists, also emphasizes consistency, for both security purposes and to alleviate headaches later on. Her top tip for reporters beginning investigative projects is simply: “Plan ahead.” She points out that starting with a clear set of decisions about where to store investigative material as you accumulate it is much simpler than trying to make sense of a downloads folder full of data in the middle of a project.

Here are a few basic tips for journalists hoping to improve their personal archiving practices. These steps relate primarily to stories with only low to medium levels of security. National security-level reporting requires rigorous adherence to protocols that might include storing all materials on air-gapped devices (computers that have had their wireless card removed and have never been connected to the internet), destruction of certain materials, and even physical security measures.

1. Store related files in a folder and give the folders names that make sense

Establish a consistent naming convention for your projects. A date, the name of the publication if you write for different outlets, and some descriptive text: “202107_GIJN_digitalpreservation” is an example of a simple and comprehensible file naming convention. Create a folder following that convention for each project, and save relevant files in that folder. Although searching the contents of your hard drive is one way to locate data, you can save yourself time with this basic organizational method. It’s especially important to store media like screenshots, photos, audio, and video — which you cannot search in plain text and which often have generic file names created by your computer, camera, or audio recorder — in clearly labeled folders.

It may help to rename files, following your convention, to easily identify important data you plan to use or cite in your piece. If you write notes or have paper files relating to a piece, label the notebook or the folders where you store these items with the same name you used for the digital folder, so you can correlate them easily. I also recommend keeping a record of any agreements with the publication that is printing the story along with the related files; different organizations may have varying rules about ownership of materials and their reuse.

Díaz-Struck of ICIJ specifically recommends creating a distinct folder of the data and documents that back up claims you make in your reporting. If, at a later date, you decide to delete some of the files you have accumulated, keeping a clear record of the materials you cite or allude to will protect you in case anyone challenges your findings, or even just to refresh your memory as to how you made your own conclusions. As her team works with data at ICIJ, they make sure to label the original version of the data as well as the final version of the data from which they report; they also document their methodology and any transformations they make to data, and archive this process document alongside the data itself.

One of the most important and catchiest acronyms in archives work is LOCKSS: ‘Lots Of Copies Keeps Stuff Safe.’

2. Back up your files, and then back them up again

One of the most important and catchiest acronyms in archives work is LOCKSS: “Lots Of Copies Keeps Stuff Safe.” Because digital files are so easily replicable, it’s simple to generate multiple copies to help you in the case of data loss. Crucially, save at least one copy in a different physical location. For instance, backing your computer’s files up to a portable hard drive is a great first step. But if you choose to keep both the computer and backup drive on your desk and the ceiling over your desk springs a leak, you will lose the data — unless you have also backed up your files to the cloud. Conversely, if an upload to your cloud backup doesn’t work properly, you might lose data unless you’ve saved it to a hard drive.

  • Back up files to an external hard drive. Programs like Apple’s Time Machine and Windows’ Backup make this process easy to automate — or just create a calendar reminder for yourself to copy key folders on a regular basis. When the drive is full, label it clearly, perhaps with the date range of the files you’re archiving, and store it in a cool, dry place. If you have a budget for your archive, you might opt for a RAID array, which is essentially a set of linked hard drives. RAID drives store data redundantly; that is to say, if one drive in the array fails, the data remains secure. If your project has any level of sensitivity, encrypt your drive with a secure passphrase.
  • For non-sensitive files, consider backing up your files to a cloud service. The service you select may vary depending on your location, specific needs, and budget. For instance, if you work with video and audio files, you might prioritize a service that offers a lot of storage cheaply, like Backblaze. You may prefer to work with a more secure backup service with servers outside the country where you live, like Sync, or Tresorit, which encrypt files both in transit and when they are being stored.
  • If you prefer not to use a cloud backup service — a prudent step when dealing with sensitive data or when reporting on national security or intelligence topics — back up the hard drive to another hard drive and store it in a physically different place from where you live or work. Keep a copy at home and a copy at the office, or to be extra secure, mail it to a friend in a different city or country so in the event of a problem, whether a hurricane or a police raid, your data will not be destroyed. For sensitive files, you can use a Tor hidden service like OnionShare to transfer files to a colleague whom you trust to store it on an encrypted machine, following proper security protocols. Always consult with a digital security expert prior to taking any of these steps. While LOCKSS might be a sound archival principle, sometimes proliferating copies of certain sensitive materials can heighten the associated risks.

Do you have a story about corruption, fraud, or abuse of power?

ICIJ accepts information about wrongdoing by corporate, government or public services around the world. We do our utmost to guarantee the confidentiality of our sources.

3. Save the final product

In 2017, in response to a staff unionization effort, billionaire Joe Ricketts shut down the network of local news websites he owned — including the sites’ entire archives. The archives eventually went back up, but their momentary erasure demonstrates the ephemerality of online news content. Content can also disappear for non-malicious reasons, like a change to a content management system that makes older content hard to find through search, or a shift in a site’s business model that deprioritizes archival content.

While the Internet Archive captures a significant amount of data, it does not capture every page of every site on every day, meaning that an individual piece or post could slip through the cracks, or require you to spend a significant amount of time digging through various captures to find the exact page you need. Whether you fear a hostile government or billionaire will have your writing taken down, or whether you simply wish to keep an accurate record of your work, it’s worth the effort to preserve your final product. Pieces with interactive components present additional challenges. As web technology changes, visualizations or animations might not render properly, even if the page has been scraped by the Internet Archive. To keep a record of your published work, take the following steps:

  • Ensure an article is saved by the Internet Archive by adding the Wayback Machine extension to your browser; clicking it will automate a crawl by the Internet Archive.
  • Save the website as a PDF, and add it to the folder where you have saved related materials. Although you may lose some graphic content, PDF is considered an archival format and is easy to back up and open in a variety of other tools.
  • For pieces with interactive components, create a recording of the website using Conifer, a web recording tool originally invented for the preservation of internet-based artworks. Conifer allows you to open a site and creates a recording as you navigate through its features: opening tool tips, playing back video or GIFs, and so on. You can then download the recording as a WARC (the same file format used to save webpages to the Internet Archive), and even “play back” the website like you would a video.

Fundamentally, it’s important to remember that digital archiving is an ongoing practice, not a set of steps you can take and then forget. There’s no digital equivalent of a dusty box in a garage; technologies change, and the measures you take to ensure your files are accessible in the future must evolve as well (imagine if you had all your interview notes on floppy disks!).

In the end, remember that no person or organization is obliged to preserve your creative work. If you want a record of what you’ve created in the news media, you need to take action. And as a journalist, your archiving work is vital. News archives — whether in fragile bound volumes of newsprint, on microfiche, on steno pads in reporters’ personal archive, and now in the form of bits and bytes — contribute significantly to our collective understanding of history. Just because our times are volatile, their record need not be.

Join ICIJ Insiders

Do you believe journalism can make a difference?
For just $15/month you can help expose the truth and hold the powerful to account.

Additional Resources

  • The Activists’ Guide to Archiving Video: This manual was created by international nonprofit WITNESS to support individuals who want to create and safely share video documentation of human rights abuses. Even if you do not work with video, these tips, which include analyzing, creating, and deleting file metadata; source protection best practices; and safe file transfer and back up, have a lot of relevance for journalists creating materials in the field.
  • Preserve This Podcast: Designed for independent podcasters, this site’s resources (which include articles, a podcast, and a zine) clearly illustrate the nuts and bolts of media archiving for nonprofessionals.
  • Digitalpreservation.gov: The Library of Congress in the United States has a clear (if slightly older) site that focuses on how to digitize analog materials and safely store the digitized copies.
  • Freedom of the Press Foundation: FPF offers numerous training resources for journalists thinking about digital security and source protection.
  • Endangered But Not Too Late: The State of Digital News Preservation: If you’re interested in digging into this topic further, the Reynolds Journalism Institute at University of Missouri conducted this extensive study of news organizations’ archiving practices, with recommendations at the newsroom level for how to ensure digital news content can serve as an historical resource going forward. RJI is also in the process of developing open source tools that will help news organizations preserve their content.

This post was originally published at GIJN.org. Talya Cooper is an archivist and researcher based in New York.

FinCEN Files
IMPACT

As reforms sparked by FinCEN Files roll out a year on, key source is behind bars

Sep 20, 2021
beers on tap at miami beach bar in luanda, angola
IMPACT

German state-owned bank fined for Angolan loan exposed in Luanda Leaks

Sep 03, 2021
Isabel dos Santos and Sindika Dokolo
LUANDA LEAKS

Isabel dos Santos ordered to return to Angola $500 million in shares ‘tainted by illegality’

Aug 02, 2021
FinCEN Files
AWARDS

Lessons from award-winning FinCEN Files and Luanda Leaks investigations

Jul 23, 2021
European Parliament and EU flag
FINCEN FILES

EU to propose watchdog to tackle anti-money laundering failures exposed by FinCEN Files

Jul 16, 2021
Protesters in London outside the Chinese Embassy
CHINA CABLES

As global pressure over human rights abuses in Xinjiang picks up, China remains defiant 

Jul 15, 2021
ICIJ is dedicated to ensuring all reports we publish are accurate. If you believe you have found an inaccuracy let us know.