Birth-control implants ruptured their wombs, shredding internal organs. Breast implants broke inside their bodies causing persistent pain. Devices meant to keep their hearts beating in rhythm delivered jolting shocks, in some cases even triggering strokes.

Journalists who reported Implant Files, International Consortium of Investigative Journalist’s award-winning investigation into the lax regulation of the $400 billion medical device industry worldwide, heard horror stories like this again and again. Patients harmed by medical devices come from all backgrounds, but most of the thousands we heard from shared a defining characteristic: they were women.

And it wasn’t just “women’s devices” that had hurt them, but sex-neutral implants like artificial hips.

The trend we noticed is consistent with studies that have shown that women experience higher rates of hip-implant failure than men and have stronger immunological reactions to metal-containing devices.

We wanted to know more, and so do health researchers.

“Having information about sex is very important because some products differ in safety between men and women,” said Diana Zuckerman, an expert on medical devices and president of the National Center for Health Research, a think tank in Washington, D.C.

The natural place to look was the Food and Drug Administration’s Manufacturer and User Facility Device Experience (MAUDE) database. This public dataset contains 8 million reports filed by manufacturers, doctors and patients when a medical device is suspected to have caused an injury or a death or has malfunctioned in a way that may put people at risk.

We hit a wall.

It turns out that while the FDA gathers information on the sex of patients, the agency won’t make the data public for “patient confidentiality” reasons, an FDA spokesperson told us. In contrast, a comparable FDA database on adverse events for drugs discloses information about the sex of patients on its public dashboard.

How about aggregate data, we asked? The FDA said no again. This time the agency said it didn’t have the resources to comply with the request.

We didn’t give up. Our team noticed that sex of patients is often disclosed through pronouns (“the patient reported she can see blood in the tubing of her insulin infusion”) and adjectives (“a male patient in good health underwent a knee repair”) in the ‘incident’ reports submitted to the FDA.

Extracting sex information manually from a maze of million of records would have been an impossible task. But what about recruiting computer intelligence to do the work for us?

Six months later, an algorithm we wrote had managed to positively identify the sex of the patient in more than 340,000 injury and death cases. Of those, 67 percent were women and 33 percent were men.

It is impossible to generalize the findings about patient sex to the entire universe of injuries and deaths reported to MAUDE, but the results do serve as additional evidence of a patient-sex imbalance that experts say the FDA is not doing enough to explore.

Rep. Rosa DeLauro, a Democrat from Connecticut, has pressed the FDA to tighten medical device oversight for two decades. She has paid particular attention to problematic women-focused devices, including textured breast implants, which are linked to a rare form of blood cancer, and Essure, an implantable contraceptive that was recently pulled from the market.

Earlier this year she reintroduced the Medical Device Safety Act, which would make it easier for harmed patients to sue device makers.

In an email, DeLauro called on the FDA to release the full dataset, so the public can better understand potential threats posed by devices that may disproportionately affect women or men.

Releasing it, she said, “is the right thing to do.”

In an email, an FDA spokesperson said the agency agrees that the public should have more information about adverse event reports for medical devices, but that the technology behind the MAUDE platform is outdated, limiting the agency’s ability to accomplish that goal. Congress recently allocated money just for that purpose and upgrades are underway, the spokesperson said.

The spokesperson also pointed to actions the agency is taking to better understand how medical devices perform in female patients, which include public hearings and new research programs to evaluate and monitor devices used specifically for women.

Putting AI at the service of journalism

ICIJ’s efforts to catalog the sex of hundreds of thousands of patients injured or killed by a medical device began earlier this year at Stanford University’s Gates Building, where Google was born 22 years ago. We shared tea and spicy roasted peanuts with students and research scientists as we brainstormed ideas to solving ICIJ’s quest to obtain public interest data the U.S. government had denied.

The meeting kicked off a partnership between ICIJ and the lab of Prof. Chris Ré, a MacArthur genius grant recipient whose team aims to make it faster and easier for humans to teach computers what they know about a topic, so computers can pick up some painstaking tasks from humans (and humans can do more meaningful work!).

At ICIJ we also had another goal: to continue to explore ways in which artificial intelligence can help investigative reporters tackle document-heavy stories that are so vast, they are seemingly impossible to execute.

Also, can journalists and academics work well together? We wanted to test this, too.

ICIJ's Emilia Diaz-Struck working on the AI project
ICIJ’s Emilia Diaz-Struck mapping out the plan.

Re’s lab had developed a tool called Snorkel, which uses a machine learning approach called “weak supervision” to classify text and images. In other words, the computer could draw conclusions from a relatively small amount of data classified by humans and by following a few rules applied to a large amount of unclassified data. (Here’s a technical description of Snorkel).

If you are feeling a little dizzy already, let me give you an example to help. The predominant machine learning approach today is “supervised learning,” in which humans spend hundreds of hours manually labeling thousands of data points (for example: this is a benign tumor, this is a malign tumor) so the computer can learn to recognize patterns and make predictions.

Well, that’s not going to happen in a resource and time-strapped newsroom, which is pretty much all of them. In Snorkel, using weak supervision the computer gets a set of parameters to follow so that it can do the labeling usually done by humans all by itself.

Here’s how our Implant Files machine learning experiment worked, in four steps:

Step 1 – Generate a first set of labels

The first step was to label a small dataset for development and validation. That meant, yes, some manual labor. Three journalists classified patient sex in 1,000 incident reports from the MAUDE database. They identified three categories: female, male and unknown. This last category meant that the narrative in the incident report didn’t have enough markers that would allow us to predict the sex of the patient. We noticed that that was the case in many instances because of the poor quality of the FDA data, which is self-reported rather than mandatory.

Step 2 – Give the computer parameters or rules

ICIJ’s senior data analyst Rigoberto Carvajal and ace Stanford computer science student Mandy Lu set out to write a set of rules (known as “labeling functions” in Snorkel lingo) that would help train the computer in the automatic sex classification. The rules delved around pronouns commonly associated with males and females as well as adjectives and body organs. Here’s an image of what some of these rules look like in our code:

Snippet of the rules used to identify sex in medical device reports
A snippet of the rules used.

“It’s like creating a dictionary for the machine,” explains ICIJ’s data editor, Emilia Diaz-Struck, who led the team of journalists.

Like with children, it’s teaching by repetition: Providing the computer multiple examples of the different ways in which something is true until the concept sticks.

With the help of the rules, the computer begins its training process. It then produces a machine-classified data set that will show if it recognized whether the patient affected by the medical device was male, female or if it was not possible to know based on the narrative of the report.

Step 3 – Bring back the humans to validate

To test that the computer was learning correctly, we brought back the journalists to check two different sets of results. They made some great discoveries that show  the limits of computer intelligence:

  • The computer struggled to identify the patient when multiple people, such as doctors, nurses, spouses, were named in the incident report. A distinction humans can do very quickly turned out to be a big ordeal for the algorithm. To help it along, research scientist Jason Fries, who was also project manager, wrote additional rules that guided the computer to recognize terms associated with spousal relationships or healthcare workers, making it easier to rule out people other than the patient.
  • Journalists found cases of false positives associated with the words female and male because, it turns out, certain devices have an assigned sex. So, in those cases, the female or male references were referring to medical devices parts and not to patients. Here’s an example: “a crack was identified on the female connector.”

Journalists were crucial “domain experts” who fact-checked the algorithm’s results in real-time so computer scientists could tweak it to improve accuracy. And we were careful to have at least two fact-checkers for each round, so they could review each other’s work.

Step 4 – Run! (the code against the whole data)

Finally, the moment arrived to run the algorithm against the whole FDA MAUDE database. Focusing on the decade that ended in 2017 (the same time frame we used in the Implant Files investigation), the computer was able to assign gender to patients in 23 percent of injury and death cases identified in the Implant Files investigation. It did so with 96 percent accuracy. They had been harmed by devices ranging from glucose testing sets to cardiovascular devices and prosthesis.

What happened to the other 77 percent? The computer assigned them the “unknown” label because the incident reports did not provide enough information for the computer to make an accurate prediction. The issue, most often, was the lack of a sex-specific adjective or pronoun.

What we learned

It wasn’t just the machine that got smarter as a result of the collaboration between ICIJ and Stanford. The machine learning exercise gave our newsroom  a deeper understanding of the ways in which AI can augment our work, especially when lots of messy data are involved. It also taught us:

  • Data quality matters: the device injury and death reports were complex and all over the place (written by a multitude of different groups of people such as doctors, patients, manufacturers), which made the algorithm’s work extra hard. Research scientist Jason Fries kept reminding us that if human readers cannot agree on a label, the machine will not do any better. “It will reflect the confusion.”
  • Humans matter: the work would have been impossible without journalists guiding the research and checking the results all the way through. Artificial intelligence will not replace reporters any time soon, it will complement their work and pick up the most repetitive aspects of their research, freeing them up to focus on those things that only humans can do well: contextualize, empathize, tell stories.
  • Less is more: using a simple rule-based approach to label data, as opposed to a completely ‘black box’ process, allows us, and the public, to know exactly what the computer is doing. “In journalism, we need to see the process step by step,” says ICIJ’s data analyst Rigoberto Carvajal.

If you are interested in building on our medical devices work or using a similar machine learning approach to solve a different journalistic problem, we have made our code public on Github. You can get in touch with us at data@icij.org.

The following people are part of the Machine Learning for Investigations partnership between ICIJ and Stanford University: Alex Ratner, Jason Fries, Mandy Lu, Jared Dunnmon, Alison Callahan, Sen Wu, Emilia Diaz-Struck, Rigoberto Carvajal, Delphine Reuter, Zshekinah Collier, Karrie Kehoe, Karen de la Hoz.

Marina Walker Guevara, ICIJ director of strategic initiatives, developed the partnership between Stanford and ICIJ when she was a John S. Knight Fellow at the university in 2018-2019.