Every investigation offers ICIJ’s data team an opportunity to learn — and the Uber Files was no different.

It may not have been our first time working with a large, leaked dataset. But rather than mapping complex networks of offshore companies or tracing dirty money flows from country to country, the Uber Files presented a new challenge: connecting digital calendar events to real-life meetings and sketching out pictures of relationships based on message exchanges between high-profile movers and shakers in politics and the transportation industry.

The Uber Files is based on a collection of more than 124,000 records, including 83,000 emails and other files created between 2013 and 2017, a period when the U.S. ride-hailing giant was expanding across the globe. The files were leaked to The Guardian and shared with ICIJ. After publication, ex-Uber lobbyist Mark MacGann came forward as the source of the leak.

The Uber Files included communications between company executives, as well as messages with key political figures and their representatives that showed Uber’s tactics in trying to gain access to markets around the world. The records contained details about meetings Uber lobbyists held with world leaders and other public officials to try to influence legislation and revealed the company’s use of stealth technology and evasive tactics to thwart regulators and law enforcement in at least six countries.

After receiving the files from The Guardian, ICIJ uploaded them onto its bespoke research platform, Datatashare, allowing journalists from more than 40 media partners in 29 countries to search and review the leaked documents.

Here are some questions and answers about ICIJ’s methods in processing and analyzing the Uber Files data.

How did the Uber Files records differ from previous leaks to ICIJ?

The Uber Files records were mostly emails, which comprised 83,000 of the leak’s 124,000 files. Other ICIJ projects were more of a mix.

Some of the first questions ICIJ had to address were what information could be structured from the emails, text and calendar items in the leak and whether external databases could help the research and reporting.

And the subject area was new. Previous leaks ICIJ has investigated related mainly to the offshore financial system, and efforts to structure data (i.e. organize information into standardized, searchable fields) centered on highlighting information tied to people and companies using entities registered in secrecy jurisdictions. In the case of the Uber Files, the records centered on communications and lobbying efforts to try to influence key stakeholders in different countries and regulations.

How did ICIJ identify the major figures in the Uber Files?

Using computer programs and programming languages such as Apache Tika, Python and Pandas, ICIJ extracted the email addresses as well as names and domains associated with them. The information was organized on a spreadsheet to help identify names of key people, including politicians and government officials, that Uber executives had contacted.

ICIJ examined three types of group calendar files. ICIJ extracted details of scheduled meetings between Uber’s representatives and politicians and public officials, then reviewed thousands of internal emails and messages to confirm that the meetings took place. Additionally, ICIJ explored public records in countries and institutions where officials are required to declare their meetings and schedules.

ICIJ found more than 100 meetings that took place from 2014 to 2016 between Uber executives and public officials, including 12 with representatives of the European Commission, that hadn’t been publicly disclosed. Company executives held private meetings with at least six world leaders, one vice president and three deputy prime ministers, the analysis found.

ICIJ also used public databases such as the European Union’s transparency register, the disclosure logs for the U.S. Senate, and the French lobbying registry.

ICIJ and its partners were also able to examine correspondence between the company and academics who published research favorable to the company, showing that the research was coordinated. The files revealed what data Uber provided the academics, the lobbying messages the research would be supporting and in what countries, what media the academics would appear on to present the results of the research, and what messages the academics would be pushing towards the general public and politicians.  ICIJ also explored publicly available databases of academic papers to identify additional publications funded or supported by Uber – oftentimes featuring current or former Uber employees as co-researchers. The data findings on academic research appeared in stories by The Guardian and other ICIJ media partners.

ICIJ also identified several spreadsheets in the leaked records that contained information about what the company called potential “stakeholders” that could be of interest for Uber. ICIJ combined the information in one master spreadsheet and organized it by country to help partners’ reporting. The research showed Uber, with the help of an advisory firm, had compiled more than 1,850 stakeholders, including sitting and former public officials, think tanks and citizens groups, in 29 countries and in EU institutions.

How did you go about quantifying the scale of Uber’s operations over time?

ICIJ used the WaybackMachine from the Internet Archive to analyze previous versions of Uber’s website and gather historical reports about how the location of the company’s operations grew and changed over time.

ICIJ also used Uber’s financial statements filed with the U.S. Securities and Exchange Commission to track growth since 2019, when the company went public.

What were the biggest challenges in analyzing the Uber Files dataset?

Like all leaks, the documents represent only a small slice of a larger reality. Not all countries where Uber had operations were mentioned in the records. The files covered only through 2017. As always, additional research and reporting was necessary.

ICIJ also encountered a discrepancy between what the company said in its financial statements and what it told users on its website about how many markets it was in. The team had to check for  methodological differences in the two figures. Uber eventually told ICIJ that “with respect to city counts, the way we calculated that number changed in 2020.”

ICIJ had to parse countries’ varying lobbying and meeting-disclosure records, accounting for variations in the type of data available and its quality. Not all countries have lobbying regulations or public lobbying records and not all require politicians to report the meetings they hold as part of their official responsibilities.

Any advice for tackling this kind of dataset?

Remember to connect leaked data with public records. This helps with validation efforts and provides additional information that can be valuable for analysis.

Understand the limitations of each dataset, its structure and quality, and which questions can or can’t be answered through data analysis.

Review regulations across countries and how they affect data gathering and analysis. Talk to experts to get perspective of how regulations are applied in the real world.

When structuring data and performing different types of analysis, document the process and leave plenty of time for fact-checking. All data gathering and analyses performed by ICIJ for the Uber Files was fact checked.

Is ICIJ going to release the Uber Files data?

ICIJ doesn’t release personal data en masse. It will continue to explore the datasets with media partners. More than 180 journalists have spent months searching the data for stories that are in the public interest. If you have any tips or information you would like to share with the team of journalists who worked on the Uber Files, you can send an email to contact@icij.org