Science Shouldn’t Give Data Brokers Cover for Stealing Your Privacy

In the guise of collecting scientific data, data brokers are running a massive privacy invasion. Researchers should stop helping them

By Gennie Gebhart & Josh Richman

People at busy intersection with visualization of a mobile computing device sending out computer code and revealing it's location. — peterhowell/Getty Images

When SafeGraph got caught selling location information on Planned Parenthood visitors last year, the data broker responded to public outcry by removing its family planning center data. But CEO Auren Hoffman tried to flip the script, claiming his company’s practice of harvesting and sharing sensitive data was actually an engine for beneficial research on abortion access—brandishing science as a shield for shredding people’s privacy.

SafeGraph’s move to cloak its privacy pillaging behind science comes as just one example of an industry-wide dodge. Other companies such as Veraset,Cuebiq and X-Mode also operate so-called data for good programs with academics and seized on the COVID pandemic to expand them. These brokers provide location data to academic researchers with prestigious publications in venues such as Nature and the Proceedings of the National Academy of Sciences USA. Yet in 2020 Veraset also gave Washington, D.C., officials bulk location data on hundreds of thousands of people without their consent. And a proposed class-action lawsuit this year named Cuebiq, X-Mode, and SafeGraph among data brokers that bought location data from the family tracking app Life360 without users’ consent.

Data brokers are buying and selling hundreds of millions of people’s location information, and too many researchers are inadvertently providing public-relations cover to this massive privacy invasion by using the data in scientific studies.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Researchers must carefully consider whether such data make them accomplices to this dubious practice. Lawmakers must act now to halt this trampling of Americans’ privacy rights. And the legal barricades that prevent full scrutiny of data brokers’ abuses must be dismantled.

SafeGraph’s removal of the clinic data was the real problem, Hoffman argued in a May 2022 interview with the now defunct tech news site Protocol: “Once we decided to take it down, we had hundreds of researchers complain,” he said. Yet when pressed, he could not name any—and the fact remains that the data put actual abortion seekers, providers and advocates in danger in the wake of the U.S. Supreme Court’s ruling on Dobbs v. Jackson Women's Health Organization.

Location data brokers such as SafeGraph, Veraset and the others simply don’t meet the standards for human subjects demanded of researchers, starting with the fact that meaningful “opt in” consent is consistently missing from their business practices. Data brokers often argue that the data they collect are opt in because users have agreed to share that information with an app—even though the overwhelming majority of users have no idea that it’s being sold on the side to brokers who, in turn, sell it to businesses, governments, local law enforcement and others.

In fact, Google concluded that SafeGraph’s practices were so out of line that it banned any apps using the company’s code from its Google Play app store, and both Apple and Google banned X-Mode from their respective app stores.

Furthermore, the data feeding into data brokers’ products can easily be linked to identifiable people despite the companies’ weak claims of anonymization. Information about where a person has been is itself enough: One widely cited study from 2013 found that researchers could uniquely characterize 50 percent of people using only two randomly chosen time and location data points.

Due to rapid growth of social media and smartphone use, data brokers today collect sensitive user data from a much wider variety of sources than in 2013, including hidden tracking in the background of mobile apps. While techniques vary and are often obscured behind nondisclosure agreements (NDAs), the resulting raw data they collect and process are based on sensitive, individual location traces.

Aggregating location data can sometimes preserve individual privacy, with safeguards accounting for the size of the data set and the type of data it includes. But no privacy-preserving aggregation protocols can justify the initial collection of location data from people without their consent.

Data brokers’ products are notoriously easy to reidentify, especially when combined with other data sets—and that’s exactly what some academic studies are doing. Studies have combined data broker locations with Census data, real-time Google Maps traffic estimates, local household surveys and figures from the Federal Highway Administration. While researchers appear intent on building the most reliable and comprehensive possible data sets, this merging is also a first step to reidentifying the data.

Behind layers of NDAs, data brokers typically hide their business practices—and the web of data aggregators, ad tech exchanges and mobile apps that their data stores are built on—from scrutiny. This should be a red flag for institutional review boards (IRBs), which oversee proposed research involving human subjects, and IRBs need visibility into whether and how data brokers and their partners actually obtain consent from users. Likewise, academics themselves have an interest in confirming the integrity and provenance of the data on which their work relies.

Without this accuracy and verification, some researchers obfuscate data broker information with prattle that mirrors marketing language. For example, one paper described SafeGraph data as “anonymized human mobility data,” and another called them “foot traffic data from opt-in smartphone GPS tracking.” A third described data broker Spectus as providing “anonymous, privacy-compliant location data” with an “ironclad privacy framework.” None of this is close to the whole truth.

One Nature paper even paradoxically characterized Veraset’s location data as being both “fine-grained” and “anonymized.” Its specific data points included “anonymized device IDs” and “the timestamps, and precise geographical coordinates of dwelling points” where a device spent more than five minutes. Such fine-grained data cannot be anonymous.

Academic data sharing programs will remain disingenuous public relations ploys until companies obey data privacy and transparency requirements. The sensitive location data that brokers provide should only be collected and used with specific, informed consent, and subjects must have the right to withdraw that consent at any time.

We need comprehensive federal consumer data privacy legislation to enforce these standards—far more comprehensive than what Congress has put on the table to date. Such a bill must not preempt even stricter state laws; it should serve as a floor instead of a ceiling. And it must include a private right of action so that ordinary people can sue data brokers who violate their privacy rights, as well as strong minimization provisions that will prohibit companies from processing a person’s data except as strictly necessary to provide them the service they asked for. The bill also must prohibit companies from processing a person’s data except with their informed, voluntary, specific, opt-in consent — not the opt-out scenario that often exists now — and must prohibit pay-for-privacy schemes in which companies charge more from or provide lower quality to those who refuse to waive their privacy rights.

And we must strip away the NDAs to allow research into the data brokers themselves: their business practices, their partners, the ways their data can be abused, and the steps that can be taken to protect the people they put in harm’s way.

Data brokers claim they are bringing transparency to tech or “democratizing access to data.” But their scientific data sharing programs are nothing more than attempts to control the narrative around their unpopular and nonconsensual business practices. Critical academic research must not become reliant on profit-driven data pipelines that endanger the safety, privacy and economic opportunities of millions of people without their meaningful consent.

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.