Algorithms Are Making Important Decisions. What Could Possibly Go Wrong?

Seemingly trivial differences in training data can skew the judgments of AI programs—and that’s not the only problem with automated decision-making

By Ananya

People seen walking from above at a distance, one person highlighted in a blue circle — Bernhard Lang/Getty Images

Can we ever really trust algorithms to make decisions for us? Previous research has proved these programs can reinforce society’s harmful biases, but the problems go beyond that. A new study shows how machine-learning systems designed to spot someone breaking a policy rule—a dress code, for example—will be harsher or more lenient depending on minuscule-seeming differences in how humans annotated data that were used to train the system.

Despite their known shortcomings, algorithms already recommend who gets hired by companies, which patients get priority for medical care, how bail is set, what television shows or movies are watched, who is granted loans, rentals or college admissions and which gig worker is allocated what task, among other significant decisions. Such automated systems are achieving rapid and widespread adoption by promising to speed up decision-making, clear backlogs, make more objective evaluations and save costs. In practice, however, news reports and research have shown these algorithms are prone to some alarming errors. And their decisions can have adverse and long-lasting consequences in people’s lives.

One aspect of the problem was highlighted by the new study, which was published this spring in Science Advances. In it, researchers trained sample algorithmic systems to automatically decide whether a given rule was being broken. For example, one of these machine-learning programs examined photographs of people to determine whether their outfits violated an office dress code, and another judged whether a cafeteria meal adhered to a school’s standards. Each sample program had two versions, however, with human programmers labeling the training images in a slightly different way in each version. In machine learning, algorithms use such labels during training to figure out how other, similar data should be categorized.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

For the dress-code model, one of the rule-breaking conditions was “short shorts or short skirt.” The first version of this model was trained with photographs that the human annotators were asked to describe using terms relevant to the given rule. For instance, they would simply note that a given image contained a “short skirt”—and based on that description, the researchers would then label that photograph as depicting a rule violation.

For the other version of the model, the researchers told the annotators the dress code policy—and then directly asked them to look at the photographs and judge which outfits broke the rules. The images were then labeled accordingly for training.

Although both versions of the automated decision-makers were based on the same rules, they reached different judgments: the versions trained on descriptive data issued harsher verdicts and were more likely to say a given outfit or meal broke the rules than those trained on past human judgments.

“So if you were to repurpose descriptive labels to construct rule violation labels, you would get more rates of predicted violations—and therefore harsher decisions,” says study co-author Aparna Balagopalan, a Ph.D. student at the Massachusetts Institute of Technology.

The discrepancies can be attributed to the human annotators, who labeled the training data differently if they were asked to simply describe an image versus when they were told to judge whether that image broke a rule. For instance, one model in the study was being trained to moderate comments in an online forum. Its training data consisted of text that annotators had labeled either descriptively (by saying whether it contained “negative comments about race, sexual orientation, gender, religion, or other sensitive personal characteristics,” for example) or with a judgment (by saying whether it violated the forum’s rule against such negative comments). The annotators were more likely to describe text as containing negative comments about these topics than they were to say it had violated the rule against such comments—possibly because they felt their annotation would have different consequences under different conditions. Getting a fact wrong is just a matter of describing the world incorrectly, but getting a decision wrong can potentially harm another human, the researchers explain.

The study’s annotators also disagreed about ambiguous descriptive facts. For instance, when making a dress code judgment based on short clothes, the term “short” can obviously be subjective—and such labels influence how a machine-learning system makes its decision. When models learn to infer rule violations depending entirely on the presence or absence of facts, they leave no room for ambiguity or deliberation. When they learn directly from humans, they incorporate the annotators’ human flexibility.

“This is an important warning for a field where datasets are often used without close examination of labeling practices, and [it] underscores the need for caution in automated decision systems—particularly in contexts where compliance with societal rules is essential,” says co-author Marzyeh Ghassemi, a computer scientist at M.I.T. and Balagopalan’s adviser.

The recent study highlights how training data can skew a decision-making algorithm in unexpected ways—in addition to the known problem of biased training data. For example, in a separate study presented at a 2020 conference, researchers found that data used by a predictive policing system in New Delhi, India, was biased against migrant settlements and minority groups and might lead to disproportionately increased surveillance of these communities. “Algorithmic systems basically infer what the next answer would be, given past data. As a result of that, they fundamentally don’t imagine a different future,” says Ali Alkhatib, a researcher in human-computer interaction who formerly worked at the Center for Applied Data Ethics at the University of San Francisco and was not involved in the 2020 paper or the new study. Official records from the past may not reflect today’s values, and that means that turning them into training data makes it difficult to move away from racism and other historical injustices.

Additionally, algorithms can make flawed decisions when they don't account for novel situations outside their training data. This can also harm marginalized people, who are often underrepresented in such datasets. For instance, starting in 2017, some LGBTQ+ YouTubers said they found their videos were hidden or demonetized when their titles included words such as “transgender.” YouTube uses an algorithm to decide which videos violate its content guidelines, and the company (which is owned by Google) said it improved that system to better avoid unintentional filtering in 2017 and subsequently denied that words such as “trans” or “transgender” had triggered its algorithm to restrict videos. “Our system sometimes makes mistakes in understanding context and nuances when it assesses a video’s monetization or Restricted Mode status. That’s why we encourage creators to appeal if they believe we got something wrong,” wrote a Google spokesperson in an e-mail to Scientific American. “When a mistake has been made, we remediate and often conduct root cause analyses to determine what systemic changes are required to increase accuracy.”

Algorithms can also err when they rely on proxies instead of the actual information they are supposed to judge. A 2019 study found that an algorithm widely used in the U.S. for making decisions about enrollment in health care programs assigned white patients higher scores than Black patients with the same health profile—and hence provided white patients with more attention and resources. The algorithm used past health care costs, rather than actual illness, as a proxy for health care needs—and, on average, more money is spent on white patients. “Matching the proxies to what we intend to predict ... is important,” Balagopalan says.

Those making or using automatic decision-makers may have to confront such problems for the foreseeable future. “No matter how much data, no matter how much you control the world, the complexity of the world is too much,” Alkhatib says. A recent report by Human Rights Watch showed how a World Bank–funded poverty relief program that was implemented by the Jordanian government uses a flawed automated allocation algorithm to decide which families receive cash transfers. The algorithm assesses a family’s poverty level based on information such as income, household expenses and employment histories. But the realities of existence are messy, and families with hardships are excluded if they don’t fit the exact criteria: For example, if a family owns a car—often necessary to get to work or to transport water and firewood—it will be less likely to receive aid than an identical family with no car and will be rejected if the vehicle is less than five years old, according to the report. Decision-making algorithms struggle with such real-world nuances, which can lead them to inadvertently cause harm. Jordan’s National Aid Fund, which implements the Takaful program, did not respond to requests for comment by press time.

Researchers are looking into various ways of preventing these problems. “The burden of evidence for why automated decision-making systems are not harmful should be shifted onto the developer rather than the users,” says Angelina Wang, a Ph.D. student at Princeton University who studies algorithmic bias. Researchers and practitioners have asked for more transparency about these algorithms, such as what data they use, how those data were collected, what the intended context of the models’ use is and how the performance of the algorithms should be evaluated.

Some researchers argue that instead of correcting algorithms after their decisions have affected individuals’ lives, people should be given avenues to appeal against an algorithm’s decision. “If I knew that I was being judged by a machine-learning algorithm, I might want to know that the model was trained on judgments for people similar to me in a specific way,” Balagopalan says.

Others have called for stronger regulations to hold algorithm makers accountable for their systems’ outputs. “But accountability is only meaningful when someone has the ability to actually interrogate stuff and has power to resist the algorithms,” Alkhatib says. “It’s really important not to trust that these systems know you better than you know yourself.”