New Tool Reveals How AI Makes Decisions

Large language models such as ChatGPT tend to make things up. A new approach now allows the systems to explain their responses—at least partially

By Manon Bischoff

Illustration showing yellow balls forming the shape of a side view of a human brain against a dusty rose backdrop. — Andriy Onufriyenko/Getty Images

When I asked ChatGPT to name the editor in chief of Spektrum der Wissenschaft, the German-language sister publication of Scientific American, the answer was, “I have the latest information until 2021, and at that time Dr. Florian Freistetter was the editor-in-chief of Spektrum der Wissenschaft. However, it is possible that the information has changed since then.” This highlights one of the biggest drawbacks of current language-generating artificial intelligence programs: they “hallucinate.” While Spektrum der Wissenschaft features a lot of work by Freistetter, he was never a staff member—let alone editor in chief—of Spektrum der Wissenschaft. That’s why it is important to work on so-called explainable AI (XAI) models that can justify their answers—and thus become more transparent.

Most AI programs function like a “black box.” “We know exactly what a model does but not why it has now specifically recognized that a picture shows a cat,” said computer scientist Kristian Kersting of the Technical University of Darmstadt in Germany to the German-language newspaper Handelsblatt. That dilemma prompted Kersting—along with computer scientists Patrick Schramowski of the Technical University of Darmstadt and Björn Deiseroth, Mayukh Deb and Samuel Weinbach, all at the Heidelberg, Germany–based AI company Aleph Alpha—to introduce an algorithm called AtMan earlier this year. AtMan allows large AI systems such as ChatGPT, Dall-E and Midjourney to finally explain their outputs.

In mid-April 2023 Aleph Alpha integrated AtMan into its own language model Luminous, allowing the AI to reason about its output. Those who want to try their hand at it can use the Luminous playground for free for tasks such as summarizing text or completing an input. For example, “I like to eat my burger with” is followed by the answer “fries and salad.” Then, thanks to AtMan, it is possible to determine which input words led to the output: in this case, “burger” and “favorite.”

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

AtMan’s explanatory power is limited to the input data, however. It can indeed explain that the words “burger” and “like” most strongly led Luminous to complete the input with “fries and salad.” But it cannot reason how Luminous knows that burgers are often consumed with fries and salad. This knowledge remains in the data with which the model was trained.

AtMan also cannot debunk all of the lies (the so-called hallucinations) told by AI systems—such as that Florian Freistetter is my boss. Nevertheless, the ability to explain AI reasoning from input data offers enormous advantages. For example, it is possible to quickly check whether an AI-generated summary is correct—and to ensure the AI hasn’t added anything. Such an ability also plays an important role from an ethical perspective. “If a bank uses an algorithm to calculate a person’s creditworthiness, for example, it is possible to check which personal data led to the result: Did the AI use discriminatory characteristics such as skin color, gender, and so on?” says Deiseroth, who co-developed AtMan.

Moreover AtMan is not limited to pure language models. It can also be used to examine the output of AI programs that generate or process images. This applies not only to programs such as Dall-E but also to algorithms that analyze medical scans in order to diagnose various disorders. Such a capability makes an AI-generated diagnosis more comprehensible. Physicians could even learn from the AI if it were to recognize patterns that previously eluded humans.

AI Algorithms Are a “Black Box”

“AI systems are being developed extremely quickly and sometimes integrated into products too early,” says Schramowski, who was also involved in the development of AtMan. “It’s important that we understand how an AI arrives at a conclusion so that we can improve it.” That’s because algorithms are still a “black box”: while researchers understand how they generally function, it’s often unclear why a specific output follows a particular input. Worse, if the same input is run through a model several times in a row, the output can vary. The reason for this is the way AI systems work.

Modern AI systems—such as language models, machine translation programs or image-generating algorithms—are constructed from neural networks. The structure of these networks is based on the visual cortex of the brain, in which individual cells called neurons pass signals to one another via connections called synapses. In the neural network, computing units act as the “neurons,” and they are built up in several layers, one after the other. As in the brain, the connections between the mechanical neurons are called “synapses,” and each one is assigned a numerical value called its “weight.”

If, for example, a user wants to pass an image to such a program, the visual is first converted into a list of numbers where each pixel corresponds to an entry. The neurons of the first layer then accept these numerical values.

Next, the data pass through the neural network layer by layer: the value of the neuron in one layer is multiplied by weight of the synapse and transferred to the neuron from the next layer. If necessary, the result there must be added to the values of other synapses that end at the same neuron. Thus, the program processes the original input layer by layer until the neurons of the last layer provide an output—for example, whether there is a cat, dog or seagull in the image.

But how do you make sure that a network processes the input data in a way that produces a meaningful result? For this, the weights—the numerical values of the synapses—must be calibrated correctly. If they are set appropriately, the program can describe a wide variety of images. You don’t configure the weights yourself; instead you subject the AI to training so that it finds values that are as suitable as possible.

This works as follows: The neural network starts with a random selection of weights. Then the program is presented with tens of thousands or hundreds of thousands of sample images, all with corresponding labels such as “seagull,” “cat” and “dog.” The network processes the first image and produces an output that it compares to the given description. If the result differs from the template (which is most likely the case in the beginning), the so-called backpropagation kicks in. This means the algorithm moves backward through the network, tracking which weights significantly influenced the result—and modifying them. The algorithm repeats this combination of processing, checking and weight adjustment with all training data. If the training is successful, the algorithm is then able to correctly describe even previously unseen images.

Two Methods for Understanding AI Results

Often, however, it is not merely the AI’s answer that is interesting but also what information led it to its judgment. For example, in the medical field, one would like to know why a program believes it has detected signs of a disease in a scan. To find out, one could of course look into the source code of the trained model itself because it contains all the information. But modern neural networks have hundreds of billions of parameters—so it’s impossible to keep track of all of them.

Nevertheless, ways exist to make an AI’s results more transparent. There are several different approaches. One is backpropagation. As in the training process, one traces back how the output was generated from the input data. To do this, one must backtrack the “synapses” in the network with the highest weights and can thus infer the original input data that most influenced the result.

Another method is to use a perturbation model, in which human testers can change the input data slightly and observe how this changes the AI’s output. This makes it possible to learn which input data influenced the result most.

These two XAI methods have been widely used. But they fail with large AI models such as ChatGPT, Dall-E or Luminous, which have several billion parameters. Backpropagation, for example, lacks the necessary memory: If the XAI traverses the network backward, one would have to keep a record of the many billions of parameters. While training an AI in a huge data center, this is possible—but the same method cannot be repeated constantly to check an input.

In the perturbation model the limiting factor is not memory but rather computing power. If one wants to know, for example, which area of an image was decisive for an AI’s response, one would have to vary each pixel individually and generate a new output from it in each instance. This requires a lot of time, as well as computing power that is not available in practice.

To develop AtMan, Kersting’s team successfully adapted the perturbation model for large AI systems so that the necessary computing power remained manageable. Unlike conventional algorithms, AtMan does not vary the input values directly but modifies the data that is already a few layers deeper in the network. This saves considerable computing steps.

An Explainable AI for Transformer Models

To understand how this works, you need to know how AI models such as ChatGPT function. These are a specific type of neural network, called transformer networks. They were originally developed to process natural language, but they are now also used in image generation and recognition.

The most difficult task in processing speech is to convert words into suitable mathematical representations. For images, this step is simple: convert them into a long list of pixel values. If the entries of two lists are close to each other, then they also correspond to visually similar images. A similar procedure must be found for words: semantically similar words such as “house” and “cottage” should have a similar representation, while similarly spelled words with different meanings, such as “house” and “mouse,” should be further apart in their mathematical form.

Transformers can master this challenging task: they convert words into a particularly suitable mathematical representation. This requires a lot of work, however. Developers have to feed the network a number of texts so that it learns which words appear in similar environments and are thus semantically similar.

It’s All about Attention

But that alone is not enough. You also have to make sure that the AI understands a longer input after training. For example, take the first lines of the German-language Wikipedia entry on Spektrum der Wissenschaft. They translate roughly to “Spektrum der Wissenschaft is a popular monthly science magazine. It was founded in 1978 as a German-language edition of Scientific American, which has been published in the U.S. since 1845, but over time has taken on an increasingly independent character from the U.S. original.” How does the language model know what “U.S.” and “original” refer to in the second sentence? In the past, most neural networks failed at such tasks—that is, until 2017, when experts at Google Brain introduced a new type of network architecture based solely on the so-called attention mechanism, the core of transformer networks.

Attention enables AI models to recognize the most important information in an input: Which words are related? What content is most relevant to the output? Thus, an AI model is able to recognize references between words that are far apart in the text. To do this, attention takes each word in a sentence and relates it to every other word. So for the sentence in the example from Wikipedia, the model starts with “Spektrum” and compares it to all the other words in the entry, including “is,” “science,” and so on. This process allows a new mathematical representation of the input words to be found—and one that takes into account the content of the sentence. This attention step occurs both during training and in operation when users type something.

This is how language models such as ChatGPT or Luminous are able to process an input and generate a response from it. By determining what content to pay attention to, the program can calculate which words are most likely to follow the input.

Shifting the Focus in a Targeted Manner

This attention mechanism can be used to make language models more transparent. AtMan, named after the idea of “attention manipulation,” specifically manipulates how much attention an AI pays to certain input words. It can direct attention toward certain content and away from other content. This makes it possible to see which parts of the input were crucial for the output—without consuming too much computing power.

For instance, researchers can pass the following text to a language model: “Hello, my name is Lucas. I like soccer and math. I have been working on ... for the past few years.” The model originally completed this sentence by filling in the blank with “my degree in computer science.” When the researchers told the model to increase its attention to “soccer,” the output changed to “the soccer field.” When they increased attention to “math,” they got “math and science.”

Thus, AtMan represents an important advance in the field of XAI and can bring us closer to understanding AI systems. But it still does not save language models from wild hallucination—and it cannot explain why ChatGPT believes that Florian Freistetter is editor in chief of Spektrum der Wissenschaft.

It can at least be used to control what content the AI does and doesn’t take into account, however. “This is important, for example, in algorithms that assess a person’s creditworthiness,” Schramowski explains. “If a program bases its results on sensitive data such as a person’s skin color, gender or origin, you can specifically turn off the focus on that.” AtMan can also raise questions if it reveals that an AI program’s output is minimally influenced by the content passed to it. In that case, the AI has obviously scooped all its generated content from the training data. “You should then check the results thoroughly,” Schramowski says.

AtMan can process not only text data in this way but any kind of data that a transformer model works with. For example, the algorithm can be combined with an AI that provides descriptions of images. This can be used to find out which areas of an image led to the description provided. In their publication, the researchers looked at a photograph of a panda—and found the AI based its description of “panda” mainly on the animal’s face.

“And it seems like AtMan can do even more,” says Deiseroth, who also helped develop the algorithm. “You could use the explanations from AtMan specifically to improve AI models.” Past work has already shown that smaller AI systems produce better results when trained to provide good reasoning. Now it remains to be investigated whether the same is true for AtMan and large transformer models. “But we still need to check that,” Deiseroth says.

This article originally appeared in Spektrum der Wissenschaft and was reproduced with permission.