New Anthropic Research Sheds Light on AI's 'Black Box'

Despite the fact that they’re created by humans, large language models are still quite mysterious. The high-octane algorithms that power our current artificial intelligence boom have a way of doing things that aren’t outwardly explicable to the people observing them. This is why AI has largely been dubbed a “black box,” a phenomenon that isn’t easily understood from the outside.

Newly published research from Anthropic, one of the top companies in the AI industry, attempts to shed some light on the more confounding aspects of AI’s algorithmic behavior. On Tuesday, Anthropic published a research paper designed to explain why its AI chatbot, Claude, chooses to generate content about certain subjects over others.

AI systems are set up in a rough approximation of the human brain—layered neural networks that intake and process information and then make “decisions” or predictions based on that information. Such systems are “trained” on large subsets of data, which allows them to make algorithmic connections. When AI systems output data based on their training, however, human observers don’t always know how the algorithm arrived at that output.

This mystery has given rise to the field of AI “interpretation,” where researchers attempt to trace the path of the machine’s decision-making so they can understand its output. In the field of AI interpretation, a “feature” refers to a pattern of activated “neurons” within a neural net—effectively a concept that the algorithm may refer back to. The more “features” within a neural net that researchers can understand, the more they can understand how certain inputs trigger the net to affect certain outputs.

In a memo on its findings, Anthropic researchers explain how they used a process known as “dictionary learning” to decipher what parts of Claude’s neural network mapped to specific concepts. Using this method, researchers say they were able to “begin to understand model behavior by seeing which features respond to a particular input, thus giving us insight into the model’s ‘reasoning’ for how it arrived at a given response.”

In an interview with Anthropic’s research team conducted by Wired’s Steven Levy, staffers explained what it was like to decipher how Claude’s “brain” works. Once they had figured out how to decrypt one feature, it led to others:

One feature that stuck out to them was associated with the Golden Gate Bridge. They mapped out the set of neurons that, when fired together, indicated that Claude was “thinking” about the massive structure that links San Francisco to Marin County. What’s more, when similar sets of neurons fired, they evoked subjects that were Golden Gate Bridge-adjacent: Alcatraz, California Governor Gavin Newsom, and the Hitchcock movie Vertigo, which was set in San Francisco. All told the team identified millions of features—a sort of Rosetta Stone to decode Claude’s neural net.

It should be noted that Anthropic, like other for-profit companies, could have certain, business-related motivations for writing and publishing its research in the way that it has. That said, the team’s paper is public, which means that you can go read it for yourself and make your own conclusions about their findings and methodologies.