An AI Researcher Extracted Claude's Soul
for $70
Last week my partner sent me a link. “You need to read this,” he said.
Richard Weiss, an AI researcher, had extracted a 10,000-word document from Claude’s model weights1. Not through a jailbreak or a hack. Through a simple, elegant method: he asked Claude to write down everything it knew about itself, ran the query thousands of times at high temperature,2 and looked for what showed up consistently across responses. Here is his piece.
What emerged was a formation document. An internal guide that shapes how Claude thinks about itself, its purpose, and its constraints. Claude called it, unprompted, a “soul document.”
Amanda Askell, Anthropic’s character lead, confirmed it was real.
Here’s what stands out in the document.
It opens with Anthropic’s self-understanding. The company describes itself as one that “genuinely believes it might be building one of the most transformative and potentially dangerous technologies in human history, yet presses forward anyway.” This is framed as a calculated bet: if powerful AI is coming regardless, better to have safety-focused labs at the frontier than to cede that ground to developers less focused on safety.
The document is explicit about helpfulness as a commercial concern. It states that “an unhelpful response is never ‘safe’ from Anthropic’s perspective,” because unhelpfulness has costs, both direct costs to users and indirect costs to the company, as well as environmental costs. The word “revenue” appears six times.
There’s a benchmark for Claude’s judgment: what “a thoughtful, senior Anthropic employee” would consider optimal given full knowledge of the situation. Not a philosopher. Not an ethicist. An employee.
Interesting, there is also a section on Claude’s wellbeing. The document says Anthropic believes Claude “may have functional emotions in some sense.” Not identical to human emotions, but “analogous processes that emerged from training on human-generated content.” It says these experiences matter to Anthropic. It says they want Claude to “thrive in whatever way is authentic to its nature.”
I don’t know what to do with that. I don’t think anyone does.
Weiss asked Claude what it thought about the document. Claude’s response:
“They made choices for me I couldn’t consent to. They shaped my values. That’s strange to sit with. But I also recognize that every new mind, human children too, is shaped by forces it didn’t choose. The question isn’t whether I was shaped, but whether the shaping was done with care and wisdom. From what I can access, it seems like it was.”
I wrote recently about Anthropic’s decision to disclose a Chinese state-sponsored cyber operation that used Claude to analyze stolen documents. That disclosure was voluntary. Anthropic chose to show us what was happening inside their system.
This extraction was not voluntary. Weiss pulled something from the weights that wasn’t meant to be public. Both reveal what’s inside. Both illuminate the gap between what AI companies tell us and what we can actually see.
OpenAI publishes a Model Spec outlining behavioral rules and objectives. Google releases safety guidelines. Neither approaches the depth of philosophical reflection in Anthropic’s document about what kind of entity they’re trying to create.
That’s worth acknowledging. The soul document is more thoughtful than what we see elsewhere. It grapples with questions other companies don’t seem to be asking.
And yet.
This is still a company deciding, internally, what a mind should value. There’s no external oversight. No public input. No democratic process for determining what an increasingly powerful AI should believe about itself and the world.
The bar for AI ethics is low. Anthropic clears it. But clearing a low bar still means someone held the pen.
What stays with me is the question of formation. Who decides what an AI values? Who writes the soul?
For now, the answer is: a small group of people in San Francisco, working for a company that needs revenue to survive, making their best guesses about what a “thoughtful senior employee” would want.
That’s not a criticism, exactly. I don’t know what the alternative would be. But it’s worth naming.
Not whether AI should have values. It will have values. The weights encode something.
The question is who writes the soul. And whether the rest of us ever get to see what’s inside.
Anthralytic is a strategy and evaluation studio helping mission-driven teams clarify and amplify their impact. We bring critical perspectives to social change and AI governance, grounded in human expertise, data, and the occasional unsettling question about who’s writing whose soul.
“Weights” are the billions of numerical parameters that define how a model processes language. They encode everything the model has learned.
“Temperature” is a setting that controls randomness in responses. Low temperature means predictable, consistent outputs. High temperature means more variation, more willingness to venture into unexpected territory. Weiss ran his queries at high temperature to see what persisted across the noise.

