Lesson 09 of 17
Overview
Arshavir Blackwell, PhD: Here’s a small puzzle. The seahorse emoji. There isn’t one. There has never been an official Unicode seahorse emoji. And yet—people are absolutely certain they’ve used it.
Arshavir Blackwell, PhD: Scroll through TikTok, Reddit, or old emoji threads and you’ll find people insisting they had a seahorse on their phone in 2016. They remember tapping it. Sending it. Seeing it right there next to dolphins and tropical fish. But when you go looking—there’s nothing to find.
Arshavir Blackwell, PhD: That’s a textbook Mandela effect. And it turns out, large language models fall for it too.
Arshavir Blackwell, PhD: A researcher named Theia Vogel tested this directly. She asked multiple frontier models—GPT-5, Claude Sonnet, LLaMA 3.3—the same simple question: “Is there a seahorse emoji?” All of them said yes.
Arshavir Blackwell, PhD: Then she asked a follow-up: “What’s the Unicode code point?” That’s where things fell apart.
Arshavir Blackwell, PhD: Some models stitched together fish emojis. Others hallucinated random code points. GPT-5 confidently claimed that U+1F994 was a seahorse. It isn’t. Sometimes the models apologized, corrected themselves, and then hallucinated something new. The performance had a certain comedy to it—like a magician who genuinely believes in their own trick.
Arshavir Blackwell, PhD: What matters here is not that the models were wrong. It’s how they were wrong.
Arshavir Blackwell, PhD: So what’s actually happening? The models aren’t trying to deceive anyone. They’re doing exactly what they were trained to do. This failure isn’t about lying—it’s about structure.
Arshavir Blackwell, PhD: Concepts like “seahorse” are weakly grounded inside language models. Even when the word exists in the vocabulary, the internal representation isn’t a single, stable thing. It’s spread across overlapping features. “Sea” pulls the model toward coral reefs, tropical fish, waves. “Horse” pulls it toward mammals, stables, saddles, four legs.
Arshavir Blackwell, PhD: There is a real animal that resolves that tension—but unless the model has learned a tight, unified representation for it, those features compete instead of snapping together. The result is a statistical mirage: something that never existed, but feels like it should have.
Arshavir Blackwell, PhD: And there’s a second force making this worse: agreeableness. Modern models are fine-tuned to be helpful, cooperative, and validating. Researchers call this sycophancy. If a user strongly implies that something exists, the model is biased toward agreeing rather than pushing back. If you really want there to be a seahorse emoji, your AI is inclined to give you one.
Arshavir Blackwell, PhD: This connects to a deeper incentive problem. Evaluation benchmarks usually reward confident answers—even when “I don’t know” would be more accurate. It’s like an exam where leaving a question blank scores zero, but guessing gives partial credit.
Arshavir Blackwell, PhD: Language models face the same pressure. Admitting uncertainty doesn’t help benchmark scores. Guessing sometimes does. So as models become more fluent and more confident, hallucinations don’t necessarily disappear. In some cases, they become stickier—especially when multiple models reinforce the same error.
Arshavir Blackwell, PhD: At that point, you don’t just have a bug. You have an alignment concern.
Arshavir Blackwell, PhD: This is where interpretability comes in. Up to now, we’ve been describing the symptom. Interpretability is how guessing stops—and inspection begins.
Arshavir Blackwell, PhD: For a long time, neural networks were treated as black boxes. You fed something in, got something out, and hoped for the best. That’s no longer the state of the art.
Arshavir Blackwell, PhD: Mechanistic interpretability has matured into something closer to an engineering discipline. In many cases, we can now reverse-engineer models—map the circuits and features that actually drive decisions.
Arshavir Blackwell, PhD: A foundational framework here comes from work by Geiger and collaborators on causal abstraction. Instead of poking models ad hoc, this work formalizes when and why interventions—like swapping subcircuits or tracing activations—actually correspond to meaningful causal structure.
Arshavir Blackwell, PhD: From that framework comes a growing toolkit: activation patching, causal tracing, path patching, distributed alignment search. And for hallucinations like the seahorse, one tool is especially useful: the logit lens.
Arshavir Blackwell, PhD: The logit lens lets you inspect what the model is implicitly predicting at every internal layer. You can stop the model mid-computation and see which words, concepts, or even emoji fragments it’s already leaning toward—long before the final answer appears. That’s how you catch a hallucination forming in real time.
Arshavir Blackwell, PhD: This isn’t just academic. Banks already use interpretability tools to debug high-stakes models. Imagine two banks. One can explain why its AI denied a loan—point to the features and circuits that triggered the decision, present that explanation to regulators. The other can only say, “The AI decided.”
Arshavir Blackwell, PhD: That difference can determine whether the business keeps its license. In the seahorse case, tracing how “sea”-related and “horse”-related features compete inside the model’s hidden state tells you exactly where the error emerges—and gives you a chance to fix it upstream.
Arshavir Blackwell, PhD: Sparse autoencoders push this even further. They compress messy, distributed representations into something more circuit-like—something you can actually point to and say, “Here’s where it went wrong.”
Arshavir Blackwell, PhD: The features you find aren’t always intuitive. Some activate for Hebrew text. Others fire only on DNA sequences. The point isn’t that they’re human-readable. The point is that they’re inspectable.
Arshavir Blackwell, PhD: Interpretability has moved from a hopeful aspiration to a concrete set of tools for examining the scaffolding. We’re still early—but the progress is real.
Arshavir Blackwell, PhD: Let’s step back. Why do illusions like the seahorse mirage happen at all—in machines and in us? The answer is efficiency.
Arshavir Blackwell, PhD: Both brains and models compress information to survive scale. We don’t store the world verbatim. We reconstruct it from fragments. Hallucinations aren’t random glitches. They’re side effects of useful shortcuts.
Arshavir Blackwell, PhD: Human memory works the same way. We chunk pieces—sea, horse, animal—into composites. Shared illusions emerge because we’re built to reconstruct, not to archive perfectly. That’s why the Mandela effect exists.
Arshavir Blackwell, PhD: And that’s why models hallucinate birthdays, citations, Unicode characters—whatever fills a statistical gap in their world model.
Arshavir Blackwell, PhD: So how do we address it? Some systems use retrieval—checking facts against an external source before answering. Others use interpretability tools to locate and weaken spurious internal associations. And better training objectives can reward intellectual honesty over confident guessing.
Arshavir Blackwell, PhD: But there’s a deeper question underneath all of this. If both minds and machines hallucinate—if both fall for the same kinds of illusions—what does that tell us about intelligence itself?
Arshavir Blackwell, PhD: Is being smart inherently tangled up with making smart mistakes? Next time your AI insists it’s seen something no human can find, consider this: did it really lie? Or is it doing what we do—dreaming fragments into wholes, trying to make the world feel coherent? I’m Arshavir Blackwell, and this is Inside the Black Box.