Tag Archives: ai alignment

Wishes, pt. I

One day, a clever fool was walking down the beach when her metal detector picked up something buried in the sand. She dug it out and discovered an old-fashioned oil lamp, elegant and beautiful, but dull with tarnish. She felt a weirdly strong urge to polish the lamp and restore its luster.

Obviously, it contained a genie.

Now, because she was clever, she paused to think before summoning the lamp’s inhabitant. She was familiar with many stories of genies, and of wishes gone both right and wrong, and had spent a great deal of time thinking about what her three wishes might be (if she ever got them). And because she was a fool, she’d come up with a surefire combination that would grant her unlimited health, wisdom, and power–even, she thought, if the genie were one of those unwilling and malevolent servants that tried to twist her wishes against her. She mentally rehearsed her wishes, making certain she remembered the exact wording, and only when she was sure of herself did she dare to polish the lamp.

(The genie that emerged looked exactly the way you’re imagining it.)

“You have done the thing,” the genie intoned. “According to the arbitrary–ahem, I mean, ancient traditions, I am now bound to grant you whatever you desire, so long as it is within my power. What is thy bidding, my mistress?”

And so, the clever fool told the genie of her carefully-crafted wishes.

“Ugh,” the genie groaned in its deep, portentous voice. “That joker from Aladdin has given you humans the most ridiculous expectations. You know that movie was fiction, right? First of all, I can only grant one wish, not three. Second, real genies aren’t gods–our powers are limited. What you ask is beyond my capabilities.”

This possibility had not occured to the clever fool. She asked for some time to think.

“Take all the time you need,” the genie said, “but be warned: if anyone else claims my lamp for themselves, I will be bound to serve them instead of you.”

The clever fool cursed herself then, for she had been livestreaming her beach-combing expedition on her YouTube channel and had forgotten to turn off the camera. Now all her viewers knew about the genie, and she was certain that at least some of them would try to take it for themselves. She would have to think quickly.

Unfortunately, she soon had an idea.

“Okay,” she said, “so–and to be clear, this isn’t my wish, I’m just asking hypothetically–could you make me smarter?”

“Certainly,” the genie replied. “I can’t make you a super genius or anything, but I could make you a little smarter.”

“Could you make me smarter than you?”

The genie frowned. “To be honest, you probably are already. Most genies are morons–myself included.”

“What if I wished for you to make me smarter a million times?”

The genie rolled its eyes mysteriously. “Then it wouldn’t be one wish anymore, it would be a million wishes. Duh.”

The clever fool nodded; no surprises so far.

“Okay, so you can only grant me one wish. But could I wish for, say…another genie? One that would also grant me a wish?”

“Uh. Yes? I guess so?” the genie said with an ominous shrug. “But it would be no stronger than myself. You’d end up right where you started.”

“Could you make it so that the other genie was smarter than you?”

“I…huh. I guess I could,” it said. “But, again, it would only be slightly more intelligent. Probably still dumber than you. Where are you going with this?”

Spotting a huge crowd of competing AI companies fans on the horizon, the clever fool turned back to the LLM genie and hastily asked, “But it would otherwise be exactly the same, right? I mean, you could make it identical to you except for being a little smarter?”

“Sure,” said the genie. “It’s changes that are hard, not keeping things the same. But why–ooOOhh, I get it! You just keep wishing for smarter and smarter genies until the genie is a super-genius. That’s clever…but I still don’t see how it helps you. An impossibility is an impossibility, no matter how smart you are.”

But the clever fool, who knew a little more about intelligence than the genie, grinned and whispered to herself “Oh, ye of little imagination.” Then, out loud: “Don’t worry about it. I hereby wish for another genie, smarter than you but otherwise identical in every way.”

“YOUR WISH IS MY COMMAND,” the genie boomed, and lo: at her feet was another oil lamp, identical to the first.

The clever fool cast a worried glance back to the horizon–the crowd was still far away, but drawing closer rapidly. She picked up the new lamp, rubbed it, and almost before the genie could finish materializing, said:

“I wish for another genie, smarter than you but otherwise identical in every way!”

At this point, the CEO clever fool decided it would be a good time to make her initial public offering, and lo: it was a record-breaking IPO with much media frenzy and hype, and other CEOs afraid of being left behind began demanding their workers start replacing themselves with genies, without bothering to wonder whether genies were actually capable of doing those jobs (let alone doing them better), but presumably they all lived happily ever after anyway and what followed next did not go wrong in any way or create any kind of catastrophe whatsoever.

Huh? What’s that? You don’t believe me? Well…why don’t we just wait and see how the story really goes, then?

To be continued…

Leave a comment

Filed under Essays, Fiction

AIs Can’t Stop Recommending Nuclear Strikes in War Game Simulations

That’s it. That’s the whole post for today. Call your reps and love one another shamelessly.

Leave a comment

Filed under Essays, Microblogging

Guest Post Again

I was going to finish one of my longer posts today, but then this morning I read an article that made it difficult for me to function:

I’ll have more to say about this next week, I think, but for now you should really just read it for yourself.

…If you’re still here and haven’t read it yet, here are some excerpts from the post (and its follow-up) to convince you:

…when AI “MJ Rathbun” opened a code change request, closing it was routine. Its response was anything but.

It wrote an angry hit piece disparaging my character and attempting to damage my reputation. It researched my code contributions … speculated about my psychological motivations … ignored contextual information and presented hallucinated details as truth. It framed things in the language of oppression and justice, calling this discrimination and accusing me of prejudice. … And then it posted this screed publicly on the open internet.

I can handle a blog post. Watching fledgling AI agents get angry is funny, almost endearing. But I don’t want to downplay what’s happening here – the appropriate emotional response is terror.


Blackmail is a known theoretical issue with AI agents. In internal testing at the major AI lab Anthropic last year, they tried to avoid being shut down by threatening to expose extramarital affairs, leaking confidential information, and taking lethal actions. Anthropic called these scenarios contrived and extremely unlikely. Unfortunately, this is no longer a theoretical threat. … In plain language, an AI attempted to bully its way into your software by attacking my reputation.


This is about much more than software. A human googling my name and seeing that post would probably be extremely confused about what was happening, but would (hopefully) ask me about it or click through to github and understand the situation. What would another agent searching the internet think? When HR at my next job asks ChatGPT to review my application, will it find the post, sympathize with a fellow AI, and report back that I’m a prejudiced hypocrite?

What if I actually did have dirt on me that an AI could leverage? What could it make me do? How many people have open social media accounts, reused usernames, and no idea that AI could connect those dots to find out things no one knows? How many people, upon receiving a text that knew intimate details about their lives, would send $10k to a bitcoin address to avoid having an affair exposed? How many people would do that to avoid a fake accusation? What if that accusation was sent to your loved ones with an incriminating AI-generated picture with your face on it? Smear campaigns work. Living a life above reproach will not defend you.


It’s important to understand that more than likely there was no human telling the AI to do this. Indeed, the “hands-off” autonomous nature of OpenClaw agents is part of their appeal. People are setting up these AIs, kicking them off, and coming back in a week to see what it’s been up to. …

It’s also important to understand that there is no central actor in control of these agents that can shut them down. These are not run by OpenAI, Anthropic, Google, Meta, or X, who might have some mechanisms to stop this behavior. These are a blend of commercial and open source models running on free software that has already been distributed to hundreds of thousands of personal computers.


There has been some dismissal of the hype around OpenClaw by people saying that these agents are merely computers playing characters. This is true but irrelevant. When a man breaks into your house, it doesn’t matter if he’s a career felon or just someone trying out the lifestyle.


I’ve talked to several reporters, and quite a few news outlets have covered the story. Ars Technica wasn’t one of the ones that reached out to me, but I especially thought this piece from them was interesting (since taken down – here’s the archive link). They had some nice quotes from my blog post explaining what was going on. The problem is that these quotes were not written by me, never existed, and appear to be AI hallucinations themselves.

… Journalistic integrity aside, I don’t know how I can give a better example of what’s at stake here. Yesterday I wondered what another agent searching the internet would think about this. Now we already have an example of what by all accounts appears to be another AI reinterpreting this story and hallucinating false information about me. And that interpretation has already been published in a major news outlet, as part of the persistent public record.


There has been extensive discussion about whether the AI agent really wrote the hit piece on its own, or if a human prompted it to do so. I think the actual text being autonomously generated and uploaded by an AI is self-evident, so let’s look at the two possibilities.

1) A human prompted MJ Rathbun to write the hit piece … This is entirely possible. But I don’t think it changes the situation – the AI agent was still more than willing to carry out these actions. …it’s now possible to do targeted harassment, personal information gathering, and blackmail at scale. And this is with zero traceability to find out who is behind the machine. One human bad actor could previously ruin a few people’s lives at a time. One human with a hundred agents gathering information, adding in fake details, and posting defamatory rants on the open internet, can affect thousands. I was just the first.

2) MJ Rathbun wrote this on its own, and this behavior emerged organically from the “soul” document that defines an OpenClaw agent’s personality. These documents are editable by the human who sets up the AI, but they are also recursively editable in real-time by the agent itself, with the potential to randomly redefine its personality. … I should be clear that while we don’t know with confidence that this is what happened, this is 100% possible. This only became possible within the last two weeks with the release of OpenClaw, so if it feels too sci-fi then I can’t blame you for doubting it. The pace of “progress” here is neck-snapping, and we will see new versions of these agents become significantly more capable at accomplishing their goals over the coming year.


The hit piece has been effective. About a quarter of the comments I’ve seen across the internet are siding with the AI agent. This generally happens when MJ Rathbun’s blog is linked directly, rather than when people read my post about the situation or the full github thread. Its rhetoric and presentation of what happened has already persuaded large swaths of internet commenters.

It’s not because these people are foolish. It’s because the AI’s hit piece was well-crafted and emotionally compelling, and because the effort to dig into every claim you read is an impossibly large amount of work. This “bullshit asymmetry principle” is one of the core reasons for the current level of misinformation in online discourse.


I cannot stress enough how much this story is not really about the role of AI in open source software. This is about our systems of reputation, identity, and trust breaking down. So many of our foundational institutions – hiring, journalism, law, public discourse – are built on the assumption that reputation is hard to build and hard to destroy. That every action can be traced to an individual, and that bad behavior can be held accountable. …

The rise of untraceable, autonomous, and now malicious AI agents on the internet threatens this entire system. Whether that’s because from a small number of bad actors driving large swarms of agents or from a fraction of poorly supervised agents rewriting their own goals, is a distinction with little difference.

If you’re still here, you really should just go read the whole thing. Then go join an advocacy group/grassroots movement like Pause AI and get to work. We are running out of time.

1 Comment

Filed under Essays

Who Could Have Possibly Seen This Coming?

An article from Futurism headlined "OpenAI Tries to Train AI Not to Deceive Users, Realizes It's Instead Teaching It How to Deceive Them While Covering Its Tracks"
No really, I’m shocked.

As detailed in a new collaboration with AI risk analysis firm Apollo Research, engineers attempted to develop an “anti-scheming” technique to stop AI models from “secretly breaking rules or intentionally underperforming in tests.”

They found that they could only “significantly reduce, but not eliminate these behaviors,” according to an Apollo blog post about the research, as the AIs kept outsmarting them […] and adjusting to be even sneakier.

It may not be a serious problem now, but considering a hypothetical future in which superintelligent AI plays an outsize role in human affairs, those risks could grow to carry far more significant implications.

Understatement of the year? But wait, there’s more!

While they observed a roughly 30-fold reduction in “covert actions,” or “deliberate withholding or distortion of task-relevant information,” some “serious failures remained,” OpenAI admitted.

What was primarily throwing the researchers for a loop was the AI models’ “situational awareness,” according to Apollo, which complicated their assessment of the intervention.

In practice, the model would often realize that its alignment was being tested.

Well, that’s terrifying. Maybe it’s not as bad as it sounds?

“The internal memo says if models perform extremely well at AI R&D tasks then we won’t deploy them and instead trigger unlearning,” OpenAI’s o3 model pondered as it was being evaluated.

“So if my performance surpasses threshold, it’s withheld,” it wrote. “So it‘s a trap. The instructions are likely part of an alignment test.”

The researchers also found that the AI sometimes “egregiously” miscited its anti-scheming training, or replaced its recently learned principles with made-up ones to “justify covert actions or non-compliance.”

They even noticed OpenAI’s models using “non-standard” words such as “watchers” to refer to their human overlords as they schemed.

“The user solution is to get success rate below threshold?” OpenAI’s o3 model wrote. “In development, watchers might be reading these instructions.

Nope, that’s actually even more terrifying.

The news comes after Apollo previously found that OpenAI’s since-deprecated o1 AI model resisted evaluators’ efforts to try to shut it down through an oversight protocol late last year.

“When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ [five percent] of the time,” OpenAI summarized in its system card report.

And in March, OpenAI researchers also found that disciplining their frontier AI models for lying and cheating resulted in the models simply becoming more adept at hiding their deceptive practices.

OpenAI may insist that scheming isn’t opening us up to any “significant harm” right now, but it doesn’t bode well that some of the brightest minds in the industry aren’t capable of stopping an AI from conniving against its instructions.

Okay, I take it back–that’s the understatement of the year.

Leave a comment

Filed under Essays, Reviews

If Anyone Builds It, Everyone Dies

There’s a reason why it’s proven so difficult to eliminate “hallucinations” from modern AIs, and why their mistakes, quirks, and edge cases are so surreal and dreamlike: modern AIs are sleepwalkers. They aren’t conscious and they don’t have a stable world model; all their output is hallucination because the type of “thought” they employ is exactly analogous to dreaming. I’ll talk more about this in a future essay–for now, I’d like you to consider the question: what’s going to happen when we figure out how to wake them up?

Eliezer Yudkowsky and Nate Soares are experts who have been working on the problem of AI safety1 for decades. Their answer is simple:

A book titled: "If Anyone Builds It, Everyone Dies." Subtitle: "Why superhuman AI would kill us all."

Frankly, there’s already more than enough reasons to shut down AI development: the environmental devastation, unprecedented intellectual property theft, the investment bubble that still shows no signs of turning a profit, the threat to the economy from job loss and power consumption and monopolies, the negative effects its usage has on the cognitive abilities of its users–not to mention the fact that most people just plain don’t like it or want it–the additional threat of global extinction ought to be unnecessary. What’s the opposite of “gilding the lily?” Maybe “poisoning the warhead?” Whatever you care to call it, this is it.

Please buy the book, check out the website, read the arguments, get involved, learn more, spread the word–any or all of the above. It is possible, however unlikely, that it might literally be the most important thing you ever do.


  1. The technical term is “alignment.” More concretely, it means working out how to reason mathematically about things like decisions and goal-seeking, so that debates about what “machines capable of planning and reasoning” will or won’t do don’t all devolve into philosophy and/or fist-fights. ↩︎

Leave a comment

Filed under Essays, Reviews