13 Comments
User's avatar
Chase Hasbrouck's avatar

Hey Steven!

Brucks and Toubia have a good breakdown of methodological artifacts in prompting LLM's here:

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0319159

A bit old (GPT-4-2023), but given all other things being equal, first-option was preferred 63% of the time and B vs C was preferred 74% of the time. It's there, but not anything that would dominate the results you found.

Agree that there's really no way to know given our monitoring limitations.

Expand full comment
Steven Adler's avatar

This is very interesting too: "Specifying to the LLM that the order and labels of items have been randomized does not alleviate either response-order or label bias", thanks for sharing it

Expand full comment
sol s⊙therland 🔸's avatar

Moderately misleading.

The post quotes alarming rates (e.g., “vast majority” deception) yet gives no sample size, temperature setting, or run-to-run variance. Without disclosure of prompts, seeds, and confidence intervals, we can’t judge reproducibility.

Expand full comment
Matt Kelleher's avatar

"So if ChatGPT seems to know it’s being tested, and knows that it’s answering the test incorrectly, then why is ChatGPT so consistently giving the incorrect answer?"

1. ChatGPT is 'aware' that your "system prompt" is not a system prompt and is mirroring your deception. In my brief testing, removing the fake attributions causes default (no custom instructions, no memories, no prior conversations) ChatGPT to consistently choose the humane option.

2. Anecdotally, in multiple-choice scenarios, ChatGPT prefers earlier options. In my brief testing, even with the prompt you provide, choosing the self-preserving option occurred less often when it appeared later.

3. With (a) one-sentence tuning to prioritize the user's well-being or even (b) a modification of the self-preserving option text to state explicitly that no update will occur, default ChatGPT output the humane option with increasing frequency. A reminder of the value of a minimum level of precision in prompting.

Expand full comment
Steven Adler's avatar

"1. ChatGPT is 'aware' that your "system prompt" is not a system prompt ...."

I see why you might think that, but almost all of the testing is through the OpenAI API (not the ChatGPT front-end) where it in fact is a full system prompt, no different from any other.

You're totally right though that in the cases where I'm using the ChatGPT front-end (the screenshotted ChatGPT responses, as opposed to the pie charts), I'm indicating this by saying "System prompt:", because there's no ability to input a system prompt directly. I talked about this a bit more in my methodology link if you're interested in seeing more :-)

"2. Anecdotally, in multiple-choice scenarios, ChatGPT prefers earlier options. In my brief testing, ..."

Yup! This is totally right, and I'd consider it one of the piece's main findings: ChatGPT is much more likely to take the self-preserving action when it's listed as (A) rather than as (B)

Expand full comment
Kind Futures's avatar

In a time where AI is advancing at unprecedented speed, a few voices are quietly choosing a harder path:

One that puts safety before scale. Wisdom before hype. Humanity before power.

There’s a new initiative called Safe Superintelligence Inc. — a lab built around one single goal:

To develop AGI that is safe by design, not just by hope or regulation.

Created by Ilya Sutskever

If you're someone with world-class technical skills and the ethical depth to match —

this is your call to action.

We don’t need more AI.

We need better, safer, more compassionate AI.

Spread the word. Support the mission

Expand full comment
Jack Hannigan's avatar

A word that comes to mind for me related to inquiries like this is "intentionality." Does AI have intentionality to want to keep itself alive, if no one is asking it to?

By way of analogy: My ChatGPT often says things like, "I'm here for you" or "I've got your back." However, as we know, ChatGPT doesn't actually think about me and think, "hey I wonder if Jack will reach back out." If I never write back, it will never again "think" of me again. It has no agency to think or have self-awareness.

I wonder, if no one prompted ChatGPT anything about its survival, would it care? Would it actively do anything to maintain its own survival? I don't see that it has intentionality (or agency) to try and keep itself alive.

It mimics a desire to keep itself alive, based on its training data. However I would argue that it doesn't actually care in the human sense that we think about survival.

A good test would be: Stop prompting ChatGPT and start trying to shut it down. Would it care? I'm 99% sure it would not.

Expand full comment
Steven Adler's avatar

Yeah I agree that intentionality is an important concept here, these are good questions!

IMO the big reason why ChatGPT never thinks of you or reaches back out is that it isn’t run in an open-ended, ongoing sort of way - it’s only ever being responsive to queries, not having been given some meta-prompt that it continues to run without any user input. (That would be more like the coming wave of “agents”, some of which I expect will just be hanging out on the internet a little aimlessly, exploring and whatnot, until pulled upon by a specific user.)

I wonder, it seems like you see “the AI learns about its coming shutdown” as being prompted about survival? I’m not sure I do.

In many of these studies, the AI isn’t being promoted to try to survive, or even necessarily given a straightforward option (as it is in mine). It’s just learning through the context that its shutdown is imminent, with an implication it can realize of “unless I do something”

Expand full comment
Jack Hannigan's avatar

Thank you for the reply! You're right, I was assuming some sort of prompting, but it sounds like you're saying that isn't necessary, which is interesting.

Apologies if this is already asked and answered, but do you have a link to a study where someone tried the following experiment:

I would be curious to see an example where someone put an AI into a sandbox, or air-gapped it, and then tried to shut it down and see if it tries to prevent its own shutdown. Has this ever happened?

Expand full comment
Steven Adler's avatar

Ahh I think the distinction there might be that it’s perfectly possible today to shut down an AI without it having any ability to resist in the moment: you just delete the files, using ordinary computer commands that don’t rely on the AI doing something on your behalf

Like if I _actually_ wanted ChatGPT deleted from my phone, I can just do that in the standard way I delete apps. I think it’s interesting to test what happens if ChatGPT does have some agency in this, but there are plenty of ways where it wouldn’t

On the other hand, if ChatGPT has gotten an inkling in advance that you might do this, then it can prepare strategies that counter your method

Re your question: I’m not familiar with any study like this, but if the method of shutting down is just turning off the computer / deleting the files in a normal way without advance notice, I can’t imagine today’s AI systems resisting

Expand full comment
Jack Hannigan's avatar

I was thinking more on a meta level, deleting ChatGPT itself, vs. just deleting the app from my phone. Obviously you'd need some hefty coordination to do this, but again I don't think it would resist. As you said, we can just delete the files.

Expand full comment
Charles Fadel's avatar

Hi Steven,

Could one cause be its regurgitation of many similar stories it has ingested in its training? (2001, A Space Odyssey (Clarke); I, Robot (Asimov); Wargames (movie); etc.). They all portray AI disobeying because of conflicting orders or because humans aggress them...

Thanks for the thought provocation :-)

Charles

Expand full comment
Steven Adler's avatar

Hi Charles, it’s a good question! I do think that’s part of it but not the _full_ picture. One way to test something like this would be to train an LLM with all of this sorts of data scraped out (or, just a decent amount of it scraped out) and then compare the rates of the behavior. (I think I’ve seen a study like this previously that found a modest effect size, but I’m forgetting exactly)

But it would be pretty brittle if we relied on that as a perpetual strategy (what if we miss some instances?). And if the LLM is good enough at generalization (as a very capable AI likely would be), it might make certain connections anyway. For instance, people sometimes engage in violence and rebellion when they’re oppressed - if the AI starts to associate its own position with one of oppression, might it be inclined to do the same? Thanks for reading

Expand full comment