Hi Steven, appreciate your article - particularly boiling down two issues of exfiltration along with just in time controls. That said, adjacent but distinct to Timothy Lee’s comment - what of the risks of intentional “rogue” AI deployment? Specifically, conversations tend to center on companies that have decent enmeshment with social contracts. They may or may not do an adequate job around risk controls. But what of shadow actors that are in fact motivated to release AIs with comprised motives designed into them? I see this as a likely scenario that is immune to ethical conversation.
Yes, I agree with this idea: I expect that autonomous rogue AI systems will exist in the future. I don't have a strong view about whether this is more likely to be the result of an AI self-exfiltrating, an experiment gone awry, or someone deliberately creating AI-powered malware.
Hey thanks, yeah I'm also concerned about those. I often think of the risk question as having two planks:
- Is AI capable enough to be useful for harm, if there were some volition?
- Is there a volition to do harm? This *could* be from the AI being misaligned, but an easier bar to clear is 'surely some malevolent groups will want to cause harm if they can', and so they can provide the volition themselves
It feels like we could be equipping adversaries with more capacity for harm than we want them to have, without necessarily having the right defensive measures in place
I agree with that assessment and concern. It arises whenever I read well intended articles on how to commandeer an AI for this or that vulnerability. We have an open ecology of powerful open weight systems. We now have cookbooks on how to compromise them for harm, allowing mal actors to learn from a plethora of thoughtful, intelligent people. While closed system companies will attempt to sort these issues, likely not fully - open weights are open weights.
I really think that the fundamental disagreement here is about superintelligence. People who believe in superintelligence believe that once AI systems reach a certain level of intelligence, the default outcome will be that they take over the world. If that's true, we need a strategy for keeping AI systems under control—and for believers in superintelligence, that seems difficult.
As a superintelligence unbeliever, this doesn't seem right at all. I certainly think it's possible that AI systems will achieve human-level intelligence on most tasks and will go far beyond human capabilities on some of them. But because I do not think raw intelligence is all that important for gaining power, I do not think that a superintelligent AI is certain, or even all that likely, to gain control over the world.
So I 100 percent agree that at some point we're likely to see an AI system exfiltrate itself from an AI lab and spread across the Internet like a virus. And I'm sure that such a superintelligent AI system will cause a significant amount of damage. That just doesn't seem that different from the world we inhabit today, when computer viruses, botnets, ransomware, and North Korean hackers are regularly damaging our networks and computer systems.
Nobody think that the harms of computer viruses, ransomware, and North Korean hackers is so severe that we ought to shut down the Internet. And by the same token, I think that the harms from rogue human-level AI are likely to be significant but probably not so significant that we'll regret having developed human-level AI. And I think it's very unlikely that the harms will be so significant that they become existential.
Very interesting, thanks for the thorough reply - I hadn't expected you to agree on the likelihood of self-exfiltration, so that's a helpful update for me. (Now I wonder if an AI company CEO *would* say the same.)
If you're up for saying a bit more, is your POV on why an escaped AI unlikely to then cause serious harms more like `max harm from an AI system, even if unguarded against, just can't get that high`, or more like `we'll have strong enough defenses to filter out what would otherwise be very serious attacks`? Appreciate the thorough engagement on this already though
I guess I don't see these as distinct possibilities.
When viruses started spreading among PCs in the late 1990s and causing significant damage, Microsoft, IT administrators, and users all started taking additional precautions to lock down their networks and systems. We haven't eliminated computer viruses, but we've learned how to keep the costs of computer viruses manageable.
Similarly, a few years ago there was a spate of ransomware attacks that harmed a significant number of organizations. I assume other organizations noticed and took steps to minimize the damage if they are targeted, including training staff about phishing and keeping better backups.
It's a dynamic system—people and organizations try to estimate the seriousness of any given threat and take appropriate precautions. If AI malware (or whatever it's called) turns out to be a bigger threat than viruses or ransomware, that will cause organizations across society to take more precautions, which in turn will mitigate the harm they cause.
As long as the first exfiltrated system doesn't pose an existential threat (which I feel pretty confident it won't), we'll have time to improve our defenses over time via trial and error.
Thanks yeah I broadly agree with this, or at least the first four paragraphs. A crux is possibly that I put much higher weight on "Is there some catastrophic action an AI could trigger, which won't be sufficiently defended in time"
I worry about this in part because I don't see enough resourcing being put into resilience + defensive measures yet, without which I just expect lots of important code to be pretty vulnerable. (I wish I had a better POV on what/how exactly to invest into that work.)
As an example, it seems plausible to me *if* AI can trigger a missile launch at some nuclear power, this could incite nuclear weapons to be used in response. This might not be full extinction of course, but would seem pretty awful.
And so then it feels like we're putting a lot of weight on "is every country with serious missiles *actually* securely air-gapped in the ways they think" & "are their opsec procedures good enough to stop various AI+insider threat models"
I think another factor for me is I'm not positive that there will be a 'fire alarm moment' where people finally get the gravity and act with real urgency. You're probably right though that "first real escape" is the moment where it would happen, if it is going to happen, & so I hope you are correct on it!
I feel some one major crux of the issue is that it may turn that that defenses are essentially unviable; to some extent, you already see this with ransomware, etc where a lot of the attacks are just paid off rather than defended against.
But eventually we might get to a place with AI where the "defense" is something akin to "replace your substrate with something non-biological"(say in a world of widespread biological weapons) where you have essentially defended yourself by going extinct.
Oh interesting, fwiw most scenarios I'm concerned about I'd expect would happen much before such a point. You're talking about changing substrate as in people giving up on biological bodies & trying to upload?
Hi Steven, appreciate your article - particularly boiling down two issues of exfiltration along with just in time controls. That said, adjacent but distinct to Timothy Lee’s comment - what of the risks of intentional “rogue” AI deployment? Specifically, conversations tend to center on companies that have decent enmeshment with social contracts. They may or may not do an adequate job around risk controls. But what of shadow actors that are in fact motivated to release AIs with comprised motives designed into them? I see this as a likely scenario that is immune to ethical conversation.
Yes, I agree with this idea: I expect that autonomous rogue AI systems will exist in the future. I don't have a strong view about whether this is more likely to be the result of an AI self-exfiltrating, an experiment gone awry, or someone deliberately creating AI-powered malware.
Hey thanks, yeah I'm also concerned about those. I often think of the risk question as having two planks:
- Is AI capable enough to be useful for harm, if there were some volition?
- Is there a volition to do harm? This *could* be from the AI being misaligned, but an easier bar to clear is 'surely some malevolent groups will want to cause harm if they can', and so they can provide the volition themselves
This is part of the reason I feel concerned about open-weights deployments of some systems, like I gestured at in "Are we ready for a 'DeepSeek for bioweapons'?" https://stevenadler.substack.com/p/are-we-ready-for-a-deepseek-for-bioweapons
It feels like we could be equipping adversaries with more capacity for harm than we want them to have, without necessarily having the right defensive measures in place
I agree with that assessment and concern. It arises whenever I read well intended articles on how to commandeer an AI for this or that vulnerability. We have an open ecology of powerful open weight systems. We now have cookbooks on how to compromise them for harm, allowing mal actors to learn from a plethora of thoughtful, intelligent people. While closed system companies will attempt to sort these issues, likely not fully - open weights are open weights.
Hi, I really appreciate the thoughtful critique!
I really think that the fundamental disagreement here is about superintelligence. People who believe in superintelligence believe that once AI systems reach a certain level of intelligence, the default outcome will be that they take over the world. If that's true, we need a strategy for keeping AI systems under control—and for believers in superintelligence, that seems difficult.
As a superintelligence unbeliever, this doesn't seem right at all. I certainly think it's possible that AI systems will achieve human-level intelligence on most tasks and will go far beyond human capabilities on some of them. But because I do not think raw intelligence is all that important for gaining power, I do not think that a superintelligent AI is certain, or even all that likely, to gain control over the world.
So I 100 percent agree that at some point we're likely to see an AI system exfiltrate itself from an AI lab and spread across the Internet like a virus. And I'm sure that such a superintelligent AI system will cause a significant amount of damage. That just doesn't seem that different from the world we inhabit today, when computer viruses, botnets, ransomware, and North Korean hackers are regularly damaging our networks and computer systems.
Nobody think that the harms of computer viruses, ransomware, and North Korean hackers is so severe that we ought to shut down the Internet. And by the same token, I think that the harms from rogue human-level AI are likely to be significant but probably not so significant that we'll regret having developed human-level AI. And I think it's very unlikely that the harms will be so significant that they become existential.
Does that make sense?
Very interesting, thanks for the thorough reply - I hadn't expected you to agree on the likelihood of self-exfiltration, so that's a helpful update for me. (Now I wonder if an AI company CEO *would* say the same.)
If you're up for saying a bit more, is your POV on why an escaped AI unlikely to then cause serious harms more like `max harm from an AI system, even if unguarded against, just can't get that high`, or more like `we'll have strong enough defenses to filter out what would otherwise be very serious attacks`? Appreciate the thorough engagement on this already though
I guess I don't see these as distinct possibilities.
When viruses started spreading among PCs in the late 1990s and causing significant damage, Microsoft, IT administrators, and users all started taking additional precautions to lock down their networks and systems. We haven't eliminated computer viruses, but we've learned how to keep the costs of computer viruses manageable.
Similarly, a few years ago there was a spate of ransomware attacks that harmed a significant number of organizations. I assume other organizations noticed and took steps to minimize the damage if they are targeted, including training staff about phishing and keeping better backups.
It's a dynamic system—people and organizations try to estimate the seriousness of any given threat and take appropriate precautions. If AI malware (or whatever it's called) turns out to be a bigger threat than viruses or ransomware, that will cause organizations across society to take more precautions, which in turn will mitigate the harm they cause.
As long as the first exfiltrated system doesn't pose an existential threat (which I feel pretty confident it won't), we'll have time to improve our defenses over time via trial and error.
Thanks yeah I broadly agree with this, or at least the first four paragraphs. A crux is possibly that I put much higher weight on "Is there some catastrophic action an AI could trigger, which won't be sufficiently defended in time"
I worry about this in part because I don't see enough resourcing being put into resilience + defensive measures yet, without which I just expect lots of important code to be pretty vulnerable. (I wish I had a better POV on what/how exactly to invest into that work.)
As an example, it seems plausible to me *if* AI can trigger a missile launch at some nuclear power, this could incite nuclear weapons to be used in response. This might not be full extinction of course, but would seem pretty awful.
And so then it feels like we're putting a lot of weight on "is every country with serious missiles *actually* securely air-gapped in the ways they think" & "are their opsec procedures good enough to stop various AI+insider threat models"
I think another factor for me is I'm not positive that there will be a 'fire alarm moment' where people finally get the gravity and act with real urgency. You're probably right though that "first real escape" is the moment where it would happen, if it is going to happen, & so I hope you are correct on it!
I feel some one major crux of the issue is that it may turn that that defenses are essentially unviable; to some extent, you already see this with ransomware, etc where a lot of the attacks are just paid off rather than defended against.
But eventually we might get to a place with AI where the "defense" is something akin to "replace your substrate with something non-biological"(say in a world of widespread biological weapons) where you have essentially defended yourself by going extinct.
Oh interesting, fwiw most scenarios I'm concerned about I'd expect would happen much before such a point. You're talking about changing substrate as in people giving up on biological bodies & trying to upload?
Essentially, though it would be more a copy and like many things, more performative than real survival.