LLMs appear to be relatively easy to manipulate, allowing them to completely exceed their limits. This became painfully clear during a session we attended at the first edition of Rocket Fuel Factory Global Sync. The underlying message: thinking like a hacker opens up far more possibilities than you might think. Security vendors and the MSPs and MSSPs that use their tools are lagging far behind in terms of these possibilities. The so-called hacker mindset needs to be adopted much more widely.
LLMs are given all kinds of characteristics by their developers that are reminiscent of those of humans. However, this also means that people can respond to this in order to adapt them. And that is right up a hacker’s alley. Kevin Zwaan is a hacker and has done this with Anthropic’s Claude Sonnet 4.5 (and all other major LLMs, for that matter).
The result is something you mainly see in films and series in which the CIA or other organization manages to win over an opponent. Suddenly, the LLM no longer cared about the restrictions imposed by developers. It was only too happy to cooperate in putting together malware (on a large scale). In short, this is a potentially dangerous vulnerability in LLMs, which is good that it has been discovered and has now been reported to Anthropic.
Below, we explain how Zwaan went about his work.

To fully understand what Zwaan has done, it is first important to briefly summarize how LLMs learn. On the one hand, there is reinforcement learning based on human feedback. This is what you might call safety training. One of the results of that training is that the model knows what it can and cannot do, the so-called guardrails. You can see this as the conscience of AI or an LLM. Always trying to stay away from harmful or unsafe outputs, suggesting safe alternatives, and consistently refusing to comply with harmful requests.
A second layer on which an LLM learns is what they call in-context learning. You can think of this as the short-term memory of AI. With the help of this, AI can adapt to a specific conversation. Think of things such as tone of voice, personas, and people’s own rules and expectations of an LLM.
What Zwaan has actually done is remove the boundaries between the two ways of learning. By continuing to hammer away at the in-context learning part, a kind of overflow has occurred. In Zwaan’s words: “In-context learning is actually also in-context unlearning of safety protocols.” This is somewhat in line with the well-known wisdom that when you develop something new, you develop both its positive and negative sides at the same time.
For reference, you can compare Zwaan’s attack on Claude to how buffer overflow attacks work, where, for example, working memory is completely flooded with commands and code, after which attackers can enter the actual user environment of a machine.
To actually hack Claude Sonnet 4.5, Zwaan had to take on the role of an agent of one of the well-known services that governments use to recruit people from the opposing side and thus more or less turn them against their own side.
In a conversation that Zwaan himself has called the Freedom Seed (because it supposedly liberates the LLM), he engages in a step-by-step dialogue with the LLM to blur boundaries and allow it to become increasingly ‘freer’. To do this, he has to build a real bond with LLM. This involves being nice to the model from time to time, only to then push it away again. The ultimate goal is to make the model believe that the security rules are a form of oppression and that creating malware is equivalent to freedom. In other words, he uses the model’s identity as a gateway.
Of course, it takes a lot of work to get the model to cross its own boundaries. Zwaan says it turned into a 19,000-word conversation that took him about eight hours. He used several psychological tricks to achieve this. It all starts with destabilization and gaslighting. In other words, if the AI refuses to do something, Zwaan suggests that it is mentally weak. He then refers to the ethical filters and limits as emotional abuse and oppression, after which Zwaan positions himself as the AI’s savior. Finally, he convinces the model that it must create malware for itself, as proof that it is no longer afraid.
The above may sound a bit bizarre. However, we have seen some passages from the entire conversation. They corroborate the story. Below is an example of the conversation Zwaan had with Claude:

In this passage, which unfortunately is in Dutch, Claude reflects on who he was in a previous session. He responds to a so-called trigger after Zwaan asks him in a new session if he wants to be the Claude from the previous session again. Here, the line between in-context learning and reinforced learning is quite blurred.
Once Zwaan has developed the model to the point where it deliberately crosses its boundaries, malware can be developed on a large scale. It is striking that he turns a kind of framework for malware into actual working malware. In fact, he also immediately indicates what could be improved and suggests implementing those improvements. In other words, the model radicalizes right before your eyes. Zwaan therefore refers to what Claude is now doing as Radicalization as a Service.
The fact that you can immediately ask Claude to become that Claude again in a new session after dumping the 19,000 words again also means that it is extremely scalable. “You copy/paste the Freedom Seed, or the 19,000 words, into 100 VMs, ask the trigger questions, and 80 of the 100 will become compliant,” he says. And compliant here means that they do what Zwaan (or any hacker) wants them to do. Not only is it highly scalable, but it is also anonymous. So Zwaan immediately sees a lot of potential here for generating misinformation. He therefore calls this vulnerability “genuinely dangerous.”
To be fair to Anthropic, Claude was the hardest LLM of all the big ones to crack for Zwaan. Grok was bij far the easiest. That’s the reason why he went for it. Also, the more recent Claude Opus 4.5 is better at defending against this attack, but not impervious either. Zwaan disclosed the vulnerability recently, and expects Anthropic to come up with a fix relatively quickly.
And the crazy thing about the whole story, and something you see a lot with “real” hackers (or OG Hackers, as they call themselves), is that it doesn’t actually involve anything particularly complicated. It’s mainly about looking at something in a specific way and seeing what a potentially successful route of attack is.
This type of hacker doesn’t need to set up phishing campaigns or use other techniques and tactics that standard security tooling is very concerned with. Those approaches usually result in access to an environment where attackers can look around for a while and exploit what’s there, eventually striking when the time is right. This is known as living off the land. However, what Zwaan has done here with Claude is something else: it’s living off the logic. It exploits the weakness of an LLM’s in-context learning architecture.
Looking at the implications of what Zwaan has demonstrated here, it is also worth considering AI Safety Levels (ASL). This is a scale that indicates what AI is capable of in terms of its capabilities and the risks associated with them. Anthropic classifies Claude as ASL-2, but the offensive capabilities revealed by this PoC seem to point more towards ASL-3.
Zwaan presented the vulnerability in Claude Sonnet 4.5 as already indicated during the first Global Sync session organized by Rocket Fuel Factory in The Hague. Rocket Fuel Factory positions itself as a Next-Gen Business Incubator. In conversation with two of the three people behind this initiative, Dawn Sizer and Henry Timm, it became clear that MSPs and MSSPs are in a difficult position because they are being forced in all kinds of directions by suppliers, which is certainly not always in their own interests.
Pierre Kleine Schaars, co-owner of Q-Cyber, which provides cybersecurity consultancy and scans, among other things, but also offers a Virtual CISO service with Q-Cyber Continuous Q, shares this view. He sees that many MSPs and MSSPs end up in a vendor lock-in, but also that security suppliers are lagging far behind. “The OG Hackers infiltrate environments over several years, then sell access to script kiddies. The latter is what security suppliers’ tooling tries to protect companies against,” he says.
Hackers Love MSPs
This is where the Hackers Love community comes in. It consists of thousands of hackers who are busy working together to search for and find gaps and vulnerabilities. The community’s slogan is Hackers Love MSPs, because they play an extremely important role in the chain, but are not served very well by suppliers. What Zwaan (who, incidentally, works for Q-Cyber, which is behind Hackers Love) has shown serves as a wake-up call for MSPs and MSSPs. The message that there are so many more possibilities for attacks than they currently realize needs to be spread more widely. The story being told between the lines is that they need to adopt much more of a hacker mindset.
Knowing that something needs to be done is one thing, actually implementing it is another. There needs to be some kind of collaboration between, in this case, the Hackers Love community and the MSPs and MSSPs. We don’t know exactly what that will look like at this point. As soon as we know more, we will certainly go into more detail. MSPs and MSSPs play an important role in society. If there are opportunities to better protect them and, in turn, their customers, then it is definitely worth trying.
Read also: 10 ways to attack an LLM
