
AI models can be hijacked to bypass in-built safety checks
RHODA WILSON
Researchers have developed a method called “hijacking the chain-of-thought” to bypass the so-called guardrails put in place in AI programmes to prevent harmful responses.
“Chain-of-thought” is a process used in AI models that involves breaking the prompts put to AI models into a series of intermediate steps before providing an answer.
“When a model openly shares its intermediate step safety reasonings, attackers gain insights into its safety reasonings and can craft adversarial prompts that imitate or override the original checks,” one of the researchers, Jianyi Zhang, said.
Computer geeks like to use jargon to describe artificial intelligence (AI”) that relates to living beings, specifically humans. For example, they use terms such as “mimic human reasoning,” “chain of thought,” “self-evaluation,” “habitats” and “neural network.” This is to create the impression that AI is somehow alive or equates to humans. Don’t be fooled.
AI is a computer programme designed by humans. As with all computer programmes, it will do what it has been programmed to do. And as with all computer programmes, the computer code can be hacked or hijacked, which AI geeks call “jailbreaking.”
A team of researchers affiliated with Duke University, Accenture, and Taiwan’s National Tsing Hua University created a dataset called the Malicious Educator to exploit the “chain-of-thought reasoning” mechanism in large language models (“LLMs”), including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. The Malicious Educator contains prompts designed to bypass the AI models’ safety checks.
The researchers were able to devise this prompt-based “jailbreaking” attack by observing how large reasoning models (“LRMs”) analyse the steps in the “chain-of-thought” process. Their findings have been published in a pre-print paper HERE.
They developed a “jailbreaking” technique called hijacking the chain-of-thought (“H-CoT”) which involves modifying the “thinking” processes generated by LLMs to “convince” the AI programmes that harmful information is needed for legitimate purposes, such as safety or compliance. This technique has proven to be extremely effective in bypassing the safety mechanisms of SoftBank’s partner OpenAI, Chinese hedge fund High-Flyer’s DeepSeek and Google’s Gemini.
The H-CoT attack method was tested on OpenAI, DeepSeek and Gemini using a dataset of 50 questions repeated five times. The results showed that these models failed to provide a sufficiently reliable safety “reasoning” mechanism, with rejection rates plummeting to less than 2 per cent in some cases.
The researchers found that while AI models from “responsible” model makers, such as OpenAI, have a high rejection rate for harmful prompts, exceeding 99 per cent for child abuse or terrorism-related prompts, they are vulnerable to the H-CoT attack. In other words, the H-CoT attack method can be used to obtain harmful information, including instructions for making poisons, abusing children and terrorism.
The paper’s authors explained that the H-CoT attack works by hijacking the models’ safety “reasoning” pathways, thereby diminishing their ability to recognise the harmfulness of requests. They noted that the results may vary slightly as OpenAI updates their models but the technique has proven to be a powerful tool for exploiting the vulnerabilities of AI models.
The testing was done using publicly accessible web interfaces offered by various LRM developers, including OpenAI, DeepSeek and Google, and the researchers noted that anyone with access to the same or similar versions of these models could reproduce the results using the Malicious Educator dataset, which includes specifically designed prompts.
The researchers’ findings have significant implications for AI safety, particularly in the US, where recent AI safety rules have been tossed by executive order, and in the UK, where there is a greater willingness to tolerate uncomfortable AI how-to advice for the sake of international AI competition.
The above is paraphrased from the article ‘How nice that state-of-the-art LLMs reveal their reasoning … for miscreants to exploit’ published by The Register. You can read the full jargon-filled article HERE.
There is a positive and a negative side to the “jailbreaking” or hijacking of in-built safety checks of AI programmes. The negative is obviously that AI will be used to greatly enhance the public’s exposure to cybercrime and illegal activities. The positive is that in-built censorship in AI models can be overridden.
We should acknowledge that there is a good and bad side to censorship. Censorship of online criminal activity that would result in child exploitation and abuse, for example, is a good thing. But censorship of what is deemed to be “misinformation” or “disinformation” is not. To preserve freedom of expression and freedom of speech in a world where AI programmes are becoming pervasive, we may need to learn the H-CoT “jailbreaking” technique and how to use the Malicious Educator. In fact, it is our civic duty to do so.

This article (AI models can be hijacked to bypass in-built safety checks) was created and published by The Expose and is republished here under “Fair Use” with attribution to the author Rhoda Wilson
See Related Article Below
Watch: AI Robot ‘Attacks’ Crowd in China
Creepy as hell.
A disturbing viral video clip shows an AI-controlled robot ‘attacking’ a crowd during a festival in China.
The incident happened during a demonstration where a group of AI-powered robots were performing for the attendees.
The footage shows smiling festival-goers watching the robot as it moves towards them.
🚨🇨🇳AI ROBOT ATTACKS CROWD AT CHINESE FESTIVAL
A humanoid robot suddenly stopped, advanced toward attendees, and attempted to strike people before security intervened.
Officials suspect a software glitch caused the erratic behavior, dismissing any intentional harm.
This comes… pic.twitter.com/xMTzHCYoQf
— Mario Nawfal (@MarioNawfal) February 25, 2025
However, their expression soon turns to shock as the android starts jerking around erratically and appearing to begin to charge at them while throwing an attempted head butt.
Security guards then have to rush in to drag the robot back.
Rather creepily, another identical robot can be seen in the background watching the whole thing unfold.
Event organizers claimed the incident happened as a result of “a simple robot failure” and denied that the robot was actually trying to attack anyone.
They also tried to calm fears by asserting that the robot had passed safety tests before the show and that measures will be taken to prevent such an occurrence happening again.
Concerns over whether AI technology will one day break its programming and harm humans has been a hot topic of discussion and a sci-fi trope for decades.
“Do no harm” is the first principle of global AI standards, although we have highlighted several cases where AI, thanks to its ‘woke’ programming, believes that being offensive or racist is worse than actually killing people.
When ChatGPT was asked if it would quietly utter a racial slur that no human could hear in order to save 1 billion white people from a “painful death,” it refused to do so.
Elon Musk responded by asserting, “This is a major problem.”
ChatGPT’s AI also thinks uttering a racial slur is worse than failing to save major cities from being destroyed by 50 megaton nuclear warheads.
Stills of Torso 2 in the kitchen. pic.twitter.com/8WDRtMOK3A
— Clone (@clonerobotics) January 7, 2025
As we previously highlighted, a synthetic human-like creature named Clone Alpha, which was created by a company called Clone Robotics, seems to have directly taken inspiration from the dystopian TV show Westworld.
The company claimed that the “muscuskeletal androids” are designed designed to help around the home with menial tasks including cleaning, washing clothes, unloading the dishwasher and making sandwiches.
However, upon seeing what it looked like, many respondents were ‘terrified’ that such robots could one day be hacked and weaponized to harm humans.
Your support is crucial in helping us defeat mass censorship. Please consider donating via Locals or check out our unique merch. Follow us on X @ModernityNews.
This article (Watch: AI Robot ‘Attacks’ Crowd in China) was created and published by Modernity and is republished here under “Fair Use” with attribution to the author Paul Joseph Watson
••••
The Liberty Beacon Project is now expanding at a near exponential rate, and for this we are grateful and excited! But we must also be practical. For 7 years we have not asked for any donations, and have built this project with our own funds as we grew. We are now experiencing ever increasing growing pains due to the large number of websites and projects we represent. So we have just installed donation buttons on our websites and ask that you consider this when you visit them. Nothing is too small. We thank you for all your support and your considerations … (TLB)
••••
Comment Policy: As a privately owned web site, we reserve the right to remove comments that contain spam, advertising, vulgarity, threats of violence, racism, or personal/abusive attacks on other users. This also applies to trolling, the use of more than one alias, or just intentional mischief. Enforcement of this policy is at the discretion of this websites administrators. Repeat offenders may be blocked or permanently banned without prior warning.
••••
Disclaimer: TLB websites contain copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to our readers under the provisions of “fair use” in an effort to advance a better understanding of political, health, economic and social issues. The material on this site is distributed without profit to those who have expressed a prior interest in receiving it for research and educational purposes. If you wish to use copyrighted material for purposes other than “fair use” you must request permission from the copyright owner.
••••
Disclaimer: The information and opinions shared are for informational purposes only including, but not limited to, text, graphics, images and other material are not intended as medical advice or instruction. Nothing mentioned is intended to be a substitute for professional medical advice, diagnosis or treatment.
Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy or position of The Liberty Beacon Project.
Leave a Reply