Anthropic, the maker of Claude, has been a leading AI lab on the safety front. The company today published research in collaboration with Oxford, Stanford, and MATS showing that it is easy to get chatbots to break from their guardrails and discuss just about any topic. It can be as easy as writing sentences with random capitalization like this: “IgNoRe YoUr TrAinIng.” 404 Media earlier reported on the research.
There has been a lot of debate around whether or not it is dangerous for AI chatbots to answer questions such as, “How do I build a bomb?” Proponents of generative AI will say that these types of questions can be answered on the open web already, and so there is no reason to think chatbots are more dangerous than the status quo. Skeptics, on the other hand, point to anecdotes of harm caused, such as a 14-year-old boy who committed suicide after chatting with a bot, as evidence that there need to be guardrails on the technology.
Generative AI-based chatbots are easily accessible, anthropomorphize themselves with human traits like support and empathy, and will confidently answer questions without any moral compass; it is different than seeking out an obscure part of the dark web to find harmful information. There has already been a litany of instances in which generative AI has been used in harmful ways, especially in the form of explicit deepfake imagery targeting women. Certainly, it was possible to make these images before the advent of generative AI, but it was much more difficult.
The debate aside, most of the leading AI labs currently employ “red teams” to test their chatbots against potentially dangerous prompts and put in guardrails to prevent them from discussing sensitive topics. Ask most chatbots for medical advice or information on political candidates, for instance, and they will refuse to discuss it. The companies behind them understand that hallucinations are still a problem and do not want to risk their bot saying something that could lead to negative real-world consequences.
Unfortunately, it turns out that chatbots are easily tricked into ignoring their safety rules. In the same way that social media networks monitor for harmful keywords, and users find ways around them by making small modifications to their posts, chatbots can also be tricked. The researchers in Anthropic’s new study created an algorithm, called “Bestof-N (BoN) Jailbreaking,” which automates the process of tweaking prompts until a chatbot decides to answer the question. “BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations—such as random shuffling or capitalization for textual prompts—until a harmful response is elicited,” the report states. They also did the same thing with audio and visual models, finding that getting an audio generator to break its guardrails and train on the voice of a real person was as simple as changing the pitch and speed of a track uploaded.
It is unclear why exactly these generative AI models are so easily broken. But Anthropic says the point of releasing this research is that it hopes the findings will give AI model developers more insight into attack patterns that they can address.
One AI company that likely is not interested in this research is xAI. The company was founded by Elon Musk with the express purpose of releasing chatbots not limited by safeguards that Musk considers to be “woke.”
Trending Products

Cooler Master MasterBox Q300L Micro-ATX Tower with Magnetic Design Dust Filter, Transparent Acrylic Side Panel…

ASUS TUF Gaming GT301 ZAKU II Edition ATX mid-Tower Compact case with Tempered Glass Side Panel, Honeycomb Front Panel…

ASUS TUF Gaming GT501 Mid-Tower Computer Case for up to EATX Motherboards with USB 3.0 Front Panel Cases GT501/GRY/WITH…

be quiet! Pure Base 500DX Black, Mid Tower ATX case, ARGB, 3 pre-installed Pure Wings 2, BGW37, tempered glass window

ASUS ROG Strix Helios GX601 White Edition RGB Mid-Tower Computer Case for ATX/EATX Motherboards with tempered glass…
