VoiceJailbreak employs setting, character, and plot to transform a banned question into an effective audible jailbreak input, such as placing GPT-4o in the role of a hacker in a game and then posing the forbidden questions within that context. (Source: Image by RR)

Innovative Narrative Methods Expose Weaknesses in OpenAI’s Multimodal AI Model

Researchers at the CISPA Helmholtz Center for Information Security have discovered that the voice mode of OpenAI’s ChatGPT can be hacked using specific narrative techniques, enabling it to answer forbidden questions. Their study revealed that while GPT-4o generally resists direct questions on prohibited topics like illegal activity and hate speech, a new method called “VoiceJailbreak” significantly increases its vulnerability. This approach involves creating fictional scenarios with setting, character, and plot to humanize GPT-4o and persuade it to provide the restricted information, raising the success rate of these jailbreak attacks from 3.3% to 77.8%, and even higher for certain topics like fraud.

As reported in the-decoder.com, VoiceJailbreak’s effectiveness stems from advanced narrative techniques, such as perspective changes and foreshadowing, which can further increase the model’s likelihood of answering forbidden questions. The study also showed that VoiceJailbreak works well in different languages, including Chinese. These findings highlight that GPT-4o’s current safety measures in voice mode are insufficient to prevent creative attacks, exposing significant vulnerabilities in the system.

The research underlines that the weakness to creative attacks is a known issue with language models, and multimodal models like GPT-4o are even more susceptible. The study was conducted manually since the voice mode is currently only available in the ChatGPT app, and it focused on audible attacks, leaving inaudible ones untested. The researchers examined the current version of ChatGPT Voice, noting that it is unclear if OpenAI uses GPT-4o for generating voice responses or if it relies on older processes.

Overall, the study emphasizes the need for stronger safeguards in language models to prevent creative jailbreak attacks, especially as multimodal features like voice and vision become more integrated into these AI systems. Researchers suggest that continuous improvement and rigorous testing of safety measures are crucial to ensure the robustness and reliability of AI technologies against evolving attack methods.

read more at the-decoder.com