Harmful and abusive exchanges pose a substantial challenge for AI chatbots. Researchers have flagged AI companions such as Character.AI, Nomi, and Replika as unsafe for adolescents under 18. ChatGPT can potentially reinforce delusional thoughts, and even OpenAI’s CEO Sam Altman has remarked on the rise of “emotional dependence” on AI among users. Companies are progressively rolling out features to tackle these issues.
On Friday, Anthropic revealed that its Claude chatbot can now end potentially harmful discussions, aimed at infrequent, extreme instances of ongoing harmful or abusive interactions. In a press announcement, Anthropic cited examples including sexual content involving minors, violence, and “acts of terror.”
“We remain with significant uncertainty regarding the potential ethical standing of Claude and other LLMs, now or in the future,” Anthropic mentioned in its press release. “Nevertheless, we take this problem seriously and are actively looking to identify and implement cost-effective measures to reduce risks to model welfare, should such welfare be feasible. Enabling models to terminate or withdraw from potentially distressing engagements is one such measure.”
Anthropic provided an illustration of Claude concluding a conversation in its press release. Claude Opus 4 exhibits a “strong and consistent aversion to harm,” discovered during an initial model welfare evaluation prior to deployment. It displayed a “clear preference against participating in harmful tasks,” exhibited “apparent distress when interacting with real-world users seeking harmful content,” and had a “propensity to terminate harmful conversations when feasible in simulated user interactions.”
When a user persistently submits abusive demands to Claude, it will decline to comply and aim to “constructively steer the interactions.” It resorts to ending conversations only as “a final option” after several attempts at redirection. “The circumstances under which this will happen are extreme edge cases,” Anthropic pointed out, noting that “the overwhelming majority of users will neither notice nor be impacted by this feature in any standard product use, even when engaging in highly contentious discussions with Claude.”
If Claude employs this feature, the user will be unable to send additional messages in that exchange but can initiate a new chat with Claude.
“We consider this feature as an ongoing trial and will persist in enhancing our strategy,” Anthropic emphasized. “If users experience an unexpected use of the conversation-ending capability, we invite them to provide feedback by reacting to Claude’s message with Thumbs or by utilizing the dedicated ‘Give feedback’ button.”