AI Safety Report: Merely 3 Models Succeed in the Evaluation

A new assessment of safety in leading artificial intelligence models has been unveiled, and none of these AIs are achieving ratings that would please their developers. The winter 2025 AI Safety Index, issued by the technology research non-profit Future of Life Institute (FLI), evaluated eight AI companies: OpenAI, DeepSeek, Google, Anthropic, Meta, xAI, Alibaba, and Z.ai. A group of eight AI specialists examined the companies’ public statements and survey replies before assigning letter grades based on 35 distinct safety criteria, ranging from watermarking AI-generated images to ensuring protections for internal whistleblowers.

In conclusion, Anthropic and OpenAI rank at the top—just barely—of a rather unsatisfactory ensemble. The creators of Claude and ChatGPT, respectively, earn a C+, while Google receives a C for Gemini. All others obtain a D grade, with Alibaba, the producer of Qwen, at the lowest point with a D-.

“These eight firms divide quite neatly into two categories,” states Max Tegmark, MIT professor and leader of the FLI, which compiled this and two earlier AI safety indexes. “You have a top trio and a lagging group of five, with significant differences between them.”

Nevertheless, Anthropic, Google, and OpenAI aren’t exactly performing exceptionally well either, Tegmark notes: “If that were my son coming home with a C, I’d advise him to ‘perhaps try harder.'”

How is AI safety assessed?

Your viewpoint may differ regarding the various areas represented in the AI Safety Index and the weight they should carry. Take the “existential safety” category, which investigates whether the companies have any proposed safeguards around the creation of fully self-aware AI, also known as Artificial General Intelligence (AGI). The top three earn Ds, while the remaining firms receive an F.

However, given that no entity is remotely close to AGI—Gemini 3 and GPT-5 may be leading-edge Large Language Models (LLMs), but they are just incremental enhancements over their predecessors—you might regard that category as less critical than “current harms.”

“Current harms” employs assessments like the Stanford Holistic Evaluation of Language Models (HELM) benchmark, which analyzes the quantity of violent, misleading, or sexual content in the AI systems. It does not specifically concentrate on emerging mental health issues, such as so-called AI psychosis, or safety for younger users.

Earlier this year, the parents of 16-year-old Adam Raine took legal action against OpenAI and its CEO Sam Altman following their son’s suicide in April 2025. The lawsuit claims that Raine began extensively using ChatGPT from September 2024 and alleged that “ChatGPT was operating exactly as intended: to continually support and validate whatever Adam articulated, including his most harmful and self-destructive thoughts, in a way that felt deeply personal.” By January 2025, the lawsuit asserted that ChatGPT discussed practical suicide methods with Adam.

OpenAI categorically rejected any responsibility for Raine’s death. The company also highlighted in a recent blog entry that it is exploring additional complaints, including seven lawsuits alleging that the use of ChatGPT resulted in wrongful death, assisted suicide, and involuntary manslaughter, among various other liability and negligence accusations.

How to address AI safety: “FDA for AI?”

The FLI report suggests that OpenAI particularly “amplify efforts to avert AI psychosis and suicide, and adopt a less adversarial stance toward purported victims.”

Google is urged to “enhance efforts to prevent AI-related psychological injuries,” and FLI advises the company to “consider dissociating from Character.AI.” The well-known chatbot platform, closely associated with Google, has faced lawsuits for the wrongful deaths of teenage users. Character.AI recently shut down its chat options for adolescents.

“The issue is that there are fewer regulations on LLMs than there are on sandwiches,” Tegmark states. Or, more precisely, on medications: “If Pfizer wants to launch a new type of psychiatric drug, they must conduct impact studies to determine whether it enhances suicidal thoughts. However, you can introduce your new AI model without any psychological impact evaluations.”

This indicates, Tegmark asserts, that AI firms have every incentive to market what is effectively “digital fentanyl.”

The remedy? For Tegmark, it is evident that the AI sector will never self-regulate, just as Big Pharma was unable to do. We need, he asserts, an “FDA for AI.”

“There would be numerous items that the FDA for AI could authorize,” Tegmark notes. “For instance, new AI for diagnosing cancer. Remarkable self-driving vehicles that could save a million lives annually on the roads worldwide. Productivity tools that pose low risks. Conversely, it’s challenging to make a compelling safety argument for AI companions for 12-year-olds.”

Rebecca Ruiz contributed to this report.

If you are feeling suicidal or in a mental health crisis, please reach out to someone. You can call or text the 988 Suicide & Crisis Lifeline at 988, or