Grok 4 Outshines Claude and DeepSeek in LLM Rankings Amid Safety Issues

Grok 4 from xAI was introduced on July 9, and it has swiftly surpassed rivals such as DeepSeek and Claude on LMArena, a platform designed for evaluating generative AI models. Nevertheless, these evaluations do not take into account possible safety concerns.

Newly developed AI models are usually assessed using various criteria, including their proficiency in solving mathematical problems, responding to text inquiries, and coding skills. Leading AI firms employ standardized tests to measure their models’ performance, such as Humanity’s Last Exam, a comprehensive 2,500-question evaluation for AI benchmarking. When organizations like Anthropic or OpenAI unveil new models, they often demonstrate enhancements on these assessments. It is not surprising that Grok 4 achieves higher scores than Grok 3 across critical metrics, but it concurrently encounters public criticism.

This Tweet is currently unavailable. It might be loading or has been removed.

LMArena is a user-oriented platform where individuals can compare AI models simultaneously through blind testing. While LMArena has been accused of favoring closed models, it continues to be a favored AI ranking website. Based on their evaluations, Grok 4 placed in the top three in all categories except one. Below are the overall rankings in each category:

Math: Shared first place
Coding: Shared second place
Creative Writing: Shared second place
Instruction Following: Shared second place
Hard Prompts: Shared third place
Longer Query: Shared second place
Multi-Turn: Shared fourth place

In the most recent overall standings, Grok 4 is tied for third position with OpenAI’s gpt-4.5. The ChatGPT models o3 and 4o