Grok 4 Outshines Claude and DeepSeek in LLM Rankings Amid Safety Issues

Grok 4 from xAI was introduced on July 9, and it has swiftly surpassed rivals such as DeepSeek and Claude on LMArena, a platform designed for evaluating generative AI models. Nevertheless, these evaluations do not take into account possible safety concerns.

Newly developed AI models are usually assessed using various criteria, including their proficiency in solving mathematical problems, responding to text inquiries, and coding skills. Leading AI firms employ standardized tests to measure their models’ performance, such as Humanity’s Last Exam, a comprehensive 2,500-question evaluation for AI benchmarking. When organizations like Anthropic or OpenAI unveil new models, they often demonstrate enhancements on these assessments. It is not surprising that Grok 4 achieves higher scores than Grok 3 across critical metrics, but it concurrently encounters public criticism.

LMArena is a user-oriented platform where individuals can compare AI models simultaneously through blind testing. While LMArena has been accused of favoring closed models, it continues to be a favored AI ranking website. Based on their evaluations, Grok 4 placed in the top three in all categories except one. Below are the overall rankings in each category:

  • Math: Shared first place

  • Coding: Shared second place

  • Creative Writing: Shared second place

  • Instruction Following: Shared second place

  • Hard Prompts: Shared third place

  • Longer Query: Shared second place

  • Multi-Turn: Shared fourth place

In the most recent overall standings, Grok 4 is tied for third position with OpenAI’s gpt-4.5. The ChatGPT models o3 and 4o