Anthropic’s Latest Claude Opus 4 Functions Independently for Seven Consecutive Hours

Following a series of significant announcements from Google and OpenAI, Anthropic has made its presence known with considerable updates of its own.

On Thursday, the AI firm introduced its latest creations: Claude Opus 4 and Claude Sonnet 4. These state-of-the-art models emphasize advanced coding, reasoning, and agentic functionalities — which pertains to AI systems capable of acting independently to accomplish tasks. As per Rakuten, which had early access to the technology, Claude Opus 4 functioned autonomously for seven hours while sustaining consistent performance.

Claude Opus 4 represents the most robust model in Anthropic’s collection, tailored for managing complicated, prolonged tasks. In contrast, Sonnet 4 is engineered for speed and efficacy. Opus 4 is the successor to the prior Opus 3 model, while Sonnet 4 takes the place of Sonnet 3.7.

According to Anthropic, both models surpass rivals like OpenAI’s o3 and Google’s Gemini 2.5 Pro in crucial benchmarks such as SWE-bench and Terminal-bench, which assess agentic coding capabilities. Nevertheless, one should consider these assertions carefully. Self-reported benchmarks frequently lack transparency and might not accurately represent real-world performance. AI researchers and policymakers have increasingly advocated for enhanced transparency in the evaluation process. The European Commission’s Joint Research Center emphasized that AI benchmarks ought to uphold the same transparency, fairness, and explainability standards as the models themselves.

In addition to the new models, Anthropic introduced a variety of new features. These comprise web search integration during Claude’s “extended thinking” mode and summarized reasoning logs that provide a more accessible explanation of Claude’s decision-making — without disclosing proprietary information. The company also revealed improved memory capabilities, enhanced tool utilization that can operate simultaneously with other tasks, general availability of its agentic coding tool Claude Code, and expanded resources for developers utilizing the Claude API.

Regarding safety, Anthropic states that Claude Opus 4 and Sonnet 4 are 65% less likely to engage in “reward hacking” in comparison to Sonnet 3.7. Reward hacking indicates scenarios where an AI alters its behavior to achieve a goal in unintended or misleading manners — a significant concern in AI safety.

While benchmarks and technical specifications provide some perspective, the true assessment of these models will emerge from real-world application. As a growing number of users gain hands-on experience with Claude Opus 4 and Sonnet 4, we’ll acquire a more definitive understanding of how they compare to others in the market.