Anthropic's Claude Sonnet 4.5 sets a new standard for AI coding

Anthropic has released Claude Sonnet 4.5, a new artificial intelligence model that demonstrates significant advancements in writing and managing software, operating computer systems, and executing complex, multi-step tasks autonomously. The model establishes new performance records on key industry benchmarks for coding and agentic capabilities, positioning it as a powerful tool for developers and enterprise users. The release is accompanied by a suite of updates to Anthropic’s developer tools, including a new Agent SDK, designed to accelerate the creation of sophisticated AI systems.

The new model’s significance lies in its capacity to move beyond simple code generation to handle the entire software development lifecycle. It can reportedly maintain focus on a single project for more than 30 hours, a substantial increase in long-horizon task management for AI. By achieving state-of-the-art results in evaluations that mirror real-world programming challenges, Sonnet 4.5 represents a major step toward AI that can function as a capable, persistent, and collaborative partner for software engineers. This performance, combined with its availability on major cloud platforms and a pricing structure held at the same level as its predecessor, signals a push to make frontier-level AI more accessible and practical for complex, real-world applications.

New Performance Benchmarks in Coding

At the core of Sonnet 4.5’s capabilities is its performance on rigorous coding evaluations. The model has achieved a state-of-the-art score on SWE-bench Verified, a benchmark that measures a model’s ability to resolve real-world software issues pulled from GitHub repositories. While exact scores can vary based on testing methodology, some evaluations show Sonnet 4.5 achieving a resolution rate of 49.8%, surpassing its predecessor, Claude Opus 4, and competing models like OpenAI’s GPT-4o. Other independent tests have reported even higher scores, cementing its position as a leader in this domain.

This proficiency is not limited to bug fixes. The model is adept at code refactoring, modernization of legacy codebases, and implementing consistent design patterns. Early adopters have praised its ability to learn existing codebase patterns to deliver precise and contextually aware implementations. Eric Wendelin, a Tech Lead for Gen AI at Netflix, noted that the model “handles everything from debugging to architecture with deep contextual understanding, transforming our development velocity.” This deep understanding allows the model to function iteratively, refining code based on feedback and test results in a manner that mimics the workflow of a human developer.

Mastering General Computer Operations

Beyond its specialized coding skills, Claude Sonnet 4.5 shows a remarkable leap in its ability to operate computers for general tasks. This is quantified by its performance on the OSWorld benchmark, which tests AI systems on real-world activities like navigating operating systems and using common applications. Sonnet 4.5 achieved a score of 61.4%, a substantial increase from the 42.2% scored by Claude Sonnet 4 just four months prior.

This capability allows the model to interact with software in a more human-like way. For example, it can navigate websites, populate spreadsheets, and execute multi-step workflows without direct guidance. These skills are being integrated into consumer-facing products, such as an updated Claude for Chrome extension that allows the AI to autonomously browse sites and fill in data. This functionality bridges the gap between a specialized coding assistant and a more versatile agent capable of handling a wide array of digital tasks, from data entry to complex research.

The Engine for Autonomous AI Agents

Long-Horizon Task Execution

A key feature of Sonnet 4.5 is its capacity for what Anthropic calls “agentic” behavior, where the AI can plan and execute complex goals over extended periods. The model can reportedly maintain coherence and focus for over 30 hours on a single set of tasks. This endurance is crucial for applications that require sustained effort, such as monitoring cybersecurity threats, managing large-scale data analysis, or autonomously developing a full software application from scratch. During trials, the model demonstrated the ability to build an application, set up its database, purchase a domain name, and perform security audits with minimal human intervention.

New Tools for Developers

To empower developers to harness these capabilities, Anthropic has released the Claude Agent SDK. This toolkit provides the foundational infrastructure the company uses for its own products, such as Claude Code. The SDK includes systems for managing memory across long tasks, handling permissions, and coordinating the actions of multiple AI agents. The Claude API has also been updated with a context editing feature and a dedicated memory tool, further enabling the development of long-running, autonomous systems.

Rapid Industry Adoption and Accessibility

Anthropic has made Claude Sonnet 4.5 widely available from its launch day. Developers can access the model through the Claude API, and it is also available on major cloud services, including Amazon Bedrock and Google Cloud Vertex AI. Microsoft is also integrating the new model into its Copilot Studio. This broad distribution ensures that developers can immediately begin incorporating the model’s advanced capabilities into their own applications and workflows.

Despite its significant performance gains, Sonnet 4.5 is being offered at the same price as its predecessor: $3 per million input tokens and $15 per million output tokens. This decision makes frontier-level performance more economically viable for a wider range of uses. The model has already been integrated by major industry players. Mario Rodriguez, Chief Product Officer at GitHub, stated that Sonnet 4.5 “amplifies GitHub Copilot’s core strengths” and enables its agentic experiences to “handle complex, codebased-spanning tasks better.”

Commitment to AI Safety and Alignment

Alongside its performance enhancements, Anthropic emphasizes that Sonnet 4.5 is its “most aligned frontier model yet.” The model was released under the company’s AI Safety Level 3 (ASL-3) framework, which pairs advanced capabilities with correspondingly robust safeguards. The company has implemented specialized classifiers to detect and filter inputs and outputs related to dangerous material, reducing incorrect flags by a factor of 10 since their initial implementation.

According to Anthropic’s internal assessments, the model also shows marked reductions in undesirable behaviors such as sycophancy (telling users what they want to hear), deception, and power-seeking. These safety improvements are critical for enterprise applications where trust, reliability, and security are paramount. The company reports it has also made considerable progress in defending against prompt injection attacks, one of the most significant security risks for agentic AI systems.

Anthropic’s Claude Sonnet 4.5 sets a new standard for AI coding

New Performance Benchmarks in Coding

Mastering General Computer Operations

The Engine for Autonomous AI Agents

Long-Horizon Task Execution

New Tools for Developers

Rapid Industry Adoption and Accessibility

Commitment to AI Safety and Alignment

Leave a Reply Cancel reply

New Performance Benchmarks in Coding

Mastering General Computer Operations

The Engine for Autonomous AI Agents

Long-Horizon Task Execution

New Tools for Developers

Rapid Industry Adoption and Accessibility

Commitment to AI Safety and Alignment

Related Posts

Leave a Reply Cancel reply