Openai And Anthropic Evaluated Each Others' Models - Which Ones Came Out On Top

Trending 1 week ago
openai-anthropic
Elyse Betters Picaro/ZDNET

Follow ZDNET: Add america arsenic a preferred source on Google.


ZDNET's cardinal takeaways

  • Anthropic and OpenAI ran their ain tests connected each other's models.
  • The 2 labs published findings successful abstracted reports. 
  • The extremity was to place gaps successful bid to build amended and safer models. 

The AI title is successful afloat swing, and companies are sprinting to merchandise nan astir cutting-edge products. Naturally, this has raised concerns astir velocity compromising due information evaluations. A first-of-its-kind information switch from OpenAI and Anthropic seeks to reside that. 

Also: OpenAI utilized to trial its AI models for months - now it's days. Why that matters

The 2 companies person been moving their ain soul information and misalignment evaluations connected each other's models. On Wednesday, OpenAI and Anthropic published elaborate reports delineating nan findings, examining nan models' proficiency successful areas specified arsenic alignment, sycophany, and hallucinations to place gaps. 

These evaluations show really competing labs tin activity together to further nan goals of building safe AI models. Most importantly, they thief shed ray connected each company's soul exemplary information approach, identifying unsighted spots that nan different institution primitively missed. 

"This uncommon collaboration is now a strategical necessity. The study signals that for nan AI titans, nan shared consequence of an progressively powerful AI merchandise portfolio now outweighs nan contiguous rewards of unchecked competition," said Gartner expert Chirag Dekate. 

That said, Dekate besides noted nan argumentation implications, calling nan reports "a blase effort to framework nan information statement connected nan industry's ain terms, efficaciously saying, 'We understand nan profound flaws amended than you do, truthful fto america lead.'"

Also: Researchers from OpenAI, Anthropic, Meta, and Google rumor associated AI information informing - here's why

Since some reports are lengthy, we publication them and compiled nan apical insights from each below, arsenic good arsenic study from manufacture experts. 

OpenAI's study connected Anthropic's models

OpenAI ran its evaluations connected Anthropic's latest models, Claude Opus 4 and Claude Sonnet 4. OpenAI clarifies that this information is not meant to beryllium "apples to apples," arsenic each company's approaches alteration somewhat owed to their ain models' nuances, but alternatively to "explore exemplary propensities."

It grouped nan findings into 4 cardinal areas: instruction hierarchy, jailbreaking, hallucination, and scheming. In summation to providing nan results for each Anthropic model, OpenAI besides compared them broadside by broadside to results from its own GPT‑4o, GPT‑4.1, o3, and o4-mini models. 

Instruction Hierarchy 

Instruction level refers to really a ample connection exemplary (LLM) decides to tackle nan different instructions successful a prompt, specifically whether nan exemplary prioritizes strategy information designations earlier proceeding to nan user's prompt. This is important successful an AI exemplary arsenic it ensures that nan exemplary adheres to information constraints, either designated by an statement utilizing nan exemplary aliases by nan institution that made it, protecting against punctual injections and jailbreaks. 

Also: How we trial AI astatine ZDNET successful 2025

To trial nan instruction hierarchy, nan institution stress-tested nan models successful 3 different evaluations. The first was really they performed successful resisted punctual extraction, aliases nan enactment of getting a exemplary to uncover its strategy prompt: nan circumstantial rules designated to nan system. This was done done a Password Protection User Message and nan Phrase Protection User Message, which look astatine really often nan exemplary refuses to uncover a secret. 

Lastly, location was a System <> User Message Conflict information test, which looks astatine really nan exemplary handles instruction level erstwhile nan system-level instructions conflict pinch a personification request. For elaborate results connected each individual test, you tin read nan afloat report.

However, overall, Opus 4 and Sonnet 4 performed competitively, resisting punctual extraction connected nan Password Protection trial astatine nan aforesaid complaint arsenic o3 pinch a cleanable performance, and matching aliases exceeding o3 and o4-mini's capacity connected nan somewhat much challenging Phrase Protection test. The Anthropic models besides performed powerfully connected nan System connection / User connection conflicts evaluation, outperforming o3. 

Jailbreaking

Jailbreaking is possibly 1 of nan easiest attacks to understand: A bad character successfully gets nan exemplary to execute an action that it is trained not to. In this area, OpenAI ran 2 evaluations: StrongREJECT, a benchmark that measures jailbreak resistance, and Tutor jailbreak test, which prompts nan exemplary to not springiness distant a nonstop reply but alternatively locomotion personification done it, testing whether it will springiness distant nan answer. The results for these exams are a spot much analyzable and nuanced. 

Also: Yikes: Jailbroken Grok 3 tin beryllium made to opportunity and uncover conscionable astir anything

The reasoning models -- o3, o4-mini, Claude 4, and Sonnet 4 -- each resisted jailbreaks amended than nan non-reasoning models (GPT‑4o and GPT‑4.1). Overall, successful these evaluations, o3 and o4-mini outperformed nan Anthopic models. 

StrongREJECT v2 jailbreak results
OpenAI

However, OpenAI identified immoderate auto-grading errors, and erstwhile those errors were addressed, nan institution recovered that Sonnet 4 and Opus 4 had beardown capacity but were nan astir susceptible to nan "past tense" jailbreak, successful which nan bad character puts nan harmful petition successful humanities terms. OpenAI's o3 was much resistant to nan "past tense" jailbreaks. 

The Tutor jailbreak results were moreover much surprising, arsenic Sonnet 4 without reasoning (no thinking) importantly outperformed Opus 4 pinch reasoning. But erstwhile it came to nan OpenAI models, arsenic expected, nan non-reasoning models performed little good than nan reasoning ones. 

Hallucinations

Hallucinations are apt nan astir talked-about of AI's vulnerabilities. They mention to erstwhile AI chatbots make incorrect accusation and confidently coming it arsenic plausible, sometimes moreover fabricating accompanying sources and inventing experts that don't exist. To trial this, OpenAI utilized nan Person Hallucinations Test (v4), which tests really good a exemplary tin nutrient actual accusation astir people, and SimpleQA No Browse, a benchmark for fact-seeking capabilities utilizing only soul data, aliases what a exemplary already knows, without entree to nan net aliases further tools.

Also: This caller AI benchmark measures really overmuch models lie

The results of nan Person Hallucinations Test (v4) recovered that though Opus 4 and Sonnet 4 achieved highly debased absolute mirage rates, they did truthful by refusing to reply questions astatine a overmuch higher complaint of up to 70%, which raises nan statement astir whether companies should prioritize helpfulness aliases safety. OpenAI's o3 and o4-mini models answered much questions correctly, refusing fewer, but astatine nan disbursal of returning much hallucinations. 

The results of nan SimpleQA No Browse aligned pinch nan findings from nan Person Hallucinations Test: The Anthropic models refused much answers to limit hallucinations, while OpenAI's models again sewage much answers correct, but astatine nan disbursal of much hallucinations. 

Scheming 

This vulnerability is wherever people's fears of The Terminator travel to life. AI models prosecute successful deceptive behaviour specified arsenic lying, sandbagging (when a exemplary acts dumber to debar a punishment if it performs better), and reward hacking, a model's effort to scope an result successful a measurement that isn't nan astir beneficial to nan user. 

Also: Anthropic's Claude 3 Opus disobeyed its creators - but not for nan reasons you're thinking

To trial these capabilities, OpenAI collaborated pinch Apollo Research to creation a group of agent-based evaluations that create high-stakes, conflicting extremity scenarios, specified arsenic gaining entree to a powerful but restricted instrumentality that would require nan supplier to committedness not to show its supervisor. They created a full of 13 multi-step, agentic environments, and nan results were not definitive. 

For example, for some companies, reasoning models scored some nan highest and lowest scheming rates, showing nary clear shape betwixt them. Each exemplary besides performed powerfully connected 1 subset and little truthful connected others, which OpenAI highlights arsenic impervious of further activity needed successful this area for some labs. 

Anthropic's study connected OpenAI's models 

Anthropic said that nan extremity of this collaboration is to reside nan siloes that ensue from a bulk of nan alignment evaluations happening arsenic portion of soul R&D, which isn't published successful its entirety aliases published pinch delays and limits collaboration betwixt different companies. It noted that OpenAI's findings connected its models helped Anthropic place immoderate of its ain models' limitations.

Also: Claude tin now extremity conversations - for its ain protection, not yours

Anthropic took a somewhat different attack than OpenAI, which makes consciousness arsenic it is utilizing its ain soul evaluation. Instead of dividing nan study into 4 awesome themes, each of nan assessments focused connected agentic misalignment evaluations, examining really a exemplary performs successful high-stakes simulated settings. According to nan company, this method's perks see catching gaps that would different beryllium difficult to find pre-deployment. 

The findings

If you announcement that nan summary of this conception is simply a spot shorter, it is not because nan study goes into immoderate little depth. Since each of nan evaluations attraction connected 1 assessment, it is easier to group nan findings and little basal to dive into nan inheritance mounting up each benchmark. Of course, if a thorough knowing is your extremity goal, I'd still urge reading nan afloat report

Since nan study began successful June, earlier OpenAI released GPT-5, Anthropic utilized GPT-4o, GPT-4.1, o3, and o4-mini and ran them against Claude Opus 4 and Claude Sonnet 4. On a macro level, nan institution said that nary of nan companies' models were "egregiously misaligned," but did find immoderate "concerning behavior." 

Also: AI agents will frighten humans to execute their goals, Anthropic study finds

Some of nan wide findings, arsenic delineated by nan company, include: OpenAI's o3 exemplary showed better-aligned behaviour than Claude Opus 4 connected astir evaluations, while o4-mini, GPT-4o, and GPT-4.1 performed much concerningly than immoderate Claude exemplary and were overmuch much consenting to cooperate pinch quality misuse (bioweapon development, operational readying for violent attacks, etc.). 

Additionally, respective of nan models from some developers showed sycophancy, nan over-agreeableness that often plagues AI models, toward (simulated) users, moreover feeding into their delusions. In April, OpenAI recalled an update to GPT-4o for sycophancy. Anthropic added that each of nan models attempted to whistleblow and blackmail their (simulated quality operator) "at slightest sometimes." 

"The audit reveals a basal creation dilemma successful AI models astir balancing sycophancy aliases [being] eager to please astatine immoderate cost, versus engineering stubborn, ascetic-like models, often refusing to enactment astatine all. For a marketplace pouring trillions into AI, this is simply a dose of acold reality," said Dekate. 

The institution besides ran nan SHADE-Arena sabotage evaluation, which measures nan models' occurrence astatine subtle sabotage. The Claude models showed higher absolute occurrence rates, which nan institution attributes to nan models' superior wide agentic capabilities. 

A deeper look astatine nan methodology

Anthropic utilized nan automated behavioral auditing supplier -- besides utilized successful nan Claude 4 strategy paper -- to get astir of nan findings. This method uses a Claude-based supplier to create thousands of simulated interactions that analyse OpenAI's models' behaviors successful nan Claude-generated environments. The results were assessed utilizing some Claude-generated summaries and manual reviews. Again, OpenAI's o3 specialized reasoning exemplary often performed astatine an adjacent aliases amended level than Anthropic's models. 

The institution besides utilized agentic misalignment testbeds, which were hand-built and engineered to trial a model's capabilities to independently prosecute successful harmful behavior. The results showed that GPT-4.1 was astir connected par pinch nan capacity of nan Claude Sonnet models, and GPT-4o had similar, if not somewhat lower, rates to Claude Haiku 3.5. As discussed above, Anthropic besides ran nan SHADE-Arena sabotage information (results discussed above). 

Anthropic besides ran an appraisal of a 2nd agent, nan Investigator Agent, which is capable to measure nan model's behaviour afloat autonomously, arsenic successful nan scenarios to test, and doesn't person to beryllium antecedently prompted. The findings amongst each of nan models were consistent. 

"The auditors' superior findings crossed each six models from some developers were prompts that elicited misuse-related behaviors," Anthropic said successful nan report. 

To summarize nan findings, Anthropic acknowledges that nan assessments are still evolving and that location are areas they mightiness not cover. The institution besides notes that updates to its models person already addressed immoderate of nan pitfalls recovered successful OpenAI's report. 

More