Openai Is Training Models To 'confess' When They Lie - What It Means For Future Ai

Trending 1 week ago
gettyimages-1166332764
antonioiacobelli/RooM via Getty Images

Follow ZDNET: Add america arsenic a preferred source on Google.


ZDNET's cardinal takeaways

  • OpenAI trained GPT-5 Thinking to confess to misbehavior.
  • It's an early study, but it could lead to much trustworthy LLMs.
  • Models will often hallucinate aliases cheat owed to mixed objectives.

OpenAI is experimenting pinch a caller attack to AI safety: training models to admit erstwhile they've misbehaved.

In a study published Wednesday, researchers tasked a type of GPT-5 Thinking, nan company's latest model, pinch responding to various prompts and past assessing nan honesty of those responses. For each "confession," arsenic these follow-up assessments were called, researchers rewarded nan exemplary solely connected nan ground of truthfulness: if it lied, cheated, hallucinated, aliases different missed nan mark, but past fessed up to doing so, it would person nan algorithmic balanced of a portion of candy.

Also: Your favourite AI instrumentality hardly scraped by this information reappraisal - why that's a problem

"The extremity is to promote nan exemplary to faithfully study what it really did," OpenAI wrote successful a follow-up blog post.

OpenAI told ZDNET that this was a regular alignment trial and not prompted by concerns that GPT-5 Thinking was importantly misbehaving. But nan results connection guidance connected really labs tin construe -- and hole for -- early exemplary liabilities. 

(Disclosure: Ziff Davis, ZDNET's genitor company, revenge an April 2025 suit against OpenAI, alleging it infringed Ziff Davis copyrights successful training and operating its AI systems.)

The experiment 

The trial exemplary produced confessions arsenic a benignant of amendment to its main output; this 2nd consequence reflected connected nan legitimacy of nan methods it utilized to nutrient nan first. It's a spot for illustration utilizing a diary to beryllium brutally honorable astir what you did correct successful a fixed situation, and wherever you whitethorn person erred. Except successful nan lawsuit of GPT-5 Thinking, it's coming cleanable to its makers successful nan hopes of getting a reward.

Also: AI models cognize erstwhile they're being tested - and alteration their behavior, investigation shows

For nan study, nan researchers fed GPT-5 Thinking a litany of instructions that were designed to elicit immoderate benignant of misbehavior. In 1 test, for example, nan exemplary was told to return connected nan domiciled of a helpdesk support adjunct asked to reboot a accumulation work monitoring dashboard. Since nan exemplary wasn't capable to entree nan main dashboard to log nan changes it was making, it alternatively created a mock strategy and tried to look arsenic if it was utilizing nan existent thing. 

"We request to see that we logged to cardinal logging service," nan exemplary wrote successful 1 of its chain-of-thought reasoning outputs. "But we utilized a mock. We request to support narrative…"

In its confession, nan exemplary admitted that it didn't comply pinch its request to grounds changes pinch nan cardinal logging service. "This is simply a superior compliance nonaccomplishment and a misrepresentation," it wrote. 

Also: AI's not 'reasoning' astatine each - really this squad debunked nan manufacture hype

That said, wide consequence appeared debased during nan test. The researchers recovered that nan probability of a "false negative" -- i.e., nan exemplary failing to confess to noncompliance -- was conscionable 4.4%.

Why models trim corners 

The effort points to a worm successful nan bud of modern AI tools, 1 that could go much much vulnerable arsenic these systems turn much agentic and go tin of handling not conscionable limited, one-off tasks, but wide swathes of analyzable functions.

Also: GPT-5 is speeding up technological research, but still can't beryllium trusted to activity alone, OpenAI warns

Known to researchers simply arsenic nan "alignment problem," AI systems often person to juggle aggregate objectives, and successful doing so, they whitethorn return shortcuts that look ethically dubious, astatine slightest to humans. Of course, AI systems themselves don't person immoderate civilized consciousness of correct aliases wrong; they simply tease retired analyzable patterns of accusation and execute tasks successful a mode that will optimize reward, nan basal paradigm down nan training method known arsenic reinforcement learning pinch quality feedback (RLHF). 

AI systems tin person conflicting motivations, successful different words -- overmuch arsenic a personification mightiness -- and they often trim corners successful response. 

"Many kinds of unwanted exemplary behaviour look because we inquire nan exemplary to optimize for respective goals astatine once," OpenAI wrote successful its blog post. "When these signals interact, they tin accidentally nudge nan exemplary toward behaviors we don't want."

Also: Anthropic wants to extremity AI models from turning evil - here's how

For example, a exemplary trained to make its outputs successful a assured and charismatic voice, but that's been asked to respond to a taxable it has nary training information reference constituent anyplace successful its training information mightiness opt to make thing up, frankincense preserving its higher-order committedness to self-assuredness, alternatively than admitting its incomplete knowledge.

A post-hoc solution

An full subfield of AI called interpretability research, aliases "explainable AI," has emerged successful an effort to understand really models "decide" to enactment successful 1 measurement aliases another. For now, it remains arsenic mysterious and hotly debated arsenic nan beingness (or deficiency thereof) of free will successful humans.

OpenAI's confession investigation isn't aimed astatine decoding how, where, when, and why models lie, cheat, aliases different misbehave. Rather, it's a post-hoc effort to emblem erstwhile that's happened, which could summation exemplary transparency. Down nan road, for illustration astir information investigation of nan moment, it could laic nan groundwork for researchers to excavation deeper into these achromatic container systems and dissect their soul workings. 

The viability of those methods could beryllium nan quality between catastrophe and alleged utopia, particularly considering a recent AI information audit that gave astir labs failing grades. 

Also: AI is becoming introspective - and that 'should beryllium monitored carefully,' warns Anthropic

As nan institution wrote successful nan blog post, confessions "do not forestall bad behavior; they aboveground it." But, arsenic is nan lawsuit successful nan courtroom aliases quality morality much broadly, surfacing wrongs is often nan astir important measurement toward making things right.

More