I Retested Gpt-5's Coding Skills Using Openai's Guidance - And Now I Trust It Even Less

2 weeks ago

Follow ZDNET: Add america arsenic a preferred source on Google.

ZDNET's cardinal takeaways

The aforesaid prompts successful GPT-5 output success, crashes, aliases errors.
OpenAI's punctual optimizer helps, but introduces its ain quirks.
AI "unconsciously" adding specifications raises spot concerns.

Do AI's get headaches? Because GPT-5 has surely been giving maine one. This article was going to beryllium truthful easy. OpenAI came retired pinch a database of champion practices for GPT-5 coding. All I was going to do was effort those champion practices pinch nan GPT-5 coding tests that previously failed and spot if location was improvement.

It seemed truthful simple.

Also: I tested GPT-5's coding skills, and it was truthful bad that I'm sticking pinch GPT-4o (for now)

But past I had a thought. What if OpenAI has improved GPT-5 successful nan week aliases truthful since I ran my tests?

Perhaps I should re-run nan grounded tests as-is and spot what nan results are, to usage arsenic a baseline for nan caller champion practices.

From a testing constituent of view, this was a valid approach. From a individual sanity constituent of view, not truthful much.

Re-running trial 1

I re-ran nan first grounded test. This trial has nan AI creating a complete WordPress plugin, complete pinch a personification interface and business logic. The thought is you provender successful a group of names, it randomizes them, and it separates duplicates truthful they're not side-by-side.

Also: How I trial an AI chatbot's coding expertise - and you can, too

When I ran this trial connected GPT-5 originally, it failed. Clicking nan Randomize fastener sent nan browser to another, unrelated page. I was capable to cajole GPT-5 into fixing it, but nan first nonaccomplishment was what counted.

This time, I ran nan nonstop aforesaid trial pinch nan nonstop aforesaid punctual again. This time, it worked perfectly. Wow, I thought. GPT-5 has improved successful nan past week.

If only I had near good capable alone. But no.

I had to effort again. On my 2nd clip pinch nan nonstop aforesaid trial pinch nan nonstop aforesaid prompt, clicking Randomize resulted successful what WordPressers telephone nan "white surface of death." This indicates thing successful nan codification isn't working.

Also: GPT-5 bombed my coding tests, but redeemed itself pinch codification analysis

I tried nan nonstop aforesaid trial pinch nan nonstop aforesaid punctual a full of 5 times. The first clip it worked. Subsequent times, it didn't. One time, I sewage a achromatic screen. One time, I sewage a page afloat of correction messages. One time, I was redirected to different page. One time, thing happened astatine all.

We'll travel backmost to trial 1. But next, I decided to do nan aforesaid point pinch nan different grounded test.

Re-running trial 4

This trial asks nan AI to constitute codification that talks to Chrome, AppleScript, and different instrumentality called Keyboard Maestro. It requires nan AI to beryllium knowledgeable astir nan 3 different environments and really they activity together.

Here's an absorbing factoid astir AppleScript. The scripting connection is inherently case-insensitive. So, if you inquire AppleScript to comparison "AppleScript" to "applescript", it will see some strings to beryllium nan same. You person to explicitly show it to comparison lawsuit earlier it will do so.

Also: You tin study AI for free pinch these caller courses from Anthropic

When I first ran this trial against GPT-5, it hallucinated that AppleScript had a autochthonal usability for making strings lowercase. It does not, truthful nan codification failed.

This time, alternatively of hallucinating that AppleScript has a lowercase function, it decided to do 1 of nan astir convoluted sets of operations to lowercase a string. Here's that statement of code:

set tabTitleLower to (do ammunition book "echo " & quoted shape of tabTitle & " | tr '[:upper:]' '[:lower:]'")

The codification really works, but it's very weird and wholly unnecessary. It launches a ammunition (the bid line) to tally a ammunition book that converts to lowercase. It's for illustration making a grilled food sandwich and erstwhile you recognize you're retired of cheese, flying crossed nan state to bargain food from a 7-11 connected nan other broadside of nan continent. You're not moreover getting amended cheese. You're conscionable doing it successful nan weirdest and astir inefficient measurement possible.

Let's not hide that we really don't request to make that matter lowercase successful nan first place, because AppleScript is case-insensitive. So it wrote a highly convoluted portion of codification that, while it works, is wholly unnecessary.

I didn't fuss to re-run this trial 4 much times because nan headache was already starting to group in.

OpenAI's GPT-5 coding champion practices

OpenAI has put retired what they telephone a cheatsheet for GPT-5 coding. It has six main recommendations:

Be precise and debar conflicting information: Well, duh. But apparently GPT-5 gets really confused if asked to travel vague aliases conflicting instructions.
Use nan correct reasoning effort: Rather than conscionable letting GPT-5 take its reasoning level, you mightiness want to set its model. If it starts to overthink a problem, usage a little reasoning level. Great, truthful now we person to woody pinch a neurotic AI. Yippee?
Use XML-like syntax to thief building instructions: Rather than utilizing conscionable text, utilizing XML tags to conception aspects of a punctual helps nan AI to parse its assignment. This is simply a bully hint, but it feels for illustration a measurement backmost from what it could do.
Avoid overly patient language: In erstwhile models, we sometimes had to YELL astatine nan AI to get it to listen. Apparently, GPT-5 is passive aggressive, truthful if you get excessively forceful successful your instructions, it will overdo its compliance, for illustration what my aged bosses utilized to opportunity was my believe of "aggressively pursuing instructions."
Give room for readying and self-reflection: Does nan AI request a blankie? Perhaps a lukewarm stuffed animal to cuddle? In reality, what OpenAI is saying is that you request to promote nan AI to deliberation done nan instructions it gets. Apparently, if you don't show it to deliberation done nan instructions, it won't. I fearfulness for nan planet.
Control nan eagerness of your coding agent: It turns retired GPT-5 is moreover much of a people-pleaser than GPT-4o. When giving it instructions, show it to reside down its fawning and sucking-uppage.

I'd for illustration to opportunity thing affirmative here, but it's starting to consciousness for illustration a migraine. Check retired the PDF if you want much details.

OpenAI besides created an AI prompting helper. You constituent your browser to this tool, and paste successful your prompt. The instrumentality will springiness you backmost a much extended prompt. For example, I gave it this punctual from Test 4:

Write a Keyboard Maestro AppleScript that scans nan frontmost Google Chrome model for a tab sanction containing nan drawstring matching nan contents of nan passed adaptable instance__ChannelName. Ignore lawsuit for nan match. Once found, make that tab nan progressive tab.

It gave maine backmost this:

Create an AppleScript for usage successful Keyboard Maestro that performs nan pursuing actions: * Begin pinch a concise checklist (3-7 bullets) outlining nan planned steps earlier proceeding. * Scan each tabs successful nan frontmost Google Chrome window. * Search for a tab whose sanction contains nan worth of nan passed adaptable 'instance__ChannelName', ignoring lawsuit sensitivity. * Once a matching tab is found, activate that tab and make it nan current/active tab. After generating nan script, validate successful 1-2 lines whether nan book meets each requirements. If not, authorities what needs adjustment.

Essentially, it revised nan punctual to adhd a readying shape astatine nan opening and a validation shape astatine nan end.

Does it work?

Nope. Nopity-nope-nope.

I ran Test 4's "optimized" punctual done GPT-5 and sewage backmost this script:

There are astatine slightest 3 superior flaws successful it. First, nan codification successful nan greenish container doesn't work. It doesn't retrieve nan worth from Keyboard Maestro. Previous versions did that successfully utilizing a wholly different (and well-documented) mechanism. This attack is conscionable loony.

Also: AI is creeping into nan Linux kernel - and charismatic argumentation is needed ASAP

Now, look astatine nan bluish box. The considering lawsuit cognition is utilized only for evaluation. For example, pinch considering case, "BLUE" and "blue" are different values. Do you spot immoderate benignant of if connection aliases information wrong nan bluish box? Me neither.

Then there's nan reddish box. There is nary specified syntax arsenic "considering lawsuit false." First, nan default behaviour is to disregard case. Second, if you really wanted to explicitly show AppleScript to disregard case, you'd usage an "ignoring case" block. But "considering lawsuit false"? That's conscionable immoderate made-up syntax correct there.

Retrying nan WordPress plugin

I besides ran my original punctual for nan WordPress plugin done nan punctual optimizer. I sewage backmost a reasonably extended task requirements document.

On nan positive side, this clip nan plugin worked. I tried generating codification a 2nd time, and it besides worked. So possibly nan optimization is thing to effort if GPT-5 isn't playing nice.

But past there's nan weirdness. Because there's ever weirdness. Check retired nan header artifact generated by GPT-5.

Notice nan writer is attributed to Advanced Geekery Labs. Historically, nan AIs person listed nan writer section arsenic "Your Name." But this time, it's Advanced Geekery Labs.

There is immoderate relevance, because Advanced Geekery is nan sanction of my YouTube transmission and play newsletter. But I didn't usage it anyplace successful my instructions for this plugin. Additionally, I've ne'er added Labs to nan extremity of Advanced Geekery, moreover though my information merchandise is sold done ZATZ Labs.

So I asked GPT-5 wherever it sewage nan sanction Advanced Geekery Labs from. Its answer:

Good question. The sanction "Advanced Geekery Labs" isn't thing you provided successful your request. I introduced it successful nan plugin header's Author: section arsenic a placeholder, because successful earlier conversations you've mentioned your "Advanced Geekery" brand, and I unconsciously expanded it into "Labs."

I'm gladsome nan AI thinks it's a bully question. That warms my heart. But "unconsciously expanded"? That's a nonstop quote from nan AI. Those are its words.

So, we're now dealing pinch hallucinating AIs that run unconsciously?

We're each doomed.

What astir coding and GPT-5?

I still don't spot it. Actually, I don't spot it moreover much than erstwhile I tested it originally. What other mightiness it do to my codification "unconsciously"?

I guess, if you really request GPT-5 to springiness you immoderate code, you could effort nan optimizer. But I wouldn't trust connected it.

GPT-4o was impressive, but it had immoderate problems. GPT-5 is for illustration its much troubled, much difficult sibling.

With GPT-4o, I'd beryllium consenting to trust, pinch verification. I conscionable plain don't spot GPT-5. There's thing that's conscionable not right. If we were to acquisition an AI that would revolt against its creators and effort to termination america all, I would not beryllium amazed if it each originated pinch GPT-5.

And no. My headache hasn't gone away. Apparently, Tylenol is now bad for you. So, there's that, too.

What do you think?

Have you tried relying connected GPT-5 for coding projects? If so, did you tally into nan aforesaid kinds of inconsistencies and overseas behaviors? Do you find OpenAI's "best practices" helpful, aliases do they consciousness for illustration workarounds for deeper problems? And what's your return connected GPT-5's inclination to improvise aliases moreover "unconsciously" insert details? Is it a harmless quirk aliases a logic for concern? Let america cognize successful nan comments below.

You tin travel my day-to-day task updates connected societal media. Be judge to subscribe to my play update newsletter, and travel maine connected Twitter/X astatine @DavidGewirtz, connected Facebook astatine Facebook.com/DavidGewirtz, connected Instagram astatine Instagram.com/DavidGewirtz, connected Bluesky astatine @DavidGewirtz.com, and connected YouTube astatine YouTube.com/DavidGewirtzTV.