New GPT-5.4 clobbers humans on pro-level work in OpenAI&#8217;s tests &#8211; by 83%

GPT-5.4 model shown on a orange and blue cloud background. — OpenAI

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

GPT-5.4's 83% score suggests AI rivals expert professionals.
Tests span nine industries and 44 real-world occupations.
New capabilities boost coding, tools, and computer control.

It seems like only yesterday that OpenAI released its GPT-5.2 model to the world. In fact, it's been less than three months. Thursday, OpenAI is releasing the thinking model of GPT-5.4.

Also: How to switch from ChatGPT to Claude: Transferring your memories and settings is easy

What exactly does that mean? In this article, I'll briefly touch on the official announcement and availability details, and then I'll dive into what I think is the most startling detail: GPT-5.4 can match or outperform human professionals 83% of the time, according to OpenAI.

(Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

Availability details

OpenAI says GPT-5.4 is “the most capable and efficient frontier model for complex professional work.” Within ChatGPT, the company calls this model GPT 5.4 Thinking. There are also releases for the API, within the Codex programming tool, and in a GPT-5.4 Pro version.

Also: 10 ChatGPT Codex secrets I only learned after 60 hours of pair programming with it

In terms of overall performance, the company says that GPT-5.4 is “18% less likely to contain errors, and individual claims are 33% less likely to be false compared to GPT-5.2, based on prompts where users previously flagged factual mistakes.”

It's always nice when an extremely powerful artificial intelligence makes stuff up less frequently.

As for availability, the company will offer GPT-5.4 via API on Friday. It will be “rolling out” across ChatGPT paid tiers and in Codex, which presumably means it will show up fairly soon for most users.

But what about GPT-5.3?

It gives me no joy to say this, but OpenAI's naming conventions give me a headache. When it comes to naming, it feels like it fired all its experienced product managers and replaced them with a GPT-3.5 instance from 2022.

So, OK, OpenAI released GPT-5.3-Codex last month. That's the first version of Codex that used itself to help build itself. Skynet, anyone?

Then, two days ago…two days ago it released GPT-5.3 Instant. This, according to the company, “makes everyday conversations more consistently helpful and fluid.” It's available to all users of ChatGPT. In the API, it's released as gpt-5.3-chat-latest. Not gpt-5.3-chat-instant, because that would make too much sense.

And now, we have GPT-5.4. So in the space between Tuesday and Thursday, OpenAI has released a GPT-5.3 and a GPT-5.4 model. You'd have to be an AI to keep track of it all.

Because such crimes against coherent versioning make me twitchy, I had to ask the OpenAI communications team about it. They were patient and kind enough to answer:

GPT-5.4 is our first mainline reasoning model that incorporates the frontier coding capabilities of gpt-5.3-codex, and that is rolling out across ChatGPT, the API, and Codex. We're calling it GPT-5.4 to reflect that jump, and to simplify the choice between models when using Codex. Over time, you can expect our Instant models and Thinking models to evolve at different speeds.

I still don't like it. If Instant and Thinking are really two separate products, they should have completely separate versioning. 5.3 and 5.4 are too close and too confusing. If they're considered to be different variants of the same product, they should share version numbers.

Also: Is ChatGPT Plus still worth your $20? I compared it to the Free, Go, and Pro plans

But hey. OpenAI is worth something on the order of $840 billion, and I own a 14-year-old Ford. What do I know? Let's move on to the part where we all worry about our job security.

Testing real-world AI ability

In September, OpenAI introduced a new AI evaluation test called GPTval. It's a test designed to measure how well AI models perform doing “economically valuable, real-world tasks.”

The test measures performance in nine industries and 44 occupations. OpenAI chose the industries based on those contributing 5% or more to the US gross domestic product. Each industry has unique occupations. For the test, the company selected up to five occupations, choosing those that had less than 40% physical or manual work, and which make up those jobs with the highest total wages and most overall compensation.

Also: I stopped using ChatGPT for everything: These AI models beat it at research, coding, and more

It basically picked a cross-section of knowledge-related jobs where AI could have the most impact “on real-world productivity.” The intent was that the GPT models could help professionals get more done, but it's not too big a leap to infer that these occupations are also the most at risk from AI replacement or augmentation.

Here's how those occupations fit into their industries.

Finance and insurance: Customer service representatives, financial and investment analysts, financial managers, personal financial advisors, securities, commodities, and financial services sales agents
Retail trade: Pharmacists, first-line supervisors of retail sales workers, general and operations managers, private detectives and investigators
Wholesale trade: Sales managers, order clerks, first-line supervisors of non-retail sales workers, sales representatives (wholesale and manufacturing, except technical and scientific products), sales representatives (wholesale and manufacturing, technical and scientific products)
Real estate and rental and leasing: Concierges, property, real estate, and community association managers, real estate sales agents, real estate brokers, counter and rental clerks
Government: Recreation workers, compliance officers, first-line supervisors of police and detectives, administrative services managers, child, family, and school social workers
Manufacturing: Mechanical engineers, industrial engineers, buyers and purchasing agents, shipping, receiving, and inventory clerks, first-line supervisors of production and operating workers
Professional, scientific, and technical services: Software developers, lawyers, accountants and auditors, computer and information systems managers, project management specialists
Health care and social assistance: Registered nurses, nurse practitioners, medical and health services managers, first-line supervisors of office and administrative support workers, medical secretaries and administrative assistants
Information: Audio and video technicians, producers and directors, news analysts, reporters, and journalists, film and video editors, editors

I could get picky about which occupations are the most impactful in the various industries, but this selection is a good one for testing model performance overall.

The tests themselves are interesting in both how they are constructed and how they are measured.

OpenAI worked with experienced professionals in each occupation to create a set of tasks that “reflect their day-to-day work.” The task sets all went through many rounds of expert review and resulted in a series of fully reviewed, complex tasks per industry.

One of the manufacturing engineer tasks, for example, involves the design of a jig (guides a tool) or a fixture (holds the work) to simplify the reeling in and reeling out of a cable spool for underground mining operations.

Also: This simple ChatGPT trick helps you spot scams before you click or respond

Grading for each of these tests was done by human professionals in each of the occupations. The graders weren't told whether the results were from the AI, or from other professionals in their fields.

Additionally, OpenAI built an automated grading system based on the work of the human graders, so that the humans don't have to take their time grading each iteration of the AI model. I'm sure OpenAI constructed this automated system with all appropriate safeguards, but I worry that some level of inherent bias might be possible when letting an AI grade the performance of an AI.

Ethan Mollick, associate professor and co-director of the Generative AI Lab at Wharton, describes the GDPval test as “probably the most economically relevant measure of AI ability.”

83% of the time

The speed of improvement is insane. GPT-5.1 was released in November and had a GDPval score of 38.8%. In December, just a month later, GPT-5.2 performance exploded to nearly double that, to 70.9%.

Professor Mollick described the importance of GDPval running on GPT-5.2. He said, “In head-to-head competition with human experts on tasks that require 4-8 hours for a human to do, GPT-5.2 wins 71% of the time as judged by other humans.”

Now, in early March, less than three months after GPT-5.2, GPT-5.4 matches or exceeds the performance of human professionals 83% of the time!

Also: How to learn ChatGPT in an hour – for free

In other words, almost every time the same task was given to an experienced human pro and GPT-5.4, the AI either kept up with or blew past the experienced human pro, at least, according to its grader, which may have been human or AI.

Sit with that for a few minutes. We're not just talking about programming tasks. We're talking about a wide range of industries and a wider range of high-value occupations.

According to Daniel Swiecki, head of Artificial Intelligence Solutions at Walleye Capital, “On our toughest internal finance and Excel evaluations, GPT-5.4 outperformed prior models, improving accuracy by 30 percentage points. This step change in reliability materially expands our automation of model updates and scenario analyses for fundamental investors.”

The freaky thing is this sort of performance could take us in two directions. On the one hand, it could help augment human pros, giving experienced folks the ability to get more done, faster. On the other hand, it could well be seen as the harbinger of a time when the AI is simply replacing the humans in high-value, high-skill jobs.

The future is probably not going to be all one or all the other. But even as OpenAI takes a victory lap for its latest release, those of us who support our families based on a lifetime of skill building within those professions have to rock back on our heels, take deep, worried breaths, and hope for the best.

Speaking personally, my approach has been to learn all I can, as quickly as I can, and use AI as much as I can. That helps me describe all of this to you, but it also helps me augment my individual productivity using AI resources, particularly for programming.

Also: I'm a ChatGPT power user: Here are 7 useful settings that are turned off by default

But I worry. AI slop is a real thing, and as AI slop increases more and more in quality, each of us will be competing with a giant superbrain that never sleeps, never eats, and is improving at almost supernatural speed.

More capabilities

In addition to overall performance, GPT-5.4 improves on other core capabilities.

Tool use: GPT-5.4 improves how AI agents select and use external tools, enabling them to complete multi-step workflows more accurately and efficiently while reducing token usage.
Computer vision: The new model enhances visual understanding, allowing it to better interpret complex images, parse documents, and reason about visual information with higher accuracy.
Computer use capabilities: Within the API and Codex, GPT-5.4 introduces native computer-use abilities that let agents interact with software systems through screenshots, keyboard and mouse commands, and automated workflows across applications.
Coding: GPT-5.4 combines the coding strengths of GPT-5.3-Codex with improved reasoning and tool use, helping developers build, debug, and iterate on complex software tasks more effectively.

Stay tuned. GPT-5.4 Thinking will be in your ChatGPT interface shortly. Let the competition begin.

What do you think?

What do you think about GPT-5.4's claim that it can match or outperform human professionals 83% of the time? Does that seem like a meaningful benchmark for real-world work?

Also: The best AI chatbots of 2026: Expert tested and reviewed

Have you started integrating AI into your own professional workflow? If so, where does it help the most or fall short? Looking ahead, do you see tools like this mostly augmenting human expertise, or eventually replacing parts of it?

Share your thoughts and experiences in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

Source link

New GPT-5.4 clobbers humans on pro-level work in OpenAI’s tests – by 83%

Product categories

Recent Posts

Recent Comments

Archives

Categories

CATEGORIES