OpenAI unveils o1, a model that can fact-check itself

Date:

Share post:


ChatGPT maker OpenAI has announced its next major product release: A generative AI model code-named Strawberry, officially called OpenAI o1.

To be more precise, o1 is actually a family of models. Two are available Thursday in ChatGPT and via OpenAI’s API: o1-preview and o1-mini, a smaller, more efficient model aimed at code generation.

You’ll have to be subscribed to ChatGPT Plus or Team to see o1 in the ChatGPT client. Enterprise and educational users will get access early next week.

Note that the o1 chatbot experience is fairly barebones at present. Unlike GPT-4o, o1’s forebear, o1 can’t browse the web or analyze files yet. The model does have image-analyzing features, but they’ve been disabled pending additional testing. And o1 is rate-limited; weekly limits are currently 30 messages for o1-preview and 50 for o1-mini.

In another downside, o1 is expensive. Very expensive. In the API, o1-preview is $15 per 1 million input tokens and $60 per 1 million output tokens. That’s 3x the cost versus GPT-4o for input and 4x the cost for output. (“Tokens” are bits of raw data; 1 million is equivalent to around 750,000 words.)

OpenAI says it plans to bring o1-mini access to all free users of ChatGPT but hasn’t set a release date. We’ll hold the company to it.

Chain of reasoning

OpenAI o1 avoids some of the reasoning pitfalls that normally trip up generative AI models because it can effectively fact-check itself by spending more time considering all parts of a question. What makes o1 “feel” qualitatively different from other generative AI models is its ability to “think” before responding to queries, according to OpenAI.

When given additional time to “think,” o1 can reason through a task holistically — planning ahead and performing a series of actions over an extended period of time that help the model arrive at an answer. This makes o1 well-suited for tasks that require synthesizing the results of multiple subtasks, like detecting privileged emails in an attorney’s inbox or brainstorming a product marketing strategy.

In a series of posts on X on Thursday, Noam Brown, a research scientist at OpenAI, said that “o1 is trained with reinforcement learning.” This teaches the system “to ‘think’ before responding via a private chain of thought” through rewards when o1 gets answers right and penalties when it does not, he said.

Brown alluded to the fact that OpenAI leveraged a new optimization algorithm and training dataset containing “reasoning data” and scientific literature specifically tailored for reasoning tasks. “The longer [o1] thinks, the better it does,” he said.

Image Credits: OpenAI

TechCrunch wasn’t offered the opportunity to test o1 before its debut; we’ll get our hands on it as soon as possible. But according to a person who did have access — Pablo Arredondo, VP at Thomson Reuters — o1 is better than OpenAI’s previous models (e.g., GPT-4o) at things like analyzing legal briefs and identifying solutions to problems in LSAT logic games.

“We saw it tackling more substantive, multi-faceted, analysis,” Arredondo told TechCrunch. “Our automated testing also showed gains against a wide range of simple tasks.”

In a qualifying exam for the International Mathematical Olympiad (IMO), a high school math competition, o1 correctly solved 83% of problems while GPT-4o only solved 13%, according to OpenAI. (That’s less impressive when you consider that Google DeepMind’s recent AI achieved a silver medal in an equivalent to the actual IMO contest.) OpenAI also says that o1 reached the 89th percentile of participants — better than DeepMind’s flagship system AlphaCode 2, for what it’s worth — in the online programming challenge rounds known as Codeforces.

OpenAI o1
Image Credits: OpenAI

In general, o1 should perform better on problems in data analysis, science, and coding, OpenAI says. (GitHub, which tested o1 with its AI coding assistant GitHub Copilot, reports that the model is adept at optimizing algorithms and app code.) And, at least per OpenAI’s benchmarking, o1 improves over GPT-4o in its multilingual skills, especially in languages like Arabic and Korean.

Ethan Mollick, a professor of management at Wharton, wrote his impressions of o1 after using it for a month in a post on his personal blog. On a challenging crossword puzzle, o1 did well, he said — getting all the answers correct (despite hallucinating a new clue).

OpenAI o1 is not perfect

Now, there are drawbacks.

OpenAI o1 can be slower than other models, depending on the query. Arredondo says o1 can take over 10 seconds to answer some questions; it shows its progress by displaying a label for the current subtask it’s performing.

Given the unpredictable nature of generative AI models, o1 likely has other flaws and limitations. Brown admitted that o1 trips up on games of tic-tac-toe from time to time, for example. And in a technical paper, OpenAI said that it’s heard anecdotal feedback from testers that o1 tends to hallucinate (i.e., confidently make stuff up) more than GPT-4o — and less often admits when it doesn’t have the answer to a question.

“Errors and hallucinations still happen [with o1],” Mollick writes in his post. “It still isn’t flawless.”

We’ll no doubt learn more about the various issues in time, and once we have a chance to put o1 through the wringer ourselves.

Fierce competition

We’d be remiss if we didn’t point out that OpenAI is far from the only AI vendor investigating these types of reasoning methods to improve model factuality.

Google DeepMind researchers recently published a study showing that by essentially giving models more compute time and guidance to fulfill requests as they’re made, the performance of those models can be significantly improved without any additional tweaks.

Illustrating the fierceness of the competition, OpenAI said that it decided against showing o1’s raw “chains of thoughts” in ChatGPT partly due to “competitive advantage.” (Instead, the company opted to show “model-generated summaries” of the chains.)

OpenAI might be first out of the gate with o1. But assuming rivals soon follow suit with similar models, the company’s real test will be making o1 widely available — and for cheaper.

From there, we’ll see how quickly OpenAI can deliver upgraded versions of o1. The company says it aims to experiment with o1 models that reason for hours, days, or even weeks to further boost their reasoning capabilities.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Trump pardons Silk Road creator Ross Ulbricht

President Trump on Tuesday pardoned Ross Ulbricht, the creator of the infamous dark web exchange Silk Road,...

MrBeast is reportedly now among those trying to buy TikTok

Jesse Tinsley, CEO of a workforce management company, Employer.com, is conducting what could become the year’s wildest...

Meta COO Sheryl Sandberg sanctioned by judge for allegedly deleting emails

A Delaware judge has sanctioned Sheryl Sandberg, Meta’s former COO and board member, for allegedly deleting emails...

Microsoft is no longer OpenAI’s exclusive cloud provider

Microsoft was once the exclusive provider of data center infrastructure for OpenAI to train and run its...

OpenAI teams up with SoftBank and Oracle on $500B data center project

OpenAI says that it will team up with Japanese conglomerate SoftBank and with Oracle, along with others,...

Scale AI’s Alexandr Wang has published an open letter lobbying Trump to invest in AI

Alexandr Wang, the CEO of Scale AI, has taken out a full-page ad in The Washington Post...

Perplexity launches Sonar, an API for AI search

Perplexity on Tuesday launched an API service called Sonar, allowing enterprises and developers to build the startup’s...

Trump targets EV charging funding programs Tesla benefits from

President Donald Trump is trying to halt the flow of funding for EV charging infrastructure from two...