OpenAI blames its massive ChatGPT outage on a ‘new telemetry service’

Date:

Share post:


OpenAI is blaming one of the longest outages in its history on a “new telemetry service” gone awry.

On Wednesday, OpenAI’s AI-powered chatbot platform, ChatGPT; its video generator, Sora; and its developer-facing API experienced major disruptions starting at around 3 p.m. Pacific. OpenAI acknowledged the problem soon after and began working on a fix. But it’d take the company roughly three hours to restore all services.

In a postmortem published late Thursday, OpenAI wrote that the outage wasn’t caused by a security incident or recent product launch, but by a telemetry service it deployed Wednesday to collect Kubernetes metrics. Kubernetes is an open source program that helps manage containers, or packages of apps and related files that are used to run software in isolated environments.

“Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused … resource-intensive Kubernetes API operations,” OpenAI wrote in the postmortem. “[Our] Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large [Kubernetes] clusters.”

That’s a lot of jargon, but basically, the new telemetry service affected OpenAI’s Kubernetes operations, including a resource that many of the company’s services rely on for DNS resolution. DNS resolution converts IP addresses to domain names; it’s the reason you’re able to type “Google.com” instead of “142.250.191.78.”

OpenAI’s use of DNS caching, which holds info about previously-looked-up domain names (like website addresses) and their corresponding IP addresses, complicated matters by “delay[ing] visibility,” OpenAI wrote, and “allowing the rollout [of the telemetry service] to continue before the full scope of the problem was understood.”

OpenAI says that it was able to detect the issue “a few minutes” before customers ultimately started seeing an impact, but that it wasn’t able to quickly implement a fix because it had to work around the overwhelmed Kubernetes servers.

“This was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways,” the company wrote. “Our tests didn’t catch the impact the change was having on the Kubernetes control plane [and] remediation was very slow because of the locked-out effect.”

OpenAI says that it’ll adopt several measures to prevent similar incidents from occurring in the future, including improvements to phased rollouts with better monitoring for infrastructure changes and new mechanisms to ensure OpenAI engineers can access the company’s Kubernetes API servers in any circumstances.

“We apologize for the impact that this incident caused to all of our customers – from ChatGPT users to developers to businesses who rely on OpenAI products,” OpenAI wrote. “We’ve fallen short of our own expectations.”



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Adobe launches subscriptions for Firefly AI

Adobe is hoping to capitalize on the early success of its Firefly AI models by launching a...

Lanch bags $27M for a social media-skewed take on fast food

E-commerce startups built around food continue to gobble up funding as investors look for sticky consumer concepts...

EU abandons ePrivacy reform, as bloc shifts focus to competitiveness and fostering data access for AI

A long stalled bid to beef up European Union rules around online tracking technologies — and put...

Google Whisk, an image remixing tool, is now available in 100+ countries

Google keeps releasing experimental products built with its AI models to give users a taste of its...

Tabby doubles valuation to $3.3B in $160M funding as it looks beyond BNPL and plans IPO

Consumer demand for credit options varies across regions, and for fintechs, understanding these differences is key to...

Shopify took down Kanye’s swastika T-shirt shop, but another antisemitic storefront still operates

Shopify took down Kanye West’s online store after the musician sold T-shirts with the swastika symbol. West, who...

Federal workers sue Elon Musk and DOGE to cut off data access

More than 100 current and former federal workers have sued Elon Musk and the Department of Government...

Founders Fund is about to close another $3B fund

Founders Fund is on track to conclude fundraising of its third growth fund at the end of...