This Week in AI: Tech giants embrace synthetic data

Date:

Share post:


Hiya, folks, welcome to TechCrunch’s regular AI newsletter. If you want this in your inbox every Wednesday, sign up here.

This week in AI, synthetic data rose to prominence.

OpenAI last Thursday introduced Canvas, a new way to interact with ChatGPT, its AI-powered chatbot platform. Canvas opens a window with a workspace for writing and coding projects. Users can generate text or code in Canvas, then, if necessary, highlight sections to edit using ChatGPT.

From a user perspective, Canvas is a big quality-of-life improvement. But what’s most interesting about the feature, to us, is the fine-tuned model powering it. OpenAI says it tailored its GPT-4o model using synthetic data to “enable new user interactions” in Canvas.

“We used novel synthetic data generation techniques, such as distilling outputs from OpenAI’s o1-preview, to fine-tune the GPT-4o to open canvas, make targeted edits, and leave high-quality comments inline,” ChatGPT head of product Nick Turley wrote in a post on X. “This approach allowed us to rapidly improve the model and enable new user interactions, all without relying on human-generated data.”

OpenAI isn’t the only Big Tech company increasingly relying on synthetic data to train its models.

In developing Movie Gen, a suite of AI-powered tools for creating and editing video clips, Meta partially relied on synthetic captions generated by an offshoot of its Llama 3 models. The company recruited a team of human annotators to fix errors in and add more detail to these captions, but the bulk of the groundwork was largely automated.

OpenAI CEO Sam Altman has argued that AI will someday produce synthetic data good enough to train itself, effectively. That would be advantageous for firms like OpenAI, which spends a fortune on human annotators and data licenses.

Meta has fine-tuned the Llama 3 models themselves using synthetic data. And OpenAI is said to be sourcing synthetic training data from o1 for its next-generation model, code-named Orion.

But embracing a synthetic-data-first approach comes with risks. As a researcher recently pointed out to me, the models used to generate synthetic data unavoidably hallucinate (i.e., make things up) and contain biases and limitations. These flaws manifest in the models’ generated data.

Using synthetic data safely, then, requires thoroughly curating and filtering it — as is the standard practice with human-generated data. Failing to do so could lead to model collapse, where a model becomes less “creative” — and more biased — in its outputs, eventually seriously compromising its functionality.

This isn’t an easy task at scale. But with real-world training data becoming more costly (not to mention challenging to obtain), AI vendors may see synthetic data as the sole viable path forward. Let’s hope they exercise caution in adopting it.

News

Ads in AI Overviews: Google says it’ll soon begin to show ads in AI Overviews, the AI-generated summaries it supplies for certain Google Search queries.

Google Lens, now with video: Lens, Google’s visual search app, has been upgraded with the ability to answer near-real-time questions about your surroundings. You can capture a video via Lens and ask questions about objects of interest in the video. (Ads probably coming for this too.)

From Sora to DeepMind: Tim Brooks, one of the leads on OpenAI’s video generator, Sora, has left for rival Google DeepMind. Brooks announced in a post on X that he’ll be working on video generation technologies and “world simulators.”

Fluxing it up: Black Forest Labs, the Andreessen Horowitz-backed startup behind the image generation component of xAI’s Grok assistant, has launched an API in beta — and released a new model.

Not so transparent: California’s recently passed AB-2013 bill requires companies developing generative AI systems to publish a high-level summary of the data that they used to train their systems. So far, few companies are willing to say whether they’ll comply. The law gives them until January 2026.

Research paper of the week

Apple researchers have been hard at work on computational photography for years, and an important aspect of that process is depth mapping. Originally this was done with stereoscopy or a dedicated depth sensor like a lidar unit, but those tend to be expensive, complex, and take up valuable internal real estate. Doing it strictly in software is preferable in many ways. That’s what this paper, Depth Pro, is all about.

Aleksei Bochkovskii et al. share a method for zero-shot monocular depth estimation with high detail, meaning it uses a single camera, doesn’t need to be trained on specific things (like it works on a camel despite never seeing one), and catches even difficult aspects like tufts of hair. It’s almost certainly in use on iPhones right now (though probably an improved, custom-built version), but you can give it a go if you want to do a little depth estimation of your own by using the code at this GitHub page.

Model of the week

Google has released a new model in its Gemini family, Gemini 1.5 Flash-8B, that it claims is among its most performant.

A “distilled” version of Gemini 1.5 Flash, which was already optimized for speed and efficiency, Gemini 1.5 Flash-8B costs 50% less to use, has lower latency, and comes with 2x higher rate limits in AI Studio, Google’s AI-focused developer environment.

“Flash-8B nearly matches the performance of the 1.5 Flash model launched in May across many benchmarks,” Google writes in a blog post. “Our models [continue] to be informed by developer feedback and our own testing of what is possible.”

Gemini 1.5 Flash-8B is well-suited for chat, transcription, and translation, Google says, or any other task that’s “simple” and “high-volume.” In addition to AI Studio, the model is also available for free through Google’s Gemini API, rate-limited at 4,000 requests per minute.

Grab bag

Speaking of cheap AI, Anthropic has released a new feature, Message Batches API, that lets devs process large amounts of AI model queries asynchronously for less money.

Similar to Google’s batching requests for the Gemini API, devs using Anthropic’s Message Batches API can send batches up to a certain size — 10,000 queries — per batch. Each batch is processed in a 24-hour period and costs 50% less than standard API calls.

Anthropic says that the Message Batches API is ideal for “large-scale” tasks like dataset analysis, classification of large datasets, and model evaluations. “For example,” the company writes in a post, “analyzing entire corporate document repositories — which might involve millions of files — becomes more economically viable by leveraging [this] batching discount.”

The Message Batches API is available in public beta with support for Anthropic’s Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku models.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

You can now buy songs from Green Day’s ‘Dookie’ in lo-fi formats like doorbell chime and wax cylinder

To celebrate the 30th anniversary of Green Day’s classic pop-punk album “Dookie,” Los Angeles art studio Brain...

Russia is banning Discord, an app its military uses

Russia is banning chat platform Discord, the Washington Post reports. The app joins platforms like Facebook and...

TezLab launches new AI-powered ‘car reports’ for Tesla and Rivian EVs

Modern connected vehicles, and especially EVs, generate a lot of data. But automakers don’t often give owners...

After winning Nobel for foundational AI work, Geoffrey Hinton says he’s proud Ilya Sutskever ‘fired Sam Altman’

Geoffrey Hinton accepted a Nobel Prize this week, recognizing the foundational work on artificial neural networks that...

Streamer Plex rolls out movie and TV show reviews

Following its $40 million fundraise at the beginning of this year, streaming media company Plex announced on Wednesday it’s...

Amazon’s new warehouses will employ 10x as many robots

At its Delivering the Future event Wednesday, Amazon announced plans for new robot-powered delivery warehouses. The first...

Levy Health wants to help women identify fertility issues sooner

Caroline Mitterdorfer’s fertility journey began with an early-stage cancer diagnosis at age 27. Mitterdorfer told TechCrunch that...

Amazon revamps Ring subscriptions with AI video search

Amazon is revamping its subscription offerings for its Ring video doorbells and cameras. A new service, called Ring...