Hugging Face releases a benchmark for testing generative AI on health tasks

Date:

Share post:


As I wrote recently, generative AI models are increasingly being brought to healthcare settings — in some cases prematurely, perhaps. Early adopters believe that they’ll unlock increased efficiency while revealing insights that’d otherwise be missed. Critics, meanwhile, point out that these models have flaws and biases that could contribute to worse health outcomes.

But is there a quantitative way to know how helpful — or harmful — a model might be when tasked with things like summarizing patient records or answering health-related questions?

Hugging Face, the AI startup, proposes a solution in a newly released benchmark test called Open Medical-LLM. Created in partnership with researchers at the nonprofit Open Life Science AI and the University of Edinburgh’s Natural Language Processing Group, Open Medical-LLM aims to standardize evaluating the performance of generative AI models on a range of medical-related tasks.

Open Medical-LLM isn’t a from-scratch benchmark per se, but rather a stitching-together of existing test sets — MedQA, PubMedQA, MedMCQA and so on — designed to probe models for general medical knowledge and related fields, such as anatomy, pharmacology, genetics and clinical practice. The benchmark contains multiple choice and open-ended questions that require medical reasoning and understanding, drawing from material including U.S. and Indian medical licensing exams and college biology test question banks.

“[Open Medical-LLM] enables researchers and practitioners to identify the strengths and weaknesses of different approaches, drive further advancements in the field and ultimately contribute to better patient care and outcome,” Hugging Face writes in a blog post.

gen AI healthcare

Image Credits: Hugging Face

Hugging Face is positioning the benchmark as a “robust assessment” of healthcare-bound generative AI models. But some medical experts on social media cautioned against putting too much stock into Open Medical-LLM, lest it lead to ill-informed deployments.

On X, Liam McCoy, a resident physician in neurology at the University of Alberta, pointed out that the gap between the “contrived environment” of medical question-answering and actual clinical practice can be quite large.

Hugging Face research scientist Clémentine Fourrier — who co-authored the blog post — agreed.

“These leaderboards should only be used as a first approximation of which [generative AI model] to explore for a given use case, but then a deeper phase of testing is always needed to examine the model’s limits and relevance in real conditions,” Fourrier said in a post on X. “Medical [models] should absolutely not be used on their own by patients, but instead should be trained to become support tools for MDs.”

It brings to mind Google’s experience several years ago attempting to bring an AI screening tool for diabetic retinopathy to healthcare systems in Thailand.

As Devin reported in 2020, Google created a deep learning system that scanned images of the eye, looking for evidence of retinopathy — a leading cause of vision loss. But despite high theoretical accuracy, the tool proved impractical in real-world testing, frustrating both patients and nurses with inconsistent results and a general lack of harmony with on-the-ground practices.

It’s telling that, of the 139 AI-related medical devices the U.S. Food and Drug Administration has approved to date, none use generative AI. It’s exceptionally difficult to test how a generative AI tool’s performance in the lab will translate to hospitals and outpatient clinics, and — perhaps more importantly — how the outcomes might trend over time.

That’s not to suggest Open Medical-LLM isn’t useful or informative. The results leaderboard, if nothing else, serves as a reminder of just how poorly models answer basic health questions. But Open Medical-LLM — and no other benchmark for that matter — is a substitute for carefully thought-out real-world testing.





Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

From Connie Chan to Ethan Kurzweil venture capitalists continue to play musical chairs

When Keith Rabois announced he was leaving Founders Fund to return to Khosla Ventures in January, it...

NASA orders studies from private space companies on Mars mission support roles

Mars exploration has been always been the exclusive purview of national space agencies, but NASA is trying...

United HealthCare CEO says ‘maybe a third’ of U.S. citizens were affected by recent hack

Two months after hackers broke into Change Healthcare systems stealing and then encrypting company data, it’s still...

Pinterest says its AI-powered collages are now more engaging than Pins

In the summer of 2022, Pinterest quietly launched a new iOS app called Shuffles that allowed people...

Snapchat’s ‘My AI’ chatbot can now set in-app reminders and countdowns

Snapchat is launching the ability for users to set in-app reminders with the help of its My...

Microsoft taps Sanctuary AI for general-purpose robot research

Microsoft, it seems, is hedging its bets when it comes to general-purpose robotics AI. At the end...

Substack now lets writers paywall their ‘Chat’ discussion spaces

Substack is launching the ability for writers to paywall their entire Chat or specific threads to paid...

UnitedHealth CEO tells Senate all systems now have multi-factor authentication after hack

UnitedHealth Group chief executive officer Andrew Witty told senators on Wednesday that the company has now enabled...