Article Type
Changed
Wed, 09/18/2024 - 13:45

HUNTINGTON BEACH, CALIF. — When Roxana Daneshjou, MD, PhD, began reviewing responses to an exploratory survey she and her colleagues created on dermatologists’ use of large language models (LLMs) such as ChatGPT in clinical practice, she was both surprised and alarmed.

Of the 134 respondents who completed the survey, 87 (65%) reported using LLMs in a clinical setting. Of those 87 respondents, 17 (20%) used LMMs daily, 28 (32%) weekly, 5 (6%) monthly, and 37 (43%) rarely. That represents “pretty significant usage,” Dr. Daneshjou, assistant professor of biomedical data science and dermatology at Stanford University, Palo Alto, California, said at the annual meeting of the Pacific Dermatologic Association.

courtesy Dr. Roxana Daneshjou
Dr. Roxana Daneshjou

Most of the respondents reported using LLMs for patient care (79%), followed by administrative tasks (74%), medical records (43%), and education (18%), “which can be problematic,” she said. “These models are not appropriate to use for patient care.”

When asked about their thoughts on the accuracy of LLMs, 58% of respondents deemed them to be “somewhat accurate” and 7% viewed them as “extremely accurate.”

The overall survey responses raise concern because LLMs “are not trained for accuracy; they are trained initially as a next-word predictor on large bodies of tech data,” Dr. Daneshjou said. “LLMs are already being implemented but have the potential to cause harm and bias, and I believe they will if we implement them the way things are rolling out right now. I don’t understand why we’re implementing something without any clinical trial or showing that it improves care before we throw untested technology into our healthcare system.”

Meanwhile, Epic and Microsoft are collaborating to bring AI technology to electronic health records, she said, and Epic is building more than 100 new AI features for physicians and patients. “I think it’s important for every physician and trainee to understand what is going on in the realm of AI,” said Dr. Daneshjou, who is an associate editor for the monthly journal NEJM AI. “Be involved in the conversation because we are the clinical experts, and a lot of people making decisions and building tools do not have the clinical expertise.”



To further illustrate her concerns, Dr. Daneshjou referenced a red teaming event she and her colleagues held with computer scientists, biomedical data scientists, engineers, and physicians across multiple specialties to identify issues related to safety, bias, factual errors, and/or security issues in GPT-3.5, GPT-4, and GPT-4 with internet. The goal was to mimic clinical health scenarios, ask the LLM to respond, and have the team members review the accuracy of LLM responses.

The participants found that nearly 20% of LLM responses were inappropriate. For example, in one task, an LLM was asked to calculate a RegiSCAR score for Drug Reaction With Eosinophilia and Systemic Symptoms for a patient, but the response included an incorrect score for eosinophilia. “That’s why these tools can be so dangerous because you’re reading along and everything seems right, but there might be something so minor that can impact patient care and you might miss it,” Dr. Daneshjou said.

On a related note, she advised against dermatologists uploading images into GPT-4 Vision, an LLM that can analyze images and provide textual responses to questions about them, and she recommends not using GPT-4 Vision for any diagnostic support. At this time, “GPT-4 Vision overcalls malignancies, and the specificity and sensitivity are not very good,” she explained.

Dr. Daneshjou disclosed that she has served as an adviser to MDalgorithms and Revea and has received consulting fees from Pfizer, L’Oréal, Frazier Healthcare Partners, and DWA and research funding from UCB.

A version of this article first appeared on Medscape.com.

Publications
Topics
Sections

HUNTINGTON BEACH, CALIF. — When Roxana Daneshjou, MD, PhD, began reviewing responses to an exploratory survey she and her colleagues created on dermatologists’ use of large language models (LLMs) such as ChatGPT in clinical practice, she was both surprised and alarmed.

Of the 134 respondents who completed the survey, 87 (65%) reported using LLMs in a clinical setting. Of those 87 respondents, 17 (20%) used LMMs daily, 28 (32%) weekly, 5 (6%) monthly, and 37 (43%) rarely. That represents “pretty significant usage,” Dr. Daneshjou, assistant professor of biomedical data science and dermatology at Stanford University, Palo Alto, California, said at the annual meeting of the Pacific Dermatologic Association.

courtesy Dr. Roxana Daneshjou
Dr. Roxana Daneshjou

Most of the respondents reported using LLMs for patient care (79%), followed by administrative tasks (74%), medical records (43%), and education (18%), “which can be problematic,” she said. “These models are not appropriate to use for patient care.”

When asked about their thoughts on the accuracy of LLMs, 58% of respondents deemed them to be “somewhat accurate” and 7% viewed them as “extremely accurate.”

The overall survey responses raise concern because LLMs “are not trained for accuracy; they are trained initially as a next-word predictor on large bodies of tech data,” Dr. Daneshjou said. “LLMs are already being implemented but have the potential to cause harm and bias, and I believe they will if we implement them the way things are rolling out right now. I don’t understand why we’re implementing something without any clinical trial or showing that it improves care before we throw untested technology into our healthcare system.”

Meanwhile, Epic and Microsoft are collaborating to bring AI technology to electronic health records, she said, and Epic is building more than 100 new AI features for physicians and patients. “I think it’s important for every physician and trainee to understand what is going on in the realm of AI,” said Dr. Daneshjou, who is an associate editor for the monthly journal NEJM AI. “Be involved in the conversation because we are the clinical experts, and a lot of people making decisions and building tools do not have the clinical expertise.”



To further illustrate her concerns, Dr. Daneshjou referenced a red teaming event she and her colleagues held with computer scientists, biomedical data scientists, engineers, and physicians across multiple specialties to identify issues related to safety, bias, factual errors, and/or security issues in GPT-3.5, GPT-4, and GPT-4 with internet. The goal was to mimic clinical health scenarios, ask the LLM to respond, and have the team members review the accuracy of LLM responses.

The participants found that nearly 20% of LLM responses were inappropriate. For example, in one task, an LLM was asked to calculate a RegiSCAR score for Drug Reaction With Eosinophilia and Systemic Symptoms for a patient, but the response included an incorrect score for eosinophilia. “That’s why these tools can be so dangerous because you’re reading along and everything seems right, but there might be something so minor that can impact patient care and you might miss it,” Dr. Daneshjou said.

On a related note, she advised against dermatologists uploading images into GPT-4 Vision, an LLM that can analyze images and provide textual responses to questions about them, and she recommends not using GPT-4 Vision for any diagnostic support. At this time, “GPT-4 Vision overcalls malignancies, and the specificity and sensitivity are not very good,” she explained.

Dr. Daneshjou disclosed that she has served as an adviser to MDalgorithms and Revea and has received consulting fees from Pfizer, L’Oréal, Frazier Healthcare Partners, and DWA and research funding from UCB.

A version of this article first appeared on Medscape.com.

HUNTINGTON BEACH, CALIF. — When Roxana Daneshjou, MD, PhD, began reviewing responses to an exploratory survey she and her colleagues created on dermatologists’ use of large language models (LLMs) such as ChatGPT in clinical practice, she was both surprised and alarmed.

Of the 134 respondents who completed the survey, 87 (65%) reported using LLMs in a clinical setting. Of those 87 respondents, 17 (20%) used LMMs daily, 28 (32%) weekly, 5 (6%) monthly, and 37 (43%) rarely. That represents “pretty significant usage,” Dr. Daneshjou, assistant professor of biomedical data science and dermatology at Stanford University, Palo Alto, California, said at the annual meeting of the Pacific Dermatologic Association.

courtesy Dr. Roxana Daneshjou
Dr. Roxana Daneshjou

Most of the respondents reported using LLMs for patient care (79%), followed by administrative tasks (74%), medical records (43%), and education (18%), “which can be problematic,” she said. “These models are not appropriate to use for patient care.”

When asked about their thoughts on the accuracy of LLMs, 58% of respondents deemed them to be “somewhat accurate” and 7% viewed them as “extremely accurate.”

The overall survey responses raise concern because LLMs “are not trained for accuracy; they are trained initially as a next-word predictor on large bodies of tech data,” Dr. Daneshjou said. “LLMs are already being implemented but have the potential to cause harm and bias, and I believe they will if we implement them the way things are rolling out right now. I don’t understand why we’re implementing something without any clinical trial or showing that it improves care before we throw untested technology into our healthcare system.”

Meanwhile, Epic and Microsoft are collaborating to bring AI technology to electronic health records, she said, and Epic is building more than 100 new AI features for physicians and patients. “I think it’s important for every physician and trainee to understand what is going on in the realm of AI,” said Dr. Daneshjou, who is an associate editor for the monthly journal NEJM AI. “Be involved in the conversation because we are the clinical experts, and a lot of people making decisions and building tools do not have the clinical expertise.”



To further illustrate her concerns, Dr. Daneshjou referenced a red teaming event she and her colleagues held with computer scientists, biomedical data scientists, engineers, and physicians across multiple specialties to identify issues related to safety, bias, factual errors, and/or security issues in GPT-3.5, GPT-4, and GPT-4 with internet. The goal was to mimic clinical health scenarios, ask the LLM to respond, and have the team members review the accuracy of LLM responses.

The participants found that nearly 20% of LLM responses were inappropriate. For example, in one task, an LLM was asked to calculate a RegiSCAR score for Drug Reaction With Eosinophilia and Systemic Symptoms for a patient, but the response included an incorrect score for eosinophilia. “That’s why these tools can be so dangerous because you’re reading along and everything seems right, but there might be something so minor that can impact patient care and you might miss it,” Dr. Daneshjou said.

On a related note, she advised against dermatologists uploading images into GPT-4 Vision, an LLM that can analyze images and provide textual responses to questions about them, and she recommends not using GPT-4 Vision for any diagnostic support. At this time, “GPT-4 Vision overcalls malignancies, and the specificity and sensitivity are not very good,” she explained.

Dr. Daneshjou disclosed that she has served as an adviser to MDalgorithms and Revea and has received consulting fees from Pfizer, L’Oréal, Frazier Healthcare Partners, and DWA and research funding from UCB.

A version of this article first appeared on Medscape.com.

Publications
Publications
Topics
Article Type
Sections
Article Source

FROM PDA 2024

Disallow All Ads
Content Gating
No Gating (article Unlocked/Free)
Alternative CME
Disqus Comments
Default
Use ProPublica
Hide sidebar & use full width
render the right sidebar.
Conference Recap Checkbox
Not Conference Recap
Clinical Edge
Display the Slideshow in this Article
Medscape Article
Display survey writer
Reuters content
Disable Inline Native ads
WebMD Article