Large Language Models Cut Time, Cost of Guideline Development

Timely and Promising

Article Type

Changed

Thu, 09/11/2025 - 10:43

Author(s)

Large language models (LLMs) may help streamline clinical guideline development by dramatically reducing the time and cost required for systematic reviews, according to a pilot study from the American Gastroenterological Association (AGA).

Faster, cheaper study screening could allow societies to update clinical recommendations more frequently, improving alignment with the latest evidence, lead author Sunny Chung, MD, of Yale School of Medicine, New Haven, Connecticut, and colleagues, reported.

“Each guideline typically requires 5 to 15 systematic reviews, making the process time-consuming (averaging more than 60 weeks) and costly (more than $140,000),” the investigators wrote in Gastroenterology . “One of the most critical yet time-consuming steps in systematic reviews is title and abstract screening. LLMs have the potential to make this step more efﬁcient.”

To test this approach, the investigators developed, validated, and applied a dual-model LLM screening pipeline with human-in-the-loop oversight, focusing on randomized controlled trials in AGA guidelines.

The system was built using the 2021 guideline on moderate-to-severe Crohn’s disease, targeting biologic therapies for induction and maintenance of remission.

Using chain-of-thought prompting and structured inclusion criteria based on the PICO framework, the investigators deployed GPT-4o (OpenAI) and Gemini-1.5-Pro (Google DeepMind) as independent screeners, each assessing titles and abstracts according to standardized logic encoded in JavaScript Object Notation. This approach mimicked a traditional double-reviewer system.

After initial testing, the pipeline was validated in a 2025 update of the same guideline, this time spanning 6 focused clinical questions on advanced therapies and immunomodulators. Results were compared against manual screening by 2 experienced human reviewers, with total screening time documented.

The system was then tested across 4 additional guideline topics: fecal microbiota transplantation (FMT) for irritable bowel syndrome and Clostridioides difficile, gastroparesis, and hepatocellular carcinoma. A final test applied the system to a forthcoming guideline on complications of acute pancreatitis.

Across all topics, the dual-LLM system achieved 100% sensitivity in identifying randomized controlled trials (RCTs). For the 2025 update of the AGA guideline on Crohn’s disease, the models flagged 418 of 4,377 abstracts for inclusion, captur-ing all 25 relevant RCTs in just 48 minutes. Manual screening of the same dataset previously took almost 13 hours.

Comparable accuracy and time savings were observed for the other topics.

The pipeline correctly flagged all 13 RCTs in 4,820 studies on FMT for irritable bowel syndrome, and all 16 RCTs in 5,587 studies on FMT for Clostridioides difficile, requiring 27 and 66 minutes, respectively. Similarly, the system captured all 11 RCTs in 3,919 hepatocellular carcinoma abstracts and all 18 RCTs in 1,578 studies on gastroparesis, completing each task in under 65 minutes. Early testing on the upcoming guideline for pancreatitis yielded similar results.

Cost analysis underscored the efficiency of this approach. At an estimated $175–200 per hour for expert screeners, traditional abstract screening would cost around $2,500 per review, versus approximately $100 for the LLM approach—a 96% reduction.

The investigators cautioned that human oversight remains necessary to verify the relevance of studies flagged by the models. While the system’s sensitivity was consistent, it also selected articles that were ultimately excluded by expert reviewers. Broader validation will be required to assess performance across non-RCT study designs, such as observational or case-control studies, they added.

“As medical literature continues to expand, the integration of artiﬁcial intelligence into evidence synthesis processes will become increasingly vital,” Dr. Chung and colleagues wrote. “With further reﬁnement and broader validation, this LLM-based pipeline has the potential to revolutionize evidence synthesis and set a new standard for guideline development.”

This study was funded by National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases. The investigators reported no conflicts of interest.

Body

Ethan Goh, MD, executive director of the Stanford AI Research and Science Evaluation (ARISE) Network, described the AGA pilot as both timely and promising.

“I’m certainly bullish about the use case,” he said in an interview. “Their study design and application is also robust, so I would congratulate them.”

Goh, a general editor for BMJ Digital Health & AI, predicted “huge potential” in the strategy for both clinicians and the general population, who benefit from the most up-to-date guidelines possible.

“I believe that using AI can represent a much faster, more cost effective, efficient way of gathering all these information sources,” he said.

Still, humans will need to be involved in the process.

“[This AI-driven approach] will always need some degree of expert oversight and judgement,” Goh said.

Speaking more broadly about automating study aggregation, Goh said AI may still struggle to determine which studies are most clinically relevant.

“When we use [AI models] to pull out medical references, anecdotally, I don’t think they’re always getting the best ones all the time, or even necessarily the right ones,” he said.

And as AI models grow more impressive, these shortcomings become less apparent, potentially lulling humans into overconfidence.

“Humans are humans,” Goh said. “We get lazy over time. That will be one of the challenges. As the systems get increasingly good, humans start to defer more and more of their judgment to them and say, ‘All right, AI, you’re doing good. Just do 100% automation.’ And then [people] start fact checking or reviewing even less.”

AI could also undermine automated reviews in another way: AI-generated publications that appear genuine, but aren’t, may creep into the dataset.

Despite these concerns, Goh concluded on an optimistic note.

“I think that there are huge ways to use AI, tools, not to replace, but to augment and support human judgment,” he said.

Ethan Goh, MD, is senior research engineer and executive director of the Stanford AI Research and Science Evaluation (ARISE) Network, at Stanford (Calif.) University. He declared no conflicts of interest.

Publications

GI and Hepatology News

Topics

IBD & Intestinal Disorders

Upper GI Tract

Pancreas and Biliary Tract

Endoscopy

GI Oncology

Sections

From the AGA Journals

Author(s)

Author(s)

Body

Ethan Goh, MD, executive director of the Stanford AI Research and Science Evaluation (ARISE) Network, described the AGA pilot as both timely and promising.

“I’m certainly bullish about the use case,” he said in an interview. “Their study design and application is also robust, so I would congratulate them.”

Goh, a general editor for BMJ Digital Health & AI, predicted “huge potential” in the strategy for both clinicians and the general population, who benefit from the most up-to-date guidelines possible.

“I believe that using AI can represent a much faster, more cost effective, efficient way of gathering all these information sources,” he said.

Still, humans will need to be involved in the process.

“[This AI-driven approach] will always need some degree of expert oversight and judgement,” Goh said.

Speaking more broadly about automating study aggregation, Goh said AI may still struggle to determine which studies are most clinically relevant.

“When we use [AI models] to pull out medical references, anecdotally, I don’t think they’re always getting the best ones all the time, or even necessarily the right ones,” he said.

And as AI models grow more impressive, these shortcomings become less apparent, potentially lulling humans into overconfidence.

“Humans are humans,” Goh said. “We get lazy over time. That will be one of the challenges. As the systems get increasingly good, humans start to defer more and more of their judgment to them and say, ‘All right, AI, you’re doing good. Just do 100% automation.’ And then [people] start fact checking or reviewing even less.”

AI could also undermine automated reviews in another way: AI-generated publications that appear genuine, but aren’t, may creep into the dataset.

Despite these concerns, Goh concluded on an optimistic note.

“I think that there are huge ways to use AI, tools, not to replace, but to augment and support human judgment,” he said.

Ethan Goh, MD, is senior research engineer and executive director of the Stanford AI Research and Science Evaluation (ARISE) Network, at Stanford (Calif.) University. He declared no conflicts of interest.

Body

Ethan Goh, MD, executive director of the Stanford AI Research and Science Evaluation (ARISE) Network, described the AGA pilot as both timely and promising.

“I’m certainly bullish about the use case,” he said in an interview. “Their study design and application is also robust, so I would congratulate them.”

Goh, a general editor for BMJ Digital Health & AI, predicted “huge potential” in the strategy for both clinicians and the general population, who benefit from the most up-to-date guidelines possible.

“I believe that using AI can represent a much faster, more cost effective, efficient way of gathering all these information sources,” he said.

Still, humans will need to be involved in the process.

“[This AI-driven approach] will always need some degree of expert oversight and judgement,” Goh said.

Speaking more broadly about automating study aggregation, Goh said AI may still struggle to determine which studies are most clinically relevant.

“When we use [AI models] to pull out medical references, anecdotally, I don’t think they’re always getting the best ones all the time, or even necessarily the right ones,” he said.

And as AI models grow more impressive, these shortcomings become less apparent, potentially lulling humans into overconfidence.

“Humans are humans,” Goh said. “We get lazy over time. That will be one of the challenges. As the systems get increasingly good, humans start to defer more and more of their judgment to them and say, ‘All right, AI, you’re doing good. Just do 100% automation.’ And then [people] start fact checking or reviewing even less.”

AI could also undermine automated reviews in another way: AI-generated publications that appear genuine, but aren’t, may creep into the dataset.

Despite these concerns, Goh concluded on an optimistic note.

“I think that there are huge ways to use AI, tools, not to replace, but to augment and support human judgment,” he said.

Ethan Goh, MD, is senior research engineer and executive director of the Stanford AI Research and Science Evaluation (ARISE) Network, at Stanford (Calif.) University. He declared no conflicts of interest.

Title

Timely and Promising

Large language models (LLMs) may help streamline clinical guideline development by dramatically reducing the time and cost required for systematic reviews, according to a pilot study from the American Gastroenterological Association (AGA).

Faster, cheaper study screening could allow societies to update clinical recommendations more frequently, improving alignment with the latest evidence, lead author Sunny Chung, MD, of Yale School of Medicine, New Haven, Connecticut, and colleagues, reported.

“Each guideline typically requires 5 to 15 systematic reviews, making the process time-consuming (averaging more than 60 weeks) and costly (more than $140,000),” the investigators wrote in Gastroenterology . “One of the most critical yet time-consuming steps in systematic reviews is title and abstract screening. LLMs have the potential to make this step more efﬁcient.”

To test this approach, the investigators developed, validated, and applied a dual-model LLM screening pipeline with human-in-the-loop oversight, focusing on randomized controlled trials in AGA guidelines.

The system was built using the 2021 guideline on moderate-to-severe Crohn’s disease, targeting biologic therapies for induction and maintenance of remission.

Using chain-of-thought prompting and structured inclusion criteria based on the PICO framework, the investigators deployed GPT-4o (OpenAI) and Gemini-1.5-Pro (Google DeepMind) as independent screeners, each assessing titles and abstracts according to standardized logic encoded in JavaScript Object Notation. This approach mimicked a traditional double-reviewer system.

After initial testing, the pipeline was validated in a 2025 update of the same guideline, this time spanning 6 focused clinical questions on advanced therapies and immunomodulators. Results were compared against manual screening by 2 experienced human reviewers, with total screening time documented.

The system was then tested across 4 additional guideline topics: fecal microbiota transplantation (FMT) for irritable bowel syndrome and Clostridioides difficile, gastroparesis, and hepatocellular carcinoma. A final test applied the system to a forthcoming guideline on complications of acute pancreatitis.

Across all topics, the dual-LLM system achieved 100% sensitivity in identifying randomized controlled trials (RCTs). For the 2025 update of the AGA guideline on Crohn’s disease, the models flagged 418 of 4,377 abstracts for inclusion, captur-ing all 25 relevant RCTs in just 48 minutes. Manual screening of the same dataset previously took almost 13 hours.

Comparable accuracy and time savings were observed for the other topics.

The pipeline correctly flagged all 13 RCTs in 4,820 studies on FMT for irritable bowel syndrome, and all 16 RCTs in 5,587 studies on FMT for Clostridioides difficile, requiring 27 and 66 minutes, respectively. Similarly, the system captured all 11 RCTs in 3,919 hepatocellular carcinoma abstracts and all 18 RCTs in 1,578 studies on gastroparesis, completing each task in under 65 minutes. Early testing on the upcoming guideline for pancreatitis yielded similar results.

Cost analysis underscored the efficiency of this approach. At an estimated $175–200 per hour for expert screeners, traditional abstract screening would cost around $2,500 per review, versus approximately $100 for the LLM approach—a 96% reduction.

The investigators cautioned that human oversight remains necessary to verify the relevance of studies flagged by the models. While the system’s sensitivity was consistent, it also selected articles that were ultimately excluded by expert reviewers. Broader validation will be required to assess performance across non-RCT study designs, such as observational or case-control studies, they added.

“As medical literature continues to expand, the integration of artiﬁcial intelligence into evidence synthesis processes will become increasingly vital,” Dr. Chung and colleagues wrote. “With further reﬁnement and broader validation, this LLM-based pipeline has the potential to revolutionize evidence synthesis and set a new standard for guideline development.”

This study was funded by National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases. The investigators reported no conflicts of interest.

Large language models (LLMs) may help streamline clinical guideline development by dramatically reducing the time and cost required for systematic reviews, according to a pilot study from the American Gastroenterological Association (AGA).

Faster, cheaper study screening could allow societies to update clinical recommendations more frequently, improving alignment with the latest evidence, lead author Sunny Chung, MD, of Yale School of Medicine, New Haven, Connecticut, and colleagues, reported.

“Each guideline typically requires 5 to 15 systematic reviews, making the process time-consuming (averaging more than 60 weeks) and costly (more than $140,000),” the investigators wrote in Gastroenterology . “One of the most critical yet time-consuming steps in systematic reviews is title and abstract screening. LLMs have the potential to make this step more efﬁcient.”

To test this approach, the investigators developed, validated, and applied a dual-model LLM screening pipeline with human-in-the-loop oversight, focusing on randomized controlled trials in AGA guidelines.

The system was built using the 2021 guideline on moderate-to-severe Crohn’s disease, targeting biologic therapies for induction and maintenance of remission.

Using chain-of-thought prompting and structured inclusion criteria based on the PICO framework, the investigators deployed GPT-4o (OpenAI) and Gemini-1.5-Pro (Google DeepMind) as independent screeners, each assessing titles and abstracts according to standardized logic encoded in JavaScript Object Notation. This approach mimicked a traditional double-reviewer system.

After initial testing, the pipeline was validated in a 2025 update of the same guideline, this time spanning 6 focused clinical questions on advanced therapies and immunomodulators. Results were compared against manual screening by 2 experienced human reviewers, with total screening time documented.

The system was then tested across 4 additional guideline topics: fecal microbiota transplantation (FMT) for irritable bowel syndrome and Clostridioides difficile, gastroparesis, and hepatocellular carcinoma. A final test applied the system to a forthcoming guideline on complications of acute pancreatitis.

Across all topics, the dual-LLM system achieved 100% sensitivity in identifying randomized controlled trials (RCTs). For the 2025 update of the AGA guideline on Crohn’s disease, the models flagged 418 of 4,377 abstracts for inclusion, captur-ing all 25 relevant RCTs in just 48 minutes. Manual screening of the same dataset previously took almost 13 hours.

Comparable accuracy and time savings were observed for the other topics.

The pipeline correctly flagged all 13 RCTs in 4,820 studies on FMT for irritable bowel syndrome, and all 16 RCTs in 5,587 studies on FMT for Clostridioides difficile, requiring 27 and 66 minutes, respectively. Similarly, the system captured all 11 RCTs in 3,919 hepatocellular carcinoma abstracts and all 18 RCTs in 1,578 studies on gastroparesis, completing each task in under 65 minutes. Early testing on the upcoming guideline for pancreatitis yielded similar results.

Cost analysis underscored the efficiency of this approach. At an estimated $175–200 per hour for expert screeners, traditional abstract screening would cost around $2,500 per review, versus approximately $100 for the LLM approach—a 96% reduction.

The investigators cautioned that human oversight remains necessary to verify the relevance of studies flagged by the models. While the system’s sensitivity was consistent, it also selected articles that were ultimately excluded by expert reviewers. Broader validation will be required to assess performance across non-RCT study designs, such as observational or case-control studies, they added.

“As medical literature continues to expand, the integration of artiﬁcial intelligence into evidence synthesis processes will become increasingly vital,” Dr. Chung and colleagues wrote. “With further reﬁnement and broader validation, this LLM-based pipeline has the potential to revolutionize evidence synthesis and set a new standard for guideline development.”

This study was funded by National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases. The investigators reported no conflicts of interest.