Groups of physicians produce more accurate diagnoses than individuals

Unpacking misdiagnosis

Article Type

Changed

Thu, 03/28/2019 - 14:29

Author(s)

Groups of physicians and trainees diagnose clinical cases with more accuracy than individuals, according to a study of solo and aggregate diagnoses collected through an online medical teaching platform.

“These findings suggest that using the concept of collective intelligence to pool many physicians’ diagnoses could be a scalable approach to improve diagnostic accuracy,” wrote lead author Michael L. Barnett, MD, of Harvard University in Boston and his coauthors, adding that “groups of all sizes outperformed individual subspecialists on cases in their own subspecialty.” The study was published online in JAMA Network Open.

This cross-sectional study examined 1,572 cases solved within the Human Diagnosis Project (Human Dx) system, an online platform for authoring and diagnosing teaching cases. The system presents real-life cases from clinical practices and asks respondents to generate ranked differential diagnoses. Cases are tagged for specialties based on both intended diagnoses and the top diagnoses chosen by respondents. All cases used in this study were authored between May 7, 2014, and October 5, 2016, and had 10 or more respondents.

Of the 2,069 attending physicians and fellows, residents, and medical students (users) who solved cases within the Human Dx system, 1,452 (70.2%) were trained in internal medicine, 1,228 (59.4%) were residents or fellows, 431 (20.8%) were attending physicians, and 410 (19.8%) were medical students. To create a collective differential, Dr. Barnett and his colleagues aggregated the responses of up to nine participants via a weighted combination of each clinician’s top three diagnoses, which they dubbed “collective intelligence.”

The diagnostic accuracy for groups of nine was 85.6% (95% confidence interval, 83.9%-87.4%), compared with individual users at 62.5% (95% CI, 60.1%-64.9%), a difference of 23% (95% CI, 14.9%-31.2%; P less than .001). Groups of five saw a 17.8% difference in accuracy versus an individual (95% CI, 14.0%-21.6%; P less than .001), compared with 12.5% for groups of two (95% CI, 9.3%-15.8%; P less than .001). Taken together, these seem to underline an association between larger groups and increased accuracy.

Individual specialists solved cases in their particular areas with a diagnostic accuracy of 66.3% (95% CI, 59.1%-73.5%), compared with nonmatched specialty accuracy of 63.9% (95% CI, 56.6%-71.2%). Groups, however, outperformed specialists across the board: 77.7% accuracy for a group of 2 (95% CI, 70.1%-84.6%; P less than .001) and 85.5% accuracy for a group of 9 (95% CI, 75.1%-95.9%; P less than .001).

The coauthors shared the limitations of their study, including the possibility that the users who contributed these cases to Human Dx may not be representative of the medical community as a whole. They also noted that, while their 431 attending physicians constituted the “largest number ... to date in a study of collective intelligence,” trainees still made up almost 80% of users. In addition, they acknowledged that Human Dx was not designed to generate collective diagnoses nor assess collective intelligence; another platform created with that ability in mind may have returned different results. Finally, they were unable to assess how exactly greater accuracy would have been linked to changes in treatment, calling it “an important question for future work.”

The authors disclosed several conflicts of interest. One doctor reported receiving personal fees from Greylock McKinnon Associates; another reported receiving personal fees from the Human Diagnosis Project and serving as their nonprofit director during the study. A third doctor reported consulting for a company that makes patient-safety monitoring systems and receiving compensation from a not-for-profit incubator, along with having equity in three medical data and software companies.

SOURCE: Barnett ML et al. JAMA Netw Open. 2019 Mar 1. doi: 10.1001/jamanetworkopen.2019.0096.

Body

Although this study from Barnett et al. is not the silver bullet for misdiagnosis, better understanding why physicians make mistakes is a necessary and valuable undertaking, according to Stephan D. Fihn, MD, of the University of Washington, Seattle.

In the past, the “correct” diagnostic approach included making a list of potential diagnoses and systematically ruling them out one by one, a process conveyed via clinicopathologic conferences in teaching hospitals. These, Dr. Fihn recalled, lasted until medical educators recognized them as “more ... theatrical events than meaningful teaching exercises” and understood that master clinicians did not actually think in the manner this approach modeled. Since then, the maturation of cognitive psychology and “a growing literature” have made diagnostic error seem like a common, sometimes unavoidable element of being human.

What can be done? Computers have always been a possibility, but “none have achieved the breadth of content and accuracy necessary to be adopted to any great extent,” Dr. Fihn wrote. Another option is crowdsourcing, as described in this study from Barnett and colleagues. Their approach has its pitfalls: A 62.5% level of diagnostic accuracy from individuals is not very high, which suggests either difficult cases or a preponderance of inexperienced clinicians who may benefit from collective intelligence even more. Regardless, he stated, “clinicians need to be cognizant of their own inherent limitations and acknowledge fallibility”; being humble and willing to seek advice “remain important, albeit imperfect, antidotes to misdiagnosis.”

These comments are adapted from an accompanying editorial (JAMA Netw Open. 2019 Mar 1. doi: 10.1001/jamanetworkopen.2019.1071 ). No conflicts of interest were reported.

Publications

Topics

Sections

Author(s)

Author(s)

Body

Although this study from Barnett et al. is not the silver bullet for misdiagnosis, better understanding why physicians make mistakes is a necessary and valuable undertaking, according to Stephan D. Fihn, MD, of the University of Washington, Seattle.

In the past, the “correct” diagnostic approach included making a list of potential diagnoses and systematically ruling them out one by one, a process conveyed via clinicopathologic conferences in teaching hospitals. These, Dr. Fihn recalled, lasted until medical educators recognized them as “more ... theatrical events than meaningful teaching exercises” and understood that master clinicians did not actually think in the manner this approach modeled. Since then, the maturation of cognitive psychology and “a growing literature” have made diagnostic error seem like a common, sometimes unavoidable element of being human.

What can be done? Computers have always been a possibility, but “none have achieved the breadth of content and accuracy necessary to be adopted to any great extent,” Dr. Fihn wrote. Another option is crowdsourcing, as described in this study from Barnett and colleagues. Their approach has its pitfalls: A 62.5% level of diagnostic accuracy from individuals is not very high, which suggests either difficult cases or a preponderance of inexperienced clinicians who may benefit from collective intelligence even more. Regardless, he stated, “clinicians need to be cognizant of their own inherent limitations and acknowledge fallibility”; being humble and willing to seek advice “remain important, albeit imperfect, antidotes to misdiagnosis.”

These comments are adapted from an accompanying editorial (JAMA Netw Open. 2019 Mar 1. doi: 10.1001/jamanetworkopen.2019.1071 ). No conflicts of interest were reported.

Body

Although this study from Barnett et al. is not the silver bullet for misdiagnosis, better understanding why physicians make mistakes is a necessary and valuable undertaking, according to Stephan D. Fihn, MD, of the University of Washington, Seattle.

In the past, the “correct” diagnostic approach included making a list of potential diagnoses and systematically ruling them out one by one, a process conveyed via clinicopathologic conferences in teaching hospitals. These, Dr. Fihn recalled, lasted until medical educators recognized them as “more ... theatrical events than meaningful teaching exercises” and understood that master clinicians did not actually think in the manner this approach modeled. Since then, the maturation of cognitive psychology and “a growing literature” have made diagnostic error seem like a common, sometimes unavoidable element of being human.

What can be done? Computers have always been a possibility, but “none have achieved the breadth of content and accuracy necessary to be adopted to any great extent,” Dr. Fihn wrote. Another option is crowdsourcing, as described in this study from Barnett and colleagues. Their approach has its pitfalls: A 62.5% level of diagnostic accuracy from individuals is not very high, which suggests either difficult cases or a preponderance of inexperienced clinicians who may benefit from collective intelligence even more. Regardless, he stated, “clinicians need to be cognizant of their own inherent limitations and acknowledge fallibility”; being humble and willing to seek advice “remain important, albeit imperfect, antidotes to misdiagnosis.”

These comments are adapted from an accompanying editorial (JAMA Netw Open. 2019 Mar 1. doi: 10.1001/jamanetworkopen.2019.1071 ). No conflicts of interest were reported.

Title

Unpacking misdiagnosis

Groups of physicians and trainees diagnose clinical cases with more accuracy than individuals, according to a study of solo and aggregate diagnoses collected through an online medical teaching platform.

“These findings suggest that using the concept of collective intelligence to pool many physicians’ diagnoses could be a scalable approach to improve diagnostic accuracy,” wrote lead author Michael L. Barnett, MD, of Harvard University in Boston and his coauthors, adding that “groups of all sizes outperformed individual subspecialists on cases in their own subspecialty.” The study was published online in JAMA Network Open.

This cross-sectional study examined 1,572 cases solved within the Human Diagnosis Project (Human Dx) system, an online platform for authoring and diagnosing teaching cases. The system presents real-life cases from clinical practices and asks respondents to generate ranked differential diagnoses. Cases are tagged for specialties based on both intended diagnoses and the top diagnoses chosen by respondents. All cases used in this study were authored between May 7, 2014, and October 5, 2016, and had 10 or more respondents.

Of the 2,069 attending physicians and fellows, residents, and medical students (users) who solved cases within the Human Dx system, 1,452 (70.2%) were trained in internal medicine, 1,228 (59.4%) were residents or fellows, 431 (20.8%) were attending physicians, and 410 (19.8%) were medical students. To create a collective differential, Dr. Barnett and his colleagues aggregated the responses of up to nine participants via a weighted combination of each clinician’s top three diagnoses, which they dubbed “collective intelligence.”

The diagnostic accuracy for groups of nine was 85.6% (95% confidence interval, 83.9%-87.4%), compared with individual users at 62.5% (95% CI, 60.1%-64.9%), a difference of 23% (95% CI, 14.9%-31.2%; P less than .001). Groups of five saw a 17.8% difference in accuracy versus an individual (95% CI, 14.0%-21.6%; P less than .001), compared with 12.5% for groups of two (95% CI, 9.3%-15.8%; P less than .001). Taken together, these seem to underline an association between larger groups and increased accuracy.

Individual specialists solved cases in their particular areas with a diagnostic accuracy of 66.3% (95% CI, 59.1%-73.5%), compared with nonmatched specialty accuracy of 63.9% (95% CI, 56.6%-71.2%). Groups, however, outperformed specialists across the board: 77.7% accuracy for a group of 2 (95% CI, 70.1%-84.6%; P less than .001) and 85.5% accuracy for a group of 9 (95% CI, 75.1%-95.9%; P less than .001).

The coauthors shared the limitations of their study, including the possibility that the users who contributed these cases to Human Dx may not be representative of the medical community as a whole. They also noted that, while their 431 attending physicians constituted the “largest number ... to date in a study of collective intelligence,” trainees still made up almost 80% of users. In addition, they acknowledged that Human Dx was not designed to generate collective diagnoses nor assess collective intelligence; another platform created with that ability in mind may have returned different results. Finally, they were unable to assess how exactly greater accuracy would have been linked to changes in treatment, calling it “an important question for future work.”

The authors disclosed several conflicts of interest. One doctor reported receiving personal fees from Greylock McKinnon Associates; another reported receiving personal fees from the Human Diagnosis Project and serving as their nonprofit director during the study. A third doctor reported consulting for a company that makes patient-safety monitoring systems and receiving compensation from a not-for-profit incubator, along with having equity in three medical data and software companies.

SOURCE: Barnett ML et al. JAMA Netw Open. 2019 Mar 1. doi: 10.1001/jamanetworkopen.2019.0096.

Groups of physicians and trainees diagnose clinical cases with more accuracy than individuals, according to a study of solo and aggregate diagnoses collected through an online medical teaching platform.

“These findings suggest that using the concept of collective intelligence to pool many physicians’ diagnoses could be a scalable approach to improve diagnostic accuracy,” wrote lead author Michael L. Barnett, MD, of Harvard University in Boston and his coauthors, adding that “groups of all sizes outperformed individual subspecialists on cases in their own subspecialty.” The study was published online in JAMA Network Open.

This cross-sectional study examined 1,572 cases solved within the Human Diagnosis Project (Human Dx) system, an online platform for authoring and diagnosing teaching cases. The system presents real-life cases from clinical practices and asks respondents to generate ranked differential diagnoses. Cases are tagged for specialties based on both intended diagnoses and the top diagnoses chosen by respondents. All cases used in this study were authored between May 7, 2014, and October 5, 2016, and had 10 or more respondents.

Of the 2,069 attending physicians and fellows, residents, and medical students (users) who solved cases within the Human Dx system, 1,452 (70.2%) were trained in internal medicine, 1,228 (59.4%) were residents or fellows, 431 (20.8%) were attending physicians, and 410 (19.8%) were medical students. To create a collective differential, Dr. Barnett and his colleagues aggregated the responses of up to nine participants via a weighted combination of each clinician’s top three diagnoses, which they dubbed “collective intelligence.”

The diagnostic accuracy for groups of nine was 85.6% (95% confidence interval, 83.9%-87.4%), compared with individual users at 62.5% (95% CI, 60.1%-64.9%), a difference of 23% (95% CI, 14.9%-31.2%; P less than .001). Groups of five saw a 17.8% difference in accuracy versus an individual (95% CI, 14.0%-21.6%; P less than .001), compared with 12.5% for groups of two (95% CI, 9.3%-15.8%; P less than .001). Taken together, these seem to underline an association between larger groups and increased accuracy.

Individual specialists solved cases in their particular areas with a diagnostic accuracy of 66.3% (95% CI, 59.1%-73.5%), compared with nonmatched specialty accuracy of 63.9% (95% CI, 56.6%-71.2%). Groups, however, outperformed specialists across the board: 77.7% accuracy for a group of 2 (95% CI, 70.1%-84.6%; P less than .001) and 85.5% accuracy for a group of 9 (95% CI, 75.1%-95.9%; P less than .001).

The coauthors shared the limitations of their study, including the possibility that the users who contributed these cases to Human Dx may not be representative of the medical community as a whole. They also noted that, while their 431 attending physicians constituted the “largest number ... to date in a study of collective intelligence,” trainees still made up almost 80% of users. In addition, they acknowledged that Human Dx was not designed to generate collective diagnoses nor assess collective intelligence; another platform created with that ability in mind may have returned different results. Finally, they were unable to assess how exactly greater accuracy would have been linked to changes in treatment, calling it “an important question for future work.”

The authors disclosed several conflicts of interest. One doctor reported receiving personal fees from Greylock McKinnon Associates; another reported receiving personal fees from the Human Diagnosis Project and serving as their nonprofit director during the study. A third doctor reported consulting for a company that makes patient-safety monitoring systems and receiving compensation from a not-for-profit incubator, along with having equity in three medical data and software companies.

SOURCE: Barnett ML et al. JAMA Netw Open. 2019 Mar 1. doi: 10.1001/jamanetworkopen.2019.0096.

Publications

Topics

Article Type

Sections

User login