ChatGPT in Health and Medicine

by Jody Ranck | Mar 21, 2023

Hype or a Gamechanger?

Key Takeaways

ChatGPT has ushered in a new round of hype about disruption, but we need to show caution about how deep the impact will be in healthcare. There is a large gap between “intelligence” that drives clinical judgement and the “intelligence” in generative AI tools such as ChatGPT. These tools have a long way to go for clinical adoption and meeting the growing consensus for standards for responsible and trustworthy AI.

The first wave of system-wide adoption is currently being proven in administrative applications that reduce work burdens and administrative waste. For many administrative tasks, these large language models may find utility alongside a number of other robotic process automation (RPA), AI and other automation tools over the next year or two. The question will be which tool(s) work best for specific administrative functions while being easily integrated into existing technologies and workflows.

Sustainability of large language models is a serious problem. Power consumption for merely training these models is a challenge, given the trend towards Net Zero and more sustainable computing architectures. This exacerbates equity concerns around who can afford to innovate in AI given the growing cost of computing.

Introduction

Last year, I wrote a report examining best practices and frameworks for the responsible development of AI tools for the healthcare enterprise, AI and Trust in Healthcare. Given the hyperbolic rise of this topic in everyday conversations, we deemed a discussion of ChatGPT in the context of this research warranted.

By now everyone has seen the hype on the now-ubiquitous Chat Generative Pre-trained Transformer (ChatGPT) as the next new thing coming from Silicon Valley’s OpenAI labs. The frenzy has driven heated conversations about copyright laws and the downfall of creative professions, to concerns that bots will be cheating on our next generation of students’ exams, to the death of quality journalism. As with other technological advancements, we will see waves of criticism, the usual schadenfreude-infused call for disruption of everything, and a lot of mediocre startups claiming to disrupt their chosen industry until we begin to see the real possibilities of ChatGPT, one way or another, through the fog of bullshit (we’ll get back to this and what we mean shortly).

PLOS Digital Health has already published one study showing that ChatGPT can pass the US Medical Licensing Exam (USMLE). What this means specifically is that it could score at or near the passing threshold on all three of the exams without any additional or special training. The authors assert that this proves the tool’s potential role in medical education and/or clinical decision-making. The provider platform Doximity also launched a beta version of their own ChatGPT tool for doctors that targets administrative tasks such as faxing preauthorizations , charting, and appeals letters. Doximity is going after that 70% of doctors still using fax machines with the DocsGPT tool.

What ChatGPT isn’t

When we hear the hype about AI in general or particular tools such as ChatGPT, it is important to go back to what intelligence really is and compare this to what ChatGPT does. Intelligence is an area of great interest in AI and ethics debates. Gary Marcus and Emily Bender are known for responding to the first wave of hype around new AI developments by pointing out some of the differences between AI tools and true human intelligence. For example, common sense is far more complex than the signifier ‘common’ indicates. It is built on years of human learning, interpretation and other complex reasoning skills including judgement that an inanimate algorithm following programmable rules cannot do. This is why many are skeptical that autonomous vehicles will ever become mainstream.

Merriam Webster defines ‘intelligence’ as:

a: the ability to learn or understand or to deal with new or trying situations

b: the ability to apply knowledge to manipulate one’s environment or to think abstractly as measured by objective criteria (as tests)

Let’s return to the bullshit part. A number of commentators on ChatGPT have been deploying philosopher Harry Frankfurt’s definition of bullshit as “speech that is intended to persuade without concern for the truth.” Arvind Narayanan and Sayash Kapoor at Princeton use this definition to argue that ChatGPT is the greatest bullshitter ever. That is, LLMs are very good at producing what they refer to as “plausible text” but not true statements.

This is a key consideration when thinking about the application of these models and tools: There is no ground truth in the training of these models. The models do not think or use expert judgement. It is therefore critical to be careful about what types of tasks we use LLMs to solve. Many routine administrative tasks or paperwork that are rules-based or can be easily checked for errors are the low-hanging fruit. We already have a lot of AI and non-AI based bots and automation tools in use for these domains (see UIPath, Notable Health, Infinitus, etc.)

But clinical decision support and the actual practice of medicine may be another matter. In December Google/DeepMind released a medical LLM called MedPaLM. This model has been trained on seven question-answering datasets from medical exams to consumer queries. The developers of MedPaLM are targeting use cases including knowledge retrieval, summarization of key findings, clinical decision support, and primary care triage. The developers note that it currently does not perform as well as clinicians.

So, what does this mean? From the published comparisons we find the following differences with clinicians:

Incorrect retrieval of information was 16.9% for MedPaLM, compared to 3.6% for clinicians
Incorrect reasoning was seen in 10.1% of the MedPaLM answers compared to 2.1% for clinicians
Incorrect comprehension happened in 18.7% of cases for the algorithm and 2.2% for clinicians

These are pretty substantial differences; many would not consider this good enough for clinical practice yet (and many experts in AI would argue that this is the wrong way to think about AI’s role in medicine. That is, AI will augment physician’s clinical judgement but not replace it).

Beyond the Accuracy Challenge

Beyond the issue of whether we can trust foundation models to provide accurate answers or better diagnoses (the answer is maybe, but there is still a long way to go for clinical practice), there is the other major problem with foundation models—energy consumption for training them is enormous. In an era when sustainability has become an important consideration across all industries, LLMs or foundation models are almost unconscionable energy hogs. Estimates have pegged CO2 production for training each of these models in the range of 25 – 500 metric tons, depending on the efficiency of the processors and energy source. For comparison, an average vehicle will produce 57 metric tons of CO2 over its entire lifetime.

Figure 2: Growing power consumption with foundation models (Source: Economist)

The power consumption issue is not only an issue about sustainability but also an equity issue. Only the largest academic institutions with well-funded AI programs (most often via lucrative corporate grants or partnerships) and the largest companies have the resources to train many of these models. This is raising concerns over equity in the ability to build LLMs where those with access to sufficient capital and/or partnerships with big tech will be able to innovate; smaller universities, for example, will be locked out.

Conclusion

So where do we see these large language models going in healthcare, particularly our tool du jour, ChatGPT? The answer is that it’s too early to tell, and there are lots of reasons to be skeptical about the hype. Yes, LLMs are undoubtedly an improvement that will impact the conversational AI space and offer potential for tools to address administrative waste, which is critically important in the era of workforce shortages. But for high-risk tasks where accuracy is vital, we still have a ways to go.

Use of these tools still requires a clinician to discern whether answers are accurate, or just bullshit (so to speak). Healthcare has been burned plenty in recent decades by new technologies that saw marketing get ahead of the science, and the AI space already has some high-profile failures. These latest shiny new things will almost definitely lose their luster before we actually see robust applications being deployed in clinical settings.

Like many AI/ML tools in this space, they will best be viewed as augmented intelligence tools or wayfinding tools that help clinicians find solutions to difficult cases and work as complementary tools to human judgement. However, there will be a long road to build trust.

In last year’s report, AI and Trust in Healthcare, I highlighted the growing consensus on what will constitute responsible AI models in healthcare. The pillars of the approach we discuss include validation/verification, bias mitigation, transparency/explainability, fairness, and health equity. By these criteria alone, we are a long way off before ChatGPT will be used effectively in clinical care. The broad-based hype we have seen in recent weeks may come back to haunt us in the coming months as premature experiments are rolled out in the name of advancing the technology without fully considering the requisite guardrails.

AI/ML, Engagement

ChatGPT in Health and Medicine

Hype or a Gamechanger?

Key Takeaways

Introduction

What ChatGPT isn’t

Beyond the Accuracy Challenge

Conclusion

0 Comments

Submit a Comment Cancel reply

Related Content

HIMSS24: Back to Form but Haunted by Change Healthcare

Providence’s Blueprint for Value Based, Data Driven Healthcare feat. Deepak Sadagopan

ViVE 2024: Bridging the Health 2.0 – HIMSS Gap

Company

Resources

Services