Cranfield, UK, 4th June 2024, Good time of the day to you, my fellow AI enthusiast, I am thinking about changing the format of these monthly newsletters to make them a little shorter and punchier and provide more of an opinion on certain topics/key themes that have emerged for the month, together with some thoughts that hopefully can spur a discussion within the community.
You can find the most recent news for generative (gen) AI in healthcare I curated following this link, but the images containing the announcements are also displayed below for your reference. The theme that I have picked up on from this month’s announcements is the volume of foundation models released, and hence, I would like to talk about how we should differentiate and measure the success of these models. Please note we will focus on large language models (LLMs) that use text rather than multi-modal models, such as those from Gleamer and SmartAlpha, which deserve a deep dive themselves.
Medical LLMs and Benchmarks
This month’s announcements came from John Snow Labs (JSL), which claimed to have achieved superior performance on the medical questionnaire compared to competitors such as Google’s Med-PaLM2 and GPT4. Whilst a UAE-based M42 announced its own Med42 model focused on the medical domain.
We all know that medical LLMs, just like the standard ones, come in different shapes and sizes. It is a likely coincidence that the JSL model and Med42 have 70 billion parameter variations (since the industry seems to be in love with 70B models; attaching the Hugging Face leaderboard names of the top 11 models as of the end of May 2024, as you can tell by the letters 70B, a few of them are based on 70 billion parameters).
However, the underlying technology between the two models is different. John Snow Labs is a proprietary model, meaning we do not know what data sources trained the model, whereas Med42 is a fine-tuned version of Llama3, the most recent open-source model by Meta (Facebook). Both approaches have their own benefits and drawbacks that are being fiercely discussed in the industry, with some individuals being pro-open, including Yann LeCunn, whilst others, who have a desire to draw in huge amounts of profits would prefer a closed source model, an example to those are the teams behind all of the closed-source models on the market, including the big names like OpenAI, Anthropic, etc.
One challenge that arises is evaluating these models, as humans need a reference for decision-making. If I were a healthcare decision-maker (CIO, CTO, etc), I would want to know which model is the best and which one is best suited for my specific needs (or use cases). For instance, Hugging Face has a leaderboard for LLMs based on the average of six benchmarks used to assess the models, whilst LMSYS have created a chatbot arena where chatbots are pitted against each other and human vote decides which LLM has a more favourable answer.
That is all good and well, but these are the domain-agnostic benchmarks; how about something relevant to the healthcare field, you may ask? Last month (April 2024), Hugging Face released the Open Medical-LLM Leaderboard which is benchmarking LLMs in healthcare. “The Leaderboard aims to track, rank and evaluate the performance of LLMs on medical question-answering tasks. It evaluates LLMs across diverse medical datasets, including MedQA (USMLE), PubMedQA, MedMCQA, and subsets of MMLU related to medicine and biology. The leaderboard aims to assess each model’s medical knowledge and question-answering capabilities comprehensively.” At the time of writing, out of the top 12 models, only two had a working hyperlink; the rest have taken you to the error page. I’m not sure why this is, but Hugging Face, please fix the issue… We have covered this topic in our recent insight here.
Some healthcare IT companies have created benchmarking tools for AI systems. For instance, K Health released K-QA, a benchmark containing 1,212 de-identified questions asked by K Health’s users, curated to form a benchmark. My colleague Alan Stoddart has written a wonderful piece diving deeper into the knowledge agent developed by K Health.
More recently, Epic announced and released an open-source validation tool for AI applications used by healthcare systems (Alan also wrote about this), which will soon add generative AI tools to their product portfolio. The validation software is available on GitHub. Epic’s validation suite meets a need among their EHR vendor’s customers to assess the tools they are considering and understand their impacts.
These evaluation tools require considerable effort and are invaluable for giving users a sense of different models’ relative performance. However, I have one fear: vendors developing AI and gen AI solutions will strive to outperform competitors in benchmarks/evaluation tools rather than building the best solution possible.
For example, the leakage of benchmark datasets into training data can artificially inflate the benchmark results. Another issue is that LLMs can be “overfitting”, a term that describes models memorising specific problems from benchmarks instead of learning general problem-solving skills. In a recent study, Scale AI created a new dataset called GSM1k, which closely resembled the widely used GSM8k benchmark for testing models for maths. When tested on GSM1k, several top models scored much worse than GSM8k, suggesting memorisation rather than true understanding.
When developing applications that require clear-cut, correct or incorrect answers, we can compile a test set consisting of labelled examples with the correct answers and then measure the percentage of times the LLM produces the correct output. However, many applications based on LLMs produce free-text output that doesn’t have a single correct response. Teams will often use advanced language models to evaluate outputs, but the results can be unreliable. In addition, using human preferences for certain answers does not necessarily mean those answers are more accurate.
However, despite their limitations, benchmarks are extremely useful, and in general, I would argue that they are a force for good. I believe we need external validation; however, I find it problematic when the validation is left up to the vendors themselves, as they can tweak the models to outperform their competitors. Hence, perhaps a third-party provider, unbiased and objective, unrelated to healthcare IT, can provide specific healthcare benchmarks to enable a fair comparison between the software and to test the models’ learning rather than memorisation. I am sure healthcare providers and some very confident vendors would be interested in such a service. However, the problem of commercialisation means that it is unlikely to happen.
Yet, there is still hope that certain consortiums such as Trustworthy & Responsible AI Network (TRAIN), which has close to 20 providers in the network can do the job. Members aim to improve the trustworthiness of AI in healthcare by sharing best practices on safe and responsible AI use, enabling registration of clinical AI systems, providing tools to measure AI outcomes and assess bias, and facilitating a federated registry for sharing real-world AI algorithm performance data among organizations. Perhaps consortiums such as TRAIN are best positioned to provide AI evaluation tools for the healthcare industry.
I would love to hear your thoughts, so if you would like to chat further, please feel free to reach out or connect with me on LinkedIn. Also, it would be great to hear your opinion on the new format, whether it works or not, and what improvements could be made. Thank you for investing your time in this update, and I wish you a fantastic day ahead! 👋
Related Research
Generative AI Market Intelligence Service 2024
This Market Intelligence Service delivers data, insights, and thorough analysis of the worldwide market potential for vendors leveraging Generative AI in healthtech. The Service encompasses Medical/Clinical IT, EMR & Digital Health, Pharma & Life Sciences, and Big Tech vendors, exploring their opportunities and strategies in the realm of generative AI
About The Author
Vlad joined Signify Research in 2023 as a Senior Market Analyst in the Digital Health team. He brings several years of experience in the consulting industry, having undertaken strategy, planning, and due diligence assignments for governments, operators, and service providers. Vlad holds an MSc degree with distinction in Business with Consulting from the University of Warwick.
About the AI in Healthcare Team
Signify Research’s AI in Healthcare team delivers in-depth market intelligence and insights across a breadth of healthcare technology sectors. Our areas of coverage include medical imaging analysis, clinical IT systems, pharmaceutical and life sciences applications, as well as electronic medical records and broader digital health solutions. Our reports provide a data-centric and global outlook of each market with granular country-level insights. Our research process blends primary data collected from in-depth interviews with healthcare professionals and technology vendors, to provide a balanced and objective view of the market.
About Signify Research
Signify Research provides healthtech market intelligence powered by data that you can trust. We blend insights collected from in-depth interviews with technology vendors and healthcare professionals with sales data reported to us by leading vendors to provide a complete and balanced view of the market trends. Our coverage areas are Medical Imaging, Clinical Care, Digital Health, Diagnostic and Lifesciences and Healthcare IT.
Clients worldwide rely on direct access to our expert Analysts for their opinions on the latest market trends and developments. Our market analysis reports and subscriptions provide data-driven insights which business leaders use to guide strategic decisions. We also offer custom research services for clients who need information that can’t be obtained from our off-the-shelf research products or who require market intelligence tailored to their specific needs.
More Information
To find out more:
E: enquiries@signifyresearch.net
T: +44 (0) 1234 986111
www.signifyresearch.net