Signify Premium Insight: Addressing the Shortcomings of AI
Published: November 4, 2021
This Insight is part of your subscription to Signify Premium Insights – Medical Imaging. This content is only available to individuals with an active account for this paid-for service and is the copyright of Signify Research. Content cannot be shared or distributed to non-subscribers or other third parties without express written consent from Signify Research. To view other recent Premium Insights that are part of the service please click here.
Co-written by Dr. Sanjay Parekh
A literature review recently published in the BMJ concluded that artificial intelligence tools are still some way off outperforming radiologists in the reading of breast images. The team conducting the research, based at the University of Warwick (UK), reviewed 12 studies published across the last decade, and found that 34 out of 36 tools (94%) performed less accurately than a single radiologist, while all 36 algorithms were found to be less accurate than the double reading of two radiologists in breast screening programmes. These, and other findings of the review appeared damning, leaving the researchers to conclude that: “Current evidence on the use of AI systems in breast cancer screening is a long way from having the quality and quantity required for its implementation into clinical practice.” With the team highlighting that far more clinical validation studies are needed to properly evaluate commercially available AI systems.
- Clinical validation remains a barrier to the increased adoption of medical imaging AI. Despite this, and despite the negative studies that are published periodically this barrier is being eroded. Clinical validation studies, both positive and negative, often require context and have caveats that make interpreting the health of an entire market from them alone impossible.
- Medical imaging AI is a technology that is best deployed as a complementary tool. As such, studies which pit algorithms against radiologists are almost guaranteed to be less successful than those which see algorithms enhancing radiologists, as they would in real clinical practice.
- The context of a solution’s use is also highly relevant. There is far less opportunity for something to go wrong in AI-led TB triage, in a region with a paucity of radiologists, compared with a well-established breast screening programme, which could see unnecessary tests or treatments administered. Furthermore, both of these screening scenarios are different again compared to a hospital’s radiology department where a patient has presented with symptoms and AI might successfully supplement or streamline a radiologist’s work.
- AI is a rapidly evolving market, and so clinical validation studies must also move in lockstep. As the market matures prospective studies in a clinical setting must begin to replace the retrospective studies that have, so far, been relied on.
The Signify View
Many young vendors, with carefully developed machine learning solutions are, at present, touting the advantages of their solutions. They are attempting to impress upon providers the potential their tools offer, convincing them to open their purse strings and make a purchase. However, the lack of clinical validation is, as detailed in Signify Research’s Machine Learning in Medical Imaging report, one the key barriers stopping these purchases being made. The BMJ study seems to suggest that this restraint is well-founded, with the promise of AI solutions apparently not living up to some of the radiologist-replacing expectation placed upon it.
The problem, though, may not lie wholly with the algorithms tested, but in part with that expectation placed upon them. There have been dramatic technical leaps in medical imaging AI in recent years, with great strides made in the area of machine learning, so it is possible that results from the review could be improved if only studies from today were used, rather than those which dated back as far as 2018, as in the BMJ study. But, even so, this would still be countered by the fact that the tools were expected to perform better than radiologists, even ultimately replacing them. This is, as detailed in the previous Premium Insight A Study in Validation, not the best way for AI solutions to be used in clinical practice. While certain, more straightforward tasks that radiologists complete, such as those that involve quantification and routine measurements are ripe to be carried out by an algorithm rather than a doctor, in the far greater majority of cases, an AI solution will be best used in addition to a radiologist.
The advantages of using an AI solution alongside the experienced and nuanced approach of a radiologist were highlighted in the review (such as fewer cancers missed because an AI algorithm is unaffected by fatigue or subjective diagnosis). One of the main drawbacks of the AI tools, for example was their frequent oversensitivity. The tools identified features of an image that could be indicative of breast cancer but are often also clinically unimportant (that is, mammographic features detected by AI that are unlikely to lead to or present as disease). Relying on AI alone, without the tempering wisdom of a radiologist, could therefore increase rates of overdiagnosis and overtreatments, leading to more unnecessary diagnostic tests and treatments, and impacting the net effect of the tools and tipping the balance towards harm rather than help.
Help Where it is Needed
There are other instances where this balance is more emphatically positive. Another recent study published in The Lancet Digital Health considered five different AI solutions for detecting tuberculosis, and found that, in general, they performed well, and according to the study, “outperforming experienced certified radiologists”. TB is more prevalent in less developed regions and more rural regions. These areas will often have fewer radiologists, and AI may be a way to identify more cases. As well as the solutions themselves working effectively, the clinical expectation for these tools is also more tangible. The study reported that follow-on Xpert testing (an assay test for the rapid diagnosis of TB and drug resistance) could be avoided, and reduce the numbers needed to treat (the number of patients that need to be treated to prevent one additional negative outcome), whilst maintaining high sensitivity.
As such, even if these tools suffered from the same issue of overdiagnosis as many of the breast analysis tools, if they are able to reduce the radiologist workload by removing the need for a radiologist to read every single scan, there are dramatic savings to be had within a health network. Another clinical factor that makes TB triage a more suitable space for tools that can be used with less supervision is the associated risk. Oversensitivity in an algorithm to detect TB may result in unnecessary sputum tests being conducted on patients without the disease, a far less invasive process than a breast biopsy. This circumstantial appropriateness is one of the key factors in the development and use of machine learning in medical imaging.
The Real World
Other studies have endeavoured to be more clinically representative in other ways. One recent study published in the RSNA journal Radiology, assessed an AI algorithm’s effectiveness of skeletal age assessment on hand radiograph examinations. The study is significant for several reasons. It is among the first prospective, multicentre, randomised controlled clinical AI trials carried out, and not only offers very positive results, but also brings results that demonstrate clinical performance in line with performance in the lab (strong algorithm performance on retrospective data). Once again, this research showed how AI tools were best used in addition to radiologists, with the study finding that overall, skeletal age assessment accuracy was improved through the use of the tool. In this case, radiologists in the study were shown the results of the AI solution and were able to either accept or reject the algorithm’s prediction. This is significant as it represents how most providers would actually use tools in their hospitals, instead of only testing the technical accuracy of the algorithms. It also highlights that AI-augmented radiologist’s diagnosis should be the objective for vendors, rather than aiming to replace radiologists themselves.
The study, which was conducted across six centres, including a reference centre, showed consistent improvements to the accuracy of skeletal age assessments across four of the five test centres (with respect to the reference centre), suggesting that the algorithm is generalisable across a variety of sites, importantly, this includes being generalisable at sites that didn’t contribute to the training dataset. Interestingly, at the fifth site, where performance was lower than at other sites, radiologists alone were more accurate than those at other sites but were also more likely to override highly accurate predictions made by the AI. This ultimately reduced the overall accuracy of the radiologist plus algorithm combination. Once again, this is an example of true clinical validation, with the research highlighting potential pitfalls associated with the adoption of AI that extend beyond those that are strictly technical.
Despite these broadly positive results, however, is the consideration of what the solution is actually achieving. While it has been shown to be able to assess skeletal age accurately and efficiently, with the study demonstrating its effectiveness in real clinical settings, it remains to be seen whether any providers would be willing to pay for such a solution. Unlike in the case of AI solutions such as breast lesion detection, stroke detection and triage, or coronary artery disease assessment (using FFR-CT tools), for example, where clinical value is clearly added, the financial case for a skeletal ageing algorithm is less obvious.
Good. Bad. Inevitable?
More universally, these, and other studies, show the progress that is being made in terms of clinical validation. While some studies, such as that published in the BMJ are, at first glance, very negative, and seem to presage the frequently prophesised AI winter, the reality is more nuanced. The research aptly demonstrated that machine learning tools aren’t, and perhaps never will be, able to make radiologists redundant, but, to the relief of almost all AI developers hoping to sell their solutions to providers, this scenario isn’t the one in which their tools have been developed to work. Instead of the Warwick team’s review (BMJ study) foreshadowing increasingly difficult conditions for the medical imaging AI market, it instead serves as a benchmark of where the industry is today. It also tempers some of the expectation, fuelled in part by the Silicon Valley-esque valuations, that medical imaging AI would be as quickly disruptive as some tech firms. It serves as a reminder that healthcare technology moves slowly and treads carefully, as a result of the discipline’s inherent risks.
These more negative studies are important, but they must also be contextualised alongside other studies that highlight the progress being made in overcoming the barrier of clinical validation. Some studies such as that from The Lancet Digital Health (TB study) highlight situations where AI might most feasibly be deployed, as an imperfect but preferable alternative to present situations. Others are more pragmatic, with the Radiology study showing success with a practical approach to medical imaging AI use in the near term. These successes aren’t merely academic either, Viz.ai’s successful bid to become the first AI vendor to secure a New Technology Add-on Payment (NTAP), for example, came as a result of successful clinical validation.
This is, simply, how the market will continue to progress. Far from portending the end of medical imaging AI, these studies, both positive and negative, are instead indications of technical development in a highly regulated market. While these studies may shape the market in the near term, determining how AI is first brought into radiology departments, or influence where vendors choose to prioritise for example. Long term, they will be no more damning than that.
About Signify Premium Insights
This Insight is part of your subscription to Signify Premium Insights – Medical Imaging. This content is only available to individuals with an active account for this paid-for service and is the copyright of Signify Research. Content cannot be shared or distributed to non-subscribers or other third parties without express written consent from Signify Research. To view other recent Premium Insights that are part of the service please click hereShare on LinkedIn