Researchers evaluated different artificial intelligence models for assessing HER2 expression and found they had the lowest agreement when identifying HER2-low samples, a similar challenge faced by pathologists scoring HER2 expression based on immunohistochemistry (IHC) staining.
In a study, published in Modern Pathology earlier this month, researchers tested the agreement on HER2 expression results between 10 different computational pathology AI models reading the same breast cancer whole-slide images. The models tested included tools from Nucleai, PathAI, 4D Path, BostonGene, AstraZeneca, Panakeia, Lunit, Indica Labs, and Caris Life Sciences.
AI-assisted tools from these companies are designed to help pathologists assess IHC expression faster and standardize the sometimes-subjective results from human scoring. IHC results for HER2 expression have become even more complex with the introduction of HER2-low and -ultralow categories in breast cancer after the US Food and Drug Administration approved AstraZeneca and Daiichi Sankyo’s Enhertu (trastauzumab deruxtecan) in these indications in 2022 and 2025.
Prior to Enhertu’s HER2-low and -ultalow scoring, breast cancer patients had to have an IHC score of 3+, or 2+ with a positive in situ hybridization (ISH) result to be eligible for a HER2-targeted treatment. Patients who had a score of IHC 0 or 1+ or 2+ with ISH negative results were considered HER2-negative. The HER2-low category expanded that eligibility for Enhertu to patients with an IHC result of 1+, or IHC 2+ and negative by ISH. The ultralow indication went even further, allowing patients with an IHC 0 score with membrane staining to receive the treatment. Roche’s Pathway Anti-HER2/neu (4B5) Rabbit Monoclonal Primary Antibody test was also approved by the FDA as a companion diagnostic to identify patients with HER2-low and -ultralow tumors who are eligible for Enhertu.
With these new categories, concordance between labs and individual pathologists on whether a sample is HER2-low, -ultralow, or negative by IHC can vary significantly. To help improve this agreement and ensure the right patients are getting the right treatment, researchers are developing new resources such as these AI tools to help pathologists score low HER2 expression or reference tools to help standardize equipment and readings.
In the study published in Modern Pathology, which was led by the nonprofit Friends of Cancer Research (FOCR), the researchers aimed to better understand the capabilities of these AI tools that have been developed to help pathologists score IHC results. Mark Stewart, an author of the study and VP of science policy at FOCR, noted that this study is the first to evaluate this many independently developed AI pathology models on the same data set.
“It’s an opportunity to help characterize where we are as a field in terms of the current capabilities of these tools and understanding how variable the outputs of these tools might be,” Stewart said. “There can be variability in the types of training data that were used to develop the models or variability in the biology of breast cancer. We wanted to see impact that might have on the AI models’ ability to predict HER2 status.”
In the study, the 10 models assessed HER2 scoring on a data set from a single institution that included 1,124 whole-slide images from 733 breast cancer patients. The researchers acknowledged that these models had differences, including the type of slides and staining they required, whether human intervention was needed to identify controls or tumor regions within the slide, and the types of results that each model produced. The study also involved three pathologists to also assess HER2 expression in the samples.
Across the 10 models, there was a 65 percent overall percent agreement for HER2 scoring based on guidelines from the American Society of Clinical Oncology and College of American Pathologists (ASCO-CAP), determined according to the percentage of cells with HER2 staining. Between the models and the pathologists, the agreement was the same, at 65 percent. The three pathologists’ HER2 scoring agreed 70 percent of the time.
When the researchers looked at agreement based on the HER2 IHC categories, it was higher among samples scored HER2 IHC 3+ than for samples with lower HER2 expression. Agreement on IHC 3+ scoring was 88 percent compared to 64 percent, 60 percent, and 62 percent for HER2 IHC 0, 1+, and 2+ scoring, respectively.
They also tested the models’ agreement when assessing the extreme positive or negative scores by looking at agreement for IHC 0 versus IHC 1+, 2+, and 3+ and IHC 3+ versus IHC 0, 1+, and 2+. Agreement was highest, 97 percent, for distinguishing HER2 3+ from all other categories. Agreement on determining IHC 0 compared to the other scores was 85.6 percent.
“These AI models were able to, in fact, predict HER2 status across the more than 1,000 patient samples for this project,” Stewart said. “Having said that, we did see varying levels of agreement from the tools. We saw very high agreement in terms of being able to identify HER2-positive 3+ cases. But that variability increased as we went into HER2-low, the 1+ or 2+ categories. It’s important to note that variability mirrors what we see with human pathologists.”
While Stewart isn’t surprised by that, he added that HER2-low is an evolving category and, as the biology of HER2-low cancers becomes better defined and the HER2 categories are more established in practice, he expects the AI tools will improve.
This research came out of FOCR’s Digital and Computational Pathology Tool Harmonization Project (Digital PATH) research project, in which the aim is to identify opportunities for harmonizing methodologies and support more consistent measurement and use. With these results, the group hopes to highlight the importance of reference data sets.
Stewart said the group hopes to gain “greater clarity” from the US Food and Drug Administration on how reference data sets could play a role in its assessment of AI digital pathology tools. Currently, only one of the tools evaluated in the Modern Pathology study, PAthAI’s AISight Dx pathology slide image management software, has FDA clearance.
If the FDA’s views on reference datasets for AI digital pathology tools are established, Stewart said it could serve as an “incentive for people to put more effort into developing and making these types of data sets available.”
Ryan Hohman, VP pf public affairs at FOCR, added that the organization is starting to discuss with members of Congress the importance of validated reference datasets for FDA review of AI tools.
Variability in the performance of these tools could impact patients whose tumors are assessed by AI digital pathology models, Stewart noted. He added that there should be “mechanisms in place to ensure that the tools are meeting adequate performance [standards] and are doing what they’re intended to do.”
“There’s a lot of conversations around potential variability when there are multiple tools out there that have gone through different regulatory pathways or there’s a lack of regulatory pathways and the impact that has on patients in terms of the information they’re getting from these tests,” Stewart said. “As a field, we need to be thinking critically about how we can help facilitate the development of these reference data sets to benchmark these assays and potentially provide a more efficient path for evaluating and, ideally, validating these tools for various uses.”