Skip to content

Project Pulse | Evaluating AI Agreement in HER2 Assessment Using Whole-Slide Imaging: Insights from the Digital PATH Project

Project Pulse | Evaluating AI Agreement in HER2 Assessment Using Whole-Slide Imaging: Insights from the Digital PATH Project

Biomarkers play a key role in identifying which patients are most likely to benefit from specific therapies, and advances in digital pathology and artificial intelligence (AI) are beginning to reshape how those biomarkers are assessed. Scientists use assays, or tests to assess biomarkers. Outputs from different assays may provide different results, which could impact the therapy a patient receives.  

The Friends of Cancer Research Diagnostic Harmonization Portfolio provides insights into variability that can exist among diagnostics and offer efficient solutions to enhance alignment and review.  Our most recent research partnership, the Digital PATH Project, recently had a manuscript of findings published in Diagnostics. Below, we’ve included a companion blog post to provide context, including a shorter summary of the manuscript findings. 

Background: Targeted Therapies and HER2

As cancer research has advanced, the development of targeted therapies tailored to specific tumor biology has transformed oncology treatment. Rather than relying on a one-size-fits-all approach, modern therapeutic approaches increasingly focus on molecular features that define subgroups of patients most likely to benefit. One well-established example of this precision medicine approach is targeting human epidermal growth factor receptor 2 (HER2) in breast cancer.

HER2 is a protein expressed on the surface of cells that plays a role in regulating cell growth. HER2 overexpression drives more aggressive tumor growth and increased risk of disease progression. Assessing HER2 status is a critical step in identifying patients who may benefit from HER2-targeted treatments.

Like many biomarkers, HER2 expression exists along a continuum. To standardize classification, pathologists use guidelines developed by the American Society of Clinical Oncology and the College of American Pathologists (ASCO–CAP) to score HER2 on a scale from 0 to 3+ based on how intense and complete HER2 staining looks when a tumor sample is examined under a microscope.

Historically, HER2-targeted therapies were limited to patients with HER2 3+ tumors, reflecting very high levels of expression. However, recent advances—particularly the development of antibody-drug conjugates (ADCs), which deliver cancer-killing drugs directly to HER2-expressing cells—have expanded treatment options to patients with lower levels of HER2 expression. As a result, accurately distinguishing HER2 0, 1+, and 2+ tumors has become increasingly important for both clinical decision-making and clinical trial design.

Current Challenges in HER2 Detection 

HER2 scoring is typically performed by pathologists, who carefully examine stained tumor slides and assign scores based on established criteria. While this approach can work well, it is inherently subjective and different pathologists may interpret the same tissue sample differently. 

As computational and AI technologies have advanced, there has been growing interest in using these tools to support pathology workflows. Digital pathology enables high-resolution whole-slide images to be analyzed by AI-driven computational models, offering the potential for increased consistency, reduced variability, and improved efficiency compared to manual review alone. 

However, tool developers create AI-enabled assessment tools independently, often using different training data and methodologies. As a result, there is limited understanding of how these models perform relative to one another and how their outputs align with human pathologist assessments. Large-scale comparative evaluations help characterize the current state of the technology as well as understanding agreement across models and identifying sources of variability. 

The Digital PATH Project focused on HER2 because it is routinely assessed in breast cancer and has well established clinical scoring guidelines. The study leveraged a dataset of more than 1,000 digital images that were stained and collected using a pre-specified protocol. Due to the ongoing drug development in this space, multiple AI models are also actively under development, allowing for a robust analysis of 10 independently developed models. Also, the scoring system for HER2 is universally accepted, enabling a clear comparison between the AI models and pathologist readings. 


How Our Study Contributes to the Field 

We established a multi-stakeholder group to evaluate HER2 scoring consistency across independently developed AI models and to compare model outputs with human pathologist assessments. 

Study design 

  • AI models: We evaluated 10 independently developed AI models. 
  • Data: We used >1,000 tumor samples from 733 patients with breast cancer, all from the same institution. 
  • Blinding: The images were anonymized, and the AI models had no access to patient information or pathologist scores. 
  • Workflow: AI models scored the images on their own, without human review, allowing for direct comparison of model-only outputs. 

Model outputs 

The AI models reported HER2 status in several different ways. First, they used the standard clinical scoring system with four categories (ASCO-CAP IHC scores: 0, 1+, 2+, and 3+). In addition, the models were evaluated using three simplified, yes-or-no (binary) approaches, where cases were grouped into one of two categories instead of four: 

  • No HER2 expression (0) versus any HER2 expression (1+, 2+, or 3+) 
  • Low or no HER2 expression (0 or 1+) versus higher expression (2+ or 3+) 
  • Not high HER2 (0, 1+, or 2+) versus high HER2 expression (3+) 

 

Research Findings 

When we compared how different AI models classify HER2 status, we found that their level of agreement depended on the level of HER2 expression in a tumor sample. To quantify consistency between models, we used overall percent agreement, which measures how often different models or readers assign the same classification. 

When HER2 status was reported using the standard four clinical categories (HER2 0, 1+, 2+, and 3+), agreement was moderate. AI models agreed with one another 65% of the time and agreement between AI models and human pathologists was also of similar magnitude.  

We then looked at agreement using simplified, binary classifications that grouped HER2 categories in different ways. Overall, these binary approaches led to higher agreement across models, though performance still varied depending on HER2 expression level: 

  • Agreement was strongest when identifying tumors with high HER2 expression (HER2 3+). In this setting, AI models agreed more than 97% of the time, indicating that they can reliably recognize cases with high HER2 levels.  
  • Agreement was lower when distinguishing tumors with no HER2 expression (HER2 0) from those with any detectable expression, with models agreeing about 86% of the time.  
  • The lowest level of agreement occurred when separating low or no HER2 expression (HER2 0 or 1+) from higher expression (HER2 2+ or 3+), where agreement dropped to around 80%. 

Notably, disagreements—whether between human readers or AI models—were most common when distinguishing between adjacent categories, such as HER2 0 versus 1+. In contrast, larger differences in expression, like HER2 0 versus HER2 3+, were much easier to identify consistently. 

We also explored why there were large disagreements (e.g., situations where some AI models called a sample 0 while others called it 3+). A pathologist re-reviewed approximately 50 slides and noted tissue heterogeneity or borderline staining in discordant cases, suggesting underlying tissue characteristics can contribute to variability in both AI model outputs and human assessments. 

Interpretation 

These findings indicate that AI-based HER2 assessments are most consistent in cases of high HER2 expression, where staining is strong and unambiguous. In contrast, tumors with low or intermediate HER2 expression are more challenging to align on, highlighting an area where both AI models and human readers continue to face interpretive challenges. 

Importantly, the similarity in disagreement patterns between AI models and pathologists suggests that variability may be driven less by failures in tumor detection and more by fundamental challenges in interpreting HER2 staining, particularly at the lower end of the expression spectrum. Large disagreements were often associated with technical factors, such as poor or uneven staining, blurry or damaged areas on sample slides, or tumors where HER2 expression varied across different regions. 

Remaining Questions and Future Directions 

This study provides important insights into the current state of AI-based HER2 assessment, illuminating several remaining questions: 

  • Are observed limitations driven by AI models or by HER2 testing methods themselves? Many HER2 assays were originally designed to identify tumors with very high HER2 expression (HER 3+), which may limit their ability to consistently distinguish lower levels of HER2 expression. 
  • Can complementary quantitative tests help improve confidence in AI-based assessments? Using additional assays alongside digital pathology could provide confirmation of results and serve as a useful resource during AI model development, training, and validation. 
  • What reference standards are needed? Establishing consistent reference datasets and benchmarks would facilitate more robust evaluation, comparison, and validation of AI tools across studies and clinical settings. 
  • How should AI-driven models be regulated and implemented in practice? Emerging regulatory frameworks, including the U.S. Food and Drug Administration’s (FDA) Good Machine Learning Practices (GMLP), will be critical for ensuring AI tools are safe, reliable, and consistent. 

Overall, our findings suggest that AI models face many of the same interpretive challenges as human pathologists, particularly when evaluating HER2-low disease. This work serves as an important foundational study, offering early guidance on how AI can be integrated into clinical trial workflows. We are already applying these insights to our ai.RECIST project, which focuses on how AI-driven models can improve the accuracy and consistency of tumor assessments in clinical trials.  

Facebook
Twitter
LinkedIn

Tags

Friends Project Pulse