Evaluating AI in Healthcare: Building Safety, Fairness and Trust

AI has immense potential to transform healthcare, enhancing diagnostics, optimising workflows, and personalising treatment. Yet the OECD’s 2024 report AI in Health: Huge Potential, Huge Risks, warns that poorly governed AI can amplify inequities, compromise patient safety, and erode public trust. Responsible evaluation is therefore a leadership imperative, not a technical afterthought.

Defining purpose and risk: the first step in responsible AI

Healthcare leaders must first establish the intended purpose, risk level, and value claims of any AI tool.

According to NICE’s Evidence Standards Framework for Digital Health Technologies, classification into Tiers A–C determines the required level of scrutiny. Tools that diagnose, treat, or guide clinical management fall under Tier C, requiring high standards of safety, performance, and cost-effectiveness.

At the scoping stage, evaluators should engage early with developers and regulators to confirm whether the AI qualifies as a medical device, what approvals (such as UKCA marking) apply, and what validation data are required. This early alignment helps avoid costly redesigns later in the deployment cycle.

Learning from the NHS: Building an evaluation framework

The NHS AI Lab’s AI in Health and Care Award programme offers a tested model for independent assessment. Its framework evaluates AI across eight domains:

Safety
Accuracy
Effectiveness
Value
Fit with sites
Implementation
Scalability
Sustainability

This approach allows both technical validation and real-world feasibility testing, ensuring AI solutions are practical and safe for frontline use.

A Proposed Phased Approach to Responsible AI Development and Deployment
Source: Makhni et al., Mayo Clinic Proceedings: Digital Health (2025)

What Makes an Effective AI Evaluation?

Key lessons from the NHS and global literature include:

Mixed-methods evaluation: Combine quantitative clinical data with qualitative feedback from clinicians and patients to capture usability and impact.
Flexible study designs: Use quasi-experimental or pragmatic trial designs that adapt to evolving technologies.
Data governance clarity: Define data-sharing responsibilities, anonymisation, and post-market monitoring from the outset.
Longer evaluation timelines: Allow ‘bedding-in’ periods so staff can adjust workflows before outcomes are measured.

These steps ensure that evaluations move beyond proof of concept into real-world impact.

Ensuring Safety, Fairness, and Explainability

AI systems must be safe, fair, and explainable. Frameworks such as the Blueprint for Trustworthy AI and Singapore’s AI Governance Guidelines emphasise transparency across the entire lifecycle, from data sourcing and algorithm training to deployment and monitoring.

Leaders should implement continuous validation processes to detect model drift and retrain algorithms with updated datasets. Bias monitoring can include health equity audits and analysis by patient demographics to identify disparities.

Equally critical is explainability. Tools like model cards or Unique AI Identifiers help document a system’s purpose, data lineage, and limitations, fostering accountability across developers, clinicians, and healthcare organisations.

Strategies to Improve Development and Dissemination of AI Tools in Health and Healthcare
Source: Angus et al., JAMA (2025)

Measuring Value: Beyond Cost and Efficiency

AI adoption should be grounded in evidence of measurable value. NICE requires proof not only of clinical outcomes but also cost-effectiveness and budget impact.

Evaluators should:

Map current care pathways.
Identify how AI changes those pathways.
Measure both direct costs (licensing, infrastructure, training) and indirect effects (efficiency gains, error reduction).

According to OECD guidance, value should encompass trust, equity, and workforce sustainability, in addition to financial metrics.

The Need for Continuous Oversight

Post-deployment evaluation is just as crucial as pre-deployment testing. Leaders should:

Establish AI safety reporting systems similar to those used in NHS clinical incident processes.
Implement independent assurance mechanisms, like the Coalition for Health AI’s proposed “AI Labs”, to audit datasets and manage certification renewals.
Invest in ongoing training, equipping clinicians to interpret AI outputs responsibly and maintain human oversight in automated environments.

Transparency in Reporting: The Path to Accountability

Transparent reporting ensures trust and reproducibility in AI-driven healthcare research. Several frameworks now support this goal:

Examples of AI-Related Reporting Guidelines
Source: Flanagin et al., JAMA (2024)

These include:

CONSORT-AI and SPIRIT-AI for clinical trial design and protocols.
MI-CLAIM, CLAIM, and MINIMAR for clinical AI modelling and imaging.
DECIDE-AI for early-stage live evaluations.
STARD-AI, TRIPOD-AI, and CHART (in development) for diagnostic and chatbot studies.

These frameworks are helping healthcare move towards a shared standard of AI transparency and reliability.

A continuous cycle of trust

Evaluating AI is not a one-off exercise; it’s an iterative process:
Plan → Test → Deploy → Monitor → Refine

The goal is not to slow innovation but to ensure that AI delivers equitable, safe, and explainable improvements in care.

Before deploying AI in healthcare, leaders should always ask:

Is it safe, effective, and fair?
Is it transparent, explainable, and accountable?
Will it sustainably improve outcomes for patients and staff?

When frameworks such as those pioneered by NHS England and NICE are rigorously applied, AI can enhance, not endanger, the human dimension of healthcare.

Authored by Tom Varghese, Global Product Marketing & Growth Manager at Orion Health.

Resources

AI in Health and Care Award. Planning and Implementing Real-World Artificial Intelligence (AI) Evaluations: Lessons from the AI in Health and Care Award. NHS England, 2024
Coalition for Health AI (CHAI). Blueprint for Trustworthy AI: Implementation Guidance and Assurance for Healthcare. Version 1.0. April 2023
Flanagin, Annette, Romain Pirracchio, Rohan Khera, Michael Berkwits, Yulin Hswen, and Kirsten Bibbins-Domingo. “Reporting Use of AI in Research and Scholarly Publication—JAMA Network Guidance.” JAMA 331, no. 13 (April 2, 2024): 1096–1098
Labkoff, Steven, Bilikis Oladimeji, Joseph Kannry, et al. “Toward a Responsible Future: Recommendations for AI-Enabled Clinical Decision Support.” Journal of the American Medical Informatics Association 31, no. 11 (2024): 2730–2739
Makhni, Sonya, Jose Rico, Paul Cerrato, et al. “A Comprehensive Approach to Responsible AI Development and Deployment.” Mayo Clinic Proceedings: Digital Health 3, no. 4 (2025): 100294.
Ministry of Health (Singapore). Artificial Intelligence in Healthcare Guidelines (AIHGle). Published October 2021
National Institute for Health and Care Excellence (NICE). Evidence Standards Framework for Digital Health Technologies. Updated August 2022
Nature Medicine Editorial Board. “Setting Guidelines to Report the Use of AI in Clinical Trials.” Nature Medicine 26, no. 9 (September 2020): 1311
Organisation for Economic Co-operation and Development (OECD). AI in Health: Huge Potential, Huge Risks. Paris: OECD Publishing, 2024

Can you trust your AI? Evaluating healthcare AI before deployment.

Defining purpose and risk: the first step in responsible AI

Learning from the NHS: Building an evaluation framework

What Makes an Effective AI Evaluation?

Ensuring Safety, Fairness, and Explainability

Measuring Value: Beyond Cost and Efficiency

The Need for Continuous Oversight

Transparency in Reporting: The Path to Accountability

A continuous cycle of trust

Resources

Keep up to date with the latest in healthcare technology

Products

Resources

Solutions

Contact

About Us

Support

Careers

Canada

USA

United Kingdom

France

Spain

Middle East

South East Asia

Australia

New Zealand

Can you trust your AI? Evaluating healthcare AI before deployment.

Defining purpose and risk: the first step in responsible AI

Learning from the NHS: Building an evaluation framework

What Makes an Effective AI Evaluation?

Ensuring Safety, Fairness, and Explainability

Measuring Value: Beyond Cost and Efficiency

The Need for Continuous Oversight

Transparency in Reporting: The Path to Accountability

A continuous cycle of trust

Resources

Keep up to date with the latest in healthcare technology

Related Resources

Celebrating 25 Years of Digital Health Innovation: Introducing Concerto AI at HINZ 2025.

Are we building integrated care, or just integrating the status quo?

Can the International Patient Summary Capture the Complexity of a Person’s Health?

Products

Resources

Solutions

Contact

About Us

Support

Careers

Canada

USA

United Kingdom

France

Spain

Middle East

South East Asia

Australia

New Zealand