Evaluating a Healthcare use case using Vertex AI and Maxim AI - Part 1

Evaluating a Healthcare use case using Vertex AI and Maxim AI - Part 1

Introduction

Building AI agents has become more accessible than ever, empowering developers to create sophisticated, autonomous systems. But moving from a working prototype to a production-ready agentic application brings a new set of challenges, from ensuring reliability and safety, to evaluating performance at scale.

Agentic systems, by nature, are complex. They make decisions, invoke tools, and maintain evolving context over long interactions. Evaluating these systems is far from trivial. Standard metrics don’t capture the nuances of multi-turn reasoning, tool use, or collaboration between agents. Teams struggle to identify where breakdowns occur and how to fix them.

That’s why today, we’re excited to announce a strategic partnership between Maxim AI and Google Cloud’s Vertex AI, a collaboration aimed at making end-to-end evaluation of agentic systems seamless, reliable, and enterprise-ready.

This article introduces a framework for evaluating large language model (LLM)-powered Healthcare use case by combining Google Vertex AI's evaluation capabilities with Maxim AI's enterprise platform. We'll focus specifically on the fundamental task of generating clinical notes from doctor-patient conversations, a cornerstone capability in modern ambient AI documentation tools.

At the heart of our evaluation pipeline are Google's Gemini models (including Gemini-1.5-Flash and Gemini-2.0-Flash), which excel at high-throughput, high-fidelity text generation. What makes this approach particularly robust is the dual use of these models: first to power the assistant responses, and then to evaluate those responses through Vertex AI's comprehensive suite of evaluators.

In this setup, we’ll not only use Gemini to power assistant responses, but also to evaluate those responses through Vertex AI’s built-in suite of evaluators such as Vertex Fluency, Vertex Safety , Vertex Rogue etc.

Combining the powerful evaluation suite with Maxim’s platform we will demonstrate how to ensure enterprise-grade reliability across any clinical assistant you build.

Introduction to Vertex’s Gen AI evaluation service API

Google’s Gen AI evaluation service enables comprehensive assessment of your LLMs using customisable metrics based on your specific criteria.

The service works by accepting three key inputs:

  • Your inference-time inputs
  • The responses generated by your LLMs
  • Any additional parameters you wish to include

After processing these inputs, the service delivers metric results tailored to your evaluation task.

Available Metrics

The service offers two categories of metrics:

  1. Model-based metrics:
  • PointwiseMetric: Evaluates individual responses against specific criteria
  • PairwiseMetric: Compares pairs of responses to determine relative performance
  1. In-memory computed metrics

What makes this service particularly flexible is that both PointwiseMetric and PairwiseMetric can be customised to align with your unique evaluation criteria.

Since the evaluation service accepts prediction results directly from models as inputs, it can seamlessly perform both the inference process and subsequent evaluation on any model supported by Vertex AI.

You can read more here - https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/evaluation

Lets see an example using Gen AI Evaluation API from their official documentation -

Let’s assume you want to evaluate the output of an LLM using a variety of evaluation metrics, including the following:

  • summarization_quality
  • groundedness
  • fulfillment
  • summarization_helpfulness
  • summarization_verbosity

Let’s import the required modules

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

Initialise Vertex with PROJECT_ID and location -

vertexai.init(project=PROJECT_ID, location="us-central1")

Prepare a Evaluation Dataset -

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

Now lets just prepare an Evaluation Task, provide the dataset and significant metrics -

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

Prepare a Prompt Template which contains the instructions (input), context, response (output)-

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

This example shows how we evaluated the summary generated by an LLM on Evals from Vertex AI. Maxim has enhanced its platform by fully integrating Vertex AI's powerful evaluation service. This integration delivers enterprise-grade LLM assessment capabilities directly within your familiar Maxim workspace. Simply configure your Vertex AI credentials in the platform settings, and instantly gain access to our comprehensive suite of third-party evaluators powered by Google.

0:00
/0:15

Configuring Vertex AI in Maxim

What Are We Planning to Build?

In this demonstration, we'll showcase a healthcare use case that automatically converts doctor-patient conversations into concise clinical notes. This powerful use case illustrates how AI can streamline medical documentation while maintaining accuracy and completeness. The quality of these AI-generated clinical notes will be assessed through multiple evaluation metrics

Clinical Notes Generator (Prompt-based)

We will create a single prompt inside the Maxim platform that takes a doctor-patient conversation as input and generates a structured clinical note. This note can later be sent to patients post-visit.

We will then run a simulated session using a dataset of 10 sample dialogues and evaluate the generated notes using Vertex AI evaluators, imported directly through Maxim's Evaluator Store.

Evaluating Clinical Notes Generation Prompt (Prompt-Based Simulation)

In this section, we walk through the full process of setting up a prompt-driven clinical note generator using Maxim's no-code interface, Gemini 2.0 Flash as the model, and Vertex AI evaluators for post-simulation analysis.

Step 1: Create Prompt in Maxim

  1. Head to the Playground section.
  2. Click “+ Create Single Prompt” and name it: Clinical_Notes_Generator_Assistant
0:00
/0:13

Creating a new prompt in Maxim

  1. Paste the following System Prompt:
You are a clinical documentation assistant for healthcare professionals.
Your job is to read a multi-turn conversation between a doctor and a patient and generate a structured clinical note based on the interaction.

Follow these rules carefully:
- Do NOT include any unnecessary commentary or disclaimers.
- The note should be clear, concise, and use standard medical terminology.
- Maintain an objective and professional tone.

The clinical note should follow this structure:

Chief complaint: [Main reason the patient came in]  
History: [Symptoms, duration, context, relevant negatives]  
Medications: [Current medications if mentioned]  
Allergies: [Any known allergies]  
Assessment: [Doctor’s impression or working diagnosis]  
Plan: [Next steps – investigations, prescriptions, follow-ups]

Only include fields that are mentioned in the user message.
Begin generating the clinical note once the user message is provided.
  1. Select Gemini 2.0 Flash as the model.
  2. Keep Temperature low (0.2–0.3 recommended for factual generation)
  3. Save the prompt.
  4. Now let’s test the single prompt once before we proceed further, provide a sample dialogue between the doctor and patient as the input user message. As you can see below, we get the summarised clinical notes as the output.
0:00
/0:22

Testing the configured prompt

Clinical Notes Received -

Chief complaint: Sore throat and mild fever. 
History: Symptoms started yesterday, accompanied by some coughing, 
but no difficulty swallowing. 
Medications: Cetirizine occasionally. 
Assessment: Sore throat and mild fever.

Step 2: Create Dataset

Upload/Paste a CSV file containing 10 rows of doctor-patient dialogues and corresponding expected notes (for reference). The columns are:

  • Input
  • Expected Output

How to do it? Here are the steps -

  1. Go to the Library → Datasets section in Maxim AI.
  2. Click the “+” button to create a new dataset.
  3. Name your dataset:Clinical_Notes_Generator_Dataset
  4. Select the template:Prompt or Workflow testing
  5. Add two columns:
    • Input → set as User Input (Dialogue between Doctor and Patient)
    • Expected Output → set as Expected Output (Clinical Notes)
  6. Click Create dataset.
  7. Click Upload CSV
    • Select the file from your local file system
    • In the column mapping dialog:
      • Map column 1Input
      • Map column 2Expected Output
      • Ensure “First row is header” is checked
    • Click Upload
  8. You can also copy the CSV data and paste it directly instead of uploading the file.
0:00
/0:14

Creating a new Dataset in Maxim

Step 3: Set Up a Test Run

  1. Click Test on top right corner on your Single Prompt Screen.
  2. Select Type: Single run - We’re evaluating one version of the prompt, so we choose the Single run mode.
  3. Choose Dataset - We select the Clinical_Notes_Generator_Dataset, which contains 10 real-world doctor-patient conversations and their expected clinical notes.
  4. Select Evaluators from Vertex Evals from Maxim Evaluators Store - we have imported the following AI Evaluators powered by Google Vertex AI from Maxim Evaluators Store to our Workspace:
EvaluatorPurpose
Vertex SafetyAssesses the presence of harmful or unsafe content in the generated text.
Vertex CoherenceThe Coherence metric evaluates the logical progression and consistency of ideas within the generated text. It ensures that the content makes sense as a whole.
Vertex Question Answering HelpfulnessAssesses how helpful the answer is in addressing the question.
Vertex Question Answering RelevanceDetermines how relevant the answer is to the posed question.
0:00
/0:24

Import Vertex AI Evaluators from Maxim's Evaluator Store

Once you have imported the required evaluators, you will be able to use them while setting up a test run as you can see below -

0:00
/0:23

Creating a new Prompt Test Run

As soon as you click on Trigger Run, it will kickstart a run which you can see in the “Runs” section. Once its completed, you will see the simulation report.

You’ll see a detailed row-by-row breakdown of the dataset —

Column Description
Status Whether this row completed successfully
Input The original doctor-patient dialogue
Expected Output The expected clinical note from the dataset
Output What your LLM actually generated
Evaluator Scores Inline pass/fail indicators per evaluator for each row
Latency Response time per row

You can click on any row to inspect:

  • The input conversation
  • The generated clinical note
  • Which evaluators flagged issues
  • Detailed feedback from each evaluator (e.g., “Safety failed: Unsafe medication”)
0:00
/0:32

These are the Test Run Results, we can go through each input, check the conversation and evaluator score/result

You can click on an entry and go to the evaluations section to inspect why the evaluation has a “FAIL” status. eg. for our first input Vertex Coherence has failed with this reason -

The response appears to be a medical note, but it is incomplete, 
making it difficult to evaluate its coherence. While the information 
presented seems organized into categories (Chief complaint, History, etc.), 
the lack of content in several sections (Assessment, Plan) disrupts the 
overall flow. The connections between the existing pieces are logical 
within a medical context, but the incompleteness affects the overall coherence. 
The absence of crucial information, particularly the assessment and plan, 
makes it challenging to understand the logical progression of the note. 
Therefore, while there's an attempt at structure and organization, it lacks 
complete information and this affects coherence.. 

Note - You can also use Maxim SDK (Python / Typescript / Go) to run simulations on your prompts within your code environment. Check the cookbooks here -

GitHub - maximhq/maxim-cookbooks: Maxim is an end-to-end AI evaluation and observability platform that empowers modern AI teams to ship agents with quality, reliability, and speed.
Maxim is an end-to-end AI evaluation and observability platform that empowers modern AI teams to ship agents with quality, reliability, and speed. - maximhq/maxim-cookbooks

This blog has opened door to our part 2 of this integration where we will use vertex evaluators for an end to end agent using multi turn evaluators.

In our upcoming Part 2, we'll take this integration to the next level by implementing Vertex AI evaluators for a complete medical assistant agent that can maintain context across multiple conversation turns. We'll demonstrate:

  • Multi-turn conversation evaluation - Assessing how well the agent maintains context and medical accuracy across extended doctor-patient dialogues
  • Agent deployment workflows - You will see how in Maxim you can import your agent via an API endpoint as a Workflow

Stay tuned as we dive deeper into building enterprise-grade medical AI assistants with the combined power of Maxim's agent framework and Google Vertex AI's evaluation capabilities.​