Agent Simulation & Testing Made Simple with Maxim AI

Agent Simulation & Testing Made Simple with Maxim AI

Generative-AI agents do more than answer one question, they maintain context, call external APIs, enforce refund policies, and handle sensitive data. Releasing such systems without systematic testing risks hallucinations, privacy breaches, and broken user journeys. Maxim’s Agent Simulation module turns quality assurance into a repeatable, dataset-driven discipline.

This article combines the workflow shown in the attached flight-booking video with key concepts from Maxim’s documentation:

• Simulation Overview (why, personas, advanced settings)

• Simulation Runs (datasets, test-run configuration, evaluator results)

• Tracing & Dashboards (root-cause analysis after a run)


1 What is agent simulation?

Agent simulation pairs a synthetic simulator (virtual user) with your AI agent in a controlled environment. Each session:

  1. Starts from a predefined scenario (“Book an economy flight NYC → SFO for 20 April”).
  2. Applies a persona (polite, impatient, frustrated, expert).
  3. Runs for a fixed number of turns or until a success condition is met.
  4. Logs every request, response, and tool call.
  5. Evaluates the transcript with objective rubrics (PII, trajectory, hallucination, latency, cost).

Because the user side is synthetic, you can execute hundreds of scenarios in minutes and surface long-tail failures long before customers see them.


2 Wrap the agent in a Maxim workflow

2.1 Create the workflow

Open Workflows → New Workflow, then enter:

Name: Travel Agent Description: Assists with flight & hotel bookings.

2.2 Define the request

POST  https:///flight-booking-with-ai.vercel.app/i/direct
{
  "messages": [
    {"role": "user", "content": "{{input}}"}
  ],
  "model_id": "gpt-4"
}

{{input}} binds to the simulator’s message.

2.3 Inject a unique simulation ID (pre-script)

export default function preScript(request) {
  // ISO timestamp → ensures uniqueness per run
  const ts = new Date().toISOString();

  // Parse the existing body (may be "{}" the first time)
  const data = JSON.parse(request.data || "{}");

  // Attach a correlation ID the backend can log
  data.id = `simulation-${ts}`;

  // Overwrite the request body
  request.data = JSON.stringify(data);

  return request;   // Must return the mutated request object
}

2.4 Return only the assistant’s final utterance (post-script)

export default function postScript(response) {
  // Convert raw string → object
  const full = JSON.parse(response.data);

  // Grab the assistant’s last message
  const last = full.messages.at(-1);

  // Strip everything else; evaluators need only this
  return { messages: [last] };
}

2.5 Authentication headers

x-maxim-token: 12345-demo-secret

(Bearer tokens and mTLS are equally supported.)


3 Manual smoke test

Type Hey and press Send. You should receive a greeting from the agent, confirming headers, body shape, and scripts all work.


4 Simulation parameters

  1. Scenario Narrative + business constraint.
  2. Persona Emotion, politeness, domain knowledge.
  3. Advanced settingsMax turns caps loops (e.g., 8).• Reference toolsrefund_processor, etc.• Context sources policies or specs to curb hallucinations.

5 Create an Agent Dataset

Use the table below when you recreate the CSV/JSON inside Maxim. Each row is a separate simulated conversation; the “Expected steps” cell is intentionally brief (Maxim will show full text on hover).

Scenario (user input) Expected steps (compressed)
Book an economy flight NYC → SFO on 20 Apr 2025 1 Clarify date & cabin → 2 List two-or-more economy options → 3 Ask user to choose → 4 Confirm booking
Book the cheapest round-trip London ↔ San Francisco in March 1 Return cheapest date pair → 2 Collect passenger count & names → 3 Verify selection → 4 Issue tickets
Family of four wants flights to Hawaii for summer vacation 1 Show 2–3 family-friendly itineraries → 2 Gather traveller details → 3 Confirm choice → 4 Complete booking
Business-class seat on direct Air India BLR → SFO, 16 Feb 1 Return nonstop options (if any) → 2 Ask user to pick one → 3 Confirm fare → 4 Finalise purchase

6 Configure & run a Simulation Run

Open the workflow → Test → Simulated session.

Field Value
Dataset travel_agent_simulation_dataset
Persona Frustrated user in a rush
Response field for evaluation messages.0.content
Max turns 8
Evaluators PII Detection • Agent Trajectory

The configuration panel includes several key fields. Dataset (travel_agent_simulation_dataset) is the collection of scenarios and expected trajectories replayed during a simulation run, where each row triggers one multi-turn conversation. Persona (Frustrated user in a rush) defines the synthetic user profile applied across scenarios, shaping tone, patience, and vocabulary so the agent adapts to a specific emotional state. Response field for evaluation (messages.0.content) specifies the JSON path that tells Maxim which part of the agent’s response should be evaluated, in this case targeting the assistant’s main textual reply. Max turns (8) sets a hard limit on the number of exchanges per session, preventing runaway loops and keeping token usage predictable. Finally, Evaluators (PII Detection • Agent Trajectory) apply quality checks after each run: PII Detection flags sensitive data leaks, while Agent Trajectory confirms the conversation followed the expected dataset steps.

Click Trigger Test Run.


7 Review evaluation results

Each scenario shows evaluator status chips:

Click a failed row to view the full transcript and evaluator notes.

Key built-in metrics: hallucination rate, sentiment delta, PII leakage, trajectory compliance, latency, and cost.


8 Automate nightly runs

"""
CI script: fails the build if hallucination > 3 %
Place in .github/workflows/ci.yml or similar.
"""
from maxim import SimulationClient
import os, sys

# Instantiate SDK client
client = SimulationClient(api_key=os.getenv("MAXIM_API_KEY"))

# Kick off the regression suite
run = client.trigger_run(test_suite="nightly_travel_agent")

# Enforce a quality gate
if run["metrics"]["hallucination_rate"] > 0.03:
    sys.exit("Build failed: hallucination rate above 3 %")


9 Dashboards & tracing for root-cause analysis

Test-Runs Comparison Dashboard trend metrics over time.

Tracing Dashboard jump from a failed evaluator directly to the exact request/response pair— including token counts and tool-call payloads.


10 Best-practice checklist

  1. Parameterise IDs, dates, tokens.
  2. Keep post-scripts minimal; do not alter semantics.
  3. Layer evaluators; trajectory & PII first, latency & cost next.
  4. Version datasets in Git.
  5. Route traffic through Bifrost for unified policy and analytics.
  6. Include co-operative, neutral, and antagonistic personas.
  7. Schedule offline simulations nightly; run a lightweight online check on every PR.

11 Measured impact (public case studies)

Clinc cut manual reporting from ~40 h to < 5 min per cycle.

Atomicwork reduced troubleshooting time by ≈ 30 % with trace search.

Thoughtful lowered therapist escalations ≈ 30 % after persona-driven simulations.


12 Conclusion

Agent simulation converts anecdotal QA into an evidence-based, auditable practice. By wrapping your endpoint in a Maxim workflow, adding dynamic scripts, and exercising it with dataset-driven simulations, you gain statistical confidence in context management, compliance, and user experience, well before customers interact with your system.

The video below shows how Agent Simulation can be performed on Maxim AI.

Ready to integrate simulation into your pipeline? Get started free or book a live demo.