Guides

How to Build Reliable Multi-Agent Systems with Google ADK and Maxim AI: Instrumentation, Evals, and Observability

Google’s Agent Development Kit (ADK) makes it straightforward to design multi‑agent systems, while Maxim provides the end‑to‑end stack for simulation, evaluation, and observability required to ship these systems reliably. This guide shows how to combine ADK and Maxim for robust agent tracing, debugging, and continuous quality measurement, with code you can copy-paste to get started fast.

Why reliability demands observability and evaluation

Agentic applications are non‑deterministic by design, decisions vary across turns, tools are invoked dynamically, and context evolves throughout sessions. That makes traditional “single input → single output” testing insufficient. You need three layers working together:

Distributed tracing and logging: Capture complete execution paths across agents, tools, and workflows to pinpoint bottlenecks and failure modes. Google ADK natively supports multi‑agent orchestration and tooling, and its official docs emphasize evaluation and debugging patterns across agents and workflows. See the ADK overview for architecture, agents, tools, and workflow orchestration patterns in Python and Java in the official documentation: Agent Development Kit. For ADK’s enterprise framing and capabilities on Google Cloud, review Vertex AI Agent Builder and Google’s blog walkthrough of multi‑agent orchestration with ADK: Build multi‑agentic systems using Google ADK.
Programmatic and LLM‑as‑a‑judge evals: Quantify output quality at session, trace, and span levels; detect regressions across versions; and gate releases. ADK’s docs outline built‑in evaluation capabilities across step‑wise execution and final responses: Evaluate agents in ADK.
Production monitoring and feedback loops: Track latency, token usage, cost, error categories, and user feedback across live traffic. Maxim’s observability suite is purpose‑built for AI applications: Agent Observability. Pair observability with pre‑release simulation and evals to create measurable quality baselines: Agent Simulation & Evaluation.

Together, ADK orchestrates agent teams and tool use; Maxim turns their behavior into actionable telemetry and quality signals, helping AI engineering and product teams iterate confidently.

What Google ADK brings to agent development

Google ADK is a modular framework for building agents with clear instructions, robust tool schemas, and flexible orchestration across sequential, parallel, and loop workflows. It supports Gemini models and integrates with broader ecosystems via MCP and third‑party tools, plus built‑in memory services. If you are new to ADK, start with the official docs for Python and Java quickstarts, agents, tools, and runners: Agent Development Kit. For a step‑by‑step tutorial on multi‑agent patterns deployed on Google Cloud, see the product page and docs: Vertex AI Agent Builder.

Key capabilities relevant to reliability:

Multi‑agent orchestration: Compose specialized agents and coordinate via sub‑agents or agents‑as‑tools. See ADK’s workflow agents, parallel execution, and agent transfer patterns in the docs: Agents and workflows.
Structured function tools: ADK auto‑inspects Python/Java function signatures and docstrings to generate tool schemas, critical for predictable tool invocation and debugging. Learn more: Function tools.
Sessions and memory: Short‑term session state and long‑term memory via in‑memory or Vertex AI Memory Bank services. Reference: Sessions & Memory.
Built‑in evaluation: Evaluate step‑wise trajectories and outcomes against test suites to improve agents pre‑release. Reference: Evaluate agents.

How Maxim complements ADK for reliability

Maxim is an end‑to‑end platform for AI simulation, evaluation, and observability. Engineering and product teams use it to:

Run agent simulations across hundreds of scenarios and personas, replay traces from any step, and measure conversational/task success: Agent Simulation & Evaluation.
Configure flexible evaluators (deterministic, statistical, LLM‑as‑a‑judge) at session/trace/span levels; combine with human‑in‑the‑loop workflows to align with user preference: Agent Simulation & Evaluation.
Instrument production observability with distributed tracing, custom dashboards, alerts, and automated quality checks against rules: Agent Observability.
Manage prompt engineering and versioning, compare models/parameters across output quality, latency, and cost, and deploy prompts without code changes: Experimentation (Playground++).

For teams routing across providers or needing enterprise‑grade reliability behind a single API, Maxim’s gateway Bifrost offers unified access, automatic fallbacks, load balancing, semantic caching, governance, observability, and OpenAI‑compatible APIs. Explore: Bifrost Features, Fallbacks and load balancing, Observability, and Drop‑in replacement.

Step‑by‑step: Instrument Google ADK with Maxim (copy‑paste setup)

Below is a minimal Python setup to run an ADK agent locally with Maxim instrumentation, enabling agent tracing, token/cost metrics, latency tracking, and structured logs. It uses Gemini via Google AI Studio for simplicity.

Install dependencies:

pip install maxim-py google-adk python-dotenv

Add environment variables:

# .env
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_API_KEY=your-google-api-key
GOOGLE_GENAI_USE_VERTEXAI=False

# Maxim
MAXIM_API_KEY=your-maxim-api-key
MAXIM_LOG_REPO_ID=your-log-repository-id

Initialize Maxim and instrument ADK before importing the root agent:

# your_agent/__init__.py
import os
from dotenv import load_dotenv

load_dotenv()
os.environ.setdefault("GOOGLE_GENAI_USE_VERTEXAI", "False")

from . import agent

try:
    from maxim import Maxim
    from maxim.logger.google_adk import instrument_google_adk

    print("Initializing Maxim instrumentation for Google ADK...")
    maxim = Maxim()
    maxim_logger = maxim.logger()

    instrument_google_adk(maxim_logger, debug=True)
    print("Maxim instrumentation complete!")

    root_agent = agent.root_agent

except ImportError as e:
    print(f"Could not initialize Maxim instrumentation: {e}")
    root_agent = agent.root_agent

Create a runner script for interactive sessions:

# run_with_maxim.py
#!/usr/bin/env python3
import asyncio
import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent))

from your_agent import root_agent
from google.adk.runners import InMemoryRunner
from google.genai.types import Part, UserContent

async def interactive_session():
    print("\\n" + "=" * 80)
    print("Agent - Conversational Mode")
    print("=" * 80)

    runner = InMemoryRunner(agent=root_agent)
    session = await runner.session_service.create_session(
        app_name=runner.app_name, user_id="user"
    )

    print("\\nType your message (or 'exit' to quit)")
    print("=" * 80 + "\\n")

    try:
        while True:
            try:
                user_input = input("You: ").strip()
            except EOFError:
                break
            if not user_input:
                continue
            if user_input.lower() in ["exit", "quit"]:
                break

            content = UserContent(parts=[Part(text=user_input)])
            print("\\nAgent: ", end="", flush=True)

            try:
                async for event in runner.run_async(
                    user_id=session.user_id,
                    session_id=session.id,
                    new_message=content,
                ):
                    if event.content and event.content.parts:
                        for part in event.content.parts:
                            if part.text:
                                print(part.text, end="", flush=True)
            except Exception as e:
                print(f"\\n\\nError: {e}")
                continue
            print("\\n")
    finally:
        from maxim.logger.google_adk.client import end_maxim_session
        end_maxim_session()
        print("\\n" + "=" * 80)
        print("View traces at: <https://app.getmaxim.ai>")
        print("=" * 80 + "\\n")

if __name__ == "__main__":
    asyncio.run(interactive_session())

Run with:

python3 run_with_maxim.py

This pattern uses ADK’s in‑memory runner to keep local development fast while Maxim captures structured traces, metrics, and logs. For ADK APIs, agents, runners, and tooling semantics, see the official quickstart and API references: ADK Quickstart (Python), Python API reference.

Advanced: Node‑level evaluation and custom callbacks for granular insights

Maxim’s instrumentation supports callbacks around generation, traces, and spans, ideal for custom metrics like latency, tokens/sec, and cost estimates, plus tagging agent outputs for downstream analysis. This example shows adding per‑generation latency and end‑of‑trace cost:

# your_agent/__init__.py (callbacks example)
import os, time
from dotenv import load_dotenv

load_dotenv()
os.environ.setdefault("GOOGLE_GENAI_USE_VERTEXAI", "False")
from . import agent

try:
    from maxim import Maxim
    from maxim.logger.google_adk import instrument_google_adk

    class MaximCallbacks:
        def __init__(self):
            self.starts = {}

        async def before_generation(self, ctx, llm_request, model_info, messages):
            self.starts[id(llm_request)] = time.time()

        async def after_generation(self, ctx, llm_response, generation, generation_result, usage_info, content, tool_calls):
            gen_id = id(getattr(ctx, "llm_request", None))
            if gen_id in self.starts:
                latency = time.time() - self.starts[gen_id]
                generation.add_metric("latency_seconds", latency)
                total_tokens = usage_info.get("total_tokens", 0)
                if latency > 0:
                    generation.add_metric("tokens_per_second", total_tokens / latency)
                del self.starts[gen_id]
            generation.add_tag("model_provider", "google")
            generation.add_tag("has_tool_calls", "yes" if tool_calls else "no")

        async def after_trace(self, invocation_context, trace, agent_output, trace_usage):
            total_tokens = trace_usage.get("total_tokens", 0)
            estimated_cost = (total_tokens / 1000.0) * 0.01  # illustrative
            trace.add_metric("estimated_cost", estimated_cost)
            trace.add_tag("estimated_cost_usd", f"${estimated_cost:.4f}")

        async def after_span(self, invocation_context, agent_span, agent_output):
            agent_name = invocation_context.agent.name
            output_length = len(agent_output) if agent_output else 0
            agent_span.add_tag("agent_name", agent_name)
            agent_span.add_tag("output_length", str(output_length))
            agent_span.add_metadata({"output_stats": {"length": output_length}})

    callbacks = MaximCallbacks()
    maxim = Maxim()
    instrument_google_adk(
        maxim.logger(),
        debug=True,
        before_generation_callback=callbacks.before_generation,
        after_generation_callback=callbacks.after_generation,
        after_trace_callback=callbacks.after_trace,
        after_span_callback=callbacks.after_span,
    )

    print("Maxim instrumentation with custom callbacks enabled!")
    root_agent = agent.root_agent

except ImportError as e:
    print(f"Could not initialize Maxim: {e}")
    root_agent = agent.root_agent

Use cases for callbacks:

Agent debugging and tracing: Tag spans by agent names, annotate output stats, and identify slow nodes for optimization (e.g., parallelization using ADK’s ParallelAgent). See parallel orchestration patterns in Google’s tutorial: Build multi‑agentic systems using Google ADK.
LLM observability and monitoring: Track latency, token throughput, and cost per trace for live SLO monitoring in Maxim: Agent Observability.
Evals and trustworthy AI: Route spans into eval pipelines and flag hallucinations or tool‑use failures with custom labels and metrics. Configure evaluators and human review from Maxim’s UI and SDKs: Agent Simulation & Evaluation.

Voice and RAG observability: what changes in multimodal agents

When agents interact through streaming audio or rely on retrieval pipelines, your observability strategy benefits from a few additions:

Voice tracing: Track bidirectional streaming events, model segments, endpoint latency, and error categories across turns. Google highlights ADK’s unique streaming capabilities for human‑like conversations in the Agent Builder product description: Vertex AI Agent Builder.
RAG tracing: Log embedding, retrieval, re‑ranking, and grounding steps as distinct spans, including latency and token attribution. Ground responses where applicable via Vertex AI Search or Google Search, and include retrieval metadata in traces for auditing and evals. See Google’s grounding resources and RAG options within Agent Builder: Vertex AI Agent Builder.

With Maxim, you can create custom dashboards slicing traces across voice vs. text, RAG vs. non‑RAG, model versions, and prompt versions to manage ai quality, hallucination detection, and rag observability without guesswork: Agent Observability.

Pre‑release simulation and continuous evaluation

Before shipping updates, use Maxim’s simulation to reproduce real‑world scenarios, measure agent evaluation metrics, and catch regressions:

Simulate conversational trajectories and re‑run from any step to isolate the root cause: Agent Simulation & Evaluation.
Run llm evaluation using programmatic rules, statistical tests, and LLM‑as‑a‑judge, and complement with human‑in‑the‑loop review for nuanced criteria: Agent Simulation & Evaluation.
Curate datasets from production logs, eval outputs, and feedback using Maxim’s Data Engine to maintain test suites that reflect reality.

Pair this with ADK’s built‑in evaluation concepts and test runs across workflows to validate step‑by‑step execution and final responses: ADK Evaluate agents.

Operationalizing reliability with Bifrost (Maxim’s AI gateway)

Production reliability often depends on your gateway and routing strategy. Bifrost provides:

Unified OpenAI‑compatible API across providers and models: Unified Interface.
Automatic fallbacks and load balancing to mitigate provider/model incidents: Fallbacks.
Semantic caching to cut repeated cost and latency: Semantic Caching.
Governance and budget management for enterprise control: Governance.
Native observability with Prometheus metrics and distributed tracing: Observability.
Drop‑in replacement for provider SDKs to get started in seconds: Drop‑in replacement.

Combining ADK’s orchestration, Maxim’s evals/observability, and Bifrost’s gateway features yields a defensible operational posture for ai monitoring, llm monitoring, and agent observability at scale.

Recommended structure for teams adopting ADK + Maxim

Development: Build and run agents locally with ADK’s InMemoryRunner. Instrument with Maxim for agent tracing, token/cost metrics, and structured logs. Use Maxim’s Experimentation to manage prompt engineering and compare models on ai quality, cost, and latency: Experimentation.
Pre‑release: Use Maxim simulation and evals to establish baselines. Configure evaluators at session/trace/span levels. Block deploys on regressions using quantitative gates: Agent Simulation & Evaluation.
Production: Route via Bifrost with fallbacks and load balancing. Instrument ai observability and alerting. Monitor llm tracing for bottlenecks and apply agent debugging workflows on outliers: Agent Observability, Bifrost Observability.

For architectural and capability references from Google, rely on the official docs and product pages: Agent Development Kit, Vertex AI Agent Builder, and Google’s multi‑agent tutorial: Build multi‑agentic systems using Google ADK.

Ready to see your agents with full‑stack ai reliability, from pre‑release simulation to production observability? Book a personalized walkthrough: Request a Maxim demo or start building now: Sign up.

How to Build Reliable Multi-Agent Systems with Google ADK and Maxim AI: Instrumentation, Evals, and Observability

Why reliability demands observability and evaluation

What Google ADK brings to agent development

How Maxim complements ADK for reliability

Step‑by‑step: Instrument Google ADK with Maxim (copy‑paste setup)

Advanced: Node‑level evaluation and custom callbacks for granular insights

Voice and RAG observability: what changes in multimodal agents

Pre‑release simulation and continuous evaluation

Operationalizing reliability with Bifrost (Maxim’s AI gateway)

Recommended structure for teams adopting ADK + Maxim

Read next

From Black Box to Glass Box: Achieving Transparency with AI Observability

The Technical Guide to Managing LLM Costs: Strategies for Optimization and ROI

Solving the 'Lost in the Middle' Problem: Advanced RAG Techniques for Long-Context LLMs

Ship your AI agents 5x faster ⚡️