Building High-Quality Document Processing Agents for Insurance Industry

Generative AI is reshaping how insurers operate and serve their customers. Across sectors like health, life, auto, and property & casualty, insurers are embracing GenAI to enhance customer experience, drive efficiency, and improve decision-making. This shift isn’t just theoretical; over two-thirds of insurers are already using GenAI regularly, and nearly 90% plan to increase investment by the end of 2025 (Source).

The insurance landscape is dominated by document-heavy, process-intensive workflows, and LLMs are streamlining them fast. From processing claims (summarizing complex documents, verifying policy coverage, and so on) to drafting personalized content (proposals, emails, etc.), GenAI is getting embedded into critical functions. AI assistants now handle routine queries around the clock, offer quick policy lookups, and free up human agents to focus on more complex tasks.

However, as GenAI becomes integral to functions like claims and underwriting, ensuring its reliability becomes essential. Errors in AI-generated summaries or decisions can lead to incorrect payouts, regulatory exposure, and diminished trust among policyholders. UnitedHealth, for instance, faced an ongoing lawsuit after its AI system was linked to healthcare claim denials, underscoring the risks of unchecked automation and the need for rigorous monitoring and evaluation.

TL;DR

In this blog, we’ll walk through a popular use case of processing insurance claims using GenAI in the Auto insurance sector. We’ll:

Use LLMs to extract key details from documents like FNOLs (First Notice of Loss), invoices, and police/medical reports.
Use LLMs to verify claim details against the policy document.
Evaluate the accuracy of the extracted data and the generated final decision using Maxim AI.

Evaluation objective

We want to ensure that our AI system:

Accurately extracts key details from submitted claim documents.
Correctly verifies claim validity based on the policy’s terms, limits, and exclusions.
Makes reliable and explainable decisions, whether to approve, reject, or escalate a claim.

I. Processing claim documents and extracting important details

The claims process involves producing several documents to support the claimant’s case, for example, FNOLs, medical reports, bills, and image evidence. In our example, we’ll use the following documents:

First Notice of Loss (FNOL): The initial report filed by the policyholder describing the incident and what they’re claiming.
Invoice: Proof of expenses related to repairs, medical treatment, or property damage, to extract claimed amounts.
Supporting evidence (Police report): to help validate the claim's context.

Step 1: Creating a document extraction workflow

We’ll use a multimodal LLM such as GPT-4o to extract and summarize key information from each document. We’ll process the documents using Maxim’s Prompt Playground, which supports uploading files, such as images, audio, and PDFs, and using them as inputs to the LLM.

We’ll use the following prompt to extract key details such as policy information, policyholder details, vehicle information, and invoice data, including the total amount claimed, and output them in a structured format.

Prompt for extracting data from documents

You are a claims assistant specialized in auto insurance. Your task is to extract and structure all information relevant to claim validation and coverage determination from the following documents:

- First Notice of Loss (FNOL)

- Police Report

- Repair Invoice (image provided)

These documents pertain to the same auto accident case. Extract the following structured fields wherever available across the sources. If data is missing or inconsistent across documents, flag it clearly.

Extract the following fields and use the keys mentioned corresponding to them:

1. Policy & Insured Info (Key: PolicyInsuredInfo)

- Policyholder Name (Key: PolicyholderName)

- Policy Number (Key: PolicyNumber)

- Policy State (Key: PolicyState)

2. Vehicle & Driver Info (Key: VehicleDriverInfo)

- Vehicle Make / Model / Year (Key: VehicleMakeModelYear)

- VIN (Key: VIN)

- License Plate (Key: LicensePlate)

- Driver Name (Key: DriverName)

- Driver License Number & State (Key: DriverLicenseNumberState)

- Injuries Reported (Key: InjuriesReported)

3. Accident Facts (Key: AccidentFacts)

- Date & Time of Accident (Key: AccidentDateTime)

- Accident Location (Key: AccidentLocation)

- Police Report Number (Key: PoliceReportNumber)

- Tow Details (Key: TowDetails)

- Fault Determination (Key: FaultDetermination)

4. Other Party Info (Key: OtherPartyInfo)

- Other Driver's Name (Key: OtherDriverName)

- Insurance Provider (Key: OtherInsuranceProvider)

- Statement of Events (Key: StatementOfEvents)

5. Repair Invoice (Key: RepairInvoice)

- Invoice Number & Date (Key: InvoiceNumberDate)

- Total Amount Billed (Key: TotalAmountBilled)

- Key Repairs (Key: KeyRepairs)

- Shop Name & Address (Key: ShopNameAddress)

- Vehicle Mentioned in Invoice (Key: VehicleMentionedInInvoice)

6. Summary: Provide a summary of this case as well

Format the output in only JSON format. Mark any missing information as NA, and do not fabricate information

💡

In such cases, we need the extracted data in a structured JSON format. To ensure this, you can select "Response Format" as JSON under the "Advanced Parameters" section of the Prompt Playground.

Step 2: Evaluating the accuracy of extraction

Setting up evaluators: Once the structured output is generated, the next step is to evaluate how accurately the LLM extracted and structured the information. Since this data is deterministic, we’ll create string-matching–based Programmatic evaluators in Maxim to validate that key fields—such as the policy number and claimed amount—are accurate.

checkPolicyNumber: Custom programmatic evaluator to validate judgment. (Similarly, more such evals can be created in Maxim to evaluate extraction accuracy)

// this evaluator checks if the correct policy number was 
// extracted in the output by comparing with ground-truth data
function validate(output, expectedPolicyNumber) {

    const jsonData = JSON.parse(output);
    const policyNumber = jsonData.PolicyInsuredInfo.PolicyNumber;
    return policyNumber === expectedPolicyNumber;
};

Preparing golden dataset: Next, we’ll create a golden dataset to evaluate how this workflow performs across different cases. Maxim supports attaching files (PDFs, images, audio, etc.) as dataset entries, which we can pass to our prompt to run automated evaluations. In our example, the golden dataset will be a collection of files such as FNOLs, invoices, and a police report, along with the expected policy number.
1. Create dataset: Head to the "Datasets" section in Maxim and create a new dataset. We’ll name this "Document processing dataset" and use the "Prompt or endpoint testing" template for this example.
  1. Input: Since our inputs are files, select column type as "Input Files" against the Input column.
  2. Expected Output: Make sure to rename this column to "expectedPolicyNumber" as the programmatic evaluator references this exact column name in the dataset.
2. Enrich dataset: We’ll use the following collection of documents (Download) to populate the dataset. Fetch the policy number from each folder name and enter it in the "expectedPolicyNumber" column.

Running evaluations: Finally, to test the LLM-based extraction workflow, go to the Prompt Playground and click "Test". Select the ground truth dataset and the evaluator created in the previous steps. You can also add additional AI, programmatic, or human evaluators based on your quality requirements.

II. Creating a claims verification and validation workflow

Once the extraction funnel is in place, we need to validate the claim’s authenticity. This involves assessing extracted information, comparing it with policy terms, verifying coverage, generating a claims summary, and deciding whether to approve, reject, or escalate the claim.

To achieve this, we’ll create a workflow using this insurance policy document as the knowledge source to validate key details such as limits, coverage, terms, etc., and generate a judgment for the next steps.

Step 1: Creating a claim validation workflow

In the Prompt Playground, we’ll use the following prompt to guide the LLM (here GPT-4o) to take the extracted JSON as input and search the policy document to provide the judgment (auto-approve, reject, human-review, etc) and justification behind the decision.

v1: Basic prompt to validate claim data

You are an AI claims assistant in an auto insurance workflow. Your role is to assess the validity of a submitted insurance claim by evaluating structured/ unstructured claim data passed in user message agains the official insurance policy document.

Insurance policy document. {{doc}}

Your task is to:

1. Assess if the claim is covered.

2. Identify relevant coverage types.

3. Check for applicable exclusions.

4. Determine the appropriate judgment.

5. Justify your reasoning with references to the policy.

Coverage types: Liability, Collision, Uninsured/Underinsured Motorist (UM/UIM), Medical Payments (MedPay) / PIP

Common exclusions: Intentional/criminal acts, commercial/ rideshare use without endorsement, unlisted drivers or autos, racing, specialty vehicles (motorcycles, RVs) unless endorsed

Judgment:

- Auto-approve if: Coverage applies, no exclusions triggered and claim amount is within limits

- Reject if: Policy is inactive or any exclusion applies or if there's any Fraud, criminal use, or non-covered event or amount claimed is beyond limits

- Human-review if: Coverage is unclear or there is missing/conflicting info or the case is of High severity, high value (i.e. amount claimed is >2000$), or involved serious injuries or amount is beyond limits

Give output in JSON format:

{

"judgment": "auto-approve" | "human-review" | "reject",

"justification": "<concise explanation referencing policy clauses>",

}

For the scope of this blog, we’re generating just 2 fields in output, but you can use a more detailed prompt like the one below to cover information such as the type of coverage, exclusions found, policy sections referenced, and other factors that impact decision-making in the claims handling process. eg:

v2: Detailed prompt to validate claim data

Insurance policy document. {{doc}}

Your task is to:

1. Assess if the claim is covered.

2. Identify relevant coverage types.

3. Check for applicable exclusions.

4. Determine the appropriate judgment.

5. Justify your reasoning with references to the policy.

Coverage types (match against claim context):

- Liability: Bodily injury, property damage

- Collision: Damage from vehicle crashes

- Comprehensive: Theft, fire, natural disasters, animals

- Uninsured/Underinsured Motorist (UM/UIM)

- Medical Payments (MedPay) / PIP

Coverage limits:

- Refer to sections 12.1 to 12.7 (e.g., 50/100/25 for BI/PD liability) in the policy document

Common exclusions:

- Intentional/criminal acts

- Commercial/rideshare use without endorsement

- Unlisted drivers or autos

- Racing, war, foreign use, mechanical failure

- Specialty vehicles (motorcycles, RVs) unless endorsed

Judgment:

- Auto-approve if: Coverage applies, no exclusions triggered and claim amount is within limits

- Reject if: Policy is inactive or any exclusion applies or if there's any Fraud, criminal use, or non-covered event

- Human-review if: Coverage is unclear or there is missing/conflicting info or the case is of High severity, high value, or involved serious injuries

Give output in JSON format:

{

"covered": true | false,

"coverage_types": [...],

"exclusions_found": [...],

"policy_sections_referenced": [...],

"judgment": "auto-approve" | "human-review" | "reject",

"justification": "<concise explanation referencing policy clauses>",

"follow_up_actions": [...]

}

💡

Set the "Response Format" as JSON to ensure structured output.

Here, we'll pass the policy document as a PDF file via the {{doc}} variable defined in the prompt. For the prompt to reference the document during test runs, we need to add the document under the column named "doc" in the corresponding test dataset.

We can also use a RAG workflow here to look up relevant information in the policy document. Maxim supports attaching a knowledge base via API or by uploading files (txt, pdf, csv, etc.), which can be referenced directly in the Prompt Playground and for automated runs.

Step 2: Evaluating the quality of the validation workflow

Since the workflow takes extracted data as input, we’ll create a No-code agent in Maxim to sequentially pass the files as input, extract the data, route the extracted data through the validation flow, and generate the final judgment. Leveraging Maxim’s evaluators, we can apply AI, programmatic, or human evals to assess the output of the No-code agent, i.e., the output of the validation flow.

Prototype end-to-end claims processing flow as a No-code agent in Maxim:

Navigate to the "Agents" section and select "No-code agent".
We can simply use the prompts we created earlier for extraction and validation by clicking "Add Node", selecting "Prompts", and choosing the desired prompt and version.
Arrange and map the nodes in the correct sequence, i.e., extraction followed by validation.

Preparing the golden dataset: We’ll refine the same dataset used to evaluate extraction by adding 2 new columns to store the policy document and the ground truth for judgment.
1. "doc": This column, of "Files" type, will contain the policy document used as a knowledge base to validate claims.
2. "expectedJudgment": This column, of "Variable" type, will be compared against the AI-generated judgment.

Setting up evaluators: Our output contains two components: a judgment (deterministic) and a justification (non-deterministic). To evaluate these, we’ll use a string-match based programmatic evaluator for the judgment, and an LLM-as-a-judge based evaluator for the justification.

validateJudgment: Custom programmatic evaluator to validate judgment.

// this evaluator checks if the correct judgment was generated
// by LLM, by comparing with ground-truth data.
function validate(output, expectedJudgment) {
    const jsonData = JSON.parse(output);
    const judgment = jsonData.judgment;
    return judgment === expectedJudgment;
};

Conciseness: This evaluator validates that the generated justification is concise with no redundant information.
If you're using a context source (in a RAG-based validation flow), you can use Maxim's built-in Faithfulness evaluator to measure the quality of LLM generation by assessing whether the output factually aligns with the provided context and input

Run evaluations: Finally, to test the claim validation workflow, go to the No-code agent and click "Test". Select the ground truth dataset and the desired evaluator (here, Faithfulness and validateJudgment), and trigger the test run. Upon completion, you’ll see a detailed report of the performance of your AI-powered claims processing workflow across chosen eval metrics and model metrics such as latency and cost.

Check out this dynamic evaluation report generated on Maxim for this case. We can dive deeper into the evaluation scores and reasoning to iteratively improve the quality of our workflow.

This example can be extended to other document-heavy workflows, such as validating receipts in auditing, verifying invoices in procurement, and processing claims in other insurance verticals.

⭐

The insurance sector handles highly sensitive data, including personal, medical, and financial information, making robust security and compliance measures critical.

Maxim adheres to leading industry standards such as HIPAA and AICPA SOC 2 Type II to ensure data protection. For customers who can't have data leave their environment, Maxim offers deployment of the platform directly within their Virtual Private Cloud (VPC), securing both the control and data planes. More on in-VPC support.