ClaimLens

Case Study • RAG Systems • Retrieval Engineering

Most RAG systems fail on real-world documents.

ClaimLens solves this by replacing naive chunking with deterministic clause-level retrieval — built for insurance policies where precision isn't optional.

Scroll to explore architecture, evaluation & insights ↓

Introduction

ClaimLens is a production-oriented Retrieval-Augmented Generation (RAG) system designed for insurance policy analysis, where accuracy and traceability are critical.

Traditional RAG pipelines often rely on heuristic chunking and loosely grounded outputs, which can lead to inconsistent retrieval and hallucinations. In domains like insurance, where decisions depend on precise clauses, this becomes a major limitation.

This project focuses on treating retrieval and reasoning as structured, deterministic systems rather than black-box pipelines, ensuring that every output is grounded, traceable, and evaluable.

Problem & Motivation

Most RAG tutorials suggest a simple pipeline: chunk documents, embed them, and retrieve with an LLM. This works well for clean text, but breaks down in real-world documents like insurance policies.

Insurance PDFs are structurally complex, with inconsistent numbering, repeated headings, annexures, and noisy formatting. Naive token-based chunking ignores these structures, often splitting clauses incorrectly or missing important context entirely.

The core problem wasn't retrieval — it was structure.

To address this, I designed a deterministic clause parser that:

• Detects multiple clause formats (numbered, roman, alphabetic, definitions)
• Assigns canonical IDs to each clause for traceability
• Enforces fail-fast behavior to avoid silent parsing errors

This ensures that each retrieval unit maps directly to a real legal clause, improving both retrieval accuracy and interpretability.

While not perfect due to challenges like multi-column layouts and inconsistent formatting, the system significantly outperforms naive chunking and provides a clear evaluation framework to iteratively improve performance.

Overview

ClaimLens is designed as a structured retrieval system rather than a naive RAG pipeline.

The system enforces:

• Deterministic parsing for stable retrieval units
• Canonical identifiers for traceability
• Evaluation-driven design for measurable performance

The goal is to move from "LLM-generated answers" to reliable, reproducible decision support.

System Architecture

The architecture is designed to separate concerns across ingestion, retrieval, ranking, and reasoning, ensuring each component is independently optimizable and testable.

• Ingestion → Page-level document loading
• Clause Splitter → Deterministic clause extraction
• Retriever → Dense retrieval (FAISS)
• Reranker → Cross-Encoder ranking refinement
• Reasoner → LLM with strict schema validation
• Pipeline → End-to-end orchestration

Architecture

The two diagrams below translate the system description into a product view and an execution flow, making it easier to see how ClaimLens moves from a policy question to a grounded answer.

Diagram 01

ClaimLens System Surface

A high-level product view showing how the experience layer, service layer, and retrieval/reasoning engine work together.

Experience Layer

Portfolio UI / user-facing interactions

→

Service Layer

API orchestration and request handling

→

ClaimLens Engine

Clause parsing, retrieval, reranking, reasoning

Data Foundation

Policy DocumentsClause StoreFAISS IndexMetadata Cache

Model Runtime

Embedding ModelCross-EncoderValidated LLM Output

Diagram 02

ClaimLens Retrieval and Reasoning Flow

The query is normalized, routed through retrieval, and only then passed into a constrained reasoning layer for a grounded final answer.

Coverage Query

User asks about claim eligibility or policy terms

→

Query Builder

Transforms the request into retrieval-friendly intent

→

Pipeline Orchestrator

Coordinates retrieval, reranking, and answer assembly

Retrieval Lane

Deterministic clause parsing creates stable retrieval units

Dense retrieval surfaces high-recall policy clauses

Cross-encoder reranking compresses evidence to the strongest set

evidence pack

Reasoning Lane

Grounded context is passed to the reasoning layer

Strict schema validation rejects malformed answers

Citation checks ensure outputs stay tied to policy clauses

Clause Evidence

Top-ranked passages retained for answer generation

→

Validation Gate

Pydantic schema and retry logic enforce structure

→

Structured Answer

Grounded response with confidence and citations

Design Constraints

• High precision required for legal clause interpretation
• Inconsistent document structures across insurers
• Need for traceable and explainable outputs
• Minimizing hallucinations in LLM reasoning

Key Engineering Decisions

Deterministic Clause Parsing

Moved from token-based chunking to deterministic clause parsing to ensure retrieval operates on semantically meaningful and stable units, improving both recall and interpretability.

Canonical Clause IDs

Token-based chunks lacked identity across runs, making evaluation inconsistent. Introduced canonical clause identifiers so that retrieval experiments are reproducible and traceable across different queries and document versions.

Fail-Fast Design

Silent failures in LLM pipelines produce unreliable outputs that are difficult to debug. Applied fail-fast validation with explicit error handling at each stage, ensuring that failures surface immediately and prevent cascading issues downstream.

Retrieval Pipeline

• Dense Retrieval (FAISS + BGE embeddings)
• Top-K = 40 candidate generation
• Cross-Encoder reranking → Top 5
• Eliminated manual hybrid weighting

Reasoning & Validation

• Strict JSON schema enforcement (Pydantic)
• Citation grounding constraints
• Retry mechanism on validation failure

What Makes This Different

Typical RAG

• Token-based chunking
• Weak evaluation
• Hallucination prone

ClaimLens

• Deterministic clause parsing
• Canonical IDs
• Strict validation

Evaluation

Evaluation was treated as a first-class component rather than an afterthought.

Metrics such as Recall@20 and MRR were used to measure retrieval effectiveness, ensuring that relevant clauses are consistently surfaced before reasoning.

This enabled iterative improvements in retrieval quality instead of relying on subjective output inspection.

Recall@20

0.93

MRR

0.89

Trade-offs

• Deterministic parsing increases complexity but improves consistency

• Cross-encoder reranking improves accuracy at the cost of latency

• Strict validation reduces flexibility but ensures reliability

Challenges

• Handling inconsistent clause structures across insurers
• Reducing noise from dense retrieval
• Enforcing strict schema validation on LLM outputs

Future Improvements

• Adaptive retrieval based on query intent
• Learning-to-rank for dynamic reranking optimization
• Feedback loop for continuous evaluation improvement
• Integration with LangGraph for agentic workflows

What I Learned

• Retrieval quality is the primary bottleneck in RAG systems

• Evaluation is essential for iterative improvement

• Structure and constraints improve LLM reliability more than prompt tuning

Key Insight

Reliable RAG systems are not achieved by better prompts, but by designing retrieval and reasoning as structured, deterministic pipelines with measurable performance.

GitHub ↗

ClaimLens

Case Study • RAG Systems • Retrieval Engineering

Most RAG systems fail on real-world documents.

ClaimLens solves this by replacing naive chunking with deterministic clause-level retrieval — built for insurance policies where precision isn't optional.

Scroll to explore architecture, evaluation & insights ↓

Introduction

ClaimLens is a production-oriented Retrieval-Augmented Generation (RAG) system designed for insurance policy analysis, where accuracy and traceability are critical.

This project focuses on treating retrieval and reasoning as structured, deterministic systems rather than black-box pipelines, ensuring that every output is grounded, traceable, and evaluable.

Problem & Motivation

Most RAG tutorials suggest a simple pipeline: chunk documents, embed them, and retrieve with an LLM. This works well for clean text, but breaks down in real-world documents like insurance policies.

The core problem wasn't retrieval — it was structure.

To address this, I designed a deterministic clause parser that:

• Detects multiple clause formats (numbered, roman, alphabetic, definitions)
• Assigns canonical IDs to each clause for traceability
• Enforces fail-fast behavior to avoid silent parsing errors

This ensures that each retrieval unit maps directly to a real legal clause, improving both retrieval accuracy and interpretability.

Overview

ClaimLens is designed as a structured retrieval system rather than a naive RAG pipeline.

The system enforces:

• Deterministic parsing for stable retrieval units
• Canonical identifiers for traceability
• Evaluation-driven design for measurable performance

The goal is to move from "LLM-generated answers" to reliable, reproducible decision support.

System Architecture

The architecture is designed to separate concerns across ingestion, retrieval, ranking, and reasoning, ensuring each component is independently optimizable and testable.

• Ingestion → Page-level document loading
• Clause Splitter → Deterministic clause extraction
• Retriever → Dense retrieval (FAISS)
• Reranker → Cross-Encoder ranking refinement
• Reasoner → LLM with strict schema validation
• Pipeline → End-to-end orchestration

Architecture

The two diagrams below translate the system description into a product view and an execution flow, making it easier to see how ClaimLens moves from a policy question to a grounded answer.

Diagram 01

ClaimLens System Surface

A high-level product view showing how the experience layer, service layer, and retrieval/reasoning engine work together.

Experience Layer

Portfolio UI / user-facing interactions

→

Service Layer

API orchestration and request handling

→

ClaimLens Engine

Clause parsing, retrieval, reranking, reasoning

Data Foundation

Policy DocumentsClause StoreFAISS IndexMetadata Cache

Model Runtime

Embedding ModelCross-EncoderValidated LLM Output

Diagram 02

ClaimLens Retrieval and Reasoning Flow

The query is normalized, routed through retrieval, and only then passed into a constrained reasoning layer for a grounded final answer.

Coverage Query

User asks about claim eligibility or policy terms

→

Query Builder

Transforms the request into retrieval-friendly intent

→

Pipeline Orchestrator

Coordinates retrieval, reranking, and answer assembly

Retrieval Lane

Deterministic clause parsing creates stable retrieval units

Dense retrieval surfaces high-recall policy clauses

Cross-encoder reranking compresses evidence to the strongest set

evidence pack

Reasoning Lane

Grounded context is passed to the reasoning layer

Strict schema validation rejects malformed answers

Citation checks ensure outputs stay tied to policy clauses

Clause Evidence

Top-ranked passages retained for answer generation

→

Validation Gate

Pydantic schema and retry logic enforce structure

→

Structured Answer

Grounded response with confidence and citations

Design Constraints

• High precision required for legal clause interpretation
• Inconsistent document structures across insurers
• Need for traceable and explainable outputs
• Minimizing hallucinations in LLM reasoning

Key Engineering Decisions

Deterministic Clause Parsing

Moved from token-based chunking to deterministic clause parsing to ensure retrieval operates on semantically meaningful and stable units, improving both recall and interpretability.

Canonical Clause IDs

Fail-Fast Design

Retrieval Pipeline

• Dense Retrieval (FAISS + BGE embeddings)
• Top-K = 40 candidate generation
• Cross-Encoder reranking → Top 5
• Eliminated manual hybrid weighting

Reasoning & Validation

• Strict JSON schema enforcement (Pydantic)
• Citation grounding constraints
• Retry mechanism on validation failure

What Makes This Different

Typical RAG

• Token-based chunking
• Weak evaluation
• Hallucination prone

ClaimLens

• Deterministic clause parsing
• Canonical IDs
• Strict validation

Evaluation

Evaluation was treated as a first-class component rather than an afterthought.

Metrics such as Recall@20 and MRR were used to measure retrieval effectiveness, ensuring that relevant clauses are consistently surfaced before reasoning.

This enabled iterative improvements in retrieval quality instead of relying on subjective output inspection.

Recall@20

0.93

MRR

0.89

Trade-offs

• Deterministic parsing increases complexity but improves consistency

• Cross-encoder reranking improves accuracy at the cost of latency

• Strict validation reduces flexibility but ensures reliability

Challenges

• Handling inconsistent clause structures across insurers
• Reducing noise from dense retrieval
• Enforcing strict schema validation on LLM outputs

Future Improvements

• Adaptive retrieval based on query intent
• Learning-to-rank for dynamic reranking optimization
• Feedback loop for continuous evaluation improvement
• Integration with LangGraph for agentic workflows

What I Learned

• Retrieval quality is the primary bottleneck in RAG systems

• Evaluation is essential for iterative improvement

• Structure and constraints improve LLM reliability more than prompt tuning

Key Insight

Reliable RAG systems are not achieved by better prompts, but by designing retrieval and reasoning as structured, deterministic pipelines with measurable performance.