At our lab, I’m currently involved in a research project commissioned by an insurance company in South Korea. This is a one-year project, and over the course of the year, I plan to share regular updates and insights through a series of blog posts.
Here’s the project overview.
Motivation
Insurance companies manage a vast number of insurance products developed over the past few decades. Each product is typically accompanied by a set of long, complex documents filled with professional jargon and intricate structures.
This complexity poses a major challenge for both customers and insurance consultants. Consider a scenario: a customer asks a consultant about their claim eligibility after an accident or illness. However, the consultant may not have the expertise or time to manually locate the relevant sections across several lengthy documents.
In such situations, a well-trained Information Retrieval (IR) model can be a game-changer—automating the process of identifying relevant information quickly and accurately. Ultimately, this has the potential to significantly reduce operational costs and improve service quality.
Data Preprocessing
We are provided with numerous PDF documents in a variety of formats. To handle this diversity, our goal is to convert them into a consistent structure using Large Language Models (LLMs).
Before we can leverage LLMs, we must first extract text from these PDF files. We’re currently evaluating various Python libraries to determine which one performs best for our use case. Once the text extraction is complete, the text will be fed into LLMs with carefully crafted prompts that guide the models to standardize the content structure.
Data Generation
To train an IR model, we require a large number of (query, passage) pairs. However, the dataset lacks real-world user queries. To address this, we plan to generate synthetic queries using LLMs.
The process involves prompting the LLM to generate queries based on specific sections of insurance documents. However, there’s a risk that the generated query may align more closely with a different section. To filter such cases, we will implement an encoder-based validation model to ensure that the query remains semantically aligned with the original passage.
Training Information Retrieval Model
Once the training data is ready, we’ll move on to training the IR model. Our current plan is to use Dense Passage Retrieval (DPR)-based methods, though agent-based approaches are also under consideration.
The final system will highlight the most relevant parts of a document in response to a user query—taking into account both the query and customer metadata. These relevant sections will be visually marked using bounding boxes within the documents.