[Insurance Retrieval] PDF Parsing

All insurance policy documents are given as PDF format. Therefore, we must convert them into texts, before feeding them to LLM model (e.g. ChatGPT) to get structured output.

We tested some PDF parser packages. I took charge of experiments for PDFPlumber, an open source package.

(You can refer to my Github repository here: https://github.com/HanKyeol0/Information-Retrieval-for-Insurance-Documents)

PDFPlumber provides certain methods for extracting various components from PDF files. Generally, insurance documents contain texts, images, and tables. Each component has some attributes such as x/y-axis positions of vertices.

We can just detect words, tables and images simply like this:

with pdfplumber.open(pdf_path) as pdf:
    for i, page in enumerate(pdf.pages, start=start_page-1):
        texts = page.extract_words()
        tables = page.find_tables()
        images = page.images

To handle tables in a consistent and robust format, I convert extracted tables into HTML format with special tokens (<|TABLE START|> and <|TABLE END|>) for LLM’s perception like this:

def extract_tableHTML(page, table_bbox, row_y_threshold=5):
    words = page.extract_words()
    table_words = [w for w in words if is_word_in_box(w, table_bbox)]

    rows = []
    current_row = []
    prev_y = None

    for word in sorted(table_words, key=lambda w: (w["top"], w["x0"])):
        if prev_y is None or abs(word["top"] - prev_y) <= row_y_threshold:
            current_row.append(word)
        else:
            rows.append(current_row)
            current_row = [word]
        prev_y = word["top"]
    if current_row:
        rows.append(current_row)

    # table HTML
    html = "<table>\n"
    for row in rows:
        html += "<tr>"
        cells = group_row_words_into_cells(row)
        for cell_text in cells:
            html += f"<td>{cell_text}</td>"
        html += "</tr>\n"
    html += "</table>"
    return "<|TABLE START|>\n" + html + "\n<|TABLE END|>"

To format data into concise and structured shape, appropriate chunking is necessary. Therefore, I defined this rule for chunking:

If an interval between two lines is bigger than a certain threshold, the latter line becomes the start of a new chunk. Otherwise, they are regarded as a consecutive paragraph.
Tables, images are considered as independent chunks.
In some cases, there can be no enough line breaks even between different paragraphs. To handle this case, if a single chunk becomes too long (bigger than a certain threshold), the code finds lines start with “조”, which means an article in Korean. Then it separates a new chunk starting the new “조” line.

i won’t include the chunking part of the code in this post since it’s quite long, you can find it in the repository.

After extracting all components from the page, they are sorted by y-position again, and then saved as a txt file.

Actually, I found that PDFPlumber had some minor setbacks while the experiment, so the OpenParser was selected as a final PDF parser, eventually.

Related Posts

[Insurance Retrieval] Starting a project: Information Retrieval for Insurance Documents

Leave a Reply Cancel reply