Extracting Tabular Data from a PDF using Indexify¶

Join Discord if you need help + ⭐ Star us on Github ⭐

In this notebook, we're going to learn how we can extract transactional data from a PDF using Indexify. For that, we'll be using a sample PDF that contains transactional data from a Home Owners Association (HOA).

We will explore several way to extract this data from the PDF using Indexify Extractor into a structured format that we can use further for RAG pipeline. This is the preview of the data that we will extract from the PDF.

Preview data

Setup¶

In [ ]:

Copied!

%pip install indexify-extractor-sdk indexify virtualenv
%pip install indexify-extractor-sdk indexify virtualenv

Trying out different extractors offered¶

We have several PDF and Invoice Extractor. Here are a few extractors that worked really well to get various fields from my HOA receipt.

First, get a taste of playing with these extractors locally.

PDFExtractor & SchemaExtractor¶

First, we will try PDFExtractor with SchemaExtractor. By default the SchemaExtractor uses OpenAI and works on the Content of chained extractor as data for JSON extraction from schema, however we can manually overwrite both the schema and the data. It can extract all the values from text in one shot.

Download the PDF extractor and Schema extractor:

In [ ]:

Copied!

!indexify-extractor download tensorlake/pdf-extractor
!indexify-extractor download tensorlake/schema
!indexify-extractor download tensorlake/pdf-extractor
!indexify-extractor download tensorlake/schema

Load the PDF extractor and the file:

In [ ]:

Copied!

import requests
req = requests.get("https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Statement_HOA.pdf")

with open("Statement_HOA.pdf","wb") as f:
    f.write(req.content)

from indexify_extractor_sdk import load_extractor, Content

pdfextractor, pdfconfig_cls = load_extractor("indexify_extractors.pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("Statement_HOA.pdf")
import requests
req = requests.get("https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Statement_HOA.pdf")

with open("Statement_HOA.pdf","wb") as f:
    f.write(req.content)

from indexify_extractor_sdk import load_extractor, Content

pdfextractor, pdfconfig_cls = load_extractor("indexify_extractors.pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("Statement_HOA.pdf")

Extract the data and find the content with content_type "text/plain":

In [ ]:

Copied!

pdf_result = pdfextractor.extract(content)
text_content = next(content.data.decode("utf-8") for content in pdf_result if content.content_type == "text/plain")
text_content
pdf_result = pdfextractor.extract(content)
text_content = next(content.data.decode("utf-8") for content in pdf_result if content.content_type == "text/plain")
text_content

Load the Schema extractor and extract the JSON using the Schema extractor:

In [ ]:

Copied!





from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    date: str
    account_number: str
    owner: str
    address: str
    last_month_balance: str
    current_amount_due: str
    registration_key: str
    due_date: str
from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    date: str
    account_number: str
    owner: str
    address: str
    last_month_balance: str
    current_amount_due: str
    registration_key: str
    due_date: str

In [ ]:

Copied!

schema = Invoice.model_json_schema()
schema = Invoice.model_json_schema()

In [ ]:

Copied!





schemaextractor, schemaconfig_cls = load_extractor("indexify_extractors.schema.schema_extractor:SchemaExtractor")

config = schemaconfig_cls(service="openai", schema=schema)
result = schemaextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content
schemaextractor, schemaconfig_cls = load_extractor("indexify_extractors.schema.schema_extractor:SchemaExtractor")

config = schemaconfig_cls(service="openai", schema=schema)
result = schemaextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content

PDFExtractor & LLMExtractor¶

Next, for more control, we will try PDFExtractor with LLMExtractor. The PDFExtractor can extract all the values from text as well as tables in one shot and passes it to the chained LLMExtractor which can be used for question answering.

Download the LLM extractor:

In [ ]:

Copied!

!indexify-extractor download tensorlake/llm
!indexify-extractor download tensorlake/llm

Load the LLM extractor and extract the JSON using the LLM extractor:

In [ ]:

Copied!

query = "by when do I have to make the payment and what amount? also what was the EV charge amount?"
prompt = """Extract information according to this schema and return json in this format {"Invoice No.": "", "Date": "", "Account Number": "", "Owner": "", "Address": "", "Registration Key": "", "Last Month Balance": "", "Current Amount Due": "", "Due Date": ""}:
Axis\nSTATEMENTInvoice No. "Invoice No."\nDate: 4/19/2024\nAccount Number:\nOwner:\nProperty:"Account Number"\n"Owner"\n"Property"\n"Owner"\n"Property"\n"Address"SUMMARY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:"Last Month Balance"\n"Current Amount Due"\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nRegister your Resident\nPortal account now!\nRegistration Key/ID:\n"Registration Key"\nresident.actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxisAccount Number: "Account Number"\nOwner: "Owner"\nPLEASE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n"Current Amount Due"\nDUE DATE\n"Due Date"\n0049 00008330 0000922000203826 7 00065303 00000000 9"""
query = "by when do I have to make the payment and what amount? also what was the EV charge amount?"
prompt = """Extract information according to this schema and return json in this format {"Invoice No.": "", "Date": "", "Account Number": "", "Owner": "", "Address": "", "Registration Key": "", "Last Month Balance": "", "Current Amount Due": "", "Due Date": ""}:
Axis\nSTATEMENTInvoice No. "Invoice No."\nDate: 4/19/2024\nAccount Number:\nOwner:\nProperty:"Account Number"\n"Owner"\n"Property"\n"Owner"\n"Property"\n"Address"SUMMARY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:"Last Month Balance"\n"Current Amount Due"\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nRegister your Resident\nPortal account now!\nRegistration Key/ID:\n"Registration Key"\nresident.actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxisAccount Number: "Account Number"\nOwner: "Owner"\nPLEASE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n"Current Amount Due"\nDUE DATE\n"Due Date"\n0049 00008330 0000922000203826 7 00065303 00000000 9"""

In [ ]:

Copied!





llmextractor, llmconfig_cls = load_extractor("indexify_extractors.llm.llm_extractor:LLMExtractor")

config = llmconfig_cls(service="openai", prompt=prompt)
result = llmextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content
llmextractor, llmconfig_cls = load_extractor("indexify_extractors.llm.llm_extractor:LLMExtractor")

config = llmconfig_cls(service="openai", prompt=prompt)
result = llmextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content

Table Extraction¶

The document also has tables in it so let's find the data from tables with content_type "application/json" and get it in a dataframe:

In [ ]:

Copied!





import json
import pandas as pd

json_content = next(content.data for content in pdf_result if content.content_type == "application/json")

# Convert the JSON string to a Python dictionary
data_dict = json.loads(json_content)

# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame.from_dict(data_dict, orient="index")

# Print the DataFrame
print(df)
import json
import pandas as pd

json_content = next(content.data for content in pdf_result if content.content_type == "application/json")

# Convert the JSON string to a Python dictionary
data_dict = json.loads(json_content)

# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame.from_dict(data_dict, orient="index")

# Print the DataFrame
print(df)

Question answering with extracted content:

In [ ]:

Copied!





config = llmconfig_cls(service="openai", prompt=str(data_dict) + str(llm_content))
result = llmextractor.extract(Content.from_text(query), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content
config = llmconfig_cls(service="openai", prompt=str(data_dict) + str(llm_content))
result = llmextractor.extract(Content.from_text(query), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content

LayoutLMDocumentQA¶

Next we try LayoutLMDocumentQA. It can't extract all the values in one shot, but can answer to single questions.

Download the extractor:

In [ ]:

Copied!

!indexify-extractor download tensorlake/layoutlm-document-qa-extractor
!indexify-extractor download tensorlake/layoutlm-document-qa-extractor

Load the extractor and the file:

In [ ]:

Copied!

from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("indexify_extractors.layoutlm_document_qa.layoutlm_document_qa:LayoutLMDocumentQA")
from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("indexify_extractors.layoutlm_document_qa.layoutlm_document_qa:LayoutLMDocumentQA")

Ask question to the extractor:

In [ ]:

Copied!

config = config_cls(query="What's the due date?")
result = extractor.extract(content, config)
result
config = config_cls(query="What's the due date?")
result = extractor.extract(content, config)
result

Start the Indexify Server¶

To make this extractor continously extract -

Download the Indexify Server
Start it in development mode on your laptop
Create extraction policies with questions that extracts the fields from the PDF
Finally, you can get all the extracted value for a document by making an API call

Download the Server¶

!curl https://getindexify.ai | sh

Terminal 1:

./indexify server -d

Create the Extraction Graph¶

In [ ]:

Copied!

from indexify import IndexifyClient
client = IndexifyClient()
from indexify import IndexifyClient
client = IndexifyClient()

In [ ]:

Copied!





extraction_graph_spec = """
name: "pdf"
extraction_policies:
  - extractor: "tensorlake/layoutlm-document-qa-extractor"
    name: "hoa-fees-due-date"
    input_params:
      query: "What's the due date?"

  - extractor: "tensorlake/layoutlm-document-qa-extractor"
    name: "hoa-fees-outstanding"
    input_params:
      query: "Whats the outstanding amount?"
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
extraction_graph_spec = """
name: "pdf"
extraction_policies:
  - extractor: "tensorlake/layoutlm-document-qa-extractor"
    name: "hoa-fees-due-date"
    input_params:
      query: "What's the due date?"

  - extractor: "tensorlake/layoutlm-document-qa-extractor"
    name: "hoa-fees-outstanding"
    input_params:
      query: "Whats the outstanding amount?"
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)                                            

Upload Files¶

In [ ]:

Copied!

content_id = client.upload_file("pdf", "Statement_HOA.pdf")
client.wait_for_extraction(content_id)
content_id
content_id = client.upload_file("pdf", "Statement_HOA.pdf")
client.wait_for_extraction(content_id)
content_id

In [ ]:

Copied!

client.get_structured_data(content_id)
client.get_structured_data(content_id)

In [ ]:

Copied!

client.sql_query("select * from ingestion;")
client.sql_query("select * from ingestion;")