Extracting Tabular Data from a PDF using Indexify¶
In this notebook, we're going to learn how we can extract transactional data from a PDF using Indexify. For that, we'll be using a sample PDF that contains transactional data from a Home Owners Association (HOA).
We will explore several way to extract this data from the PDF using Indexify Extractor into a structured format that we can use further for RAG pipeline. This is the preview of the data that we will extract from the PDF.
Setup¶
%pip install indexify-extractor-sdk indexify virtualenv
Trying out different extractors offered¶
We have several PDF and Invoice Extractor. Here are a few extractors that worked really well to get various fields from my HOA receipt.
First, get a taste of playing with these extractors locally.
PDFExtractor & SchemaExtractor¶
First, we will try PDFExtractor with SchemaExtractor. By default the SchemaExtractor uses OpenAI and works on the Content of chained extractor as data for JSON extraction from schema, however we can manually overwrite both the schema and the data. It can extract all the values from text in one shot.
Download the PDF extractor and Schema extractor:
!indexify-extractor download tensorlake/pdf-extractor
!indexify-extractor download tensorlake/schema
Load the PDF extractor and the file:
import requests
req = requests.get("https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Statement_HOA.pdf")
with open("Statement_HOA.pdf","wb") as f:
f.write(req.content)
from indexify_extractor_sdk import load_extractor, Content
pdfextractor, pdfconfig_cls = load_extractor("indexify_extractors.pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("Statement_HOA.pdf")
Extract the data and find the content with content_type "text/plain":
pdf_result = pdfextractor.extract(content)
text_content = next(content.data.decode("utf-8") for content in pdf_result if content.content_type == "text/plain")
text_content
Load the Schema extractor and extract the JSON using the Schema extractor:
from pydantic import BaseModel
class Invoice(BaseModel):
invoice_number: str
date: str
account_number: str
owner: str
address: str
last_month_balance: str
current_amount_due: str
registration_key: str
due_date: str
schema = Invoice.model_json_schema()
schemaextractor, schemaconfig_cls = load_extractor("indexify_extractors.schema.schema_extractor:SchemaExtractor")
config = schemaconfig_cls(service="openai", schema=schema)
result = schemaextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content
PDFExtractor & LLMExtractor¶
Next, for more control, we will try PDFExtractor with LLMExtractor. The PDFExtractor can extract all the values from text as well as tables in one shot and passes it to the chained LLMExtractor which can be used for question answering.
Download the LLM extractor:
!indexify-extractor download tensorlake/llm
Load the LLM extractor and extract the JSON using the LLM extractor:
query = "by when do I have to make the payment and what amount? also what was the EV charge amount?"
prompt = """Extract information according to this schema and return json in this format {"Invoice No.": "", "Date": "", "Account Number": "", "Owner": "", "Address": "", "Registration Key": "", "Last Month Balance": "", "Current Amount Due": "", "Due Date": ""}:
Axis\nSTATEMENTInvoice No. "Invoice No."\nDate: 4/19/2024\nAccount Number:\nOwner:\nProperty:"Account Number"\n"Owner"\n"Property"\n"Owner"\n"Property"\n"Address"SUMMARY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:"Last Month Balance"\n"Current Amount Due"\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nRegister your Resident\nPortal account now!\nRegistration Key/ID:\n"Registration Key"\nresident.actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxisAccount Number: "Account Number"\nOwner: "Owner"\nPLEASE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n"Current Amount Due"\nDUE DATE\n"Due Date"\n0049 00008330 0000922000203826 7 00065303 00000000 9"""
llmextractor, llmconfig_cls = load_extractor("indexify_extractors.llm.llm_extractor:LLMExtractor")
config = llmconfig_cls(service="openai", prompt=prompt)
result = llmextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content
Table Extraction¶
The document also has tables in it so let's find the data from tables with content_type "application/json" and get it in a dataframe:
import json
import pandas as pd
json_content = next(content.data for content in pdf_result if content.content_type == "application/json")
# Convert the JSON string to a Python dictionary
data_dict = json.loads(json_content)
# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame.from_dict(data_dict, orient="index")
# Print the DataFrame
print(df)
Question answering with extracted content:
config = llmconfig_cls(service="openai", prompt=str(data_dict) + str(llm_content))
result = llmextractor.extract(Content.from_text(query), config)
llm_content = next(content.data.decode("utf-8") for content in result if content.content_type == "text/plain")
llm_content
LayoutLMDocumentQA¶
Next we try LayoutLMDocumentQA. It can't extract all the values in one shot, but can answer to single questions.
Download the extractor:
!indexify-extractor download tensorlake/layoutlm-document-qa-extractor
Load the extractor and the file:
from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("indexify_extractors.layoutlm_document_qa.layoutlm_document_qa:LayoutLMDocumentQA")
Ask question to the extractor:
config = config_cls(query="What's the due date?")
result = extractor.extract(content, config)
result
Start the Indexify Server¶
To make this extractor continously extract -
- Download the Indexify Server
- Start it in development mode on your laptop
- Create extraction policies with questions that extracts the fields from the PDF
- Finally, you can get all the extracted value for a document by making an API call
Download the Server¶
!curl https://getindexify.ai | sh
Terminal 1:
./indexify server -d
Create the Extraction Graph¶
from indexify import IndexifyClient
client = IndexifyClient()
extraction_graph_spec = """
name: "pdf"
extraction_policies:
- extractor: "tensorlake/layoutlm-document-qa-extractor"
name: "hoa-fees-due-date"
input_params:
query: "What's the due date?"
- extractor: "tensorlake/layoutlm-document-qa-extractor"
name: "hoa-fees-outstanding"
input_params:
query: "Whats the outstanding amount?"
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
Upload Files¶
content_id = client.upload_file("pdf", "Statement_HOA.pdf")
client.wait_for_extraction(content_id)
content_id
client.get_structured_data(content_id)
client.sql_query("select * from ingestion;")