# Install required packages (run once)
# %pip install langchain langchain-ollama langchain-google-genai \
# pydantic pandas pymupdf pillow python-dotenv
# uv add langchain langchain-ollama langchain-google-genai \
# pydantic pandas pymupdf pillow python-dotenvStructured Output Extraction with Pydantic and LangChain
Introduction
Large Language Models (LLMs) are powerful at understanding unstructured content — text, images, PDFs — but their raw output is free-form text that is hard to use programmatically. The pattern covered in this notebook solves that problem:
- Define a Pydantic schema — a typed Python class that describes exactly what fields you want back.
- Bind the schema to an LLM via LangChain’s
with_structured_output(). - Batch process files (
.jpgimages or.pdfdocuments) and collect every result into a tidy pandasDataFrame.
I demonstrate everything twice: once with a local open-source model via Ollama (llava / llama3.2-vision for images; llama3 for text), and once with Google Gemini 2.5 Pro (a proprietary, cloud-based model) using an example of a hypothetical invoice.
Setup & Installation
import os
import base64
import json
from pathlib import Path
from typing import Optional
import pandas as pd
from pydantic import BaseModel, Field, field_validator, model_validator
from PIL import Image
import fitz # PyMuPDF
# LangChain imports
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_ollama import ChatOllama
from langchain_google_genai import ChatGoogleGenerativeAI
# Load API keys from a .env file (GOOGLE_API_KEY=...)
from dotenv import load_dotenv
load_dotenv()You should have a .env file in the root directory of your project folder. This .env file should contain the API keys you are using, e.g. GOOGLE_API_KEY=<api key here> as mentioned in the set-up. This .env file should also be added to your .gitignore so you don’t accidentally commit it to git.
Structured Output in Principle
1.1 Why Pydantic?
LLMs return strings. When you need structured data (e.g. for a database or downstream computation), you have two bad options without Pydantic:
- Parse with regex — brittle, breaks on any formatting variation.
- Trust the model to return valid JSON — it often adds markdown fences, explanatory text, or simply hallucinate keys.
Pydantic gives you a contract: a strongly-typed schema that validates, coerces, and rejects bad data with clear error messages.
1.2 Anatomy of a Good Pydantic Schema
A good schema for LLM extraction follows these principles:
| Principle | Why it matters |
|---|---|
Use Field(description=...) on every field |
The description is injected into the JSON schema that the model receives. Clear descriptions = better extraction. |
Use Optional[T] for fields that may be absent |
LLMs sometimes cannot find a value; forcing the field causes hallucination. |
Add field_validator for domain constraints |
Catches out-of-range or nonsensical values before they reach your code. |
| Nest sub-models for related groups of fields | Keeps the top-level schema readable and makes partial extraction easier. |
Prefer str over Enum for messy real-world data |
Enums cause failures when the model uses a synonym. |
1.3 Example: Invoice extraction schema
The goal is simple: you want an AI model to read an invoice and hand you back organised, reliable data — not a wall of text. This code describes the exact shape of that data.
Think of a Pydantic BaseModel as a form with labelled fields. Instead of getting a free-form answer from the AI, you’re saying: “Fill in this form, with these fields, in these data types.” Pydantic then checks that the form was filled in correctly before the data ever reaches your code.
The following is example code followed by an elaboration:
# Example: Invoice extraction schema
class LineItem(BaseModel):
"""A single line item on an invoice."""
description: str = Field(description="Short description of the product or service")
quantity: float = Field(description="Number of units")
unit_price: float = Field(description="Price per unit in the invoice currency")
total: float = Field(description="Line total = quantity × unit_price")
class InvoiceData(BaseModel):
"""Structured data extracted from an invoice image or PDF page."""
invoice_number: Optional[str] = Field(
default=None,
description="Invoice or reference number, e.g. 'INV-2024-0042'",
)
invoice_date: Optional[str] = Field(
default=None,
description="Issue date in ISO 8601 format (YYYY-MM-DD) if determinable",
)
vendor_name: str = Field(description="Name of the company that issued the invoice")
vendor_address: Optional[str] = Field(
default=None, description="Full address of the vendor"
)
customer_name: Optional[str] = Field(
default=None, description="Name of the customer / bill-to party"
)
line_items: list[LineItem] = Field(
default_factory=list,
description="All individual line items found on the invoice",
)
subtotal: Optional[float] = Field(
default=None, description="Amount before tax"
)
tax_amount: Optional[float] = Field(
default=None, description="Total tax or VAT amount"
)
total_amount: float = Field(
description="Grand total amount due (including tax)"
)
currency: str = Field(
default="USD",
description="Three-letter ISO 4217 currency code, e.g. 'EUR' or 'USD'",
)
# ── Validators ──────────────────────────────────────────────────────────
@field_validator("currency")
@classmethod
def normalise_currency(cls, v: str) -> str:
"""Uppercase and strip whitespace from the currency code."""
return v.strip().upper()
@field_validator("total_amount", "subtotal", "tax_amount", mode="before")
@classmethod
def non_negative(cls, v):
"""Monetary amounts must be non-negative."""
if v is not None and float(v) < 0:
raise ValueError("Monetary amount cannot be negative")
return v
@model_validator(mode="after")
def check_totals(self) -> "InvoiceData":
"""Warn if subtotal + tax ≠ total (allow 1 % tolerance for rounding)."""
if self.subtotal is not None and self.tax_amount is not None:
expected = self.subtotal + self.tax_amount
if abs(expected - self.total_amount) / max(self.total_amount, 1) > 0.01:
# Don't raise — just log; the model may have seen a discount line
print(
f"Totals may not add up: {self.subtotal} + {self.tax_amount}"
f" ≠ {self.total_amount}"
)
return selfThe part of the code underneath:
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
total: float.. represents a single row on an invoice — one product or service. An invoice for office supplies might have three line items: pens, paper, and a stapler. Each one has a description, how many were bought, what each cost, and the row total.
The Field(description="...") part is instructions for the AI, not for Python. When LangChain sends this schema to the model, those descriptions travel with it, so the AI knows what each field means.
The InvoiceData class is the full invoice. Most fields are straightforward — vendor name, date, total — but three things are worth highlighting:
Optional[str]vsstr: some fields haveOptional[...]and adefault=None; others don’t.vendor_name: str— required. Every invoice has a sender. If the AI can’t find it, that’s an error.invoice_number: Optional[str] = None— optional. Some invoices genuinely don’t have a number. Rather than forcing the AI to make something up, you tell it “if you can’t find this, just leave it blank.”
This distinction matters a lot in practice. Forcing required fields on uncertain data causes the AI to hallucinate values. Making genuinely required fields non-optional means you catch real problems early.
line_items: list[LineItem]: this is where the two forms connect. The main invoice form contains a list of line item forms — potentially zero, one, or many. Pydantic validates each entry in that list against theLineItemschema automatically. This nesting lets you model the real structure of a document rather than flattening everything.Sanity Checking: once the AI fills in the form, Pydantic runs your validators before handing the data to you. There are three here:
normalise_currency— a simple cleanup. If the AI returns" eur "or"Eur", this strips the whitespace and uppercases it to"EUR". You’d rather fix small formatting issues than reject otherwise good data.non_negative— a domain rule. Monetary amounts on an invoice can’t be negative. If the AI returns-150.0for a total (which can happen with confused models), this raises an error immediately rather than silently putting a nonsensical value in your DataFrame.check_totals— a cross-field consistency check. It looks at multiple fields together: doessubtotal + tax_amountroughly equaltotal_amount? If not, it prints a warning. Notice it doesn’t raise an error here — the comment explains why: a discount line could legitimately cause a mismatch. So it flags it for a human to review rather than rejecting the whole extraction.
You can of course also ask an LLM to assist you in writing a pydantic scheme and validators.
Without this structure, you’d ask the AI “extract the invoice data” and get back a paragraph of text you’d have to parse yourself. With it, you get a Python object where result.total_amount is guaranteed to be a non-negative float, result.line_items is a typed list you can loop over, and any missing optional fields are cleanly None rather than absent keys that crash your code.
# Inspect the JSON schema that LangChain will send to the mode
import json
print(json.dumps(InvoiceData.model_json_schema(), indent=2))1.4 Binding the Schema to a Model with with_structured_output
with_structured_output() is the cleanest LangChain idiom. It:
- Generates the JSON schema from your Pydantic model.
- Instructs the LLM to use function / tool calling (preferred) or JSON-mode as a fallback.
- Automatically validates and coerces the response back into your Pydantic class.
# ── Ollama (local) ───────────────────────────────────────────────────────────
llm_ollama = ChatOllama(model="llama3.2", temperature=0)
extractor_ollama = llm_ollama.with_structured_output(InvoiceData)
# ── Gemini 2.5 Pro (cloud) ───────────────────────────────────────────────────
llm_gemini = ChatGoogleGenerativeAI(
model="gemini-2.5-pro",
temperature=0,
google_api_key=os.getenv("GOOGLE_API_KEY"),
)
extractor_gemini = llm_gemini.with_structured_output(InvoiceData)# ── Minimal smoke-test: extract from plain text ──────────────────────────────
SAMPLE_TEXT = """
INVOICE
Vendor: Acme Software B.V.
Invoice #: INV-2025-0117
Date: 2025-03-01
Bill To: DataCorp GmbH
Items:
- Annual SaaS licence 1 × €4 800.00 = €4 800.00
- Premium support 1 × €1 200.00 = €1 200.00
Subtotal: €6 000.00
VAT (21 %): €1 260.00
Total Due: €7 260.00
"""
messages = [
SystemMessage(
content=(
"You are an expert invoice parser. "
"Extract all fields from the provided invoice text."
)
),
HumanMessage(content=SAMPLE_TEXT),
]
# With Ollama
result_ollama: InvoiceData = extractor_ollama.invoke(messages)
print(result_ollama.model_dump_json(indent=2))# With Gemini
result_gemini: InvoiceData = extractor_gemini.invoke(messages)
print(result_gemini.model_dump_json(indent=2))Extracting from Image Files (.jpg)
2.1 Encoding Images for Multimodal Models
Both Ollama vision models and Gemini accept images as base64-encoded strings embedded in the chat message. The helper below handles any common image format.
def encode_image(image_path: str | Path) -> tuple[str, str]:
"""
Read an image file and return (base64_string, mime_type).
Automatically converts non-JPEG/PNG files to JPEG.
"""
path = Path(image_path)
suffix = path.suffix.lower()
mime_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png"}
if suffix not in mime_map:
# Convert to JPEG via Pillow
img = Image.open(path).convert("RGB")
import io
buf = io.BytesIO()
img.save(buf, format="JPEG")
return base64.b64encode(buf.getvalue()).decode(), "image/jpeg"
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode(), mime_map[suffix]
def image_to_human_message(image_path: str | Path, prompt: str) -> HumanMessage:
"""Wrap an image + text prompt in a LangChain HumanMessage."""
b64, mime = encode_image(image_path)
return HumanMessage(
content=[
{
"type": "image_url",
"image_url": {"url": f"data:{mime};base64,{b64}"},
},
{"type": "text", "text": prompt},
]
)2.2 The Extraction Prompt
INVOICE_EXTRACTION_PROMPT = (
"You are an expert document parser. "
"Extract every piece of financial information from this invoice image. "
"If a field is not visible or not applicable, return null. "
"Do not invent values that are not in the image."
)2.3 Processing a Single Image
def extract_from_image(
image_path: str | Path,
extractor, # a bound with_structured_output chain
system_prompt: str = INVOICE_EXTRACTION_PROMPT,
) -> InvoiceData:
"""Run structured extraction on a single image file."""
msg = image_to_human_message(image_path, system_prompt)
result = extractor.invoke([SystemMessage(content=system_prompt), msg])
return result2.4 Selecting the Right Ollama Vision Model
Not all Ollama models support images. Use a vision-capable model:
# Pull a vision model (one-time setup)
ollama pull llava # 7 B, solid general vision
ollama pull llama3.2-vision # Meta's official multimodal variant
ollama pull minicpm-v # Smaller, faster for lighter hardware# Vision-capable Ollama model
llm_ollama_vision = ChatOllama(model="llama3.2-vision", temperature=0)
extractor_ollama_vision = llm_ollama_vision.with_structured_output(InvoiceData)
# Gemini already supports vision; reuse the same extractor
extractor_gemini_vision = extractor_geminiExtracting from PDF Files
3.1 Strategy: Page-by-Page Image Rendering
PDFs can contain native text, scanned images, or mixed content. The most robust strategy is to render each page to an image and send it to the vision model. PyMuPDF (fitz) does this in a few lines.
def pdf_pages_to_images(
pdf_path: str | Path,
dpi: int = 150,
) -> list[bytes]:
"""
Render every page of a PDF to a JPEG byte string.
Returns a list where index i = page i (0-based).
"""
doc = fitz.open(str(pdf_path))
pages_bytes = []
for page in doc:
mat = fitz.Matrix(dpi / 72, dpi / 72) # scale factor
pix = page.get_pixmap(matrix=mat, colorspace=fitz.csRGB)
pages_bytes.append(pix.tobytes("jpeg"))
doc.close()
return pages_bytes
def pdf_page_to_human_message(page_bytes: bytes, prompt: str) -> HumanMessage:
"""Wrap a rendered PDF page (as bytes) in a LangChain HumanMessage."""
b64 = base64.b64encode(page_bytes).decode()
return HumanMessage(
content=[
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}"},
},
{"type": "text", "text": prompt},
]
)def extract_from_pdf(
pdf_path: str | Path,
extractor,
page_index: int = 0, # which page to extract from (default: first)
system_prompt: str = INVOICE_EXTRACTION_PROMPT,
) -> InvoiceData:
"""
Extract structured data from one page of a PDF.
For multi-page invoices you may loop over pages or concatenate their text.
"""
pages = pdf_pages_to_images(pdf_path)
if page_index >= len(pages):
raise IndexError(f"Page {page_index} does not exist in {pdf_path}")
msg = pdf_page_to_human_message(pages[page_index], system_prompt)
return extractor.invoke([SystemMessage(content=system_prompt), msg])Alternative: Text Extraction (for digitally native PDFs)
If you know the PDF is not scanned, extracting plain text is faster and cheaper:
def extract_text_from_pdf(pdf_path: str | Path) -> str:
"""Extract all text from a PDF using PyMuPDF."""
doc = fitz.open(str(pdf_path))
text = "\n\n".join(page.get_text() for page in doc)
doc.close()
return text
def extract_from_pdf_text(
pdf_path: str | Path,
extractor,
system_prompt: str = INVOICE_EXTRACTION_PROMPT,
) -> InvoiceData:
"""Structured extraction using plain-text content (no vision required)."""
text = extract_text_from_pdf(pdf_path)
messages = [
SystemMessage(content=system_prompt),
HumanMessage(content=f"Invoice content:\n\n{text}"),
]
return extractor.invoke(messages)You can find out whether this is the case by just opening the pdf, and trying to “copy and paste” text in the pdf. If your pdf reader actually recognizes the text, then this should be done.
Batch Processing to DataFrame
4.1 The Batch Runner
from pathlib import Path
def batch_extract(
file_paths: list[str | Path],
extractor,
use_vision: bool = True,
) -> pd.DataFrame:
"""
Extract InvoiceData from a mixed list of .jpg / .png / .pdf files.
Parameters
----------
file_paths : list of paths to process
extractor : a with_structured_output chain bound to InvoiceData
use_vision : if True, PDFs are rendered to images; otherwise plain text
extraction is used for PDFs (requires selectable text)
Returns
-------
pandas DataFrame — one row per file, all InvoiceData fields as columns,
plus a `source_file` column and an `error` column for failed extractions.
"""
records = []
for path in file_paths:
path = Path(path)
print(f" Processing: {path.name} … ", end="", flush=True)
row = {"source_file": path.name, "error": None}
try:
suffix = path.suffix.lower()
if suffix in {".jpg", ".jpeg", ".png"}:
result = extract_from_image(path, extractor)
elif suffix == ".pdf":
if use_vision:
result = extract_from_pdf(path, extractor)
else:
result = extract_from_pdf_text(path, extractor)
else:
raise ValueError(f"Unsupported file type: {suffix}")
# Flatten the Pydantic model (nested models → JSON strings)
data = result.model_dump()
# Serialize nested line_items list as JSON string for the DataFrame
data["line_items"] = json.dumps(data["line_items"])
row.update(data)
print("✓")
except Exception as exc:
row["error"] = str(exc)
print(f"✗ {exc}")
records.append(row)
df = pd.DataFrame(records)
# Reorder: source_file first, then all InvoiceData fields, error last
fixed_cols = ["source_file"]
invoice_cols = [c for c in df.columns if c not in ("source_file", "error")]
error_col = ["error"]
df = df[fixed_cols + invoice_cols + error_col]
return df4.2 Running the Batch
# ── Point this at your own folder of invoices ────────────────────────────────
INVOICE_DIR = Path("./invoices") # adjust as needed
image_files = sorted(INVOICE_DIR.glob("*.jpg")) + sorted(INVOICE_DIR.glob("*.png"))
pdf_files = sorted(INVOICE_DIR.glob("*.pdf"))
all_files = image_files + pdf_files
print(f"Found {len(all_files)} files to process.\n")# ── With Ollama (local, no API cost) ─────────────────────────────────────────
print("=== Ollama (llama3.2-vision) ===")
df_ollama = batch_extract(all_files, extractor_ollama_vision, use_vision=True)
print(df_ollama.head())# ── With Gemini 2.5 Pro (cloud) ───────────────────────────────────────────────
print("=== Gemini 2.5 Pro ===")
df_gemini = batch_extract(all_files, extractor_gemini_vision, use_vision=True)
print(df_gemini.head())4.3 Post-Processing the DataFrame
# ── Save results ─────────────────────────────────────────────────────────────
df_ollama.to_csv("invoices_ollama.csv", index=False)
df_gemini.to_csv("invoices_gemini.csv", index=False)
# ── Basic analysis ────────────────────────────────────────────────────────────
print("Total extracted (Gemini):")
print(df_gemini[["source_file", "vendor_name", "total_amount", "currency"]])
# Revenue per currency
print("\nTotal revenue by currency:")
print(
df_gemini.groupby("currency")["total_amount"]
.sum()
.reset_index()
.rename(columns={"total_amount": "total"})
)
# Failed extractions
failed = df_gemini[df_gemini["error"].notna()]
if not failed.empty:
print(f"\n⚠️ {len(failed)} file(s) failed:")
print(failed[["source_file", "error"]])Click here to see example to to compare Ollama vs. Gemini in this example
# ── Compare accuracy on the same files ───────────────────────────────────────
comparison_cols = ["source_file", "invoice_number", "total_amount", "currency"]
compare = (
df_ollama[comparison_cols]
.merge(
df_gemini[comparison_cols],
on="source_file",
suffixes=("_ollama", "_gemini"),
)
)
# Flag rows where models disagree on the total
compare["total_match"] = (
compare["total_amount_ollama"].round(2) == compare["total_amount_gemini"].round(2)
)
print(compare)Advanced Schema Design Tips
6.1 Optional Fields with Sensible Defaults
Use Optional liberally for fields that may genuinely be absent in some documents. If you force a field and the model cannot find it, it will hallucinate.
from typing import Literal
class ReceiptData(BaseModel):
"""Receipt data — looser schema for point-of-sale receipts."""
merchant_name: str = Field(description="Name of the shop or merchant")
merchant_category: Optional[str] = Field(
default=None,
description="Category of business, e.g. 'grocery', 'restaurant', 'pharmacy'",
)
transaction_date: Optional[str] = Field(
default=None, description="Date of transaction (YYYY-MM-DD)"
)
transaction_time: Optional[str] = Field(
default=None, description="Time of transaction (HH:MM, 24 h)"
)
items_purchased: list[str] = Field(
default_factory=list,
description="Plain-text names of items purchased",
)
total_amount: float = Field(description="Total amount paid")
payment_method: Optional[Literal["cash", "card", "mobile", "other"]] = Field(
default=None,
description="How the customer paid. Use 'other' if unclear.",
)
currency: str = Field(default="USD")6.2 Retry on Validation Failure
Pydantic raises ValidationError when the model returns bad data. Wrap calls in a retry loop for production code:
from pydantic import ValidationError
import time
def extract_with_retry(
messages,
extractor,
max_retries: int = 3,
wait_seconds: float = 2.0,
):
"""Retry extraction on validation errors, adding the error to the prompt."""
history = list(messages)
for attempt in range(1, max_retries + 1):
try:
return extractor.invoke(history)
except ValidationError as exc:
if attempt == max_retries:
raise
error_feedback = (
f"Your previous response failed schema validation:\n{exc}\n"
"Please correct your response and try again."
)
history.append(HumanMessage(content=error_feedback))
time.sleep(wait_seconds)6.3 Union Types for Mixed Document Batches
If your batch contains both invoices and receipts, but you don’t know which pages are invoices and which are receipts, use a Union schema:
from typing import Union, Annotated
from pydantic import Discriminator, Tag
class DocumentClassifier(BaseModel):
"""First-pass classifier to decide which schema to use."""
document_type: Literal["invoice", "receipt", "unknown"] = Field(
description="Type of financial document"
)
confidence: float = Field(
ge=0.0, le=1.0,
description="Confidence score between 0 and 1"
)
def extract_any_document(image_path, extractor_classify, extractor_invoice, extractor_receipt):
"""Two-pass extraction: classify first, then extract with the right schema."""
classifier = extractor_classify # bound to DocumentClassifier
msg = image_to_human_message(image_path, "Classify this financial document.")
doc_type: DocumentClassifier = classifier.invoke([msg])
if doc_type.document_type == "invoice":
return extract_from_image(image_path, extractor_invoice)
elif doc_type.document_type == "receipt":
return extract_from_image(image_path, extractor_receipt)
else:
raise ValueError(f"Unknown document type: {doc_type.document_type}")Model Setup Reference
Ollama Setup (local)
# 1. Install Ollama → https://ollama.com/download
# 2. Pull models
ollama pull llama3.2 # Text-only, good for PDF text extraction
ollama pull llama3.2-vision # Multimodal — images + text
ollama pull llava # Alternative vision model
ollama pull minicpm-v # Lightweight vision model
# 3. Verify Ollama is running
ollama list# Text-only Ollama model (for PDF text extraction)
llm_ollama_text = ChatOllama(model="llama3.2", temperature=0)
extractor_ollama_text = llm_ollama_text.with_structured_output(InvoiceData)
# Vision Ollama model
llm_ollama_vision = ChatOllama(model="llama3.2-vision", temperature=0)
extractor_ollama_vision = llm_ollama_vision.with_structured_output(InvoiceData)Gemini 2.5 Pro Setup (cloud)
# 1. Create a Google AI Studio API key → https://aistudio.google.com
# 2. Add to .env: GOOGLE_API_KEY=your_key_here
# 3. Install: pip install langchain-google-genaillm_gemini = ChatGoogleGenerativeAI(
model="gemini-2.5-pro", # or "gemini-2.0-flash" for faster/cheaper
temperature=0,
google_api_key=os.getenv("GOOGLE_API_KEY"),
)
extractor_gemini = llm_gemini.with_structured_output(InvoiceData)Summary
| Step | What you do | Key tool |
|---|---|---|
| Define schema | Subclass BaseModel, annotate fields with Field(description=…) |
pydantic |
| Bind to LLM | llm.with_structured_output(MyModel) |
langchain_core |
| Load images | encode_image() → base64 → HumanMessage |
Pillow, base64 |
| Load PDFs | fitz.open() → render pages → base64 |
PyMuPDF |
| Batch process | Loop files, call extractor, catch errors | pure Python |
| Collect results | model.model_dump() → pd.DataFrame |
pandas |
| Compare models | Merge DataFrames on source_file |
pandas |
The same pattern works for any domain — not just invoices. Swap InvoiceData for a schema that fits your problem (medical records, scientific papers, product listings, …) and the rest of the pipeline stays identical.