Structured Output Extraction with Pydantic and LangChain

Author

Bas Machielsen

Published

March 16, 2026

Introduction

Large Language Models (LLMs) are powerful at understanding unstructured content — text, images, PDFs — but their raw output is free-form text that is hard to use programmatically. The pattern covered in this notebook solves that problem:

  1. Define a Pydantic schema — a typed Python class that describes exactly what fields you want back.
  2. Bind the schema to an LLM via LangChain’s with_structured_output().
  3. Batch process files (.jpg images or .pdf documents) and collect every result into a tidy pandas DataFrame.

I demonstrate everything twice: once with a local open-source model via Ollama (llava / llama3.2-vision for images; llama3 for text), and once with Google Gemini 2.5 Pro (a proprietary, cloud-based model) using an example of a hypothetical invoice.

Setup & Installation

# Install required packages (run once)
# %pip install langchain langchain-ollama langchain-google-genai \
#              pydantic pandas pymupdf pillow python-dotenv
# uv add langchain langchain-ollama langchain-google-genai \
#              pydantic pandas pymupdf pillow python-dotenv
import os
import base64
import json
from pathlib import Path
from typing import Optional

import pandas as pd
from pydantic import BaseModel, Field, field_validator, model_validator
from PIL import Image
import fitz  # PyMuPDF

# LangChain imports
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_ollama import ChatOllama
from langchain_google_genai import ChatGoogleGenerativeAI

# Load API keys from a .env file (GOOGLE_API_KEY=...)
from dotenv import load_dotenv
load_dotenv()

You should have a .env file in the root directory of your project folder. This .env file should contain the API keys you are using, e.g. GOOGLE_API_KEY=<api key here> as mentioned in the set-up. This .env file should also be added to your .gitignore so you don’t accidentally commit it to git.

Structured Output in Principle

1.1 Why Pydantic?

LLMs return strings. When you need structured data (e.g. for a database or downstream computation), you have two bad options without Pydantic:

  1. Parse with regex — brittle, breaks on any formatting variation.
  2. Trust the model to return valid JSON — it often adds markdown fences, explanatory text, or simply hallucinate keys.

Pydantic gives you a contract: a strongly-typed schema that validates, coerces, and rejects bad data with clear error messages.

1.2 Anatomy of a Good Pydantic Schema

A good schema for LLM extraction follows these principles:

Principle Why it matters
Use Field(description=...) on every field The description is injected into the JSON schema that the model receives. Clear descriptions = better extraction.
Use Optional[T] for fields that may be absent LLMs sometimes cannot find a value; forcing the field causes hallucination.
Add field_validator for domain constraints Catches out-of-range or nonsensical values before they reach your code.
Nest sub-models for related groups of fields Keeps the top-level schema readable and makes partial extraction easier.
Prefer str over Enum for messy real-world data Enums cause failures when the model uses a synonym.

1.3 Example: Invoice extraction schema

The goal is simple: you want an AI model to read an invoice and hand you back organised, reliable data — not a wall of text. This code describes the exact shape of that data.

Think of a Pydantic BaseModel as a form with labelled fields. Instead of getting a free-form answer from the AI, you’re saying: “Fill in this form, with these fields, in these data types.” Pydantic then checks that the form was filled in correctly before the data ever reaches your code.

The following is example code followed by an elaboration:

# Example: Invoice extraction schema

class LineItem(BaseModel):
    """A single line item on an invoice."""
    description: str = Field(description="Short description of the product or service")
    quantity: float = Field(description="Number of units")
    unit_price: float = Field(description="Price per unit in the invoice currency")
    total: float = Field(description="Line total = quantity × unit_price")


class InvoiceData(BaseModel):
    """Structured data extracted from an invoice image or PDF page."""

    invoice_number: Optional[str] = Field(
        default=None,
        description="Invoice or reference number, e.g. 'INV-2024-0042'",
    )
    invoice_date: Optional[str] = Field(
        default=None,
        description="Issue date in ISO 8601 format (YYYY-MM-DD) if determinable",
    )
    vendor_name: str = Field(description="Name of the company that issued the invoice")
    vendor_address: Optional[str] = Field(
        default=None, description="Full address of the vendor"
    )
    customer_name: Optional[str] = Field(
        default=None, description="Name of the customer / bill-to party"
    )
    line_items: list[LineItem] = Field(
        default_factory=list,
        description="All individual line items found on the invoice",
    )
    subtotal: Optional[float] = Field(
        default=None, description="Amount before tax"
    )
    tax_amount: Optional[float] = Field(
        default=None, description="Total tax or VAT amount"
    )
    total_amount: float = Field(
        description="Grand total amount due (including tax)"
    )
    currency: str = Field(
        default="USD",
        description="Three-letter ISO 4217 currency code, e.g. 'EUR' or 'USD'",
    )

    # ── Validators ──────────────────────────────────────────────────────────

    @field_validator("currency")
    @classmethod
    def normalise_currency(cls, v: str) -> str:
        """Uppercase and strip whitespace from the currency code."""
        return v.strip().upper()

    @field_validator("total_amount", "subtotal", "tax_amount", mode="before")
    @classmethod
    def non_negative(cls, v):
        """Monetary amounts must be non-negative."""
        if v is not None and float(v) < 0:
            raise ValueError("Monetary amount cannot be negative")
        return v

    @model_validator(mode="after")
    def check_totals(self) -> "InvoiceData":
        """Warn if subtotal + tax ≠ total (allow 1 % tolerance for rounding)."""
        if self.subtotal is not None and self.tax_amount is not None:
            expected = self.subtotal + self.tax_amount
            if abs(expected - self.total_amount) / max(self.total_amount, 1) > 0.01:
                # Don't raise — just log; the model may have seen a discount line
                print(
                    f"Totals may not add up: {self.subtotal} + {self.tax_amount}"
                    f" ≠ {self.total_amount}"
                )
        return self

The part of the code underneath:

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    total: float

.. represents a single row on an invoice — one product or service. An invoice for office supplies might have three line items: pens, paper, and a stapler. Each one has a description, how many were bought, what each cost, and the row total.

The Field(description="...") part is instructions for the AI, not for Python. When LangChain sends this schema to the model, those descriptions travel with it, so the AI knows what each field means.

The InvoiceData class is the full invoice. Most fields are straightforward — vendor name, date, total — but three things are worth highlighting:

  1. Optional[str] vs str: some fields have Optional[...] and a default=None; others don’t.

    • vendor_name: strrequired. Every invoice has a sender. If the AI can’t find it, that’s an error.
    • invoice_number: Optional[str] = Noneoptional. Some invoices genuinely don’t have a number. Rather than forcing the AI to make something up, you tell it “if you can’t find this, just leave it blank.”

    This distinction matters a lot in practice. Forcing required fields on uncertain data causes the AI to hallucinate values. Making genuinely required fields non-optional means you catch real problems early.

  2. line_items: list[LineItem]: this is where the two forms connect. The main invoice form contains a list of line item forms — potentially zero, one, or many. Pydantic validates each entry in that list against the LineItem schema automatically. This nesting lets you model the real structure of a document rather than flattening everything.

  3. Sanity Checking: once the AI fills in the form, Pydantic runs your validators before handing the data to you. There are three here:

    1. normalise_currency — a simple cleanup. If the AI returns " eur " or "Eur", this strips the whitespace and uppercases it to "EUR". You’d rather fix small formatting issues than reject otherwise good data.

    2. non_negative — a domain rule. Monetary amounts on an invoice can’t be negative. If the AI returns -150.0 for a total (which can happen with confused models), this raises an error immediately rather than silently putting a nonsensical value in your DataFrame.

    3. check_totals — a cross-field consistency check. It looks at multiple fields together: does subtotal + tax_amount roughly equal total_amount? If not, it prints a warning. Notice it doesn’t raise an error here — the comment explains why: a discount line could legitimately cause a mismatch. So it flags it for a human to review rather than rejecting the whole extraction.

You can of course also ask an LLM to assist you in writing a pydantic scheme and validators.

Without this structure, you’d ask the AI “extract the invoice data” and get back a paragraph of text you’d have to parse yourself. With it, you get a Python object where result.total_amount is guaranteed to be a non-negative float, result.line_items is a typed list you can loop over, and any missing optional fields are cleanly None rather than absent keys that crash your code.

# Inspect the JSON schema that LangChain will send to the mode
import json
print(json.dumps(InvoiceData.model_json_schema(), indent=2))

1.4 Binding the Schema to a Model with with_structured_output

with_structured_output() is the cleanest LangChain idiom. It:

  • Generates the JSON schema from your Pydantic model.
  • Instructs the LLM to use function / tool calling (preferred) or JSON-mode as a fallback.
  • Automatically validates and coerces the response back into your Pydantic class.
# ── Ollama (local) ───────────────────────────────────────────────────────────
llm_ollama = ChatOllama(model="llama3.2", temperature=0)
extractor_ollama = llm_ollama.with_structured_output(InvoiceData)

# ── Gemini 2.5 Pro (cloud) ───────────────────────────────────────────────────
llm_gemini = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",
    temperature=0,
    google_api_key=os.getenv("GOOGLE_API_KEY"),
)
extractor_gemini = llm_gemini.with_structured_output(InvoiceData)
# ── Minimal smoke-test: extract from plain text ──────────────────────────────
SAMPLE_TEXT = """
INVOICE
Vendor: Acme Software B.V.
Invoice #: INV-2025-0117
Date: 2025-03-01

Bill To: DataCorp GmbH

Items:
  - Annual SaaS licence    1 × €4 800.00  = €4 800.00
  - Premium support        1 × €1 200.00  = €1 200.00

Subtotal: €6 000.00
VAT (21 %): €1 260.00
Total Due: €7 260.00
"""

messages = [
    SystemMessage(
        content=(
            "You are an expert invoice parser. "
            "Extract all fields from the provided invoice text."
        )
    ),
    HumanMessage(content=SAMPLE_TEXT),
]

# With Ollama
result_ollama: InvoiceData = extractor_ollama.invoke(messages)
print(result_ollama.model_dump_json(indent=2))
# With Gemini
result_gemini: InvoiceData = extractor_gemini.invoke(messages)
print(result_gemini.model_dump_json(indent=2))

Extracting from Image Files (.jpg)

2.1 Encoding Images for Multimodal Models

Both Ollama vision models and Gemini accept images as base64-encoded strings embedded in the chat message. The helper below handles any common image format.

def encode_image(image_path: str | Path) -> tuple[str, str]:
    """
    Read an image file and return (base64_string, mime_type).
    Automatically converts non-JPEG/PNG files to JPEG.
    """
    path = Path(image_path)
    suffix = path.suffix.lower()
    mime_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png"}

    if suffix not in mime_map:
        # Convert to JPEG via Pillow
        img = Image.open(path).convert("RGB")
        import io
        buf = io.BytesIO()
        img.save(buf, format="JPEG")
        return base64.b64encode(buf.getvalue()).decode(), "image/jpeg"

    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode(), mime_map[suffix]


def image_to_human_message(image_path: str | Path, prompt: str) -> HumanMessage:
    """Wrap an image + text prompt in a LangChain HumanMessage."""
    b64, mime = encode_image(image_path)
    return HumanMessage(
        content=[
            {
                "type": "image_url",
                "image_url": {"url": f"data:{mime};base64,{b64}"},
            },
            {"type": "text", "text": prompt},
        ]
    )

2.2 The Extraction Prompt

INVOICE_EXTRACTION_PROMPT = (
    "You are an expert document parser. "
    "Extract every piece of financial information from this invoice image. "
    "If a field is not visible or not applicable, return null. "
    "Do not invent values that are not in the image."
)

2.3 Processing a Single Image

def extract_from_image(
    image_path: str | Path,
    extractor,           # a bound with_structured_output chain
    system_prompt: str = INVOICE_EXTRACTION_PROMPT,
) -> InvoiceData:
    """Run structured extraction on a single image file."""
    msg = image_to_human_message(image_path, system_prompt)
    result = extractor.invoke([SystemMessage(content=system_prompt), msg])
    return result

2.4 Selecting the Right Ollama Vision Model

Not all Ollama models support images. Use a vision-capable model:

# Pull a vision model (one-time setup)
ollama pull llava            # 7 B, solid general vision
ollama pull llama3.2-vision  # Meta's official multimodal variant
ollama pull minicpm-v        # Smaller, faster for lighter hardware
# Vision-capable Ollama model
llm_ollama_vision = ChatOllama(model="llama3.2-vision", temperature=0)
extractor_ollama_vision = llm_ollama_vision.with_structured_output(InvoiceData)

# Gemini already supports vision; reuse the same extractor
extractor_gemini_vision = extractor_gemini

Extracting from PDF Files

3.1 Strategy: Page-by-Page Image Rendering

PDFs can contain native text, scanned images, or mixed content. The most robust strategy is to render each page to an image and send it to the vision model. PyMuPDF (fitz) does this in a few lines.

def pdf_pages_to_images(
    pdf_path: str | Path,
    dpi: int = 150,
) -> list[bytes]:
    """
    Render every page of a PDF to a JPEG byte string.
    Returns a list where index i = page i (0-based).
    """
    doc = fitz.open(str(pdf_path))
    pages_bytes = []
    for page in doc:
        mat = fitz.Matrix(dpi / 72, dpi / 72)  # scale factor
        pix = page.get_pixmap(matrix=mat, colorspace=fitz.csRGB)
        pages_bytes.append(pix.tobytes("jpeg"))
    doc.close()
    return pages_bytes


def pdf_page_to_human_message(page_bytes: bytes, prompt: str) -> HumanMessage:
    """Wrap a rendered PDF page (as bytes) in a LangChain HumanMessage."""
    b64 = base64.b64encode(page_bytes).decode()
    return HumanMessage(
        content=[
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{b64}"},
            },
            {"type": "text", "text": prompt},
        ]
    )
def extract_from_pdf(
    pdf_path: str | Path,
    extractor,
    page_index: int = 0,           # which page to extract from (default: first)
    system_prompt: str = INVOICE_EXTRACTION_PROMPT,
) -> InvoiceData:
    """
    Extract structured data from one page of a PDF.
    For multi-page invoices you may loop over pages or concatenate their text.
    """
    pages = pdf_pages_to_images(pdf_path)
    if page_index >= len(pages):
        raise IndexError(f"Page {page_index} does not exist in {pdf_path}")
    msg = pdf_page_to_human_message(pages[page_index], system_prompt)
    return extractor.invoke([SystemMessage(content=system_prompt), msg])

Alternative: Text Extraction (for digitally native PDFs)

If you know the PDF is not scanned, extracting plain text is faster and cheaper:

def extract_text_from_pdf(pdf_path: str | Path) -> str:
    """Extract all text from a PDF using PyMuPDF."""
    doc = fitz.open(str(pdf_path))
    text = "\n\n".join(page.get_text() for page in doc)
    doc.close()
    return text


def extract_from_pdf_text(
    pdf_path: str | Path,
    extractor,
    system_prompt: str = INVOICE_EXTRACTION_PROMPT,
) -> InvoiceData:
    """Structured extraction using plain-text content (no vision required)."""
    text = extract_text_from_pdf(pdf_path)
    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=f"Invoice content:\n\n{text}"),
    ]
    return extractor.invoke(messages)

You can find out whether this is the case by just opening the pdf, and trying to “copy and paste” text in the pdf. If your pdf reader actually recognizes the text, then this should be done.

Batch Processing to DataFrame

4.1 The Batch Runner

from pathlib import Path

def batch_extract(
    file_paths: list[str | Path],
    extractor,
    use_vision: bool = True,
) -> pd.DataFrame:
    """
    Extract InvoiceData from a mixed list of .jpg / .png / .pdf files.

    Parameters
    ----------
    file_paths  : list of paths to process
    extractor   : a with_structured_output chain bound to InvoiceData
    use_vision  : if True, PDFs are rendered to images; otherwise plain text
                  extraction is used for PDFs (requires selectable text)

    Returns
    -------
    pandas DataFrame — one row per file, all InvoiceData fields as columns,
    plus a `source_file` column and an `error` column for failed extractions.
    """
    records = []

    for path in file_paths:
        path = Path(path)
        print(f"  Processing: {path.name} … ", end="", flush=True)
        row = {"source_file": path.name, "error": None}

        try:
            suffix = path.suffix.lower()

            if suffix in {".jpg", ".jpeg", ".png"}:
                result = extract_from_image(path, extractor)

            elif suffix == ".pdf":
                if use_vision:
                    result = extract_from_pdf(path, extractor)
                else:
                    result = extract_from_pdf_text(path, extractor)

            else:
                raise ValueError(f"Unsupported file type: {suffix}")

            # Flatten the Pydantic model (nested models → JSON strings)
            data = result.model_dump()
            # Serialize nested line_items list as JSON string for the DataFrame
            data["line_items"] = json.dumps(data["line_items"])
            row.update(data)
            print("✓")

        except Exception as exc:
            row["error"] = str(exc)
            print(f"✗  {exc}")

        records.append(row)

    df = pd.DataFrame(records)

    # Reorder: source_file first, then all InvoiceData fields, error last
    fixed_cols = ["source_file"]
    invoice_cols = [c for c in df.columns if c not in ("source_file", "error")]
    error_col = ["error"]
    df = df[fixed_cols + invoice_cols + error_col]

    return df

4.2 Running the Batch

# ── Point this at your own folder of invoices ────────────────────────────────
INVOICE_DIR = Path("./invoices")          # adjust as needed

image_files = sorted(INVOICE_DIR.glob("*.jpg")) + sorted(INVOICE_DIR.glob("*.png"))
pdf_files   = sorted(INVOICE_DIR.glob("*.pdf"))
all_files   = image_files + pdf_files

print(f"Found {len(all_files)} files to process.\n")
# ── With Ollama (local, no API cost) ─────────────────────────────────────────
print("=== Ollama (llama3.2-vision) ===")
df_ollama = batch_extract(all_files, extractor_ollama_vision, use_vision=True)
print(df_ollama.head())
# ── With Gemini 2.5 Pro (cloud) ───────────────────────────────────────────────
print("=== Gemini 2.5 Pro ===")
df_gemini = batch_extract(all_files, extractor_gemini_vision, use_vision=True)
print(df_gemini.head())

4.3 Post-Processing the DataFrame

# ── Save results ─────────────────────────────────────────────────────────────
df_ollama.to_csv("invoices_ollama.csv", index=False)
df_gemini.to_csv("invoices_gemini.csv", index=False)

# ── Basic analysis ────────────────────────────────────────────────────────────
print("Total extracted (Gemini):")
print(df_gemini[["source_file", "vendor_name", "total_amount", "currency"]])

# Revenue per currency
print("\nTotal revenue by currency:")
print(
    df_gemini.groupby("currency")["total_amount"]
    .sum()
    .reset_index()
    .rename(columns={"total_amount": "total"})
)

# Failed extractions
failed = df_gemini[df_gemini["error"].notna()]
if not failed.empty:
    print(f"\n⚠️  {len(failed)} file(s) failed:")
    print(failed[["source_file", "error"]])
Click here to see example to to compare Ollama vs. Gemini in this example
# ── Compare accuracy on the same files ───────────────────────────────────────
comparison_cols = ["source_file", "invoice_number", "total_amount", "currency"]

compare = (
    df_ollama[comparison_cols]
    .merge(
        df_gemini[comparison_cols],
        on="source_file",
        suffixes=("_ollama", "_gemini"),
    )
)

# Flag rows where models disagree on the total
compare["total_match"] = (
    compare["total_amount_ollama"].round(2) == compare["total_amount_gemini"].round(2)
)

print(compare)

Advanced Schema Design Tips

6.1 Optional Fields with Sensible Defaults

Use Optional liberally for fields that may genuinely be absent in some documents. If you force a field and the model cannot find it, it will hallucinate.

from typing import Literal

class ReceiptData(BaseModel):
    """Receipt data — looser schema for point-of-sale receipts."""

    merchant_name: str = Field(description="Name of the shop or merchant")
    merchant_category: Optional[str] = Field(
        default=None,
        description="Category of business, e.g. 'grocery', 'restaurant', 'pharmacy'",
    )
    transaction_date: Optional[str] = Field(
        default=None, description="Date of transaction (YYYY-MM-DD)"
    )
    transaction_time: Optional[str] = Field(
        default=None, description="Time of transaction (HH:MM, 24 h)"
    )
    items_purchased: list[str] = Field(
        default_factory=list,
        description="Plain-text names of items purchased",
    )
    total_amount: float = Field(description="Total amount paid")
    payment_method: Optional[Literal["cash", "card", "mobile", "other"]] = Field(
        default=None,
        description="How the customer paid. Use 'other' if unclear.",
    )
    currency: str = Field(default="USD")

6.2 Retry on Validation Failure

Pydantic raises ValidationError when the model returns bad data. Wrap calls in a retry loop for production code:

from pydantic import ValidationError
import time

def extract_with_retry(
    messages,
    extractor,
    max_retries: int = 3,
    wait_seconds: float = 2.0,
):
    """Retry extraction on validation errors, adding the error to the prompt."""
    history = list(messages)

    for attempt in range(1, max_retries + 1):
        try:
            return extractor.invoke(history)
        except ValidationError as exc:
            if attempt == max_retries:
                raise
            error_feedback = (
                f"Your previous response failed schema validation:\n{exc}\n"
                "Please correct your response and try again."
            )
            history.append(HumanMessage(content=error_feedback))
            time.sleep(wait_seconds)

6.3 Union Types for Mixed Document Batches

If your batch contains both invoices and receipts, but you don’t know which pages are invoices and which are receipts, use a Union schema:

from typing import Union, Annotated
from pydantic import Discriminator, Tag

class DocumentClassifier(BaseModel):
    """First-pass classifier to decide which schema to use."""
    document_type: Literal["invoice", "receipt", "unknown"] = Field(
        description="Type of financial document"
    )
    confidence: float = Field(
        ge=0.0, le=1.0,
        description="Confidence score between 0 and 1"
    )

def extract_any_document(image_path, extractor_classify, extractor_invoice, extractor_receipt):
    """Two-pass extraction: classify first, then extract with the right schema."""
    classifier = extractor_classify  # bound to DocumentClassifier
    msg = image_to_human_message(image_path, "Classify this financial document.")
    doc_type: DocumentClassifier = classifier.invoke([msg])

    if doc_type.document_type == "invoice":
        return extract_from_image(image_path, extractor_invoice)
    elif doc_type.document_type == "receipt":
        return extract_from_image(image_path, extractor_receipt)
    else:
        raise ValueError(f"Unknown document type: {doc_type.document_type}")

Model Setup Reference

Ollama Setup (local)

# 1. Install Ollama  →  https://ollama.com/download
# 2. Pull models
ollama pull llama3.2           # Text-only, good for PDF text extraction
ollama pull llama3.2-vision    # Multimodal — images + text
ollama pull llava              # Alternative vision model
ollama pull minicpm-v          # Lightweight vision model

# 3. Verify Ollama is running
ollama list
# Text-only Ollama model (for PDF text extraction)
llm_ollama_text = ChatOllama(model="llama3.2", temperature=0)
extractor_ollama_text = llm_ollama_text.with_structured_output(InvoiceData)

# Vision Ollama model
llm_ollama_vision = ChatOllama(model="llama3.2-vision", temperature=0)
extractor_ollama_vision = llm_ollama_vision.with_structured_output(InvoiceData)

Gemini 2.5 Pro Setup (cloud)

# 1. Create a Google AI Studio API key  →  https://aistudio.google.com
# 2. Add to .env:  GOOGLE_API_KEY=your_key_here
# 3. Install:  pip install langchain-google-genai
llm_gemini = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",          # or "gemini-2.0-flash" for faster/cheaper
    temperature=0,
    google_api_key=os.getenv("GOOGLE_API_KEY"),
)
extractor_gemini = llm_gemini.with_structured_output(InvoiceData)

Summary

Step What you do Key tool
Define schema Subclass BaseModel, annotate fields with Field(description=…) pydantic
Bind to LLM llm.with_structured_output(MyModel) langchain_core
Load images encode_image() → base64 → HumanMessage Pillow, base64
Load PDFs fitz.open() → render pages → base64 PyMuPDF
Batch process Loop files, call extractor, catch errors pure Python
Collect results model.model_dump()pd.DataFrame pandas
Compare models Merge DataFrames on source_file pandas

The same pattern works for any domain — not just invoices. Swap InvoiceData for a schema that fits your problem (medical records, scientific papers, product listings, …) and the rest of the pipeline stays identical.