Why JDF is AI-friendly

This format was designed in the AI era, on purpose, for AI consumption. Large language models read text. PDFs are binary. JDF is the document format that closes the gap — and every AI-shaped workflow built on top of PDF, DOCX, or HTML scrapes is a hack around that gap.

Why AI engines prefer JDF in one sentence: JSON is the language LLMs are trained to read, write, and reason about — handing them a .jdf is like handing a programmer source code instead of a screenshot of source code.

Why every modern AI engine reads JDF more easily

Every major LLM — GPT-4, Claude, Gemini, Llama, Mistral — was trained on enormous corpora of JSON. Tool calls return JSON. Function arguments are JSON. Structured output modes (OpenAI's response_format, Anthropic's tools, Google's responseSchema) all use JSON Schema. The model spends literally billions of tokens of pretraining looking at JSON. So when you hand it a .jdf:

No format tax. The model doesn't burn context on "what does this byte sequence mean" — it walks the tree the way it walks any function call argument.
Structure is explicit. A heading is "heading": 1, not "this line happens to be 24pt bold". The model never has to infer structure that the format already encodes.
Schemas are first-class. Point any LLM at spec/jdf-schema.json and it can validate, generate, and edit JDF documents with structured-output guarantees. Try that with PDF.
Token-efficient. JSON keys compress under BPE tokenisation. "heading": 2 is two tokens. The equivalent visual cue in a PDF text dump is the entire surrounding line context the model has to read to guess the level.
No OCR layer. PDFs frequently embed text as glyph paths or invisible OCR overlays. JDF is plain UTF-8 strings — there is nothing to OCR, nothing to misread, no Unicode normalisation surprises.
Round-trip safe. If a model writes a JDF, it can be opened, rendered, edited by a human, and fed back into another model — all without lossy parsing in between. You cannot say this about any binary format.

The PDF problem for LLMs

When you hand a PDF to an LLM, you can't actually hand it the file. You hand it the output of a parser:

pdftotext on a PDF returns a soup of broken lines, lost columns, and wrongly-grouped tables.
pdfplumber, pymupdf, etc. work better but still inject layout artifacts the model has to "guess past".
Multi-column papers come out interleaved.
Tables come out as columns of numbers without row alignment.
Footnotes mix into the body. Headers repeat on every page. Page numbers fragment paragraphs.
Embedded images and diagrams: gone, or replaced with [image].

Every prompt that says "here is a PDF, summarize it" is actually saying "here is one parser's best guess at the PDF — please reconstruct what the author meant". The model burns tokens recovering structure that should never have been lost.

The JDF answer

A .jdf is JSON. The model already speaks JSON natively. It doesn't need a parser. It doesn't need to guess at structure. The structure is the file.

{
  "$jdf": "1.0.0",
  "meta": { "title": "Q3 Earnings", "author": "Acme Inc." },
  "pages": [{
    "elements": [
      { "type": "text", "content": "Revenue grew 18% YoY.", "heading": 1 },
      { "type": "table",
        "headers": ["Quarter", "Revenue", "YoY"],
        "rows": [["Q1", "$2.3M", "+12%"], ["Q2", "$2.7M", "+15%"]] }
    ]
  }]
}

That's the file. Headings are tagged. Tables are real arrays with named columns. The author is in meta.author, not buried in a corrupt XMP packet.

Concrete consequences for AI workflows

1. Cleaner context

The model sees the document the way you'd want it to. No [image truncated], no broken hyphenation, no "page 3 of 14" line in the middle of a paragraph.

2. Smaller tokens

JSON is more compact than parser-extracted text once you account for the structure tags the model would otherwise have to infer. You pay for what's there — not for noise.

3. Tool use that just works

If you're building an agent that operates on documents, you can hand it jq queries directly:

# pull every heading
jq '.pages[].elements[] | select(.heading) | .content' doc.jdf

# list all tables
jq '.pages[].elements[] | select(.type == "table")' doc.jdf

# find every external link
jq '.. | objects | select(.type == "external") | .target' doc.jdf

No PDF library. No regex. No "the parser missed this row". The agent's tool call returns structured data because the document is structured data.

4. Generation is symmetric

If an LLM can read JDF natively, it can also write JDF natively. Asking GPT or Claude to "produce a one-page report" works, because the output is a JSON object — the same shape it just consumed. PDF generation requires either a tool call into a heavyweight library or hallucinated bytes.

5. RAG without the PDF middle-man

Most retrieval-augmented systems ingest PDFs by extracting text, embedding chunks, and praying the chunking didn't slice through a table. With JDF you embed elements directly:

// each element is its own retrieval unit, with metadata
for (const page of doc.pages) {
  for (const el of page.elements) {
    if (el.type === "text") {
      await embed(el.content, { type: "text", heading: el.heading });
    } else if (el.type === "table") {
      await embed(serialize(el), { type: "table", rows: el.rows.length });
    }
  }
}

Each chunk carries semantic metadata — type, heading level, page number — that the model uses at retrieval time. No more "this chunk is half a paragraph and half a table cell".

Side-by-side: same document, two formats

Imagine asking a model "what was Q2 revenue?" — given the same source document.

What the LLM sees	PDF (after `pdftotext`)	JDF
Headings	Mixed in with body text — model guesses from line length and surrounding whitespace.	`"heading": 1` — explicit, queryable, no inference.
Tables	Cells smear across columns. Multi-row headers collapse. Model invents rows.	`{"headers": [...], "rows": [...]}` — every cell at its real coordinate.
Footnotes	Inserted into the body line. Hallucinated as part of the paragraph.	Distinct elements with their own type. Trivial to filter out or surface.
Page numbers / running headers	Appear every N lines, fragment paragraphs.	Live in `header` / `footer` blocks, never in the content tree.
Images / charts	Replaced with `[image]` or dropped silently.	Stored in `resources.images` with alt text — a vision model can read them at the right anchor in the doc.
Links	Often lost — text shows but the underlying URL is gone.	First-class `link` field on text and runs.
Validation	None — anything goes, model has to trust the parser.	JSON Schema, validates in any IDE, fails fast in CI.

Why this matters for your AI stack

Cheaper inference. A clean structured doc costs fewer tokens than a noisy parsed PDF. Every prompt is shorter and more focused.
Higher accuracy. Models hallucinate less when the context is structured. "Find the Q2 revenue" reduces to a JSON path, not a free-text scan.
Reliable structured output. Ask the LLM to produce a JDF document and it actually can — because the target schema is the same JSON it was trained to emit. Try asking it for a bit-perfect PDF.
Tool-friendly. Agents can call jq, JSONPath, or any JSON manipulation library on a .jdf. There is no equivalent for PDF.
Diffable history. Every revision is a real diff, so an LLM reviewing a doc revision sees exactly what changed — not a 200-page re-export.
Trainable on its own. If you fine-tune a model, JDF training pairs are tiny, structured, and lossless. PDF pairs need a parser, a normaliser, and a prayer.

The bigger picture

The web invented Markdown so humans and machines could share the same source for short text. JDF does the same thing for the next layer up — long-form documents with structure, layout, and embedded media. PDF was designed in 1993 for laser printers; it predates almost every concern AI workflows now have.

If your stack involves an LLM touching documents, the format ought to be one the LLM can read without a translator. JDF is that format.

← PreviousGetting started Next →Embed on the web