JDFJDF/ docs

Why JDF is AI-friendly

This format was designed in the AI era, on purpose, for AI consumption. Large language models read text. PDFs are binary. JDF is the document format that closes the gap — and every AI-shaped workflow built on top of PDF, DOCX, or HTML scrapes is a hack around that gap.

Why AI engines prefer JDF in one sentence: JSON is the language LLMs are trained to read, write, and reason about — handing them a .jdf is like handing a programmer source code instead of a screenshot of source code.

Why every modern AI engine reads JDF more easily

Every major LLM — GPT-4, Claude, Gemini, Llama, Mistral — was trained on enormous corpora of JSON. Tool calls return JSON. Function arguments are JSON. Structured output modes (OpenAI's response_format, Anthropic's tools, Google's responseSchema) all use JSON Schema. The model spends literally billions of tokens of pretraining looking at JSON. So when you hand it a .jdf:

The PDF problem for LLMs

When you hand a PDF to an LLM, you can't actually hand it the file. You hand it the output of a parser:

Every prompt that says "here is a PDF, summarize it" is actually saying "here is one parser's best guess at the PDF — please reconstruct what the author meant". The model burns tokens recovering structure that should never have been lost.

The JDF answer

A .jdf is JSON. The model already speaks JSON natively. It doesn't need a parser. It doesn't need to guess at structure. The structure is the file.

{
  "$jdf": "1.0.0",
  "meta": { "title": "Q3 Earnings", "author": "Acme Inc." },
  "pages": [{
    "elements": [
      { "type": "text", "content": "Revenue grew 18% YoY.", "heading": 1 },
      { "type": "table",
        "headers": ["Quarter", "Revenue", "YoY"],
        "rows": [["Q1", "$2.3M", "+12%"], ["Q2", "$2.7M", "+15%"]] }
    ]
  }]
}

That's the file. Headings are tagged. Tables are real arrays with named columns. The author is in meta.author, not buried in a corrupt XMP packet.

Concrete consequences for AI workflows

1. Cleaner context

The model sees the document the way you'd want it to. No [image truncated], no broken hyphenation, no "page 3 of 14" line in the middle of a paragraph.

2. Smaller tokens

JSON is more compact than parser-extracted text once you account for the structure tags the model would otherwise have to infer. You pay for what's there — not for noise.

3. Tool use that just works

If you're building an agent that operates on documents, you can hand it jq queries directly:

# pull every heading
jq '.pages[].elements[] | select(.heading) | .content' doc.jdf

# list all tables
jq '.pages[].elements[] | select(.type == "table")' doc.jdf

# find every external link
jq '.. | objects | select(.type == "external") | .target' doc.jdf

No PDF library. No regex. No "the parser missed this row". The agent's tool call returns structured data because the document is structured data.

4. Generation is symmetric

If an LLM can read JDF natively, it can also write JDF natively. Asking GPT or Claude to "produce a one-page report" works, because the output is a JSON object — the same shape it just consumed. PDF generation requires either a tool call into a heavyweight library or hallucinated bytes.

5. RAG without the PDF middle-man

Most retrieval-augmented systems ingest PDFs by extracting text, embedding chunks, and praying the chunking didn't slice through a table. With JDF you embed elements directly:

// each element is its own retrieval unit, with metadata
for (const page of doc.pages) {
  for (const el of page.elements) {
    if (el.type === "text") {
      await embed(el.content, { type: "text", heading: el.heading });
    } else if (el.type === "table") {
      await embed(serialize(el), { type: "table", rows: el.rows.length });
    }
  }
}

Each chunk carries semantic metadata — type, heading level, page number — that the model uses at retrieval time. No more "this chunk is half a paragraph and half a table cell".

Side-by-side: same document, two formats

Imagine asking a model "what was Q2 revenue?" — given the same source document.

What the LLM seesPDF (after pdftotext)JDF
Headings Mixed in with body text — model guesses from line length and surrounding whitespace. "heading": 1 — explicit, queryable, no inference.
Tables Cells smear across columns. Multi-row headers collapse. Model invents rows. {"headers": [...], "rows": [...]} — every cell at its real coordinate.
Footnotes Inserted into the body line. Hallucinated as part of the paragraph. Distinct elements with their own type. Trivial to filter out or surface.
Page numbers / running headers Appear every N lines, fragment paragraphs. Live in header / footer blocks, never in the content tree.
Images / charts Replaced with [image] or dropped silently. Stored in resources.images with alt text — a vision model can read them at the right anchor in the doc.
Links Often lost — text shows but the underlying URL is gone. First-class link field on text and runs.
Validation None — anything goes, model has to trust the parser. JSON Schema, validates in any IDE, fails fast in CI.

Why this matters for your AI stack

The bigger picture

The web invented Markdown so humans and machines could share the same source for short text. JDF does the same thing for the next layer up — long-form documents with structure, layout, and embedded media. PDF was designed in 1993 for laser printers; it predates almost every concern AI workflows now have.

If your stack involves an LLM touching documents, the format ought to be one the LLM can read without a translator. JDF is that format.