Skip to content
getgriddy.ai/blog/introducing-griddy-forge
Excel & Sheets

Introducing griddy-forge: Faster Document Ingestion for AI Spreadsheet Workflows

We published griddy-forge, a benchmark-driven document ingestion engine that replaces MarkItDown in Griddy. Up to 34x faster with higher extraction quality across PDFs, DOCX, spreadsheets, and CSV.

·5 min read

Reviewed by Griddy

Updated for current Excel and Google Sheets workflows, with examples chosen to map back to real spreadsheet tasks rather than abstract formula syntax.

When you upload a document to Griddy — a PDF, a spreadsheet, a DOCX — the first thing that happens is ingestion. The file gets converted into structured text that the AI agent can reason over. That ingestion step determines how fast the agent responds and how accurately it understands your document.

Until now, Griddy used MarkItDown for that job. It worked, but the conversion was often slower than acceptable for interactive use, and extraction fidelity on complex documents left gaps that downstream reasoning had to work around.

Today we are publishing griddy-forge, a document ingestion engine we built from scratch to solve that problem. The paper is now available on Zenodo: griddy-forge: Benchmark-Driven Document Ingestion for AI Spreadsheet Workflows.

What griddy-forge does

griddy-forge converts documents into structured text optimized for LLM consumption. It supports PDFs, DOCX, XLSX, XLS, CSV, and TSV — the file types people actually work with in spreadsheet workflows.

The engine includes two rendering modes:

  • LLM-text — structured output optimized for agent reasoning, with lower token footprint
  • Markdown — standard markdown output for parser-to-parser comparisons

Both modes are faster and more accurate than the MarkItDown path they replace.

The product result

The benchmark results that matter most are the ones closest to real Griddy usage.

On Griddy workload files (prompt-library CSVs, real spreadsheet fixtures):

Metricgriddy-forgeMarkItDown
Latency9.2 ms167.6 ms
Quality1.0000.853
Token footprint30.1% fewerbaseline

That is an 18.2x speedup with perfect extraction quality and a meaningfully smaller prompt footprint.

On Griddy-derived documents (internal-style PDFs and DOCX briefs):

Metricgriddy-forgeMarkItDown
Latency237.1 ms710.4 ms
Quality0.9880.913

A 3x speedup with higher fidelity on the kinds of documents Griddy users actually upload.

Common document results

griddy-forge is not just faster on Griddy-specific files. Across standard document types:

  • Public office files (DOCX, XLS, XLSX, CSV): 18.6x faster, quality 1.000 vs 0.993
  • Synthetic DOCX (comments, nested tables, hyperlinks): 34.2x faster, quality 1.000 vs 0.892
  • Synthetic spreadsheets (formulas, dates, multi-sheet): 19.8x faster, quality 1.000 vs 0.833

The speed differences are not rounding errors. On common office documents, griddy-forge typically finishes in single-digit milliseconds where MarkItDown takes hundreds.

Public PDF benchmarks

PDF extraction is the hardest lane. griddy-forge was evaluated against MarkItDown, PyMuPDF4LLM, and LiteParse on public government and academic PDFs.

On the 25-document standard public PDF lane (IRS, NOAA, FDIC, NSF, NIST, ACL, CISA, and GovInfo sources):

EngineLatencyQuality
griddy-forge3.70 s0.946
MarkItDown4.02 s0.672
PyMuPDF4LLM19.62 s0.644
LiteParse31.08 s0.647

On the 12-document hard PDF lane (dense digital-text PDFs):

EngineLatencyQuality
griddy-forge2.74 s0.921
MarkItDown3.43 s0.672
PyMuPDF4LLM10.98 s0.647
LiteParse15.48 s0.628

The quality gap is the headline: +0.274 on standard PDFs and +0.249 on hard PDFs versus MarkItDown. That is the difference between an agent that understands a document's structure and one that partially garbles it.

Honest holdout results

The paper includes a frozen 10-document holdout lane that was never tuned against. This is where the results are most honest:

  • griddy-forge quality: 0.669 vs MarkItDown: 0.547 (+0.122)
  • griddy-forge is slightly slower on these documents (2.48 s vs 2.28 s)

Quality improves across all holdout categories — research reports, scientific articles, legal opinions, and multi-column technical papers — but latency is not yet a win on difficult unseen PDFs. The paper reports this directly rather than burying it.

Why this matters for Griddy users

Every time you upload a file to Griddy, the quality of what the AI sees depends on ingestion. Faster ingestion means faster responses. Higher fidelity means fewer mistakes when the agent reads your tables, parses your numbers, or follows your document structure.

griddy-forge makes both of those better, and the improvements are largest on the exact file types Griddy users work with most: spreadsheets, CSVs, and business documents.

NOTE

The full paper, benchmark data, and publication summary are available at doi.org/10.5281/zenodo.19386269. Benchmark output-token counts are repo-internal proxy metrics, not provider billing tokens.

The Griddy way

You do not need to think about document ingestion at all. Upload a PDF, a spreadsheet, or a DOCX to Griddy and the agent handles the rest — now backed by griddy-forge instead of MarkItDown.

"Import the data from this PDF into my spreadsheet and clean up the formatting"

griddy-forge reads the document. Griddy does the work.

Skip the manual work

Describe it. Griddy does it.

Instead of writing this formula yourself, just tell Griddy what you need in plain English. Works in Excel and Google Sheets.