Introducing griddy-forge: Faster Document Ingestion for AI Spreadsheet Workflows
We published griddy-forge, a benchmark-driven document ingestion engine that replaces MarkItDown in Griddy. Up to 34x faster with higher extraction quality across PDFs, DOCX, spreadsheets, and CSV.
Reviewed by Griddy
Updated for current Excel and Google Sheets workflows, with examples chosen to map back to real spreadsheet tasks rather than abstract formula syntax.
When you upload a document to Griddy — a PDF, a spreadsheet, a DOCX — the first thing that happens is ingestion. The file gets converted into structured text that the AI agent can reason over. That ingestion step determines how fast the agent responds and how accurately it understands your document.
Until now, Griddy used MarkItDown for that job. It worked, but the conversion was often slower than acceptable for interactive use, and extraction fidelity on complex documents left gaps that downstream reasoning had to work around.
Today we are publishing griddy-forge, a document ingestion engine we built from scratch to solve that problem. The paper is now available on Zenodo: griddy-forge: Benchmark-Driven Document Ingestion for AI Spreadsheet Workflows.
What griddy-forge does
griddy-forge converts documents into structured text optimized for LLM consumption. It supports PDFs, DOCX, XLSX, XLS, CSV, and TSV — the file types people actually work with in spreadsheet workflows.
The engine includes two rendering modes:
- LLM-text — structured output optimized for agent reasoning, with lower token footprint
- Markdown — standard markdown output for parser-to-parser comparisons
Both modes are faster and more accurate than the MarkItDown path they replace.
The product result
The benchmark results that matter most are the ones closest to real Griddy usage.
On Griddy workload files (prompt-library CSVs, real spreadsheet fixtures):
| Metric | griddy-forge | MarkItDown |
|---|---|---|
| Latency | 9.2 ms | 167.6 ms |
| Quality | 1.000 | 0.853 |
| Token footprint | 30.1% fewer | baseline |
That is an 18.2x speedup with perfect extraction quality and a meaningfully smaller prompt footprint.
On Griddy-derived documents (internal-style PDFs and DOCX briefs):
| Metric | griddy-forge | MarkItDown |
|---|---|---|
| Latency | 237.1 ms | 710.4 ms |
| Quality | 0.988 | 0.913 |
A 3x speedup with higher fidelity on the kinds of documents Griddy users actually upload.
Common document results
griddy-forge is not just faster on Griddy-specific files. Across standard document types:
- Public office files (DOCX, XLS, XLSX, CSV): 18.6x faster, quality 1.000 vs 0.993
- Synthetic DOCX (comments, nested tables, hyperlinks): 34.2x faster, quality 1.000 vs 0.892
- Synthetic spreadsheets (formulas, dates, multi-sheet): 19.8x faster, quality 1.000 vs 0.833
The speed differences are not rounding errors. On common office documents, griddy-forge typically finishes in single-digit milliseconds where MarkItDown takes hundreds.
Public PDF benchmarks
PDF extraction is the hardest lane. griddy-forge was evaluated against MarkItDown, PyMuPDF4LLM, and LiteParse on public government and academic PDFs.
On the 25-document standard public PDF lane (IRS, NOAA, FDIC, NSF, NIST, ACL, CISA, and GovInfo sources):
| Engine | Latency | Quality |
|---|---|---|
| griddy-forge | 3.70 s | 0.946 |
| MarkItDown | 4.02 s | 0.672 |
| PyMuPDF4LLM | 19.62 s | 0.644 |
| LiteParse | 31.08 s | 0.647 |
On the 12-document hard PDF lane (dense digital-text PDFs):
| Engine | Latency | Quality |
|---|---|---|
| griddy-forge | 2.74 s | 0.921 |
| MarkItDown | 3.43 s | 0.672 |
| PyMuPDF4LLM | 10.98 s | 0.647 |
| LiteParse | 15.48 s | 0.628 |
The quality gap is the headline: +0.274 on standard PDFs and +0.249 on hard PDFs versus MarkItDown. That is the difference between an agent that understands a document's structure and one that partially garbles it.
Honest holdout results
The paper includes a frozen 10-document holdout lane that was never tuned against. This is where the results are most honest:
- griddy-forge quality: 0.669 vs MarkItDown: 0.547 (+0.122)
- griddy-forge is slightly slower on these documents (2.48 s vs 2.28 s)
Quality improves across all holdout categories — research reports, scientific articles, legal opinions, and multi-column technical papers — but latency is not yet a win on difficult unseen PDFs. The paper reports this directly rather than burying it.
Why this matters for Griddy users
Every time you upload a file to Griddy, the quality of what the AI sees depends on ingestion. Faster ingestion means faster responses. Higher fidelity means fewer mistakes when the agent reads your tables, parses your numbers, or follows your document structure.
griddy-forge makes both of those better, and the improvements are largest on the exact file types Griddy users work with most: spreadsheets, CSVs, and business documents.
→ NOTE
The Griddy way
You do not need to think about document ingestion at all. Upload a PDF, a spreadsheet, or a DOCX to Griddy and the agent handles the rest — now backed by griddy-forge instead of MarkItDown.
"Import the data from this PDF into my spreadsheet and clean up the formatting"
griddy-forge reads the document. Griddy does the work.
Skip the manual work
Describe it. Griddy does it.
Instead of writing this formula yourself, just tell Griddy what you need in plain English. Works in Excel and Google Sheets.