Engineering · Document Intelligence

Ingestion Is the Whole Game

Why a cheaper, faster, multimodal model still doesn't fix the foundation — and why we had to build our own.

On 28 April, NVIDIA released Nemotron 3 Nano Omni. It is a 30-billion-parameter hybrid multimodal model with native vision, native audio, a long context window, and per-page inference costs around 700× lower than the frontier closed model we currently use for structured extraction. On paper, it is exactly the model we have been waiting for.

We tested it carefully on the documents that actually matter to our clients. It failed. Not catastrophically, and not in the ways a benchmark leaderboard would surface — but in exactly the way we have watched every other ingestion tool fail.

This is not a post about Nemotron. The lesson is older and more important than any one model. Ingestion — the step where a document becomes structured data that an AI can reason over — is the single most consequential layer in the entire document-intelligence pipeline. Almost nothing on the market does it well enough for high-stakes work. And it is the layer that most teams treat as a commodity.

What we had already tried

We have been trying to outsource ingestion for two years. We have used Mistral OCR. We have evaluated Google Document AI. We have tested unstructured.io. We have tried the obvious hybrid pattern — pass the document through a cheap OCR layer, then hand the result to a frontier reasoning model and let the frontier model recover what the OCR layer missed.

None of it works on the documents our clients send us.

Mistral OCR is the best of the pure text-based options. On a page of clean prose it is excellent and inexpensive. On a financial table, the structure is gone the moment it has finished. Mistral does not know what a table is in any sense the downstream system can use. It produces a linearised representation of the cells — text, in order, with some markdown around it — and once the structure has been flattened, no LLM behind it can put it back. Claude Sonnet cannot. GPT cannot. We tried both, repeatedly.

The hybrid pattern fails for the same reason. By the time the cheap layer has finished, the frontier model is not reconstructing the table from the document. It is guessing at the table from a corrupted representation of the document. The strongest reasoning model in the world cannot recover information that has already been destroyed upstream.

What we found with Nemotron

When NVIDIA released Nemotron 3 Nano Omni at the end of April, we ran it through the same evaluation we run everything else through. We picked a representative test page: a securities-registration table from a recent 10-K. Hierarchical column headers. The legal caption "Securities registered pursuant to Section 12(b) of the Act." Trading symbols across multiple classes of stock. Footnotes. A layout that spans the page non-trivially. The kind of page that, in a real diligence workflow, sits silently in the middle of a 250-page filing and either anchors the entire analysis or quietly poisons it.

Nemotron can see. The raw OCR is clean. Glyphs render correctly. Currency, headings, and special characters — including the box-tick marks for ☒ and ☐ that appear on cover pages — come through. Hosted on Nebius Token Factory with strict JSON-schema enforcement enabled, the model produced valid, parseable output every time. No markdown fences. No hallucinated fields. Latency in the 15–23 second range per page. The schema compliance was, frankly, the best we have seen from an open multimodal model.

Then we re-ran the same page.

The element count changed — one table on one run, two on the next. Rows split differently. The table title came back, on the first call, as a column header. On the next call, as a synthesised name the model had invented. On the call after that, as the actual legal caption. The schema was rock-solid. The semantics were drifting between calls.

That is the part that matters. Schema validity is table stakes. What an ingestion layer has to produce, for the system above it to do its job, is not "valid JSON." It is structure: which value belongs to which entity, which header dominates which sub-header, what an empty cell means, what unit a number is expressed in, which footnote modifies which figure, whether two rows are siblings or one is a child of the other. None of those questions are answered by a JSON layout of the page. And Nemotron, like every visually-trained model we have evaluated, is not optimising for them.

This is not a Nemotron problem

Every model in this category is being trained against the wrong objective. The reward function, implicitly or explicitly, rewards a faithful reconstruction of what the page looks like. Bounding boxes. Layout fidelity. Glyph accuracy. Table outlines. That is the wrong target.

A downstream reasoning model does not want a visual reproduction of the page. It wants the structure that the visual layout was trying to communicate. A pixel-faithful but structurally-confused output is not a partial success. It is worse than no extraction at all, because it hands the model above it confident-looking inputs that lead to confidently wrong answers. The user never sees the failure. The reasoning trace looks fine. The number is wrong.

This is the failure mode we keep finding. Not in Nemotron specifically. In Mistral OCR. In Google Document AI on tables it does not have a template for. In unstructured.io on filings that do not match its parsers. In every visually-trained ingestion tool we have evaluated against real client documents.

Where the failure goes

Document intelligence is a pipeline, and ingestion is the foundation. Every layer above it inherits its errors. If the ingestion layer hands a frontier reasoning model a corrupted table — wrong row groupings, wrong header attribution, a number assigned to the wrong column — the reasoning is irrelevant. The model is reasoning, eloquently and at considerable cost, over the wrong facts.

The most expensive mistake in a due diligence workflow is not at the answer layer. It is at the ingestion layer. The answer-layer mistake is visible: a user can read the summary, push back, ask for sources, request a second opinion. The ingestion mistake is silent. The user sees a confident, well-formatted analysis. They do not see that the figure it relies on was lifted from the wrong row two steps earlier in the pipeline. The footnote that would have changed the conclusion was attached to the wrong line item before the model ever began to reason.

That is the failure mode we are most paranoid about. Not the model that hallucinates. The model that confidently summarises a table it never correctly parsed.

What people actually do today

Most of the document-intelligence products on the market are a model with a drive connector. Drop in a PDF. Ask a question. Trust the answer.

On clean, text-native PDFs with no tables, this works. On the documents our clients send us, it does not. A 10-K. A Word document where a table spans three pages and Word's own renderer breaks the spans across page boundaries. An HTML filing from a regulator with idiosyncratic element nesting. An XBRL submission, where the structured data and the human-readable filing have diverged. A board pack assembled from PowerPoint exports, PDF scans, and the occasional photographed annex. A side letter that arrives as a phone-camera capture of a printout.

"Connect your drive and ask the model" is not a document-intelligence strategy. It is a demo. It works until the document gets interesting, and then it fails silently.

What ingestion actually has to do

The objective of an ingestion layer is not to reproduce the page. It is to produce a representation of the page that lets the system above it reason without ambiguity about: which value belongs to which entity; which header dominates which sub-header, across any depth of nesting; what an empty cell means in context (zero, not applicable, redacted, suppressed); what unit a figure is expressed in, and which surrounding text modifies that unit; which footnote applies to which value; whether two rows are siblings, or whether one is a child of the other; when a table continues on the next page, and when a new table begins; when a value is part of a totalling chain, and when it is not.

None of those questions are answered by visual reconstruction. They are answered by structural understanding, and structural understanding has to be designed into the pipeline. It does not emerge from a multimodal model being trained on more pixels.

Why we built Vela-Ingestion

We did not want to build our own ingestion layer. We tried to avoid it. We tested every credible alternative on the market — open-source, commercial, hosted, on-premise, single-model, multi-stage. We tried hybrids. We tried cheap-then-expensive. We tried fine-tuning. None of it cleared the bar we needed to clear for the documents our clients work on.

So we built Vela-Ingestion. We built it because we had to, not because we wanted to.

It is vision-based, like the better tools on the market — for the reasons described in our earlier piece on documents as attack vectors, you do not want any extraction surface other than what is visibly rendered on the page. But it is not vision-as-reconstruction. It is vision in service of a rigid, structured extraction workflow that knows what it is looking at and what the downstream system needs. The objective is not to make a JSON copy of the page. The objective is to produce the smallest, most unambiguous representation that preserves everything a downstream model will need to reason without guessing.

I am not going to describe the recipe. It took us a long time to get here, and it is one of the parts of Vela we are most careful about. What I will say is that across every client document set we have processed — financial statements, registration filings, contracts, term sheets, regulatory submissions, board packs, ESG disclosures — it has not failed us. Not on a table that spans three pages. Not on a column header that fans out across two rows. Not on an XBRL filing whose structured data disagrees with the rendered version. Not on a scanned annex stapled into a clean digital PDF. That is not a claim I would have made about anything else we tested, and it is not a claim I make lightly.

The forward-looking note

Open multimodal models will keep getting better. Nemotron 4 will be better than Nemotron 3. The cost curve will keep collapsing — today's 700× gap will likely become 7,000× before long. None of that, on its own, will fix the problem this piece describes.

The problem is not model quality. It is the problem statement the model is being asked to solve. A model trained to reconstruct visible pages will never, on its own, be a reliable ingestion layer for high-stakes documents. The fix is not a better OCR. The fix is a different objective.

If a model provider eventually trains an ingestion model whose reward function is structural fidelity rather than visual fidelity — one that is graded on whether a downstream reasoning model can answer the right question, not on whether a human evaluator can recognise the page — we will be the first to swap our own pipeline out for it. Until then, we will keep building.

Why this matters

My colleague Arnaud Blandin argued, in his piece on the governed system of action, that the integrity of any system of action depends on the integrity of what is fed into it. I agree with him, and I want to make it more specific.

The single highest-leverage point in any document-driven AI system is the place where unstructured input becomes structured input. Everything above it is downstream of it. Everything above it inherits its errors. Get that layer wrong, and no model, no reasoning trace, no governance plane, no human-in-the-loop further down the pipeline can fully recover. The error has already entered the record.

Ingestion is not a commodity. It is not a checkbox. It is the foundation, and it is where most of the failure modes that show up later in a due-diligence workflow were actually born.

It is the whole game.

Vela Intelligence builds decision intelligence infrastructure for regulated, high-stakes environments. Vela-Ingestion is the structured-extraction layer at the bottom of that stack — vision-based, workflow-shaped, and built to make every layer above it trustworthy. For a walkthrough on real documents, contact contact@velaintelligence.com.

See it on your own documents.

We'll run a representative document set through Vela-Ingestion — including the difficult ones — and walk you through what comes out.