A new open document format backed by IBM, NVIDIA, and Red Hat is being positioned as the AI-era successor to PDF. The LF AI & Data Foundation announced the formation of theDocLang Specification Working Group on June 9.
DocLang is an open, universal, AI-native format designed to standardise how enterprises prepare, exchange, and govern document data for AI systems. Its backers want it to play the role for machines that PDF has played for human readers.
“Our vision is for DocLang to become a broadly adopted international standard for AI-ready documents, providing a consistent representation for both humans and machines, much as PDF became the universal standard for document exchange in the human-centric era,” said Peter Staar, Principal Research Scientist and Manager at IBM Software.
The working group operates under the Linux Foundation's LF AI & Data body. Founding members IBM, NVIDIA, and Red Hat are joined by contributors ABBYY and HumanSignal.
It will run under the Joint Development Foundation's vendor-neutral, open governance model. Its goal is a specification supporting reliable, interoperable document processing across AI and agentic workflows.
A key feature for compliance teams is embedded governance controls. These help downstream systems enforce policies on privacy, extraction scope, and model training permissions.
The specification will also preserve semantic meaning and geometric layout in a single format. It encodes structural elements such as headings, paragraphs, and tables alongside their position on the page. The format is optimised for modern AI tokenisation and modelling approaches.
Enterprises currently work across a fragmented landscape of formats, including PDFs and JPEGs, built for human rather than AI consumption. The foundation said this disconnect can introduce complexity, raise costs, and reduce reliability when extracting meaning from business documents.
“Documents remain one of the most important sources of enterprise knowledge, but most were never designed for AI-driven workflows,” said Mark Collier, general manager of AI & Infrastructure at the Linux Foundation and executive director of LF AI & Data.
“With the launch of the DocLang Working Group, we are bringing the open source community together to develop a vendor-neutral, interoperable standard that helps organizations prepare document data for AI more reliably, transparently, and at scale,” Collier said.
Staar said DocLang draws on years of IBM research into document representation for AI, including OTSL for compact tables and DocTags for preserving structure. ‘
“Together with our industry partners, we have distilled these lessons into DocLang, a new AI-native format for unstructured content,” he said.
Maxime Vermeir, Vice President, AI Strategy at ABBYY, said the format addresses a foundational problem in enterprise AI.
“Documents were built for humans, not machines,” he said.
“By introducing a minimal, standardized, and AI-native representation of document structure, layout, meaning and governance, DocLang creates a far more deterministic foundation for modern AI systems,” Vermeir said.
The working group builds on Docling, the open source document processing toolkit hosted by LF AI & Data. Docling was developed by the AI for Knowledge team at IBM Research Zurich and released as open source in 2024.
Docling converts formats including PDF, Word, PowerPoint, Excel, HTML, and images into structured, AI-ready outputs. DocLang complements it by defining an open standard for expressing and exchanging that structured output across systems. Together, the two projects span document ingestion, parsing, standardised representation, and downstream consumption by language models and agents.