The PDF Arlington Model covers the requirements of PDF object model as specified in ISO 32000-2:2020 (PDF 2.0, including resolved errata) as well as some (but not all!) aspects from earlier Adobe PDF references and various extensions (identified by the predicate fn:Extension(...)
). It is based on its own formal grammar serialized as a set of TSV files (see also the Arlington Model Notes). References to Tables and clauses below all refer to ISO 32000-2:2020.
veraPDF adds the support for this model by translating TSV files into its own validation profile based on veraPDF formal grammar for validation rules.
This repository implements the Java-based utility performing this translation.
The Arlington PDF model utilizes an expanded set of types in order to more precisely define PDF objects and data integrity relationships. veraPDF fully supports this expanded set and implements certain additional checks. These checks include:
- all dictionary keys need to be direct objects as is required by ISO 32000-2:2020 otherwise veraPDF will terminate processing with an exception message:
Exception: Caught unexpected runtime exception during validation caused by exception: Wrapp org.verapdf.exceptions.VeraPDFParserException: Error while parsing object X Y ... invalid pdf dictionary (object key - X Y obj, offset ZZZZ)
. dates
must be a valid PDF date stringrectangle
must be a PDF array with precisely 4 entriesmatrix
must be a PDF array with precisely 6 numeric entriesbitmask
must be a PDF integer objectstring-ascii
must contain only 7-bit ASCII characters (<= 0x7F)name-tree
is a complex data structure (see Table 36), internally comprising various PDF objects:- NameTreeNode shall not contain entries except Kids, Limits, Names.
- Entry Kids in NameTreeNode shall have type Array.
- Entry Kids in NameTreeNode is required, when
fn:Not(fn:IsPresent(Names))
. - Entry Kids with type Array in NameTreeNode shall satisfy special case:
fn:Not(fn:IsPresent(fn:IsPresent(Names)))
- Entry Limits in NameTreeNode shall have type Array.
- Entry Names in NameTreeNode shall have type Array.
- Entry Names in NameTreeNode is required, when
fn:Not(fn:IsPresent(Kids))
. - Entry Names with type Array in NameTreeNode shall satisfy special case:
fn:Not(fn:IsPresent(fn:IsPresent(Kids)))
- NameTreeNodeLimitsArray shall contain exactly 2 elements.
- Entry 0 in NameTreeNodeLimitsArray shall have type String.
- Entry 1 in NameTreeNodeLimitsArray shall have type String.
- NameTreeNodeNamesArray shall contain 2 * n elements.
- NameTreeNodeNamesArraySubArray shall contain exactly 2 elements check.
- Entry 0 in NameTreeNodeNamesArraySubArray shall have type String.
- Entry in NameTreeNodesArray shall have type Dictionary.
number-tree
is a complex data structure (see Table 37), internally comprising various PDF objects:- NumberTreeNode shall not contain entries except Kids, Limits, Nums.
- Entry Kids in NumberTreeNode shall have type Array
- Entry Kids in NumberTreeNode is required, when
fn:Not(fn:IsPresent(Nums))
- Entry Kids with type Array in NumberTreeNode shall satisfy special case:
fn:Not(fn:IsPresent(fn:IsPresent(Nums)))
- Entry Limits in NumberTreeNode shall have type Array
- Entry Nums in NumberTreeNode shall have type Array
- Entry Nums in NumberTreeNode is required, when
fn:Not(fn:IsPresent(Kids))
- Entry Nums with type Array in NumberTreeNode shall satisfy special case:
fn:Not(fn:IsPresent(fn:IsPresent(Kids))) && fn:ArraySortAscending(Nums,2)
- NumberTreeNodeLimitsArray shall contain exactly 2 elements
- Entry 0 in NumberTreeNodeLimitsArray shall have type Integer
- Entry 1 in NumberTreeNodeLimitsArray shall have type Integer
- NumberTreeNodeNumsArray shall contain 2 * n elements
- NumberTreeNodeNumsArraySubArray shall contain exactly 2 elements
- Entry 0 in NumberTreeNodeNumsArraySubArray shall have type Integer
- Entry in NumberTreeNodesArray shall have type Dictionary
Note however that veraPDF does not perform any checks against architectural limitations or previous limitations noted in older PDF specifications (such as assumptions about the size of integers, the length of PDF names or strings, etc).
The following Arlington predicates are not supported:
- AlwaysUnencrypted
- FontHasLatinChars
- Ignore
- IsLastInNumberFormatArray
- IsMeaningful
- KeyNameIsColorant
veraPDF adds extra logic to the Arlington model. Namely, it:
-
extends Arlington model for checking PDF Fields and Widgets by automatically generating new types
AnnotWidgetField
,AnnotWidgetFieldTx
,AnnotWidgetFieldCh
,AnnotWidgetFieldBtn
,AnnotWidgetFieldSig
that represent merged Widget annotation and Form field dictionaries (ISO 32000-2, 12.5.6.19). See pdf-association/arlington-pdf-model#28 for additional discussion. -
includes fixes of some known Arlington model issues:
- wrong use of fn:InNameTree predicate. See pdf-association/arlington-pdf-model#49
- incorrect required condition for AS entry in Widget annotations. See pdf-association/arlington-pdf-model#61
- null in array of structure elements. See pdf-association/arlington-pdf-model#90
-
supports some future Arlington model predicates:
- ValueOnlyWhen. See pdf-association/arlington-pdf-model#74
- IsInArray. See pdf-association/arlington-pdf-model#114
-
contains additional rule about PageLabel number tree. See pdf-association/arlington-pdf-model#118
-
custom structure attribute names of the standard structure attribute owners (Table 376) defined by external specifications (such as
HTML-3.20
,HTML-4.01
,CSS-1
,CSS-2
, etc.) will be reported as deviations (i.e. veraPDF does not implement an implicit understanding of those external specifications) -
the PDF Metadata (clause 14.3) and Associated Files (clause 14.13) features are documented as being permitted on any PDF object, however ISO 32000-2:2020 only formally documents the presence of the Metadata and AF keys on a few specific objects. Because the Arlington model accurately reflects the specification as written, it only defines support for those explicit objects in the TSV data. As a result, veraPDF will report deviations when a Metadata or AF keys occurs in other objects, which may or may not be an issue.
- See the Arlington Model Notes and PDF Errata #403.
-
veraPDF does not check object numbers against the trailer Size entry and thus will not report object numbers that are invalid (see Table 15) - this is not correct according to the PDF specification.
- If veraPDF can’t find an object by its object number, it shows the log message "Trying to get object <object number> that is not present in the document" and considers that this object is null (which is correct according to the PDF specification).
-
veraPDF only operates as a "PDF 1.5 processor" and will always process both conventional cross-reference sections and all cross-reference streams that are present in hybrid-reference PDF files.
- Thus veraPDF will not report issues that a pre-PDF 1.5 processor may experience when processing the same hybrid-reference PDF file (i.e. when objects defined in cross-reference streams are not present). See clause 7.5.8.4.
-
the set of enabled extensions in the Arlington model (predicate
fn:Extension(...)
) will influence the deviations reported by veraPDF. Currently this is configurable in veraPDF via the CLI and GUI, but not yet the REST implementation.