Converting PDF to PDF/A

pdf

Given a PDF of random origin, how do I, on Linux:

  • confirm whether it is in PDF/A format already?
  • if it is not in PDF/A format, convert it to PDF/A with a minimum loss of fidelity?

I am aware that the conversion may cause loss of exotic elements of the document, but let's assume that the ability to open the document at all in a relatively far future is more important than such spiffy features (which might not be available/readable at such a time anyway). I would rather be able to visually confirm the accuracy of the conversion when I can trivially view the documents side by side than risk not being able to open the original file.

Best Answer

Identification

I found this tool which looks to be what you can use to identify PDF/A files. It's called DROID (Digital Record and Object Identification). It's Java based and can be run from a GUI or the command-line.

excerpt

DROID is a software tool developed by The National Archives to perform automated batch identification of file formats. Developed by its Digital Preservation Department as part of its broader digital preservation activities, DROID is designed to meet the fundamental requirement of any digital repository to be able to identify the precise format of all stored digital objects, and to link that identification to a central registry of technical information about that format and its dependencies.

Given it's sponsored by the National Archives I would assume it's the right tool for doing this, given the intended purpose of the PDF/A format. Also the project is open source and the code is available on Github as well as packaged in binary form from the National Archives website.

Validation & Conversion

If you're looking for a tool to perform validation & conversion I believe PDFBox can do this. PDFBox lists PDF/A validation right on the front page of their website. It's another Java application 8-).

excerpt from website

PDF/A Validation
Validate PDFs against the PDF/A ISO standard.

Under the command line tools section on the left of their main page the show the following usage for the tool:

$ java -jar pdfbox-app-x.y.z.jar org.apache.pdfbox.ConvertColorspace [OPTIONS] <inputfile> <outputfile>

veraPDF is another tool capable of validating PDF/A; it is part of the Open Preservation Foundation’s reference tool set. It’s also a Java application.

Conversion

For just doing conversion I found this method from a blog post titled: Free way to convert an existing PDF to PDF/A, that uses the following tools:

  • Ghostscript 8.64 Only.
  • PDFBox 0.7.3
  • pdfmarks ( file to supply additional meta data)
  • PDFA_def.ps
  • USWebCoatedSWOP.icc

With the above in place you use the following command:

$ gs -sDEVICE=pdfwrite -q -dNOPAUSE -dBATCH -dNOSAFER     \
-dPDFA -dUseCIEColor -sProcessColorModel=DeviceCMYK       \
-sOutputFile=Out_PDFA.pdf PDFA_def.ps pdfmarks IN_PDF.pdf

It isn't without it's warts. The article discusses one of them, fixing the print flags on hyperlinks being one of them. The article provides a Java application that you can use to fix these:

$ java FixPrintFlag Out_PDFA.pdf New_verifiablePDFA.pdf

It's not pretty but appears to be workable. See the article for more details.

References

Related Question