Given a PDF of random origin, how do I, on Linux:
- confirm whether it is in PDF/A format already?
- if it is not in PDF/A format, convert it to PDF/A with a minimum loss of fidelity?
I am aware that the conversion may cause loss of exotic elements of the document, but let's assume that the ability to open the document at all in a relatively far future is more important than such spiffy features (which might not be available/readable at such a time anyway). I would rather be able to visually confirm the accuracy of the conversion when I can trivially view the documents side by side than risk not being able to open the original file.
Best Answer
Identification
I found this tool which looks to be what you can use to identify PDF/A files. It's called DROID (Digital Record and Object Identification). It's Java based and can be run from a GUI or the command-line.
excerpt
Given it's sponsored by the National Archives I would assume it's the right tool for doing this, given the intended purpose of the PDF/A format. Also the project is open source and the code is available on Github as well as packaged in binary form from the National Archives website.
Validation & Conversion
If you're looking for a tool to perform validation & conversion I believe PDFBox can do this. PDFBox lists PDF/A validation right on the front page of their website. It's another Java application 8-).
excerpt from website
Under the command line tools section on the left of their main page the show the following usage for the tool:
veraPDF is another tool capable of validating PDF/A; it is part of the Open Preservation Foundation’s reference tool set. It’s also a Java application.
Conversion
For just doing conversion I found this method from a blog post titled: Free way to convert an existing PDF to PDF/A, that uses the following tools:
With the above in place you use the following command:
It isn't without it's warts. The article discusses one of them, fixing the print flags on hyperlinks being one of them. The article provides a Java application that you can use to fix these:
It's not pretty but appears to be workable. See the article for more details.
References