Ubuntu – Remove pages with redundant content from PDF document

pdf

I have a huge PDF document (several hundred pages) which was very likely produced from a PowerPoint presentation.

It contains one page per presentation step, so if on one PowerPoint slide there was some text or an image that appeared later with an animation, there will be two pages in the PDF: one with only the initial content of the slide and one with the animated object added.

I do not have access to the original presentation, only the PDF.

Now is there a way (script, application, …) to automatically remove all redundant pages and only keep the last page with all information of each set of pages per original slide, discarding the incomplete pages with the animation steps?

The check algorithm would actually be quite simple: for each page, check whether the following page contains at least all information (text, images, objects) from the current page or more, and if so, remove the current page.

Unfortunately I have no idea whether or how it is possible to compare pages of PDF documents and find out whether the elements of one are a subset of the other or not.

Can you give me any solution that helps me to tidy up my huge PDF document and remove the animation steps in order to print it out?

I do not really want a PDF editor application where I have to check and remove all the pages manually anyway, I already have https://www.xodo.com/ for that. What I hope for is a way to automate the process.

Best Answer

Have you looked into pdfbox? You can invoke its various features from the command line. You can extract each page as text, use diff to see if each successive page has mostly additions to the previous one, keep track of the interesting pages, then use pdfbox again extract only those pages.

I assume from your rating that you don't need detailed instructions for how to accomplish all this :)