Grab the content from a .pages file, without using Pages.app

pages

I’m interested in automatically extracting the content of .pages documents, preferably via a programming language on a web server.

There are quite a few references around the web that say Pages documents are actually zipped archives containing various files (e.g. http://www.tuaw.com/2009/11/02/iwork-secret-life-as-zip-file-revealed-includes-pdf-preview/), but a Pages document sent to me by a friend doesn’t seem to be unzippable by Mac OS X’s Archive Utility, or The Unarchiver, no matter what I change the extension to.

Is there a way to get the content from recent Pages files?

Best Answer

If you're trying to roll your own solution, the actual .pages file is a package. If you right click it, you can show package contents. Inside the resulting folder will be all the graphic files plus a file called index.xlm.gz. If you unzip the file, it is an xml file containing all the text in the pages document.