I am trying to convert docx files to pdf on my Ubuntu server using the command line but none of converters I tried so far seems to convert Word 2007/2010/2013 files correctly.
Appearently online converters can manage it without any problems but Web services are not an option because the files contain sensitive data. For tests I use this Word 2007 file because it contains some important elements (formulas, vector graphics, images, lists, etc.). I tested the following tools (partly from this post):
lowriter
(LibreOffice Writer) – incorrect output (the circle is supposed to be on the last page, not the first one)
unoconv
– the same as LibreOffice since it doesn't use its own converter. Converting to odt first and then to pdf messes the file completely up.
abiword --to=pdf filename.doc
– incorrect and incomplete (many elements are missing):
OpenOffice Writer – same result as for abiword
wvPDF
– crash with the following error message:
~$ wvPDF 2007_Office_DocEncryption.docx test.pdf
Current directory: /home/webmt/dev/test/
Some problem running latex.
Check for Errors in test.log
Continuing…
Conversion into dvi failed
Is there any way to convert docx files to PDF on Linux correctly? It would also help me if I knew it works for someone with any of the programs I already mentioned.
I will start a bounty as soon as SE lets me.
p.s. I'm using Ubuntu server 12.04
Conclusion:
I had to conclude that as for me, as for now, there is no reliable tool which will work with new MS Word formats and all kind of its elements on Ubuntu and create a one-to-one copy of docx files. None of tools I tested could convert the sample file properly. Since I will be facing very different kind of document versions/contents and the output quality has one of the highest priority, I will end up performing the conversions by means of VB macros in Word on a Windows server connected to my Linux.
I will set the post getting the best results as the accepted answer. However, the bounty was intended for a solution with absolutely correct conversion. Thanks to everyone, again.
Best Answer
This answer passes all tests, but the flow chart one in your test document.
Why is this better than other methods suggest thus far?
I have tested the other methods suggested so far (especially
oowriter
andebook-convert
), but they pass less tests than this method. Theebook-convert
method strips the margins and a part of the texts out of the document.This method even yields better results than a professional converter as rainbowpdf.
I also tried converting it to html, but the drawing with the square in the circle and the flow chart are incorrect.
Why does the flow chart test fail?
It seems that libreoffice and unoconv have some problems with correctly rendering the flow chart that is in the .docx file. This is probably because it was made using smart art in Microsoft Office. That is the problem. That is a bug also discussed on this thread. The textual and visual information is present in the pdf resulting from the above method as you can see (I had to select the text, though).
The font color, for instance, is not properly read and some lines are too long. I am not aware of any linux solution that is able to display smart art correctly. :(
This is also the reason why all the
print
solutions posted on this page will not satisfy you.In short
In short, what you are doing is really hard and there are at present no solutions that will fully satisfy you. The achilles' heel of docx2pdf conversions is the smart art. If you can live without that or if you can find a way to spot smart art and convert it somehow into an image, you can reach your goal.
Option 1. Force your users to deal with the problem
This is a very inelegant solution. Your content creators could save their smart art as jpg as described in the office help pages and hence the conversion would be possible on your server.
Option 2. Hack your way around the problem
If the flow charts are often very similar and depending on how good a developper you are, you could try and convert the smart art separately. You could, extract the drawing1.xml file from the .docx cluster of documents and then use natural language processing and some crazy hacks to rebuild a the smart art. For instance, you'd have to mess with this type of xml:
Or as a minimal solution you at least extract the text (
<a:t>
?) from the file and save it in an easier way. Or if the flow-charts of your pdfs are all the same, you could write a script to change the text color and the line length in the xml itself. Then you could rundoc2pdf
and you'd have a file that essentially has all the right info, but maybe not the formatting. In the case of flow charts you'd probably also want to include some of the formatting, because the formatting is part of the info.Option 3. Use a third party service
I have done some more research the past few days and I have found a service that does the conversion perfectly: zamzar. Zamzar allows you to upload a docx file and then emails you a link. They also have a (paying?) service where you can send any file to pdf@zamzar.com and then get the converted file back in your inbox. You could easily build a system around this where you automatically send the file and parse it from the email. This is not so much work and it the end result is the best.
Notes
I have actually taken the time to test the different methods proposed on this page. Please back any comments up with actual tests.