docx2txt
works on the information in the docx
file which is a zipped set of XML files.
With regards to line wrapping the .docx
XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt
normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx
contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt
via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.
So docx2txt
or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².
If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result⁴.
Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt
largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.
¹ This sort of incompatibility is the primary reason some of my friends cannot change to Linux from Windows, although they would like to. They have to use MicroSoft Word, as Open/LibreOffice every once in a while mangles texts they exchange with clients.
² You can install all the fonts used in the Word files and might get lucky for some texts, some of the time.
³ Rendering PDFs from .doc/.docx
⁴ The program uses GUI automation—as if someone is clicking its menus—and doesn't attempt to drive Word via an API. I am pretty sure the latter can be done as well and would have the advantage of not breaking things if Word would get upgraded
You can use the entries in /proc
to truncate such files.
# ls -l /proc/4315/fd
That will show all the files opened by process 4315. You've already used lsof
and that shows that the deleted file is file descriptor 135, so you can free the space used by that deleted file as follows:
# > /proc/4315/fd/135
The same goes for the other deleted file opened by process 44654, there it's file descriptor 133, so:
# > /proc/44654/fd/133
You should now see that the space is freed up.
You can also use this to copy the contents of a file that's been deleted but still held open by a process, just cp /proc/XXX/fd/YY /some/other/place/filename
.
Best Answer
UNIX/Linux does not have the same early DOS / CP/M heritage that Windows does. So extensions are generally less significant to most UNIX utilities and tools.
I usually use a command-line only environment. Extensions in such an environment under Linux aren't really significant except as a convenience to the operator or user. (I don't have enough experience with KDE or GNOME to know how their filemanagers deal with extensions.)
But such convenience is usually important. If
config.ini
is really in Microsoft-standard ".ini" format, I'd let the extension stand. Plain old text files usually carry no extension in Linux, but this isn't universal for all programs configuration files. The programmer usually gets to decide that.I think ".txt" is useful under Linux if you want to emphasize that it's NOT a configuration file or other machine-readable document. However, in source distributions, the convention is to name such files all caps without an extension (i.e. README, INSTALL, COPYING, etc.)
There are some standards and conventions but nothing stopping you from naming anything whatever you want, unless you are sharing things with others.
In Windows, naming a file
.exe
indicates to the shell (usuallyexplorer.exe
) that it's an executable file. UNIX builds this knowledge into the file system's permissions. If the properx
bits (seeman chmod
) are set, it is recognized as executable by shells and kernel functions (I believe). Beyond this, Linux doesn't care, most shells won't care, and most programs look in the file to find it's "type."Of course, there's the nice command
file
which can analyze the file and tell you what it is with a degree of certainty. I believe if it can't match the data in the file with any known type, and if it contains only printable ASCII/Unicode characters, then it assumes its a text file.@Bruce Ediger below is absolutely correct. There is nothing in the kernel or filesystem level, i.e. Linux itself, enforcing or caring that the contents of a file needs to match up with its name, or the program that is supposed to understand it. This doesn't mean it's not possible to create a shell or launcher utility to do things based on filename.