Count of hyphenated words and their unhyphenated counterparts in latex files

latex

I have a thesis of approximately 100,000 words, typeset in latex. I have rather inconsistently hyphenated some of the words, for example "spider-fear" and "spider fear".

I would like to get a list of all words in the tex files that are hyphenated (along with a count) and then I would also like a count for the number of times that the unhyphenated version also appears.

Presumably using a tool like awk, grep or sed?

Best Answer

You can do this by means of a spiffy Perl program, texcount.pl, which you can download from this Web page. This program counts words in TeX documents (or letters, or mathematical formulas, ...), a non-trivial task given the presence of keywords specific to TeX which are to be excluded from from the count. The program has a number of features and options (which however I never used), but the one you need is:

   texcount.pl -freq myfile.tex

which will return the full list of words used (to standard output) with their frequency of appearance. You can then easily parse this to see when you have used hyphenated or non-hyphenated combinations. Please notice that the program can easily include multi-file projects, where sections, appendices, bibliography and so on are stored in different files. It will not, however, (or at least, AFAIK) point to the precise location of the words: you will have to hunt them down one by one.

Edit:

A quick but partial solution to finding all occurrences of the non-hyphenated expressions is the following:

  grep 'spider *fear' file.tex -n

which searches for the two words separated by zero or more (the * symbol) spaces, and returns the line number (the -n option) of this occurrence. This is fast, but it is incomplete because the use of grep automatically implies that one cannot locate the expressions spider fear whenever these are split into two or more lines. Since for arbitrary expressions this can occur even within words, finding these occurrences will require a tad more work than I am willing to do.

Edit 2:

Another bit of the solution is the following:

   grep 'spider *$'  -A 1 filename | grep '^ *fear' -n

This will search for all lines which end with spider followed by an unspecified number of white spaces, followed by another line beginning with an unspecified number of spaces and then the word fear. In doing so, It will also output the line number of this occurrence.

Keep in mind that, in all of the previous cases, you are searching for lower-case expressions only. If you wish to include capitals, just substitute grep -i for grep.

The only part that is missing now is when words are broken between different lines, like in

    spi
    der