Count of hyphenated words and their unhyphenated counterparts in latex files

latex

I have a thesis of approximately 100,000 words, typeset in latex. I have rather inconsistently hyphenated some of the words, for example "spider-fear" and "spider fear".

I would like to get a list of all words in the tex files that are hyphenated (along with a count) and then I would also like a count for the number of times that the unhyphenated version also appears.

Presumably using a tool like awk, grep or sed?

Best Answer

You can do this by means of a spiffy Perl program, texcount.pl, which you can download from this Web page. This program counts words in TeX documents (or letters, or mathematical formulas, ...), a non-trivial task given the presence of keywords specific to TeX which are to be excluded from from the count. The program has a number of features and options (which however I never used), but the one you need is:

   texcount.pl -freq myfile.tex

which will return the full list of words used (to standard output) with their frequency of appearance. You can then easily parse this to see when you have used hyphenated or non-hyphenated combinations. Please notice that the program can easily include multi-file projects, where sections, appendices, bibliography and so on are stored in different files. It will not, however, (or at least, AFAIK) point to the precise location of the words: you will have to hunt them down one by one.

Edit:

A quick but partial solution to finding all occurrences of the non-hyphenated expressions is the following:

  grep 'spider *fear' file.tex -n

which searches for the two words separated by zero or more (the * symbol) spaces, and returns the line number (the -n option) of this occurrence. This is fast, but it is incomplete because the use of grep automatically implies that one cannot locate the expressions spider fear whenever these are split into two or more lines. Since for arbitrary expressions this can occur even within words, finding these occurrences will require a tad more work than I am willing to do.

Edit 2:

Another bit of the solution is the following:

   grep 'spider *$'  -A 1 filename | grep '^ *fear' -n

This will search for all lines which end with spider followed by an unspecified number of white spaces, followed by another line beginning with an unspecified number of spaces and then the word fear. In doing so, It will also output the line number of this occurrence.

Keep in mind that, in all of the previous cases, you are searching for lower-case expressions only. If you wish to include capitals, just substitute grep -i for grep.

The only part that is missing now is when words are broken between different lines, like in

    spi
    der

Related Solutions

Mac – Word count for LaTeX within emacs

(defun latex-word-count ()
  (interactive)
  (shell-command (concat "/usr/local/bin/texcount.pl "
                         ; "uncomment then options go here "
                         (buffer-file-name))))

You may opt to put texcount.pl somewhere other than /usr/local/bin, just modify the code as appropriate if you do. This creates a new command "M-x latex-word-count", which will run texcount.pl on the current file (it will give the wrong result if you have not saved the file though). You can remove the semicolon and replace the filler text with any command line arguments you want to use, if any. You can bind this to a keyboard command with something like this in your .emacs:

(define-key latex-mode-map "\C-cw" 'latex-word-count)

The page which describes how to install texcount is here: texcount faq. Short version:

sudo cp texcount.pl /usr/local/bin/texcount.pl

or alternatively you could do as they recommend and simply name it texcount, and update the code appropriately.

Create latex style files

There's not much special about sty or cls files; they're just LaTeX files with a special purpose and another file extension. You could use any editor to write them, preferably your favourite LaTeX editor. I'm not aware of any dedicated editor just for style and class files; and I'm not really sure how the WYSIWYG concept could be applied to styles/classes anyway.

If you just want to collect some LaTeX settings/definitions in a common file, use your favourite editor to write them (or copy them from a document where they're already working). Insert \ProvidesFile{packagename} at the beginning of the file. Save it with a .sty extension in a place where TeX can find it. Then you can invoke \usepackage{packagename} in your LaTeX documents, and your package will be loaded right away.

Here's an example where I put together my settings for letters with the scrlettr class:

\ProvidesFile{FJ-Brief-CB}

\name{Florian Jenn}
\signature{\bigskip Florian Jenn}

\address{Some street 123 \quad 03\,044 Cottbus}

\subjecton

% and so on...

For “real” packages, consult “LaTeX2e for class and package writers” at http://www.latex-project.org/guides/clsguide.pdf, as already mentioned by user33872. Additionally, there's a short overview by Joseph Wright: http://www.texdev.net/2009/10/05/the-dtx-format/. Basically, you'll have to write a doc (dtx) file, from which the sty and documentation files can be generated.

Any editors that can be used for LaTeX should do; however, it's nice to have explicit dtx (docTeX) support. AFAIK, Emacs (docTeX mode in AUCTeX) or WinEdt (see http://www.winedt.org/Config/modes/DTX.php) have it. I've had a quick look at Kile and TeXmaker – they don't have explicit modes (editing dtx is still possible, just not so nice). See also Joseph Wright's notes on editing dtx: http://www.texdev.net/2009/10/11/working-with-dtx-files/

Best Answer

Related Solutions

Mac – Word count for LaTeX within emacs

Create latex style files

Related Question