Linux – Word count for markdown

linuxmarkdownpandocword count

Is there a way to get a word count of natural language words in Markdown (or better, Pandoc Markdown), via the command line? It's possible to just use wc to get a very rough estimate, but wc is naive, and counts anything surrounded by white space as a word. This includes things like header formatting, bullet points, and URLs in links.

What would be ideal would be to remove all markdown formatting, (including Pandoc citations, if possible), and then pass that through wc, but I can't find a way to do that, as the pandoc plaintext output format still includes a lot of markdown styling.

Best Answer

There is a new lua filter for that: https://pandoc.org/lua-filters.html#counting-words-in-a-document

Save the following code as wordcount.lua

-- counts words in a document

words = 0

wordcount = {
  Str = function(el)
    -- we don't count a word if it's entirely punctuation:
    if el.text:match("%P") then
        words = words + 1
    end
  end,

  Code = function(el)
    _,n = el.text:gsub("%S+","")
    words = words + n
  end,

  CodeBlock = function(el)
    _,n = el.text:gsub("%S+","")
    words = words + n
  end
}

function Pandoc(el)
    -- skip metadata, just count body:
    pandoc.walk_block(pandoc.Div(el.blocks), wordcount)
    print(words .. " words in body")
    os.exit(0)
end

and call pandoc like this:

pandoc --lua-filter wordcount.lua myfile.md
Related Question