11 Interlude: Most Frequent Words

In this interlude we will develop a program that reads a text and prints the most frequent words in that text. As in the previous interlude, the program here is quite simple, but it uses some more advanced features, such as iterators and anonymous functions.

The main data structure of our program is a table that maps each word found in the text to its frequency counter. With this data structure, the program has three main tasks:

Read the text, counting the number of occurrences of each word.
Sort the list of words in descending order of frequencies.
Print the first n entries in the sorted list.

To read the text, we can iterate over all its lines and, for each line, we iterate over all its words. For each word that we read, we increment its respective counter:

      local counter = {}
      
      for line in io.lines() do
        for word in string.gmatch(line, "%w+") do
          counter[word] = (counter[word] or 0) + 1
        end
      end

Here, we describe a “word” using the pattern ’%w+’, that is, one or more alphanumeric characters.

The next step is to sort the list of words. However, as the attentive reader may have noticed already, we do not have a list of words to sort! Nevertheless, it is easy to create one, using the words that appear as keys in table counter:

      local words = {}    -- list of all words found in the text
      
      for w in pairs(counter) do
        words[#words + 1] = w
      end

Once we have the list, we can sort it using table.sort:

      table.sort(words, function (w1, w2)
        return counter[w1] > counter[w2] or
               counter[w1] == counter[w2] and w1 < w2
      end)

Remember that the order function must return true when w1 must come before w2 in the result. Words with larger counters come first; words with equal counters come in alphabetical order.

Figure 11.1, “Word-frequency program” presents the complete program.

Figure 11.1. Word-frequency program

      local counter = {}
      
      for line in io.lines() do
        for word in string.gmatch(line, "%w+") do
          counter[word] = (counter[word] or 0) + 1
        end
      end
      
      local words = {}    -- list of all words found in the text
      
      for w in pairs(counter) do
        words[#words + 1] = w
      end
      
      table.sort(words, function (w1, w2)
        return counter[w1] > counter[w2] or
               counter[w1] == counter[w2] and w1 < w2
      end)
      
      -- number of words to print
      local n = math.min(tonumber(arg[1]) or math.huge, #words)
      
      for i = 1, n do
        io.write(words[i], "\t", counter[words[i]], "\n")
      end

The last loop prints the result, which is the first n words and their respective counters. The program assumes that its first argument is the number of words to be printed; by default, it prints all words if no argument is given.

As an example, we show the result of applying this program over this book:

      $ lua wordcount.lua 10 < book.of
      the	5996
      a	3942
      to	2560
      is	1907
      of	1898
      in	1674
      we	1496
      function	1478
      and	1424
      x	1266

Exercises

Exercise 11.1: When we apply the word-frequency program to a text, usually the most frequent words are uninteresting small words like articles and prepositions. Change the program so that it ignores words with less than four letters.

Exercise 11.2: Repeat the previous exercise but, instead of using length as the criterion for ignoring a word, the program should read from a text file a list of words to be ignored.