In this interlude we will develop a program that reads a text and prints the most frequent words in that text. As in the previous interlude, the program here is quite simple, but it uses some more advanced features, such as iterators and anonymous functions.
The main data structure of our program is a table that maps each word found in the text to its frequency counter. With this data structure, the program has three main tasks:
Read the text, counting the number of occurrences of each word.
Sort the list of words in descending order of frequencies.
Print the first n entries in the sorted list.
To read the text, we can iterate over all its lines and, for each line, we iterate over all its words. For each word that we read, we increment its respective counter:
local counter = {} for line in io.lines() do for word in string.gmatch(line, "%w+") do counter[word] = (counter[word] or 0) + 1 end end
Here, we describe a “word” using the pattern ’%w+
’,
that is, one or more alphanumeric characters.
The next step is to sort the list of words.
However, as the attentive reader may have noticed already,
we do not have a list of words to sort!
Nevertheless, it is easy to create one,
using the words that appear as keys in table counter
:
local words = {} -- list of all words found in the text for w in pairs(counter) do words[#words + 1] = w end
Once we have the list,
we can sort it using table.sort
:
table.sort(words, function (w1, w2) return counter[w1] > counter[w2] or counter[w1] == counter[w2] and w1 < w2 end)
Remember that the order function must return true
when w1
must come before w2
in the result.
Words with larger counters come first;
words with equal counters come in alphabetical order.
Figure 11.1, “Word-frequency program” presents the complete program.
Figure 11.1. Word-frequency program
local counter = {} for line in io.lines() do for word in string.gmatch(line, "%w+") do counter[word] = (counter[word] or 0) + 1 end end local words = {} -- list of all words found in the text for w in pairs(counter) do words[#words + 1] = w end table.sort(words, function (w1, w2) return counter[w1] > counter[w2] or counter[w1] == counter[w2] and w1 < w2 end) -- number of words to print local n = math.min(tonumber(arg[1]) or math.huge, #words) for i = 1, n do io.write(words[i], "\t", counter[words[i]], "\n") end
The last loop prints the result,
which is the first n
words and their respective counters.
The program assumes that its first argument is
the number of words to be printed;
by default, it prints all words if no argument is given.
As an example, we show the result of applying this program over this book:
$ lua wordcount.lua 10 < book.of the 5996 a 3942 to 2560 is 1907 of 1898 in 1674 we 1496 function 1478 and 1424 x 1266
Exercise 11.1: When we apply the word-frequency program to a text, usually the most frequent words are uninteresting small words like articles and prepositions. Change the program so that it ignores words with less than four letters.
Exercise 11.2: Repeat the previous exercise but, instead of using length as the criterion for ignoring a word, the program should read from a text file a list of words to be ignored.
Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>