19 Interlude: Markov Chain Algorithm

Our next complete program is an implementation of the Markov chain algorithm, described by Kernighan & Pike in their book The Practice of Programming (Addison-Wesley, 1999).

The program generates pseudo-random text based on what words can follow a sequence of n previous words in a base text. For this implementation, we will assume that n is two.

The first part of the program reads the base text and builds a table that, for each prefix of two words, gives a list of the words that follow that prefix in the text. After building the table, the program uses it to generate random text, wherein each word follows two previous words with the same probability as in the base text. As a result, we have text that is very, but not quite, random. For instance, when applied to this book, the output of the program has pieces like this: “Constructors can also traverse a table constructor, then the parentheses in the following line does the whole file in a field n to store the contents of each function, but to show its only argument. If you want to find the maximum element in an array can return both the maximum value and continues showing the prompt and running the code. The following words are reserved and cannot be used to convert between degrees and radians.”

To use a two-word prefix as a key in tables, we will represent it by the two words concatenated with a space in between:

      function prefix (w1, w2)
        return w1 .. " " .. w2
      end

We use the string NOWORD (a newline) to initialize the prefix words and to mark the end of the text. For instance, for the text "the more we try the more we do" the table of following words would be like this:

      { ["\n \n"] = {"the"},
        ["\n the"] = {"more"},
        ["the more"] = {"we", "we"},
        ["more we"] = {"try", "do"},
        ["we try"] = {"the"},
        ["try the"] = {"more"},
        ["we do"] = {"\n"},
      }

The program keeps its table in the variable statetab. To insert a new word in a list in this table, we use the following function:

      function insert (prefix, value)
        local list = statetab[prefix]
        if list == nil then
          statetab[prefix] = {value}
        else
          list[#list + 1] = value
        end
      end

It first checks whether that prefix already has a list; if not, it creates a new one with the new value. Otherwise, it inserts the new value at the end of the existing list.

To build the statetab table, we keep two variables, w1 and w2, with the last two words read. We read the words using the iterator allwords, from the section called “Iterators and Closures”, but we adapted the definition of “word” to include optional punctuation marks, such as commas and periods (see Figure 19.1, “Auxiliary definitions for the Markov program”). For each new word read, we add it to the list associated with w1–w2 and then update w1 and w2.

After building the table, the program starts to generate a text with MAXGEN words. First, it re-initializes variables w1 and w2. Then, for each prefix, it chooses a next word randomly from the list of valid next words, prints this word, and updates w1 and w2. Figure 19.1, “Auxiliary definitions for the Markov program” and Figure 19.2, “The Markov program” show the complete program.

Figure 19.1. Auxiliary definitions for the Markov program

      function allwords ()
        local line = io.read()    -- current line
        local pos = 1             -- current position in the line
        return function ()        -- iterator function
          while line do           -- repeat while there are lines
            local w, e = string.match(line, "(%w+[,;.:]?)()", pos)
            if w then                         -- found a word?
              pos = e                         -- update next position
              return w                        -- return the word
            else
              line = io.read()         -- word not found; try next line
              pos = 1                  -- restart from first position
            end
          end
          return nil              -- no more lines: end of traversal
        end
      end
      
      function prefix (w1, w2)
        return w1 .. " " .. w2
      end
      
      local statetab = {}
      
      function insert (prefix, value)
        local list = statetab[prefix]
        if list == nil then
          statetab[prefix] = {value}
        else
          list[#list + 1] = value
        end
      end

Figure 19.2. The Markov program

      local MAXGEN = 200
      local NOWORD = "\n"
      
      -- build table
      local w1, w2 = NOWORD, NOWORD
      for nextword in allwords() do
        insert(prefix(w1, w2), nextword)
        w1 = w2; w2 = nextword;
      end
      insert(prefix(w1, w2), NOWORD)
      
      -- generate text
      w1 = NOWORD; w2 = NOWORD     -- reinitialize
      for i = 1, MAXGEN do
        local list = statetab[prefix(w1, w2)]
        -- choose a random item from list
        local r = math.random(#list)
        local nextword = list[r]
        if nextword == NOWORD then return end
        io.write(nextword, " ")
        w1 = w2; w2 = nextword
      end

Exercises

Exercise 19.1: Generalize the Markov-chain algorithm so that it can use any size for the sequence of previous words used in the choice of the next word.