18 Iterators and the Generic for

We have been using the generic for for several tasks through the book, such as reading the lines of a file or traversing the matches of a pattern over a subject. However, we still do not know how to create our own iterators. In this chapter, we will fill this gap. Starting with simple iterators, we will learn how to use all the power of the generic for to write all kinds of iterators.

An iterator is any construction that allows us to iterate over the elements of a collection. In Lua, we typically represent iterators by functions: each time we call the function, it returns the next element from the collection. A typical example is io.read: each time we call it, it returns the next line of the standard input file, returning nil when there are no more lines to be read.

Any iterator needs to keep some state between successive calls, so that it knows where it is and how to proceed from there. For io.read, C keeps that state in its stream structure. For our own iterators, closures provide an excellent mechanism for keeping state. Remember that a closure is a function that accesses one or more local variables from its enclosing environment. These variables keep their values across successive calls to the closure, allowing the closure to remember where it is along a traversal. Of course, to create a new closure we must also create its non-local variables. Therefore, a closure construction typically involves two functions: the closure itself and a factory, the function that creates the closure plus its enclosing variables.

As an example, let us write a simple iterator for a list. Unlike ipairs, this iterator does not return the index of each element, only its value:

      function values (t)
        local i = 0
        return function ()  i = i + 1; return t[i]  end
      end

In this example, values is the factory. Each time we call this factory, it creates a new closure (the iterator itself). This closure keeps its state in its external variables t and i, which are also created by values. Each time we call the iterator, it returns a next value from the list t. After the last element the iterator returns nil, which signals the end of the iteration.

We can use this iterator in a while loop:

      t = {10, 20, 30}
      iter = values(t)           -- creates the iterator
      while true do
        local element = iter()   -- calls the iterator
        if element == nil then break end
        print(element)
      end

However, it is easier to use the generic for. After all, it was designed for this kind of iteration:

      t = {10, 20, 30}
      for element in values(t) do
        print(element)
      end

The generic for does all the bookkeeping for an iteration loop: it keeps the iterator function internally, so that we do not need the iter variable; it calls the iterator for each new iteration; and it stops the loop when the iterator returns nil. (In the next section, we will see that the generic for does even more than that.)

As a more advanced example, Figure 18.1, “Iterator to traverse all words from the standard input” shows an iterator to traverse all the words from the standard input.

To do this traversal, we keep two values: the contents of the current line (variable line), and where we are on this line (variable pos). With this data, we can always generate the next word. The main part of the iterator function is the call to string.match, which searches for a word in the current line starting at the current position. It describes a word using the pattern ’%w+’, which matches one or more alphanumeric characters. If it finds a word, it captures and returns the word and the position of the first character after it (with an empty capture). The function then updates the current position and returns this word. Otherwise, the iterator reads a new line and repeats the search. If there are no more lines, it returns nil to signal the end of the iteration.

Despite its complexity, the use of allwords is straightforward:

      for word in allwords() do
        print(word)
      end

This is a common situation with iterators: they may not be easy to write, but they are easy to use. This is not a big problem; more often than not, end users programming in Lua do not define iterators, but just use those provided by the application.

One drawback of those previous iterators is that we need to create a new closure to initialize each new loop. For many situations, this is not a real problem. For instance, in the allwords iterator, the cost of creating one single closure is negligible compared to the cost of reading a whole file. However, in some situations this overhead can be inconvenient. In such cases, we can use the generic for itself to keep the iteration state. In this section, we will see the facilities that the generic for offers to hold state.

We saw that the generic for keeps the iterator function internally, during the loop. Actually, it keeps three values: the iterator function, an invariant state, and a control variable. Let us see the details now.

The syntax for the generic for is as follows:

      for var-list in exp-list do
        body
      end

Here, var-list is a list of one or more variable names, separated by commas, and exp-list is a list of one or more expressions, also separated by commas. Usually the expression list has only one element, a call to an iterator factory. In the next code, for instance, the list of variables is k,v and the list of expressions has the single element pairs(t):

      for k, v in pairs(t) do print(k, v) end

We call the first (or only) variable in the list the control variable. Its value is never nil during the loop, because when it becomes nil the loop ends.

The first thing the for does is to evaluate the expressions after the in. These expressions should result in the three values kept by the for: the iterator function, the invariant state, and the initial value for the control variable. Like in a multiple assignment, only the last (or the only) element of the list can result in more than one value; and the number of values is adjusted to three, extra values being discarded or nils added as needed. For instance, when we use simple iterators, the factory returns only the iterator function, so the invariant state and the control variable get nil.

After this initialization step, the for calls the iterator function with two arguments: the invariant state and the control variable. From the standpoint of the for construct, the invariant state has no meaning at all. The for only passes the state value from the initialization step to all calls to the iterator function. Then the for assigns the values returned by the iterator function to the variables declared by its variable list. If the first returned value (the one assigned to the control variable) is nil, the loop terminates. Otherwise, the for executes its body and calls the iteration function again, repeating the process.

More precisely, a construction like

      for var_1, ..., var_n in explist do block end

is equivalent to the following code:

      do
        local _f, _s, _var = explist
        while true do
          local var_1, ... , var_n = _f(_s, _var)
          _var = var_1
          if _var == nil then break end
          block
        end
      end

So, if our iterator function is f, the invariant state is s, and the initial value for the control variable is a0, the control variable will loop over the values a1 = f(s, a0), a2 = f(s, a1), and so on, until ai is nil. If the for has other variables, they simply get the extra values returned by each call to f.

As the name implies, a stateless iterator is an iterator that does not keep any state by itself. Therefore, we can use the same stateless iterator in multiple loops, avoiding the cost of creating new closures.

As we just saw, the for loop calls its iterator function with two arguments: the invariant state and the control variable. A stateless iterator generates the next element for the iteration using only these two values. A typical example of this kind of iterator is ipairs, which iterates over all elements of a sequence:

      a = {"one", "two", "three"}
      for i, v in ipairs(a) do
        print(i, v)
      end

The whole state of the iteration comprises the table being traversed (the invariant state, which does not change during the loop), plus the current index (the control variable). Both ipairs (the factory) and the iterator are quite simple; we could write them in Lua as follows:

      local function iter (t, i)
        i = i + 1
        local v = t[i]
        if v then
          return i, v
        end
      end
      
      function ipairs (t)
        return iter, t, 0
      end

When Lua calls ipairs(t) in a for loop, it gets three values: the function iter as the iterator, the table t as the invariant state, and zero as the initial value for the control variable. Then, Lua calls iter(t, 0), which results in 1,t[1] (unless t[1] is already nil). In the second iteration, Lua calls iter(t, 1), which results in 2,t[2], and so on, until the first nil element.

The function pairs, which iterates over all elements of a table, is similar, except that its iterator function is next, which is a primitive function in Lua:

      function pairs (t)
        return next, t, nil
      end

The call next(t, k), where k is a key of the table t, returns a next key in the table, in an arbitrary order, plus the value associated with this key as a second return value. The call next(t, nil) returns a first pair. When there are no more pairs, next returns nil.

We might use next directly, without calling pairs:

      for k, v in next, t do
        loop body
      end

Remember that the for loop adjusts its expression list to three results, so that it gets next, t, and nil; this is exactly what it gets when it calls pairs(t).

Another interesting example of a stateless iterator is one to traverse a linked list. (Linked lists are not frequent in Lua, but sometimes we need them.) A first thought could be to use only the current node as the control variable, so that the iterator function could return its next node:

      local function getnext (node)
          return node.next
      end
      
      function traverse (list)
        return getnext, nil, list
      end

However, this implementation would skip the first node. Instead, we can use the following code:

      local function getnext (list, node)
        if not node then
          return list
        else
          return node.next
        end
      end
      
      function traverse (list)
        return getnext, list, nil
      end

The trick here is to use the first node as the invariant state (the second value returned by traverse), besides the current node as the control variable. The first time the iterator function getnext is called, node will be nil, and so the function will return list as the first node. In subsequent calls, node will not be nil, and so the iterator will return node.next, as expected.

A common confusion happens when programmers try to order the entries of a table. In a table, the entries form a set, and have no order whatsoever. If we want to order them, we have to copy the keys to an array and then sort the array.

We saw an example of this technique in the Most Frequent Words program, in Chapter 11, Interlude: Most Frequent Words. Let us see here another example. Suppose that we read a source file and build a table that gives, for each function name, the line where this function is defined; something like this:

      lines = {
        ["luaH_set"] = 10,
        ["luaH_get"] = 24,
        ["luaH_present"] = 48,
      }

Now we want to print these function names in alphabetical order. If we traverse this table with pairs, the names appear in an arbitrary order. We cannot sort them directly, because these names are keys of the table. However, when we put them into an array, then we can sort them. First, we must create an array with these names, then sort it, and finally print the result:

      a = {}
      for n in pairs(lines) do a[#a + 1] = n end
      table.sort(a)
      for _, n in ipairs(a) do print(n) end

Some people get confused here. After all, for Lua, arrays also have no order (they are tables, after all). But we know how to count! So, we impose the order, when we access the array with ordered indices. That is why we should always traverse arrays with ipairs, rather than pairs. The first function imposes the key order 1, 2, etc., whereas the latter uses the natural arbitrary order of the table (which may not be what we need, even though usually it is).

Now we are ready to write an iterator that traverses a table following the order of its keys:

      function pairsByKeys (t, f)
        local a = {}
        for n in pairs(t) do   -- create a list with all keys
          a[#a + 1] = n
        end
        table.sort(a, f)       -- sort the list
        local i = 0            -- iterator variable
        return function ()     -- iterator function
          i = i + 1
          return a[i], t[a[i]]   -- return key, value
        end
      end

The factory function pairsByKeys first collects the keys into an array, then it sorts the array, and finally it returns the iterator function. At each step, the iterator returns the next key and value from the original table, following the order in the array a. An optional parameter f allows the specification of an alternative order.

With this function, it is easy to solve our initial problem of traversing a table in order:

      for name, line in pairsByKeys(lines) do
        print(name, line)
      end

As usual, all the complexity is hidden inside the iterator.

The name iterator is a little misleading, because our iterators do not iterate: what iterates is the for loop. Iterators only provide the successive values for the iteration. Maybe a better name would be generator” —which generates elements for the iteration— but iterator is already well established in other languages, such as Java.

However, there is another way to build iterators wherein iterators actually do the iteration. When we use such iterators, we do not write a loop; instead, we simply call the iterator with an argument that describes what the iterator must do at each iteration. More specifically, the iterator receives as argument a function that it calls inside its loop.

As a concrete example, let us rewrite once more the allwords iterator using this style:

      function allwords (f)
        for line in io.lines() do
          for word in string.gmatch(line, "%w+") do
            f(word)    -- call the function
          end
        end
      end

To use this iterator, we must supply the loop body as a function. If we want only to print each word, we simply use print:

      allwords(print)

Often, we use an anonymous function as the body. For instance, the next code fragment counts how many times the word hello appears in the input file:

      local count = 0
      allwords(function (w)
        if w == "hello" then count = count + 1 end
      end)
      print(count)

The same task, written with the previous iterator style, is not very different:

      local count = 0
      for w in allwords() do
        if w == "hello" then count = count + 1 end
      end
      print(count)

True iterators were popular in older versions of Lua, when the language did not have the for statement. How do they compare with generator-style iterators? Both styles have approximately the same overhead: one function call per iteration. On the one hand, it is easier to write the iterator with true iterators (although we can recover this easiness with coroutines, as we will see in the section called “Coroutines as Iterators”). On the other hand, the generator style is more flexible. First, it allows two or more parallel iterations. (For instance, consider the problem of iterating over two files comparing them word by word.) Second, it allows the use of break and return inside the iterator body. With a true iterator, a return returns from the anonymous function, not from the function doing the iteration. For these reasons, overall I usually prefer generators.

Exercise 18.1: Write an iterator fromto such that the next loop becomes equivalent to a numeric for:

      for i in fromto(n, m) do
        body
      end

Can you implement it as a stateless iterator?

Exercise 18.2: Add a step parameter to the iterator from the previous exercise. Can you still implement it as a stateless iterator?

Exercise 18.3: Write an iterator uniquewords that returns all words from a given file without repetitions. (Hint: start with the allwords code in Figure 18.1, “Iterator to traverse all words from the standard input”; use a table to keep all words already reported.)

Exercise 18.4: Write an iterator that returns all non-empty substrings of a given string.

Exercise 18.5: Write a true iterator that traverses all subsets of a given set. (Instead of creating a new table for each subset, it can use the same table for all its results, only changing its contents between iterations.)

Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>