We have been using the generic for for several tasks through the book, such as reading the lines of a file or traversing the matches of a pattern over a subject. However, we still do not know how to create our own iterators. In this chapter, we will fill this gap. Starting with simple iterators, we will learn how to use all the power of the generic for to write all kinds of iterators.
An iterator is any construction that allows us to
iterate over the elements of a collection.
In Lua, we typically represent iterators by functions:
each time we call the function,
it returns the “next” element from the collection.
A typical example is io.read
:
each time we call it,
it returns the next line of the standard input file,
returning nil when there are no more lines to be read.
Any iterator needs to keep some state between
successive calls,
so that it knows where it is and how to proceed from there.
For io.read
, C keeps that state in its stream structure.
For our own iterators,
closures provide an excellent mechanism for keeping state.
Remember that a closure is a function that accesses one or
more local variables from its enclosing environment.
These variables keep their values across successive calls
to the closure,
allowing the closure to remember where it is along a traversal.
Of course, to create a new closure we must also create its
non-local variables.
Therefore, a closure construction typically involves two functions:
the closure itself
and a factory,
the function that creates the closure plus its enclosing variables.
As an example,
let us write a simple iterator for a list.
Unlike ipairs
,
this iterator does not return the index of each element,
only its value:
function values (t) local i = 0 return function () i = i + 1; return t[i] end end
In this example, values
is the factory.
Each time we call this factory,
it creates a new closure (the iterator itself).
This closure keeps its state in its external variables
t
and i
,
which are also created by values
.
Each time we call the iterator,
it returns a next value from the list t
.
After the last element the iterator returns nil,
which signals the end of the iteration.
We can use this iterator in a while loop:
t = {10, 20, 30} iter = values(t) -- creates the iterator while true do local element = iter() -- calls the iterator if element == nil then break end print(element) end
However, it is easier to use the generic for. After all, it was designed for this kind of iteration:
t = {10, 20, 30} for element in values(t) do print(element) end
The generic for does all the bookkeeping for an
iteration loop:
it keeps the iterator function internally,
so that we do not need the iter
variable;
it calls the iterator for each new iteration;
and it stops the loop when the iterator returns nil.
(In the next section,
we will see that the generic for does
even more than that.)
As a more advanced example, Figure 18.1, “Iterator to traverse all words from the standard input” shows an iterator to traverse all the words from the standard input.
Figure 18.1. Iterator to traverse all words from the standard input
function allwords () local line = io.read() -- current line local pos = 1 -- current position in the line return function () -- iterator function while line do -- repeat while there are lines local w, e = string.match(line, "(%w+)()", pos) if w then -- found a word? pos = e -- next position is after this word return w -- return the word else line = io.read() -- word not found; try next line pos = 1 -- restart from first position end end return nil -- no more lines: end of traversal end end
To do this traversal, we keep two values:
the contents of the current line (variable line
),
and where we are on this line (variable pos
).
With this data, we can always generate the next word.
The main part of the iterator function
is the call to string.match
,
which searches for a word in the current line
starting at the current position.
It describes a “word” using the pattern ’%w+
’,
which matches one or more alphanumeric characters.
If it finds a word,
it captures and returns the word
and the position of the first character after it
(with an empty capture).
The function then updates the current position
and returns this word.
Otherwise, the iterator reads a new line and repeats the search.
If there are no more lines,
it returns nil to signal the end of the iteration.
Despite its complexity, the use of allwords
is straightforward:
for word in allwords() do print(word) end
This is a common situation with iterators: they may not be easy to write, but they are easy to use. This is not a big problem; more often than not, end users programming in Lua do not define iterators, but just use those provided by the application.
One drawback of those previous iterators is that we need
to create a new closure to initialize each new loop.
For many situations, this is not a real problem.
For instance, in the allwords
iterator,
the cost of creating one single closure is negligible compared
to the cost of reading a whole file.
However, in some situations this overhead can be inconvenient.
In such cases, we can use the generic for itself
to keep the iteration state.
In this section,
we will see the facilities that the
generic for offers to hold state.
We saw that the generic for keeps the iterator function internally, during the loop. Actually, it keeps three values: the iterator function, an invariant state, and a control variable. Let us see the details now.
The syntax for the generic for is as follows:
forvar-list
inexp-list
dobody
end
Here, var-list
is a list of one or more variable names,
separated by commas,
and exp-list
is a list of one or more expressions,
also separated by commas.
Usually the expression list has only one element,
a call to an iterator factory.
In the next code,
for instance,
the list of variables is k,v
and the list of expressions has the single element pairs(t)
:
for k, v in pairs(t) do print(k, v) end
We call the first (or only) variable in the list the control variable. Its value is never nil during the loop, because when it becomes nil the loop ends.
The first thing the for does is to evaluate the expressions after the in. These expressions should result in the three values kept by the for: the iterator function, the invariant state, and the initial value for the control variable. Like in a multiple assignment, only the last (or the only) element of the list can result in more than one value; and the number of values is adjusted to three, extra values being discarded or nils added as needed. For instance, when we use simple iterators, the factory returns only the iterator function, so the invariant state and the control variable get nil.
After this initialization step, the for calls the iterator function with two arguments: the invariant state and the control variable. From the standpoint of the for construct, the invariant state has no meaning at all. The for only passes the state value from the initialization step to all calls to the iterator function. Then the for assigns the values returned by the iterator function to the variables declared by its variable list. If the first returned value (the one assigned to the control variable) is nil, the loop terminates. Otherwise, the for executes its body and calls the iteration function again, repeating the process.
More precisely, a construction like
for var_1, ..., var_n inexplist
doblock
end
is equivalent to the following code:
do local _f, _s, _var =explist
while true do local var_1, ... , var_n = _f(_s, _var) _var = var_1 if _var == nil then break endblock
end end
So, if our iterator function is f, the invariant state is s, and the initial value for the control variable is a0, the control variable will loop over the values a1 = f(s, a0), a2 = f(s, a1), and so on, until ai is nil. If the for has other variables, they simply get the extra values returned by each call to f.
As the name implies, a stateless iterator is an iterator that does not keep any state by itself. Therefore, we can use the same stateless iterator in multiple loops, avoiding the cost of creating new closures.
As we just saw,
the for loop calls its iterator function with two arguments:
the invariant state and the control variable.
A stateless iterator generates the next element for
the iteration using only these two values.
A typical example of this kind of iterator is ipairs
,
which iterates over all elements of a sequence:
a = {"one", "two", "three"} for i, v in ipairs(a) do print(i, v) end
The whole state of the iteration comprises the table being traversed
(the invariant state,
which does not change during the loop),
plus the current index (the control variable).
Both ipairs
(the factory) and the iterator are quite simple;
we could write them in Lua as follows:
local function iter (t, i) i = i + 1 local v = t[i] if v then return i, v end end function ipairs (t) return iter, t, 0 end
When Lua calls ipairs(t)
in a for loop,
it gets three values:
the function iter
as the iterator,
the table t
as the invariant state,
and zero as the initial value for the control variable.
Then, Lua calls iter(t, 0)
,
which results in 1,t[1]
(unless t[1]
is already nil).
In the second iteration, Lua calls iter(t, 1)
,
which results in 2,t[2]
, and so on,
until the first nil element.
The function pairs
,
which iterates over all elements of a table, is similar,
except that its iterator function is next
,
which is a primitive function in Lua:
function pairs (t) return next, t, nil end
The call next(t, k)
,
where k
is a key of the table t
,
returns a next key in the table,
in an arbitrary order,
plus the value associated with this key
as a second return value.
The call next(t, nil)
returns a first pair.
When there are no more pairs, next
returns nil.
We might use next
directly,
without calling pairs
:
for k, v in next, t do
loop body
end
Remember that the for loop adjusts its
expression list to three results,
so that it gets next
, t
, and nil;
this is exactly what it gets when it calls pairs(t)
.
Another interesting example of a stateless iterator is one to traverse a linked list. (Linked lists are not frequent in Lua, but sometimes we need them.) A first thought could be to use only the current node as the control variable, so that the iterator function could return its next node:
local function getnext (node) return node.next end function traverse (list) return getnext, nil, list end
However, this implementation would skip the first node. Instead, we can use the following code:
local function getnext (list, node) if not node then return list else return node.next end end function traverse (list) return getnext, list, nil end
The trick here is to use the first node as the invariant state
(the second value returned by traverse
),
besides the current node as the control variable.
The first time the iterator function getnext
is called,
node
will be nil,
and so the function will return list
as the first node.
In subsequent calls, node
will not be nil,
and so the iterator will return node.next
,
as expected.
A common confusion happens when programmers try to order the entries of a table. In a table, the entries form a set, and have no order whatsoever. If we want to order them, we have to copy the keys to an array and then sort the array.
We saw an example of this technique in the “Most Frequent Words” program, in Chapter 11, Interlude: Most Frequent Words. Let us see here another example. Suppose that we read a source file and build a table that gives, for each function name, the line where this function is defined; something like this:
lines = { ["luaH_set"] = 10, ["luaH_get"] = 24, ["luaH_present"] = 48, }
Now we want to print these function names in alphabetical order.
If we traverse this table with pairs
,
the names appear in an arbitrary order.
We cannot sort them directly,
because these names are keys of the table.
However, when we put them into an array,
then we can sort them.
First, we must create an array with these names, then sort it,
and finally print the result:
a = {} for n in pairs(lines) do a[#a + 1] = n end table.sort(a) for _, n in ipairs(a) do print(n) end
Some people get confused here.
After all, for Lua, arrays also have no order
(they are tables, after all).
But we know how to count!
So, we impose the order,
when we access the array with ordered indices.
That is why we should always traverse arrays
with ipairs
, rather than pairs
.
The first function imposes the key order 1, 2, etc.,
whereas the latter uses the natural arbitrary order of the table
(which may not be what we need,
even though usually it is).
Now we are ready to write an iterator that traverses a table following the order of its keys:
function pairsByKeys (t, f) local a = {} for n in pairs(t) do -- create a list with all keys a[#a + 1] = n end table.sort(a, f) -- sort the list local i = 0 -- iterator variable return function () -- iterator function i = i + 1 return a[i], t[a[i]] -- return key, value end end
The factory function pairsByKeys
first collects the keys into an array,
then it sorts the array,
and finally it returns the iterator function.
At each step, the iterator returns
the next key and value from the original table,
following the order in the array a
.
An optional parameter f
allows the specification of an
alternative order.
With this function, it is easy to solve our initial problem of traversing a table in order:
for name, line in pairsByKeys(lines) do print(name, line) end
As usual, all the complexity is hidden inside the iterator.
The name “iterator” is a little misleading, because our iterators do not iterate: what iterates is the for loop. Iterators only provide the successive values for the iteration. Maybe a better name would be “generator” —which generates elements for the iteration— but “iterator” is already well established in other languages, such as Java.
However, there is another way to build iterators wherein iterators actually do the iteration. When we use such iterators, we do not write a loop; instead, we simply call the iterator with an argument that describes what the iterator must do at each iteration. More specifically, the iterator receives as argument a function that it calls inside its loop.
As a concrete example,
let us rewrite once more the allwords
iterator using this style:
function allwords (f) for line in io.lines() do for word in string.gmatch(line, "%w+") do f(word) -- call the function end end end
To use this iterator, we must supply the loop body as a function.
If we want only to print each word,
we simply use print
:
allwords(print)
Often, we use an anonymous function as the body. For instance, the next code fragment counts how many times the word “hello” appears in the input file:
local count = 0 allwords(function (w) if w == "hello" then count = count + 1 end end) print(count)
The same task, written with the previous iterator style, is not very different:
local count = 0 for w in allwords() do if w == "hello" then count = count + 1 end end print(count)
True iterators were popular in older versions of Lua, when the language did not have the for statement. How do they compare with generator-style iterators? Both styles have approximately the same overhead: one function call per iteration. On the one hand, it is easier to write the iterator with true iterators (although we can recover this easiness with coroutines, as we will see in the section called “Coroutines as Iterators”). On the other hand, the generator style is more flexible. First, it allows two or more parallel iterations. (For instance, consider the problem of iterating over two files comparing them word by word.) Second, it allows the use of break and return inside the iterator body. With a true iterator, a return returns from the anonymous function, not from the function doing the iteration. For these reasons, overall I usually prefer generators.
Exercise 18.1:
Write an iterator fromto
such that
the next loop becomes equivalent to a numeric for:
for i in fromto(n, m) do
body
end
Can you implement it as a stateless iterator?
Exercise 18.2: Add a step parameter to the iterator from the previous exercise. Can you still implement it as a stateless iterator?
Exercise 18.3:
Write an iterator uniquewords
that returns
all words from a given file without repetitions.
(Hint: start with the allwords
code in Figure 18.1, “Iterator to traverse all words from the standard input”;
use a table to keep all words already reported.)
Exercise 18.4: Write an iterator that returns all non-empty substrings of a given string.
Exercise 18.5: Write a true iterator that traverses all subsets of a given set. (Instead of creating a new table for each subset, it can use the same table for all its results, only changing its contents between iterations.)
Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>