14 Data Structures

Tables in Lua are not a data structure; they are the data structure. We can represent all structures that other languages offer —arrays, records, lists, queues, sets— with tables in Lua. Moreover, Lua tables implement all these structures efficiently.

In more traditional languages, such as C and Pascal, we implement most data structures with arrays and lists (where lists = records + pointers). Although we can implement arrays and lists using Lua tables (and sometimes we do this), tables are more powerful than arrays and lists; many algorithms are simplified to the point of triviality with the use of tables. For instance, we seldom write a search in Lua, because tables offer direct access to any type.

It takes a while to learn how to use tables efficiently. Here, we will see how to implement typical data structures with tables and cover some examples of their uses. We will start with arrays and lists, not because we need them for the other structures, but because most programmers are already familiar with them. (We have already seen the basics of this material in Chapter 5, Tables, but I will repeat it here for completeness.) Then we will continue with more advanced examples, such as sets, bags, and graphs.

We implement arrays in Lua simply by indexing tables with integers. Therefore, arrays do not have a fixed size, but grow as needed. Usually, when we initialize the array we define its size indirectly. For instance, after the following code, any attempt to access a field outside the range 1–1000 will return nil, instead of zero:

      local a = {}    -- new array
      for i = 1, 1000 do
        a[i] = 0
      end

The length operator (#) uses this fact to find the size of an array:

      print(#a)        --> 1000

We can start an array at index zero, one, or any other value:

      -- create an array with indices from -5 to 5
      a = {}
      for i = -5, 5 do
        a[i] = 0
      end

However, it is customary in Lua to start arrays with index one. The Lua libraries adhere to this convention; so does the length operator. If our arrays do not start with one, we will not be able to use these facilities.

We can use a constructor to create and initialize arrays in a single expression:

      squares = {1, 4, 9, 16, 25, 36, 49, 64, 81}

Such constructors can be as large as we need. In Lua, it is not uncommon data-description files with constructors with a few million elements.

There are two main ways to represent matrices in Lua. The first one is with a jagged array (an array of arrays), that is, a table wherein each element is another table. For instance, we can create a matrix of zeros with dimensions N by M with the following code:

      local mt = {}          -- create the matrix
      for i = 1, N do
        local row = {}       -- create a new row
        mt[i] = row
        for j = 1, M do
          row[j] = 0
        end
      end

Because tables are objects in Lua, we have to create each row explicitly to build a matrix. On the one hand, this is certainly more verbose than simply declaring a matrix, as we do in C. On the other hand, it gives us more flexibility. For instance, we can create a triangular matrix by changing the inner loop in the previous example to for j=1,i do ... end. With this code, the triangular matrix uses only half the memory of the original one.

The second way to represent a matrix is by composing the two indices into a single one. Typically, we do this by multiplying the first index by a suitable constant and then adding the second index. With this approach, the following code would create our matrix of zeros with dimensions N by M:

      local mt = {}          -- create the matrix
      for i = 1, N do
        local aux = (i - 1) * M
        for j = 1, M do
          mt[aux + j] = 0
        end
      end

Quite often, applications use a sparse matrix, a matrix wherein most elements are zero or nil. For instance, we can represent a graph by its adjacency matrix, which has the value x in position (m,n) when there is a connection with cost x between nodes m and n. When these nodes are not connected, the value in position (m,n) is nil. To represent a graph with ten thousand nodes, where each node has about five neighbors, we will need a matrix with a hundred million entries (a square matrix with 10000 columns and 10000 rows), but approximately only fifty thousand of them will not be nil (five non-nil columns for each row, corresponding to the five neighbors of each node). Many books on data structures discuss at length how to implement such sparse matrices without wasting 800 MB of memory, but we seldom need these techniques when programming in Lua. Because we represent arrays with tables, they are naturally sparse. With our first representation (tables of tables), we will need ten thousand tables, each one with about five elements, with a grand total of fifty thousand entries. With the second representation, we will have a single table, with fifty thousand entries in it. Whatever the representation, we need space only for the non-nil elements.

We cannot use the length operator over sparse matrices, because of the holes (nil values) between active entries. This is not a big loss; even if we could use it, we probably would not. For most operations, it would be quite inefficient to traverse all those empty entries. Instead, we can use pairs to traverse only the non-nil elements. As an example, let us see how to do matrix multiplication with sparse matrices represented by jagged arrays.

Suppose we want to multiply a matrix a[M,K] by a matrix b[K,N], producing the matrix c[M,N]. The usual matrix-multiplication algorithm goes like this:

      for i = 1, M do
        for j = 1, N do
          c[i][j] = 0
          for k = 1, K do
            c[i][j] = c[i][j] + a[i][k] * b[k][j]
          end
        end
      end

The two outer loops traverse the entire resulting matrix, and for each element, the inner loop computes its value.

For sparse matrices with jagged arrays, this inner loop is a problem. Because it traverses a column of b, instead of a row, we cannot use something like pairs here: the loop has to visit each row looking whether that row has an element in that column. Instead of visiting only a few non-zero elements, the loop visits all zero elements, too. (Traversing a column can be an issue in other contexts, too, because of its loss of spatial locality.)

The following algorithm is quite similar to the previous one, but it reverses the order of the two inner loops. With this simple change, it avoids traversing columns:

      -- assumes 'c' has zeros in all elements
      for i = 1, M do
        for k = 1, K do
          for j = 1, N do
            c[i][j] = c[i][j] + a[i][k] * b[k][j]
          end
        end
      end

Now, the middle loop traverses the row a[i], and the inner loop traverses the row b[k]. Both can use pairs, visiting only the non-zero elements. The initialization of the resulting matrix c is not an issue here, because an empty sparse matrix is naturally filled with zeros.

Figure 14.1, “Multiplication of sparse matrices” shows the complete implementation of the above algorithm, using pairs and taking care of sparse entries. This implementation visits only the non-nil elements, and the result is naturally sparse. Moreover, the code deletes resulting entries that by chance evaluate to zero.

Because tables are dynamic entities, it is easy to implement linked lists in Lua. We represent each node with a table (what else?); links are simply table fields that contain references to other tables. For instance, let us implement a singly-linked list, where each node has two fields, value and next. A simple variable is the list root:

      list = nil

To insert an element at the beginning of the list, with a value v, we do this:

      list = {next = list, value = v}

To traverse the list, we write this:

      local l = list
      while l do
        visit l.value
        l = l.next
      end

We can also implement easily other kinds of lists, such as doubly-linked lists or circular lists. However, we seldom need those structures in Lua, because usually there is a simpler way to represent our data without using linked lists. For instance, we can represent a stack with an (unbounded) array.

A simple way to implement queues in Lua is with functions insert and remove from the table library. As we saw in the section called “The Table Library”, these functions insert and remove elements in any position of an array, moving other elements to accommodate the operation. However, these moves can be expensive for large structures. A more efficient implementation uses two indices, one for the first element and another for the last. With that representation, we can insert or remove an element at both ends in constant time, as shown in Figure 14.2, “A double-ended queue”.

If we use this structure in a strict queue discipline, calling only pushLast and popFirst, both first and last will increase continually. However, because we represent arrays in Lua with tables, we can index them either from 1 to 20 or from 16777201 to 16777220. With 64-bit integers, such a queue can run for thirty thousand years, doing ten million insertions per second, before it has problems with overflows.

As I said, before, we seldom do searches in Lua. Instead, we use what we call an index table or a reverse table.

Suppose we have a table with the names of the days of the week:

      days = {"Sunday", "Monday", "Tuesday", "Wednesday",
              "Thursday", "Friday", "Saturday"}

Now we want to translate a name into its position in the week. We can search the table, looking for the given name. A more efficient approach, however, is to build a reverse table, say revDays, which has the names as indices and the numbers as values. This table would look like this:

      revDays = {["Sunday"] = 1,   ["Monday"] = 2,
                 ["Tuesday"] = 3,  ["Wednesday"] = 4,
                 ["Thursday"] = 5, ["Friday"] = 6,
                 ["Saturday"] = 7}

Then, all we have to do to find the order of a name is to index this reverse table:

      x = "Tuesday"
      print(revDays[x])    --> 3

Of course, we do not need to declare the reverse table manually. We can build it automatically from the original one:

      revDays = {}
      for k,v in pairs(days) do
        revDays[v] = k
      end

The loop will do the assignment for each element of days, with the variable k getting the keys (1, 2, ...) and v the values ("Sunday", "Monday", ...).

Suppose we want to list all identifiers used in a program source; for that, we need to filter the reserved words out of our listing. Some C programmers could be tempted to represent the set of reserved words as an array of strings and search this array to know whether a given word is in the set. To speed up the search, they could even use a binary tree to represent the set.

In Lua, an efficient and simple way to represent such sets is to put the set elements as indices in a table. Then, instead of searching the table for a given element, we just index the table and test whether the result is nil. In our example, we could write the following code:

      reserved = {
        ["while"] = true,     ["if"] = true,
        ["else"] = true,      ["do"] = true,
      }
      
      for w in string.gmatch(s, "[%a_][%w_]*") do
        if not reserved[w] then
          do something with 'w'   -- 'w' is not a reserved word
        end
      end

(In the definition of reserved, we cannot write while = true, because while is not a valid name in Lua. Instead, we use the notation ["while"] = true.)

We can have a clearer initialization using an auxiliary function to build the set:

      function Set (list)
        local set = {}
        for _, l in ipairs(list) do set[l] = true end
        return set
      end
      
      reserved = Set{"while", "end", "function", "local", }

We can also use another set to collect the identifiers:

      local ids = {}
      for w in string.gmatch(s, "[%a_][%w_]*") do
        if not reserved[w] then
          ids[w] = true
        end
      end
      
      -- print each identifier once
      for w in pairs(ids) do print(w) end

Bags, also called multisets, differ from regular sets in that each element can appear multiple times. An easy representation for bags in Lua is similar to the previous representation for sets, but with a counter associated with each key.[14] To insert an element, we increment its counter:

      function insert (bag, element)
        bag[element] = (bag[element] or 0) + 1
      end

To remove an element, we decrement its counter:

      function remove (bag, element)
        local count = bag[element]
        bag[element] = (count and count > 1) and count - 1 or nil
      end

We only keep the counter if it already exists and it is still greater than zero.

Suppose we are building a string piecemeal, for instance reading a file line by line. Our typical code could look like this:

      local buff = ""
      for line in io.lines() do
        buff = buff .. line .. "\n"
      end

Despite its innocent look, this code in Lua can cause a huge performance penalty for large files: for instance, it takes more than 30 seconds to read a 4.5 MB file on my new machine.

Why is that? To understand what happens, let us imagine that we are in the middle of the read loop; each line has 20 bytes and we have already read some 2500 lines, so buff is a string with 50 kB. When Lua concatenates buff..line.."\n", it allocates a new string with 50020 bytes and copies the 50000 bytes from buff into this new string. That is, for each new line, Lua moves around 50 kB of memory, and growing. The algorithm is quadratic. After reading 100 new lines (only 2 kB), Lua has already moved more than 5 MB of memory. When Lua finishes reading 350 kB, it has moved around more than 50 GB. (This problem is not peculiar to Lua: other languages wherein strings are immutable values present a similar behavior, Java being a famous example.)

Before we continue, we should remark that, despite all I said, this situation is not a common problem. For small strings, the above loop is fine. To read an entire file, Lua provides the io.read("a") option, which reads the file at once. However, sometimes we must face this problem. Java offers the StringBuffer class to ameliorate the problem. In Lua, we can use a table as the string buffer. The key to this approach is the function table.concat, which returns the concatenation of all the strings of a given list. Using concat, we can write our previous loop as follows:

      local t = {}
      for line in io.lines() do
        t[#t + 1] = line .. "\n"
      end
      local s = table.concat(t)

This algorithm takes less than 0.05 seconds to read the same file that took more than half a minute to read with the original code. (Nevertheless, for reading a whole file it is still better to use io.read with the "a" option.)

We can do even better. The function concat accepts an optional second argument, which is a separator to be inserted between the strings. Using this separator, we do not need to insert a newline after each line:

      local t = {}
      for line in io.lines() do
        t[#t + 1] = line
      end
      s = table.concat(t, "\n") .. "\n"

The function inserts the separator between the strings, but we still have to add the last newline. This last concatenation creates a new copy of the resulting string, which can be quite long. There is no option to make concat insert this extra separator, but we can deceive it, inserting an extra empty string in t:

      t[#t + 1] = ""
      s = table.concat(t, "\n")

Now, the extra newline that concat adds before this empty string is at the end of the resulting string, as we wanted.

Like any decent language, Lua allows multiple implementations for graphs, each one better adapted to some particular algorithms. Here we will see a simple object-oriented implementation, where we represent nodes as objects (actually tables, of course) and arcs as references between nodes.

We will represent each node as a table with two fields: name, with the node’s name; and adj, with the set of nodes adjacent to this one. Because we will read the graph from a text file, we need a way to find a node given its name. So, we will use an extra table mapping names to nodes. Given a name, function name2node returns the corresponding node:

      local function name2node (graph, name)
        local node = graph[name]
        if not node then
          -- node does not exist; create a new one
          node = {name = name, adj = {}}
          graph[name] = node
        end
        return node
      end

Figure 14.3, “Reading a graph from a file” shows the function that builds a graph.

It reads a file where each line has two node names, meaning that there is an arc from the first node to the second. For each line, the function uses string.match to split the line in two names, finds the nodes corresponding to these names (creating the nodes if needed), and connects the nodes.

Figure 14.4, “Finding a path between two nodes” illustrates an algorithm using such graphs.

The function findpath searches for a path between two nodes using a depth-first traversal. Its first parameter is the current node; the second is its goal; the third parameter keeps the path from the origin to the current node; the last parameter is a set with all the nodes already visited, to avoid loops. Note how the algorithm manipulates nodes directly, without using their names. For instance, visited is a set of nodes, not of node names. Similarly, path is a list of nodes.

To test this code, we add a function to print a path and some code to put it all to work:

      function printpath (path)
        for i = 1, #path do
          print(path[i].name)
        end
      end
      
      g = readgraph()
      a = name2node(g, "a")
      b = name2node(g, "b")
      p = findpath(a, b)
      if p then printpath(p) end

Exercise 14.1: Write a function to add two sparse matrices.

Exercise 14.2: Modify the queue implementation in Figure 14.2, “A double-ended queue” so that both indices return to zero when the queue is empty.

Exercise 14.3: Modify the graph structure so that it can keep a label for each arc. The structure should represent each arc by an object, too, with two fields: its label and the node it points to. Instead of an adjacent set, each node keeps an incident set that contains the arcs that originate at that node.

Adapt the function readgraph to read two node names plus a label from each line in the input file. (Assume that the label is a number.)

Exercise 14.4: Assume the graph representation of the previous exercise, where the label of each arc represents the distance between its end nodes. Write a function to find the shortest path between two given nodes, using Dijkstra’s algorithm.



[14] We already used this representation for the most-frequent-words program in Chapter 11, Interlude: Most Frequent Words.

Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>