15 Data Files and Serialization

When dealing with data files, it is usually much easier to write the data than to read it back. When we write a file, we have full control of what is going on. When we read a file, on the other hand, we do not know what to expect. Besides all the kinds of data that a correct file can contain, a robust program should also handle bad files gracefully. Therefore, coding robust input routines is always difficult. In this chapter, we will see how we can use Lua to eliminate all code for reading data from our programs, simply by writing the data in an appropriate format. More specifically, we write data as Lua programs that, when run, rebuild the data.

Data description has been one of the main applications of Lua since its creation in 1993. At that time, the main alternative for a textual data-description language would be SGML. For many people (including us), SGML is bloated and complex. In 1998, some people simplified it to create XML, which in our view is still bloated and complex. Other people shared our view, and some of them created JSON (in 2001). JSON is based on Javascript and quite similar to restricted Lua data files. On the one hand, JSON has a big advantage of being an international standard, and several languages (including Lua) have libraries to manipulate JSON files. On the other hand, Lua files are trivial to read and more flexible.

Using a full programming language for data description is surely flexible, but it brings two problems. One is security, as data files can run amok inside our program. We can solve that by running the file in a sandbox, which we will discuss in the section called “Sandboxing”.

The other problem is performance. Lua not only runs fast, but it also compiles fast. For instance, in my new machine, Lua 5.3 reads, compiles, and runs a program with ten million assignments in four seconds, using 240 MB. For comparison, Perl 5.18 takes 21 seconds and 6 GB, Python 2.7 and Python 3.4 trash the machine, Node.js 0.10.25 gives an out of memory error after eight seconds, and Rhino 1.7 also gives an out of memory error, after six minutes.

Table constructors provide an interesting alternative for file formats. With a little extra work when writing data, reading becomes trivial. The technique is to write our data file as Lua code that, when run, rebuilds the data into the program. With table constructors, these chunks can look remarkably like a plain data file.

Let us see an example to make things clear. If our data file is in a predefined format, such as CSV (Comma-Separated Values) or XML, we have little choice. However, if we are going to create the file for our own use, we can use Lua constructors as our format. In this format, we represent each data record as a Lua constructor. Instead of writing in our data file something like

      Donald E. Knuth,Literate Programming,CSLI,1992
      Jon Bentley,More Programming Pearls,Addison-Wesley,1990

we write this:

      Entry{"Donald E. Knuth",
            "Literate Programming",
            "CSLI",
            1992}
      
      Entry{"Jon Bentley",
            "More Programming Pearls",
            "Addison-Wesley",
            1990}

Remember that Entry{code} is the same as Entry({code}), that is, a call to some function Entry with a table as its single argument. So, that previous piece of data is a Lua program. To read that file, we only need to run it, with a sensible definition for Entry. For instance, the following program counts the number of entries in a data file:

      local count = 0
      function Entry () count = count + 1 end
      dofile("data")
      print("number of entries: " .. count)

The next program collects in a set the names of all authors found in the file, and then prints them:

      local authors = {}      -- a set to collect authors
      function Entry (b) authors[b[1]] = true end
      dofile("data")
      for name in pairs(authors) do print(name) end

Note the event-driven approach in these program fragments: the function Entry acts as a callback function, which is called during the dofile for each entry in the data file.

When file size is not a big concern, we can use name-value pairs for our representation:[15]

      Entry{
        author = "Donald E. Knuth",
        title = "Literate Programming",
        publisher = "CSLI",
        year = 1992
      }
      
      Entry{
        author = "Jon Bentley",
        title = "More Programming Pearls",
        year = 1990,
        publisher = "Addison-Wesley",
      }

This format is what we call a self-describing data format, because each piece of data has attached to it a short description of its meaning. Self-describing data are more readable (by humans, at least) than CSV or other compact notations; they are easy to edit by hand, when necessary; and they allow us to make small modifications in the basic format without having to change the data file. For instance, if we add a new field we need only a small change in the reading program, so that it supplies a default value when the field is absent.

With the name-value format, our program to collect authors becomes this:

      local authors = {}      -- a set to collect authors
      function Entry (b) authors[b.author] = true end
      dofile("data")
      for name in pairs(authors) do print(name) end

Now the order of fields is irrelevant. Even if some entries do not have an author, we have to adapt only the function Entry:

      function Entry (b)
        authors[b.author or "unknown"] = true
      end

Frequently we need to serialize some data, that is, to convert the data into a stream of bytes or characters, so that we can save it into a file or send it through a network connection. We can represent serialized data as Lua code in such a way that, when we run the code, it reconstructs the saved values into the reading program.

Usually, if we want to restore the value of a global variable, our chunk will be something like varname = exp, where exp is the Lua code to create the value. The varname is the easy part, so let us see how to write the code that creates a value. For a numeric value, the task is easy:

      function serialize (o)
        if type(o) == "number" then
          io.write(tostring(o))
        else other cases
        end
      end

By writing a float in decimal format, however, we risk losing some precision. We can use a hexadecimal format to avoid this problem. With format ("%a"), the read float will have exactly the same bits of the original one. Moreover, since Lua 5.3 we should distinguish between integers and floats, so that they can be restored with the correct subtype:

      local fmt = {integer = "%d", float = "%a"}
      
      function serialize (o)
        if type(o) == "number" then
          io.write(string.format(fmt[math.type(o)], o))
        else other cases

For a string value, a naive approach would be something like this:

      if type(o) == "string" then
        io.write("'", o, "'")

However, if the string contains special characters (such as quotes or newlines) the resulting code will not be a valid Lua program.

You may be tempted to solve this problem changing quotes:

      if type(o) == "string" then
        io.write("[[", o, "]]")

Beware of code injection! If a malicious user manages to direct your program to save something like " ]]..os.execute('rm *')..[[ " (for instance, she can supply this string as her address), your final chunk will be like this one:

      varname = [[ ]]..os.execute('rm *')..[[ ]]

You will have a bad surprise trying to load this data.

A simple way to quote a string in a secure way is with the option "%q" from string.format. This option was designed to save the string in a way that it can be safely read back by Lua. It surrounds the string with double quotes and properly escapes double quotes, newlines, and some other characters inside the string:

      a = 'a "problematic" \\string'
      print(string.format("%q", a))    --> "a \"problematic\" \\string"

Using this feature, our serialize function now looks like this:

      function serialize (o)
        if type(o) == "number" then
          io.write(string.format(fmt[math.type(o)], o))
        elseif type(o) == "string" then
          io.write(string.format("%q", o))
        else other cases
        end
      end

Lua 5.3.3 extended the format option "%q" to work also with numbers (plus nil and Booleans), again writing them in a proper way to be read back by Lua. (In particular, it formats floats in hexadecimal, to ensure full precision.) Thus, since that version, we can simplify and extend serialize even more:

      function serialize (o)
        local t = type(o)
        if t == "number" or t == "string" or t == "boolean" or
           t == "nil" then
          io.write(string.format("%q", o))
        else other cases
        end
      end

Another way to save strings is the notation [=[...]=] for long strings. However, this notation is mainly intended for hand-written code, where we do not want to change a literal string in any way. In automatically generated code, it is easier to escape problematic characters, as the option "%q" from string.format does.

If you nevertheless want to use the long-string notation for automatically generated code, you must take care of some details. The first one is that you must choose a proper number of equals signs. A good proper number is one more than the maximum that appears in the original string. Because strings containing long sequences of equals signs are common (e.g., comments delimiting parts of a source code), we should limit our attention to sequences of equals signs enclosed by square bracket. The second detail is that Lua always ignores a newline at the beginning of a long string; a simple way to avoid this problem is to add always a newline to be ignored.

The function quote in Figure 15.1, “Quoting arbitrary literal strings” is the result of our previous remarks.

It takes an arbitrary string and returns it formatted as a long string. The call to gmatch creates an iterator to traverse all occurrences of the pattern ’]=*%f[%]]’ (that is, a closing square bracket followed by a sequence of zero or more equals signs followed by a frontier with a closing square bracket) in the string s. For each occurrence, the loop updates n with the maximum number of equals signs so far. After the loop, we use string.rep to replicate an equals sign n + 1 times, which is one more than the maximum occurring in the string. Finally, string.format encloses s with pairs of brackets with the correct number of equals signs in between and adds extra spaces around the quoted string plus a newline at the beginning of the enclosed string.

(We might be tempted to use the simpler pattern ’]=*]’, which does not use a frontier pattern for the second square bracket, but there is a subtlety here. Suppose the subject is "]=]==]". The first match is "]=]". After it, what is left in the string is "==]", and so there is no other match; in the end of the loop, n would be one instead of two. The frontier pattern does not consume the bracket, so that it remains in the subject for the following matches.)

Our next (and harder) task is to save tables. There are several ways to save them, according to what assumptions we make about the table structure. No single algorithm seems appropriate for all cases. Simple tables not only can use simpler algorithms, but also the output can be shorter and clearer.

Our first attempt is in Figure 15.2, “Serializing tables without cycles”.

Despite its simplicity, that function does a reasonable job. It even handles nested tables (that is, tables within other tables), as long as the table structure is a tree —that is, there are no shared subtables and no cycles. (A small aesthetic improvement would be to indent nested tables; see Exercise 15.1.)

The previous function assumes that all keys in a table are valid identifiers. If a table has numeric keys, or string keys that are not syntactic valid Lua identifiers, we are in trouble. A simple way to solve this difficulty is to use the following code to write each key:

      io.write(string.format(" [%s] = ", serialize(k)))

With this change, we improve the robustness of our function, at the cost of the aesthetics of the resulting file. Consider the next call:

      serialize{a=12, b='Lua', key='another "one"'}

The result of this call using the first version of serialize is this:

      {
        a = 12,
        b = "Lua",
        key = "another \"one\"",
      }

Compare it with the result using the second version:

      {
        ["a"] = 12,
        ["b"] = "Lua",
        ["key"] = "another \"one\"",
      }

We can have both robustness and aesthetics by testing for each case whether it needs the square brackets; again, we will leave this improvement as an exercise.

To handle tables with generic topology (i.e., with cycles and shared subtables) we need a different approach. Constructors cannot create such tables, so we will not use them. To represent cycles we need names, so our next function will get as arguments the value to be saved plus its name. Moreover, we must keep track of the names of the tables already saved, to reuse them when we detect a cycle. We will use an extra table for this tracking. This table will have previously saved tables as indices and their names as the associated values.

The resulting code is in Figure 15.3, “Saving tables with cycles”.

We keep the restriction that the tables we want to save have only strings and numbers as keys. The function basicSerialize serializes these basic types, returning the result. The next function, save, does the hard work. The saved parameter is the table that keeps track of tables already saved. As an example, suppose we build a table like this:

      a = {x=1, y=2; {3,4,5}}
      a[2] = a    -- cycle
      a.z = a[1]  -- shared subtable

The call save("a", a) will save it as follows:

      a = {}
      a[1] = {}
      a[1][1] = 3
      a[1][2] = 4
      a[1][3] = 5
      
      a[2] = a
      a["y"] = 2
      a["x"] = 1
      a["z"] = a[1]

The actual order of these assignments may vary, as it depends on a table traversal. Nevertheless, the algorithm ensures that any node needed in a new definition is already defined.

If we want to save several values with shared parts, we can make the calls to save them using the same saved table. For instance, assume the following two tables:

      a = {{"one", "two"}, 3}
      b = {k = a[1]}

If we save them independently, the result will not have common parts. However, if we use the same saved table for both calls to save, then the result will share common parts:

      local t = {}
      save("a", a, t)
      save("b", b, t)
      
        --> a = {}
        --> a[1] = {}
        --> a[1][1] = "one"
        --> a[1][2] = "two"
        --> a[2] = 3
        --> b = {}
        --> b["k"] = a[1]

As is usual in Lua, there are several other alternatives. Among them, we can save a value without giving it a global name (instead, the chunk builds a local value and returns it), we can use the list syntax when possible (see the exercises for this chapter), and so on. Lua gives you the tools; you build the mechanisms.

Exercise 15.1: Modify the code in Figure 15.2, “Serializing tables without cycles” so that it indents nested tables. (Hint: add an extra parameter to serialize with the indentation string.)

Exercise 15.2: Modify the code of the previous exercise so that it uses the syntax ["key"]=value, as suggested in the section called “Saving tables without cycles”.

Exercise 15.3: Modify the code of the previous exercise so that it uses the syntax ["key"]=value only when necessary (that is, when the key is a string but not a valid identifier).

Exercise 15.4: Modify the code of the previous exercise so that it uses the constructor syntax for lists whenever possible. For instance, it should serialize the table {14, 15, 19} as {14, 15, 19}, not as {[1] = 14, [2] = 15, [3] = 19}. (Hint: start by saving the values of the keys 1, 2, ..., as long as they are not nil. Take care not to save them again when traversing the rest of the table.)

Exercise 15.5: The approach of avoiding constructors when saving tables with cycles is too radical. It is possible to save the table in a more pleasant format using constructors for the simple case, and to use assignments later only to fix sharing and loops. Reimplement the function save (Figure 15.3, “Saving tables with cycles”) using this approach. Add to it all the goodies that you have implemented in the previous exercises (indentation, record syntax, and list syntax).



Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>