When dealing with data files, it is usually much easier to write the data than to read it back. When we write a file, we have full control of what is going on. When we read a file, on the other hand, we do not know what to expect. Besides all the kinds of data that a correct file can contain, a robust program should also handle bad files gracefully. Therefore, coding robust input routines is always difficult. In this chapter, we will see how we can use Lua to eliminate all code for reading data from our programs, simply by writing the data in an appropriate format. More specifically, we write data as Lua programs that, when run, rebuild the data.
Data description has been one of the main applications of Lua since its creation in 1993. At that time, the main alternative for a textual data-description language would be SGML. For many people (including us), SGML is bloated and complex. In 1998, some people simplified it to create XML, which in our view is still bloated and complex. Other people shared our view, and some of them created JSON (in 2001). JSON is based on Javascript and quite similar to restricted Lua data files. On the one hand, JSON has a big advantage of being an international standard, and several languages (including Lua) have libraries to manipulate JSON files. On the other hand, Lua files are trivial to read and more flexible.
Using a full programming language for data description is surely flexible, but it brings two problems. One is security, as “data” files can run amok inside our program. We can solve that by running the file in a sandbox, which we will discuss in the section called “Sandboxing”.
The other problem is performance. Lua not only runs fast, but it also compiles fast. For instance, in my new machine, Lua 5.3 reads, compiles, and runs a program with ten million assignments in four seconds, using 240 MB. For comparison, Perl 5.18 takes 21 seconds and 6 GB, Python 2.7 and Python 3.4 trash the machine, Node.js 0.10.25 gives an “out of memory” error after eight seconds, and Rhino 1.7 also gives an “out of memory” error, after six minutes.
Table constructors provide an interesting alternative for file formats. With a little extra work when writing data, reading becomes trivial. The technique is to write our data file as Lua code that, when run, rebuilds the data into the program. With table constructors, these chunks can look remarkably like a plain data file.
Let us see an example to make things clear. If our data file is in a predefined format, such as CSV (Comma-Separated Values) or XML, we have little choice. However, if we are going to create the file for our own use, we can use Lua constructors as our format. In this format, we represent each data record as a Lua constructor. Instead of writing in our data file something like
Donald E. Knuth,Literate Programming,CSLI,1992 Jon Bentley,More Programming Pearls,Addison-Wesley,1990
we write this:
Entry{"Donald E. Knuth", "Literate Programming", "CSLI", 1992} Entry{"Jon Bentley", "More Programming Pearls", "Addison-Wesley", 1990}
Remember that Entry{
is the same as
code
}Entry({
, that is,
a call to some function code
})Entry
with a table as its single argument.
So, that previous piece of data is a Lua program.
To read that file,
we only need to run it,
with a sensible definition for Entry
.
For instance, the following program counts the number
of entries in a data file:
local count = 0 function Entry () count = count + 1 end dofile("data") print("number of entries: " .. count)
The next program collects in a set the names of all authors found in the file, and then prints them:
local authors = {} -- a set to collect authors function Entry (b) authors[b[1]] = true end dofile("data") for name in pairs(authors) do print(name) end
Note the event-driven approach in these program fragments:
the function Entry
acts as a callback function,
which is called during the dofile
for each entry in
the data file.
When file size is not a big concern, we can use name-value pairs for our representation:[15]
Entry{ author = "Donald E. Knuth", title = "Literate Programming", publisher = "CSLI", year = 1992 } Entry{ author = "Jon Bentley", title = "More Programming Pearls", year = 1990, publisher = "Addison-Wesley", }
This format is what we call a self-describing data format, because each piece of data has attached to it a short description of its meaning. Self-describing data are more readable (by humans, at least) than CSV or other compact notations; they are easy to edit by hand, when necessary; and they allow us to make small modifications in the basic format without having to change the data file. For instance, if we add a new field we need only a small change in the reading program, so that it supplies a default value when the field is absent.
With the name-value format, our program to collect authors becomes this:
local authors = {} -- a set to collect authors function Entry (b) authors[b.author] = true end dofile("data") for name in pairs(authors) do print(name) end
Now the order of fields is irrelevant.
Even if some entries do not have an author,
we have to adapt only the function Entry
:
function Entry (b) authors[b.author or "unknown"] = true end
Frequently we need to serialize some data, that is, to convert the data into a stream of bytes or characters, so that we can save it into a file or send it through a network connection. We can represent serialized data as Lua code in such a way that, when we run the code, it reconstructs the saved values into the reading program.
Usually, if we want to restore the value of a global variable,
our chunk will be something like varname =
,
where exp
exp
is the Lua code to create the value.
The varname
is the easy part,
so let us see how to write the code that creates a value.
For a numeric value, the task is easy:
function serialize (o)
if type(o) == "number" then
io.write(tostring(o))
else other cases
end
end
By writing a float in decimal format, however,
we risk losing some precision.
We can use a hexadecimal format to avoid this problem.
With format ("%a"
),
the read float will have exactly the same bits of the original one.
Moreover, since Lua 5.3 we should distinguish between
integers and floats,
so that they can be restored with the correct subtype:
local fmt = {integer = "%d", float = "%a"}
function serialize (o)
if type(o) == "number" then
io.write(string.format(fmt[math.type(o)], o))
else other cases
For a string value, a naive approach would be something like this:
if type(o) == "string" then io.write("'", o, "'")
However, if the string contains special characters (such as quotes or newlines) the resulting code will not be a valid Lua program.
You may be tempted to solve this problem changing quotes:
if type(o) == "string" then io.write("[[", o, "]]")
Beware of code injection!
If a malicious user manages to direct your
program to save something like
" ]]..os.execute('rm *')..[[ "
(for instance, she can supply this string as her address),
your final chunk will be like this one:
varname = [[ ]]..os.execute('rm *')..[[ ]]
You will have a bad surprise trying to load this “data”.
A simple way to quote a string in a secure way
is with the option "%q"
from string.format
.
This option was designed to save the string
in a way that it can be safely read back by Lua.
It surrounds the string with double quotes
and properly escapes double quotes, newlines,
and some other characters inside the string:
a = 'a "problematic" \\string' print(string.format("%q", a)) --> "a \"problematic\" \\string"
Using this feature, our serialize
function now looks like this:
function serialize (o)
if type(o) == "number" then
io.write(string.format(fmt[math.type(o)], o))
elseif type(o) == "string" then
io.write(string.format("%q", o))
else other cases
end
end
Lua 5.3.3 extended the format option "%q"
to work also with numbers (plus nil and Booleans),
again writing them in a proper way to be read back by Lua.
(In particular, it formats floats in hexadecimal,
to ensure full precision.)
Thus, since that version,
we can simplify and extend serialize
even more:
function serialize (o)
local t = type(o)
if t == "number" or t == "string" or t == "boolean" or
t == "nil" then
io.write(string.format("%q", o))
else other cases
end
end
Another way to save strings is the notation
[=[...]=]
for long strings.
However, this notation is mainly intended for hand-written code,
where we do not want to change a literal string in any way.
In automatically generated code,
it is easier to escape problematic characters,
as the option "%q"
from string.format
does.
If you nevertheless want to use the long-string notation for automatically generated code, you must take care of some details. The first one is that you must choose a proper number of equals signs. A good proper number is one more than the maximum that appears in the original string. Because strings containing long sequences of equals signs are common (e.g., comments delimiting parts of a source code), we should limit our attention to sequences of equals signs enclosed by square bracket. The second detail is that Lua always ignores a newline at the beginning of a long string; a simple way to avoid this problem is to add always a newline to be ignored.
The function quote
in Figure 15.1, “Quoting arbitrary literal strings”
is the result of our previous remarks.
Figure 15.1. Quoting arbitrary literal strings
function quote (s) -- find maximum length of sequences of equals signs local n = -1 for w in string.gmatch(s, "]=*%f[%]]") do n = math.max(n, #w - 1) -- -1 to remove the ']' end -- produce a string with 'n' plus one equals signs local eq = string.rep("=", n + 1) -- build quoted string return string.format(" [%s[\n%s]%s] ", eq, s, eq) end
It takes an arbitrary string and
returns it formatted as a long string.
The call to gmatch
creates an iterator to
traverse all occurrences of the pattern ’]=*%f[%]]
’
(that is, a closing square bracket followed by
a sequence of zero or more equals signs followed
by a frontier with a closing square bracket)
in the string s
.
For each occurrence,
the loop updates n
with
the maximum number of equals signs so far.
After the loop, we use string.rep
to replicate an equals sign n + 1
times,
which is one more than the maximum occurring in the string.
Finally, string.format
encloses s
with pairs of
brackets with the correct number of equals signs in between
and adds extra spaces around the quoted string plus a newline at
the beginning of the enclosed string.
(We might be tempted to use the simpler pattern ’]=*]
’,
which does not use a frontier pattern for the second square bracket,
but there is a subtlety here.
Suppose the subject is "]=]==]"
.
The first match is "]=]"
.
After it, what is left in the string is "==]"
,
and so there is no other match;
in the end of the loop, n
would be one instead of two.
The frontier pattern does not consume the bracket,
so that it remains in the subject for the following matches.)
Our next (and harder) task is to save tables. There are several ways to save them, according to what assumptions we make about the table structure. No single algorithm seems appropriate for all cases. Simple tables not only can use simpler algorithms, but also the output can be shorter and clearer.
Our first attempt is in Figure 15.2, “Serializing tables without cycles”.
Figure 15.2. Serializing tables without cycles
function serialize (o) local t = type(o) if t == "number" or t == "string" or t == "boolean" or t == "nil" then io.write(string.format("%q", o)) elseif t == "table" then io.write("{\n") for k,v in pairs(o) do io.write(" ", k, " = ") serialize(v) io.write(",\n") end io.write("}\n") else error("cannot serialize a " .. type(o)) end end
Despite its simplicity, that function does a reasonable job. It even handles nested tables (that is, tables within other tables), as long as the table structure is a tree —that is, there are no shared subtables and no cycles. (A small aesthetic improvement would be to indent nested tables; see Exercise 15.1.)
The previous function assumes that all keys in a table are valid identifiers. If a table has numeric keys, or string keys that are not syntactic valid Lua identifiers, we are in trouble. A simple way to solve this difficulty is to use the following code to write each key:
io.write(string.format(" [%s] = ", serialize(k)))
With this change, we improve the robustness of our function, at the cost of the aesthetics of the resulting file. Consider the next call:
serialize{a=12, b='Lua', key='another "one"'}
The result of this call
using the first version of serialize
is this:
{ a = 12, b = "Lua", key = "another \"one\"", }
Compare it with the result using the second version:
{ ["a"] = 12, ["b"] = "Lua", ["key"] = "another \"one\"", }
We can have both robustness and aesthetics by testing for each case whether it needs the square brackets; again, we will leave this improvement as an exercise.
To handle tables with generic topology (i.e., with cycles and shared subtables) we need a different approach. Constructors cannot create such tables, so we will not use them. To represent cycles we need names, so our next function will get as arguments the value to be saved plus its name. Moreover, we must keep track of the names of the tables already saved, to reuse them when we detect a cycle. We will use an extra table for this tracking. This table will have previously saved tables as indices and their names as the associated values.
The resulting code is in Figure 15.3, “Saving tables with cycles”.
Figure 15.3. Saving tables with cycles
function basicSerialize (o) -- assume 'o' is a number or a string return string.format("%q", o) end function save (name, value, saved) saved = saved or {} -- initial value io.write(name, " = ") if type(value) == "number" or type(value) == "string" then io.write(basicSerialize(value), "\n") elseif type(value) == "table" then if saved[value] then -- value already saved? io.write(saved[value], "\n") -- use its previous name else saved[value] = name -- save name for next time io.write("{}\n") -- create a new table for k,v in pairs(value) do -- save its fields k = basicSerialize(k) local fname = string.format("%s[%s]", name, k) save(fname, v, saved) end end else error("cannot save a " .. type(value)) end end
We keep the restriction that the tables we want to save
have only strings and numbers as keys.
The function basicSerialize
serializes these basic types,
returning the result.
The next function, save
, does the hard work.
The saved
parameter is the table that keeps track of
tables already saved.
As an example, suppose we build a table like this:
a = {x=1, y=2; {3,4,5}} a[2] = a -- cycle a.z = a[1] -- shared subtable
The call save("a", a)
will save it as follows:
a = {} a[1] = {} a[1][1] = 3 a[1][2] = 4 a[1][3] = 5 a[2] = a a["y"] = 2 a["x"] = 1 a["z"] = a[1]
The actual order of these assignments may vary, as it depends on a table traversal. Nevertheless, the algorithm ensures that any node needed in a new definition is already defined.
If we want to save several values with shared parts,
we can make the calls to save
them using
the same saved
table.
For instance, assume the following two tables:
a = {{"one", "two"}, 3} b = {k = a[1]}
If we save them independently,
the result will not have common parts.
However,
if we use the same saved
table for both calls to save
,
then the result will share common parts:
local t = {} save("a", a, t) save("b", b, t) --> a = {} --> a[1] = {} --> a[1][1] = "one" --> a[1][2] = "two" --> a[2] = 3 --> b = {} --> b["k"] = a[1]
As is usual in Lua, there are several other alternatives. Among them, we can save a value without giving it a global name (instead, the chunk builds a local value and returns it), we can use the list syntax when possible (see the exercises for this chapter), and so on. Lua gives you the tools; you build the mechanisms.
Exercise 15.1:
Modify the code in Figure 15.2, “Serializing tables without cycles”
so that it indents nested tables.
(Hint:
add an extra parameter to
serialize
with the indentation string.)
Exercise 15.2:
Modify the code of the previous exercise
so that it uses the syntax ["
,
as suggested in the section called “Saving tables without cycles”.
key
"]=value
Exercise 15.3:
Modify the code of the previous exercise
so that it uses the syntax ["
only when necessary
(that is, when the key is a string but not a valid identifier).
key
"]=value
Exercise 15.4:
Modify the code of the previous exercise
so that it uses the constructor syntax for lists
whenever possible.
For instance,
it should serialize the table {14, 15, 19}
as {14, 15, 19}
,
not as {[1] = 14, [2] = 15, [3] = 19}
.
(Hint:
start by saving the values of the keys 1, 2, ...,
as long as they are not nil.
Take care not to save them again when traversing
the rest of the table.)
Exercise 15.5:
The approach of avoiding constructors
when saving tables with cycles is too radical.
It is possible to save the table in a more pleasant format
using constructors for the simple case,
and to use assignments later only to fix sharing and loops.
Reimplement the function save
(Figure 15.3, “Saving tables with cycles”) using this approach.
Add to it all the goodies that you have implemented
in the previous exercises
(indentation, record syntax, and list syntax).
[15] If this format reminds you of BibTeX, it is not a coincidence. BibTeX was one of the inspirations for the constructor syntax in Lua.
Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>