13 Bits and Bytes

Lua handles binary data similarly to text. A string in Lua can contain any bytes, and almost all library functions that handle strings can handle arbitrary bytes. We can even do pattern matching on binary data. On top of that, Lua 5.3 introduced extra facilities to manipulate binary data: besides integer numbers, it brought bitwise operators and functions to pack and unpack binary data. In this chapter, we will cover these and other facilities for handling binary data in Lua.

Starting with version 5.3, Lua offers a standard set of bitwise operators on numbers. Unlike arithmetic operations, bitwise operators only work on integer values. The bitwise operators are & (bitwise AND), | (bitwise OR), ~ (bitwise exclusive-OR), >> (logical right shift), << (left shift), and the unary ~ (bitwise NOT). (Note that, in several languages, the exclusive-OR operator is denoted by ^. In Lua, ^ means exponentiation.)

      > string.format("%x", 0xff & 0xabcd)    --> cd
      > string.format("%x", 0xff | 0xabcd)    --> abff
      > string.format("%x", 0xaaaa ~ -1)      --> ffffffffffff5555
      > string.format("%x", ~0)               --> ffffffffffffffff

(Several examples in this chapter will use string.format to show results in hexadecimal.)

All bitwise operators work on all bits of integers. In Standard Lua, that means 64 bits. That can be a problem when implementing algorithms that assume 32-bit integers (e.g., the cryptographic hash SHA-2). However, it is not difficult to perform 32-bit integer manipulation. Except for the right-shift operation, all bitwise operations on 64 bits agree with the same operations on 32 bits, if we simply ignore the higher half bits. The same is true for addition, subtraction, and multiplication. So, all we have to do to operate on 32-bit integers is to erase the higher 32 bits of an integer before a right shift. (We seldom do divisions on that kind of computations.)

Both shift operators fill with zeros the vacant bits. This is usually called logical shifts. Lua does not offer an arithmetic right shift, which fills vacant bits with the signal bit. We can perform the equivalent to arithmetic shifts with a floor division by an appropriate power of two. (For instance, x // 16 is the same as an arithmetic shift by four.)

Negative displacements shift in the other direction, that is, a >> n is the same as a << -n:

      > string.format("%x", 0xff << 12)    --> ff000
      > string.format("%x", 0xff >> -12)   --> ff000

If the displacement is equal to or larger than the number of bits in the integer representation (64 in Standard Lua, 32 in Small Lua), the result is zero, as all bits are shifted out of the result:

      > string.format("%x", -1 << 80)    --> 0

The representation of integers uses one bit to store the signal. Therefore, the maximum integer that we can represent with 64-bit integers is 263 - 1, instead of 264 - 1. Usually, this difference is irrelevant, as 263 - 1 is quite large already. However, sometimes we cannot waste a bit for the signal, because we are either handling external data with unsigned integers or implementing some algorithm that needs integers with all their 64 bits. Moreover, in Small Lua the difference can be quite significant. For instance, if we use a 32-bit signed integer as a position in a file, we are limited to 2 GB files; an unsigned integer doubles that limit.

Lua does not offer explicit support for unsigned integers. Nevertheless, with some care, it is not difficult to handle unsigned integers in Lua, as we will see now.

We can write constants larger than 263 - 1 directly, despite appearances:

      > x = 13835058055282163712         -- 3 << 62
      > x                                --> -4611686018427387904

The problem here is not the constant, but the way Lua prints it: the standard way to print numbers interprets them as signed integers. We can use the %u or %x options in string.format to see integers as unsigned:

      > string.format("%u", x)           --> 13835058055282163712
      > string.format("0x%X", x)         --> 0xC000000000000000

Due to the way signed integers are represented (two’s complement), the operations of addition, subtraction, and multiplication work the same way for signed and unsigned integers:

      > string.format("%u", x)             --> 13835058055282163712
      > string.format("%u", x + 1)         --> 13835058055282163713
      > string.format("%u", x - 1)         --> 13835058055282163711

(With such a large value, multiplying x even by two would overflow, so we did not include that operation in the example.)

Order operators work differently for signed and unsigned integers. The problem appears when we compare two integers with a difference in the higher bit. For signed integers, the integer with that bit set is the smaller, because it represents a negative number:

      > 0x7fffffffffffffff < 0x8000000000000000    --> false

This result is clearly incorrect if we regard both integers as unsigned. So, we need a different operation to compare unsigned integers. Lua 5.3 provides math.ult (unsigned less than) for that need:

      > math.ult(0x7fffffffffffffff, 0x8000000000000000)   --> true

Another way to do the comparison is to flip the signal bit of both operands before doing a signed comparison:

      > mask = 0x8000000000000000
      > (0x7fffffffffffffff ~ mask) < (0x8000000000000000 ~ mask)
        --> true

Unsigned division is also different from its signed version. Figure 13.1, “Unsigned division” shows an algorithm for unsigned division.

The first test (d < 0) is equivalent to testing whether d is larger than 263. In that case, the quotient can only be 1 (if n is equal to or larger than d) or 0. Otherwise, we do the equivalent of dividing the dividend by two, then dividing the result by the divisor, and then multiplying the result by two. The right shift is equivalent to an unsigned division by two; the result will be a non-negative signed integer. The subsequent left shift corrects the quotient, undoing this previous division.

In general, floor(floor(n / 2) / d) * 2 (the computation done by the algorithm) is not equal to floor(((n / 2) / d) * 2) (the correct result). However, it is not difficult to prove that the difference is at most one. So, the algorithm computes the rest of the division (in the variable r) and checks whether it is greater than the divisor: if so, it corrects the quotient (adding one to it) and it is done.

Converting an unsigned integer to/from a float needs some adjustments. To convert an unsigned integer to a float, we can convert it as a signed integer and correct the result with a modulo operator:

      > u = 11529215046068469760          -- an example
      > f = (u + 0.0) % 2^64
      > string.format("%.0f", f)          --> 11529215046068469760

The value of u + 0.0 is -6917529027641081856, because the standard conversion sees u as a signed integer. The modulo operation brings the value back to the range of unsigned integers. (In real code we do not need the addition, because the modulo with a float would do the conversion anyway.)

To convert from a float to an unsigned integer, we can use the following code:

      > f = 0xA000000000000000.0             -- an example
      > u = math.tointeger(((f + 2^63) % 2^64) - 2^63)
      > string.format("%x", u)               --> a000000000000000

The addition transforms a value greater than 263 in a value greater than 264. The modulo operator then projects this value to the range [0,263), and the subtraction makes it a negative value (that is, a value with the highest bit set). For a value smaller than 263, the addition keeps it smaller than 264, the modulo operator does not affect it, and the subtraction restores its original value.

Lua 5.3 also introduced functions for converting between binary data and basic values (numbers and strings). The function string.pack packs values into a binary string; string.unpack extracts those values from the string.

Both string.pack and string.unpack get as their first argument a format string, which describes how the data is packed. Each letter in this string describes how to pack/unpack one value; see the following example:

      > s = string.pack("iii", 3, -27, 450)
      > #s                             --> 12
      > string.unpack("iii", s)        --> 3   -27   450   13

This call to string.pack creates a string with the binary codes of three integers (according to the description "iii"), each encoding its corresponding argument. The string length will be three times the native size of an integer (3 times 4 bytes in my machine). The call to string.unpack decodes three integers (again according to "iii") from the given string and returns the decoded values.

The function string.unpack also returns the position in the string after the last item read, to simplify iterations. (This explains the 13 in the results of the last example.) Accordingly, it accepts an optional third argument, which tells where to start reading. For instance, the next example prints all strings packed inside a given string:

      s = "hello\0Lua\0world\0"
      local i = 1
      while i <= #s do
        local res
        res, i = string.unpack("z", s, i)
        print(res)
      end
        --> hello
        --> Lua
        --> world

As we will see, the z option means a zero-terminated string, so that the call to unpack extracts the string at position i from s and returns that string plus the next position for the loop.

There are several options for coding an integer, each corresponding to a native integer size: b (char), h (short), i (int), and l (long); the option j uses the size of a Lua integer. To use a fixed, machine-independent size, we can suffix the i option with a number from one to 16. For instance, i7 will produce seven-byte integers. All sizes check for overflows:

      > x = string.pack("i7", 1 << 54)
      > string.unpack("i7", x)             --> 18014398509481984   8
      > x = string.pack("i7", -(1 << 54))
      > string.unpack("i7", x)             --> -18014398509481984  8
      > x = string.pack("i7", 1 << 55)
      stdin:1: bad argument #2 to 'pack' (integer overflow)

We can pack and unpack integers wider than native Lua integers but, when unpacking, their actual values must fit into Lua integers:

      > x = string.pack("i12", 2^61)
      > string.unpack("i12", x)      --> 2305843009213693952    13
      > x = "aaaaaaaaaaaa"           -- fake a large 12-byte number
      > string.unpack("i12", x)
      stdin:1: 12-byte integer does not fit into Lua Integer

Each integer option has an upper-case version corresponding to an unsigned integer of the same size:

      > s = "\xFF"
      > string.unpack("b", s)       --> -1    2
      > string.unpack("B", s)       --> 255   2

Moreover, unsigned integers have an extra option T for size_t. (The size_t type in ISO C is an unsigned integer larger enough to hold the size of any object.)

We can pack strings in three representations: zero-terminated strings, fixed-length strings, and strings with explicit length. Zero-terminated strings use the z option. For fixed-length strings, we use the option cn, where n is the number of bytes in the packed string. The last option for strings stores the string preceded by its length. In that case, the option has the format sn, where n is the size of the unsigned integer used to store the length. For instance, the option s1 stores the string length in one byte:

      s = string.pack("s1", "hello")
      for i = 1, #s do print((string.unpack("B", s, i))) end
        --> 5                           (length)
        --> 104                         ('h')
        --> 101                         ('e')
        --> 108                         ('l')
        --> 108                         ('l')
        --> 111                         ('o')

Lua raises an error if the length does not fit into the given size. We can also use a pure s as the option; in that case, the length is stored as a size_t, which is large enough to hold the length of any string. (In 64-bit machines, size_t usually is an eight-byte unsigned integer, which may be a waste of space for small strings.)

For floating-point numbers, there are three options: f for single precision, d for double precision, and n for a Lua float.

The format string also has options to control the endianess and the alignment of the binary data. By default, a format uses the machine’s native endianess. The > option turns all subsequent encodings in that format to big endian, or network byte order:

      s = string.pack(">i4", 1000000)
      for i = 1, #s do print((string.unpack("B", s, i))) end
        --> 0
        --> 15
        --> 66
        --> 64

The < option turns to little endian:

      s = string.pack("<i2 i2", 500, 24)
      for i = 1, #s do print((string.unpack("B", s, i))) end
        --> 244
        --> 1
        --> 24
        --> 0

Finally, the = option turns back to the default machine’s native endianess.

For alignment, the !n option forces data to align at indices that are multiples of n. More specifically, if the item is smaller than n, it is aligned at its own size; otherwise, it is aligned at n. For instance, suppose we start the format string with !4. Then, one-byte integers will be written in indices multiple of one (that is, any index), two-byte integers will be written in indices multiple of two, and four-byte or larger integers will be written in indices multiple of four. The ! option (without a number) sets the alignment to the machine’s native alignment.

The function string.pack does the alignment by adding zeros to the resulting string until the index has a proper value. The function string.unpack simply skips the padding when reading the string. Alignment only works for powers of two: if we set the alignment to four and try to manipulate a three-byte integer, Lua will raise an error.

Any format string works as if prefixed by "=!1", which means native endianess and no alignment (as every index is a multiple of one). We can change the endianess and the alignment at any point during the translation.

If needed, we can add padding manually. The option x means one byte of padding; string.pack adds a zero byte to the resulting string; string.unpack skips one byte from the subject string.

The functions io.input and io.output always open a file in text mode. In POSIX, there is no difference between binary files and text files. In some systems like Windows, however, we must open binary files in a special way, using the letter b in the mode string of io.open.

Typically, we read binary data either with the "a" pattern, that reads the whole file, or with the pattern n, that reads n bytes. (Lines make no sense in a binary file.) As a simple example, the following program converts a text file from Windows format to POSIX format —that is, it translates sequences of carriage return–newlines to newlines:

      local inp = assert(io.open(arg[1], "rb"))
      local out = assert(io.open(arg[2], "wb"))
      
      local data = inp:read("a")
      data = string.gsub(data, "\r\n", "\n")
      out:write(data)
      
      assert(out:close())

It cannot use the standard I/O streams (stdin/stdout), because these streams are open in text mode. Instead, it assumes that the names of the input file and the output file are arguments to the program. We can call this program with the following command line:

      > lua prog.lua file.dos file.unix

As another example, the following program prints all strings found in a binary file:

      local f = assert(io.open(arg[1], "rb"))
      local data = f:read("a")
      local validchars = "[%g%s]"
      local pattern = "(" .. string.rep(validchars, 6) .. "+)\0"
      for w in string.gmatch(data, pattern) do
        print(w)
      end

The program assumes that a string is any zero-terminated sequence of six or more valid characters, where a valid character is any character accepted by the pattern validchars. In our example, this pattern comprises the printable characters. We use string.rep and concatenation to create a pattern that matches all sequences of six or more validchars ended by a zero. The parentheses in the pattern capture the string without the zero.

Our last example is a program to make a dump of a binary file, showing its contents in hexadecimal. Figure 13.2, “Dumping the dump program” shows the result of applying this program to itself on a POSIX machine.

The complete program is here:

      local f = assert(io.open(arg[1], "rb"))
      local blocksize = 16
      for bytes in f:lines(blocksize) do
        for i = 1, #bytes do
          local b = string.unpack("B", bytes, i)
          io.write(string.format("%02X ", b))
        end
        io.write(string.rep("   ", blocksize - #bytes))
        bytes = string.gsub(bytes, "%c", ".")
        io.write(" ", bytes, "\n")
      end

Again, the first program argument is the input file name; the output is regular text, so it can go to the standard output. The program reads the file in chunks of 16 bytes. For each chunk, it writes the hexadecimal representation of each byte, and then it writes the chunk as text, changing control characters to dots. We use string.rep to fill with blanks the last line (which in general will not have exactly 16 bytes), keeping the alignment.

Exercise 13.1: Write a function to compute the modulo operation for unsigned integers.

Exercise 13.2: Implement different ways to compute the number of bits in the representation of a Lua integer.

Exercise 13.3: How can you test whether a given integer is a power of two?

Exercise 13.4: Write a function to compute the Hamming weight of a given integer. (The Hamming weight of a number is the number of ones in its binary representation.)

Exercise 13.5: Write a function to test whether the binary representation of a given integer is a palindrome.

Exercise 13.6: Implement a bit array in Lua. It should support the following operations:

  • newBitArray(n) (creates an array with n bits),

  • setBit(a, n, v) (assigns the Boolean value v to bit n of array a),

  • testBit(a, n) (returns a Boolean with the value of bit n).

Exercise 13.7: Suppose we have a binary file with a sequence of records, each one with the following format:

      struct Record {
        int x;
        char[3] code;
        float value;
      };

Write a program that reads that file and prints the sum of the value fields.

Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>