4 Strings

Strings represent text. A string in Lua can contain a single letter or an entire book. Programs that manipulate strings with 100K or 1M characters are not unusual in Lua.

Strings in Lua are sequences of bytes. The Lua core is agnostic about how these bytes encode text. Lua is eight-bit clean and its strings can contain bytes with any numeric code, including embedded zeros. This means that we can store any binary data into a string. We can also store Unicode strings in any representation (UTF-8, UTF-16, etc.); however, as we will discuss, there are several good reasons to use UTF-8 whenever possible. The standard string library that comes with Lua assumes one-byte characters, but it can handle UTF-8 strings quite reasonably. Moreover, since version 5.3, Lua comes with a small library to help the use of UTF-8 encoding.

Strings in Lua are immutable values. We cannot change a character inside a string, as we can in C; instead, we create a new string with the desired modifications, as in the next example:

      a = "one string"
      b = string.gsub(a, "one", "another")  -- change string parts
      print(a)       --> one string
      print(b)       --> another string

Strings in Lua are subject to automatic memory management, like all other Lua objects (tables, functions, etc.). This means that we do not have to worry about allocation and deallocation of strings; Lua handles it for us.

We can get the length of a string using the length operator (denoted by #):

      a = "hello"
      print(#a)             --> 5
      print(#"good bye")    --> 8

This operator always counts the length in bytes, which is not the same as characters in some encodings.

We can concatenate two strings with the concatenation operator .. (two dots). If any operand is a number, Lua converts this number to a string:

      > "Hello " .. "World"     --> Hello World
      > "result is " .. 3       --> result is 3

(Some languages use the plus sign for concatenation, but 3 + 5 is different from 3 .. 5.)

Remember that strings in Lua are immutable values. The concatenation operator always creates a new string, without any modification to its operands:

      > a = "Hello"
      > a .. " World"           --> Hello World
      > a                       --> Hello

Literal strings

We can delimit literal strings by single or double matching quotes:

      a = "a line"
      b = 'another line'

They are equivalent; the only difference is that inside each kind of quote we can use the other quote without escapes.

As a matter of style, most programmers always use the same kind of quotes for the same kind of strings, where the “kinds” of strings depend on the program. For instance, a library that manipulates XML may reserve single-quoted strings for XML fragments, because those fragments often contain double quotes.

Strings in Lua can contain the following C-like escape sequences:

`\a`	bell
`\b`	back space
`\f`	form feed
`\n`	newline
`\r`	carriage return
`\t`	horizontal tab
`\v`	vertical tab
`\\`	backslash
`\"`	double quote
`\'`	single quote

The following examples illustrate their use:

      > print("one line\nnext line\n\"in quotes\", 'in quotes'")
      one line
      next line
      "in quotes", 'in quotes'
      > print('a backslash inside quotes: \'\\\'')
      a backslash inside quotes: '\'
      > print("a simpler way: '\\'")
      a simpler way: '\'

We can specify a character in a literal string also by its numeric value through the escape sequences \ddd and \xhh, where ddd is a sequence of up to three decimal digits and hh is a sequence of exactly two hexadecimal digits. As a somewhat artificial example, the two literals "ALO\n123\"" and '\x41LO\10\04923"' have the same value in a system using ASCII: 0x41 (65 in decimal) is the ASCII code for A, 10 is the code for newline, and 49 is the code for the digit 1. (In this example we must write 49 with three digits, as \049, because it is followed by another digit; otherwise Lua would read the escape as \492.) We could also write that same string as '\x41\x4c\x4f\x0a\x31\x32\x33\x22', representing each character by its hexadecimal code.

Since Lua 5.3, we can also specify UTF-8 characters with the escape sequence \u{h... h}; we can write any number of hexadecimal digits inside the brackets:

      > "\u{3b1} \u{3b2} \u{3b3}"         --> α β γ

(The above example assumes an UTF-8 terminal.)

We can delimit literal strings also by matching double square brackets, as we do with long comments. Literals in this bracketed form can run for several lines and do not interpret escape sequences. Moreover, it ignores the first character of the string when this character is a newline. This form is especially convenient for writing strings that contain large pieces of code, as in the following example:

      page = [[
      <html>
      <head>
        <title>An HTML Page</title>
      </head>
      <body>
        <a href="http://www.lua.org">Lua</a>
      </body>
      </html>
      ]]
      
      write(page)

Sometimes, we may need to enclose a piece of code containing something like a = b[c[i]] (notice the ]] in this code), or we may need to enclose some code that already has some code commented out. To handle such cases, we can add any number of equals signs between the two opening brackets, as in [===[. After this change, the literal string ends only at the next closing brackets with the same number of equals signs in between (]===], in our example). The scanner ignores any pairs of brackets with a different number of equals signs. By choosing an appropriate number of signs, we can enclose any literal string without having to modify it in any way.

This same facility is valid for comments, too. For instance, if we start a long comment with --[=[, it extends until the next ]=]. This facility allows us to comment out easily a piece of code that contains parts already commented out.

Long strings are the ideal format to include literal text in our code, but we should not use them for non-text literals. Although literal strings in Lua can contain arbitrary bytes, it is not a good idea to use this feature (e.g., you may have problems with your text editor); moreover, end-of-line sequences like "\r\n" may be normalized to "\n" when read. Instead, it is better to code arbitrary binary data using numeric escape sequences either in decimal or in hexadecimal, such as "\x13\x01\xA1\xBB". However, this poses a problem for long strings, because they would result in quite long lines. For those situations, since version 5.2 Lua offers the escape sequence \z: it skips all subsequent space characters in the string until the first non-space character. The next example illustrates its use:

      data = "\x00\x01\x02\x03\x04\x05\x06\x07\z
              \x08\x09\x0A\x0B\x0C\x0D\x0E\x0F"

The \z at the end of the first line skips the following end-of-line and the indentation of the second line, so that the byte \x08 directly follows \x07 in the resulting string.

Coercions

Lua provides automatic conversions between numbers and strings at run time. Any numeric operation applied to a string tries to convert the string to a number. Lua applies such coercions not only in arithmetic operators, but also in other places that expect a number, such as the argument to math.sin.

Conversely, whenever Lua finds a number where it expects a string, it converts the number to a string:

      print(10 .. 20)           --> 1020

(When we write the concatenation operator right after a numeral, we must separate them with a space; otherwise, Lua thinks that the first dot is a decimal point.)

Many people argue that these automatic coercions were not a good idea in the design of Lua. As a rule, it is better not to count on them. They are handy in a few places, but add complexity both to the language and to programs that use them.

As a reflection of this “second-class status”, Lua 5.3 did not implement a full integration of coercions and integers, favoring instead a simpler and faster implementation. The rule for arithmetic operations is that the result is an integer only when both operands are integers; a string is not an integer, so any arithmetic operation with strings is handled as a floating-point operation:

      > "10" + 1             --> 11.0

To convert a string to a number explicitly, we can use the function tonumber, which returns nil if the string does not denote a proper number. Otherwise, it returns integers or floats, following the same rules of the Lua scanner:

      > tonumber("  -3 ")        --> -3
      > tonumber(" 10e4 ")       --> 100000.0
      > tonumber("10e")          --> nil   (not a valid number)
      > tonumber("0x1.3p-4")     --> 0.07421875

By default, tonumber assumes decimal notation, but we can specify any base between 2 and 36 for the conversion:

      > tonumber("100101", 2)       --> 37
      > tonumber("fff", 16)         --> 4095
      > tonumber("-ZZ", 36)         --> -1295
      > tonumber("987", 8)          --> nil

In the last line, the string does not represent a proper numeral in the given base, so tonumber returns nil.

To convert a number to a string, we can call the function tostring:

      print(tostring(10) == "10")   --> true

These conversions are always valid. Remember, however, that we have no control over the format (e.g., the number of decimal digits in the resulting string). For full control, we should use string.format, which we will see in the next section.

Unlike arithmetic operators, order operators never coerce their arguments. Remember that "0" is different from 0. Moreover, 2 < 15 is obviously true, but "2" < "15" is false (alphabetical order). To avoid inconsistent results, Lua raises an error when we mix strings and numbers in an order comparison, such as 2 < "15".

The String Library

The power of a raw Lua interpreter to manipulate strings is quite limited. A program can create string literals, concatenate them, compare them, and get string lengths. However, it cannot extract substrings or examine their contents. The full power to manipulate strings in Lua comes from its string library.

As I mentioned before, the string library assumes one-byte characters. This equivalence is true for several encodings (e.g., ASCII or ISO-8859-1), but it breaks in any Unicode encoding. Nevertheless, as we will see, several parts of the string library are quite useful for UTF-8.

Some functions in the string library are quite simple: the call string.len(s) returns the length of a string s; it is equivalent to #s. The call string.rep(s, n) returns the string s repeated n times; we can create a string of 1 MB (e.g., for tests) with string.rep("a", 2^20). The function string.reverse reverses a string. The call string.lower(s) returns a copy of s with the upper-case letters converted to lower case; all other characters in the string are unchanged. The function string.upper converts to upper case.

      > string.rep("abc", 3)             --> abcabcabc
      > string.reverse("A Long Line!")   --> !eniL gnoL A
      > string.lower("A Long Line!")     --> a long line!
      > string.upper("A Long Line!")     --> A LONG LINE!

As a typical use, if we want to compare two strings regardless of case, we can write something like this:

        string.lower(a) < string.lower(b)

The call string.sub(s, i, j) extracts a piece of the string s, from the i-th to the j-th character inclusive. (The first character of a string has index 1.) We can also use negative indices, which count from the end of the string: index -1 refers to the last character, -2 to the previous one, and so on. Therefore, the call string.sub(s, 1, j) gets a prefix of the string s with length j; string.sub(s, j, -1) gets a suffix of the string, starting at the j-th character; and string.sub(s, 2, -2) returns a copy of the string s with the first and last characters removed:

      > s = "[in brackets]"
      > string.sub(s, 2, -2)      --> in brackets
      > string.sub(s, 1, 1)       --> [
      > string.sub(s, -1, -1)     --> ]

Remember that strings in Lua are immutable. Like any other function in Lua, string.sub does not change the value of a string, but returns a new string. A common mistake is to write something like string.sub(s, 2, -2) and assume that it will modify the value of s. If we want to modify the value of a variable, we must assign the new value to it:

      s = string.sub(s, 2, -2)

The functions string.char and string.byte convert between characters and their internal numeric representations. The function string.char gets zero or more integers, converts each one to a character, and returns a string concatenating all these characters. The call string.byte(s, i) returns the internal numeric representation of the i-th character of the string s; the second argument is optional; the call string.byte(s) returns the internal numeric representation of the first (or single) character of s. The following examples assume the ASCII encoding for characters:

      print(string.char(97))                    --> a
      i = 99; print(string.char(i, i+1, i+2))   --> cde
      print(string.byte("abc"))                 --> 97
      print(string.byte("abc", 2))              --> 98
      print(string.byte("abc", -1))             --> 99

In the last line, we used a negative index to access the last character of the string.

A call like string.byte(s, i, j) returns multiple values with the numeric representation of all characters between indices i and j (inclusive):

      print(string.byte("abc", 1, 2))           --> 97 98

A nice idiom is {string.byte(s, 1, -1)}, which creates a list with the codes of all characters in s. (This idiom only works for strings somewhat shorter than 1 MB. Lua limits its stack size, which in turn limits the maximum number of returns from a function. The default stack limit is one million entries.)

The function string.format is a powerful tool for formatting strings and converting numbers to strings. It returns a copy of its first argument, the so-called format string, with each directive in that string replaced by a formatted version of its correspondent argument. The directives in the format string have rules similar to those of the C function printf. A directive is a percent sign plus a letter that tells how to format the argument: d for a decimal integer, x for hexadecimal, f for a floating-point number, s for strings, plus several others.

      > string.format("x = %d  y = %d", 10, 20)   --> x = 10  y = 20
      > string.format("x = %x", 200)              --> x = c8
      > string.format("x = 0x%X", 200)            --> x = 0xC8
      > string.format("x = %f", 200)              --> x = 200.000000
      > tag, title = "h1", "a title"
      > string.format("<%s>%s</%s>", tag, title, tag)
        --> <h1>a title</h1>

Between the percent sign and the letter, a directive can include other options that control the details of the formatting, such as the number of decimal digits of a floating-point number:

      print(string.format("pi = %.4f", math.pi))      --> pi = 3.1416
      d = 5; m = 11; y = 1990
      print(string.format("%02d/%02d/%04d", d, m, y)) --> 05/11/1990

In the first example, the %.4f means a floating-point number with four digits after the decimal point. In the second example, the %02d means a decimal number with zero padding and at least two digits; the directive %2d, without the zero, would use blanks for padding. For a complete description of these directives, see the documentation of the C function printf, as Lua calls the standard C library to do the hard work here.

We can call all functions from the string library as methods on strings, using the colon operator. For instance, we can rewrite the call string.sub(s, i, j) as s:sub(i, j); string.upper(s) becomes s:upper(). (We will discuss the colon operator in detail in Chapter 21, Object-Oriented Programming.)

The string library includes also several functions based on pattern matching. The function string.find searches for a pattern in a given string:

      > string.find("hello world", "wor")   --> 7   9
      > string.find("hello world", "war")   --> nil

It returns the initial and final positions of the pattern in the string, or nil if it cannot find the pattern. The function string.gsub (Global SUBstitution) replaces all occurrences of a pattern in a string with another string:

      > string.gsub("hello world", "l", ".")     --> he..o wor.d    3
      > string.gsub("hello world", "ll", "..")   --> he..o world    1
      > string.gsub("hello world", "a", ".")     --> hello world    0

It also returns, as a second result, the number of replacements it made.

We will discuss more about these functions and all about pattern matching in Chapter 10, Pattern Matching.

Unicode

Since version 5.3, Lua includes a small library to support operations on Unicode strings encoded in UTF-8. Even before that library, Lua already offered a reasonable support for UTF-8 strings.

UTF-8 is the dominant encoding for Unicode on the Web. Because of its compatibility with ASCII, UTF-8 is also the ideal encoding for Lua. That compatibility is enough to ensure that several string-manipulation techniques that work on ASCII strings also work on UTF-8 with no modifications.

UTF-8 represents each Unicode character using a variable number of bytes. For instance, it represents A with one byte, 65; it represents the Hebrew character Aleph, which has code 1488 in Unicode, with the two-byte sequence 215–144. UTF-8 represents all characters in the ASCII range as in ASCII, that is, with a single byte smaller than 128. It represents all other characters using sequences of bytes where the first byte is in the range [194,244] and the continuation bytes are in the range [128,191]. More specifically, the range of the starting bytes for two-byte sequences is [194,223]; for three-byte sequences, the range is [224,239]; and for four-byte sequences, it is [240,244]. None of those ranges overlap. This property ensures that the code sequence of any character never appears as part of the code sequence of any other character. In particular, a byte smaller than 128 never appears in a multibyte sequence; it always represents its corresponding ASCII character.

Several things in Lua “just work” for UTF-8 strings. Because Lua is 8-bit clean, it can read, write, and store UTF-8 strings just like other strings. Literal strings can contain UTF-8 data. (Of course, you probably will want to edit your source code as a UTF-8 file in a UTF-8–aware editor.) The concatenation operation works correctly for UTF-8 strings. String order operators (less than, less equal, etc.) compare UTF-8 strings following the order of their character codes in Unicode.

Lua’s operating-system library and I/O library are mainly interfaces to the underlying system, so their support for UTF-8 strings depends on that underlying system. On Linux, for instance, we can use UTF-8 for file names, but Windows uses UTF-16. Therefore, to manipulate Unicode file names on Windows, we need either extra libraries or changes to the standard Lua libraries.

Let us now see how functions from the string library handle UTF-8 strings. The functions reverse, upper, lower, byte, and char do not work for UTF-8 strings, as all of them assume that one character is equivalent to one byte. The functions string.format and string.rep work without problems with UTF-8 strings except for the format option '%c', which assumes that one character is one byte. The functions string.len and string.sub work correctly with UTF-8 strings, with indices referring to byte counts (not character counts). More often than not, this is what we need.

Let us now have a look at the new utf8 library. The function utf8.len returns the number of UTF-8 characters (codepoints) in a given string. Moreover, it validates the string: if it finds any invalid byte sequence, it returns false plus the position of the first invalid byte:

      > utf8.len("résumé")              --> 6
      > utf8.len("ação")                --> 4
      > utf8.len("Månen")               --> 5
      > utf8.len("ab\x93")              --> nil    3

(Of course, to run these examples we need a terminal that understands UTF-8.)

The functions utf8.char and utf8.codepoint are the equivalent of string.char and string.byte in the UTF-8 world:

      > utf8.char(114, 233, 115, 117, 109, 233)    --> résumé
      > utf8.codepoint("résumé", 6, 7)             --> 109    233

Note the indices in the last line. Most functions in the utf8 library work with indices in bytes. For instance, the call string.codepoint(s, i, j) considers both i and j to be byte positions in string s. If we want to use character indices, the function utf8.offset converts a character position to a byte position:

      > s = "Nähdään"
      > utf8.codepoint(s, utf8.offset(s, 5))    --> 228
      > utf8.char(228)                          --> ä

In this example, we used utf8.offset to get the byte index of the fifth character in the string, and then provided that index to codepoint.

As in the string library, the character index for utf8.offset can be negative, in which case the counting is from the end of the string:

      > s = "ÃøÆËÐ"
      > string.sub(s, utf8.offset(s, -2))    --> ËÐ

The last function in the utf8 library is utf8.codes. It allows us to iterate over the characters in a UTF-8 string:

      for i, c in utf8.codes("Ação") do
        print(i, c)
      end
        --> 1    65
        --> 2    231
        --> 4    227
        --> 6    111

This construction traverses all characters in the given string, assigning its position in bytes and its numeric code to two local variables. In our example, the loop body only prints the values of those variables. (We will discuss iterators in more detail in Chapter 18, Iterators and the Generic for.)

Unfortunately, there is not much more that Lua can offer. Unicode has too many peculiarities. It is virtually impossible to abstract almost any concept from specific languages. Even the concept of what is a character is vague, because there is no one-to-one correspondence between Unicode coded characters and graphemes. For instance, the common grapheme é can be represented by a single codepoint ("\u{E9}") or by two codepoints, an e followed by a diacritical mark ("e\u{301}"). Other apparently basic concepts, such as what is a letter, also change across different languages. Because of this complexity, complete support for Unicode demands huge tables, which are incompatible with the small size of Lua. So, for anything fancier, the best approach is an external library.

Exercises

Exercise 4.1: How can you embed the following fragment of XML as a string in a Lua program?

      <![CDATA[
        Hello world
      ]]>

Show at least two different ways.

Exercise 4.2: Suppose you need to write a long sequence of arbitrary bytes as a literal string in Lua. What format would you use? Consider issues like readability, maximum line length, and size.

Exercise 4.3: Write a function to insert a string into a given position of another one:

      > insert("hello world", 1, "start: ")    --> start: hello world
      > insert("hello world", 7, "small ")     --> hello small world

Exercise 4.4: Redo the previous exercise for UTF-8 strings:

      > insert("ação", 5, "!")     --> ação!

(Note that the position now is counted in codepoints.)

Exercise 4.5: Write a function to remove a slice from a string; the slice should be given by its initial position and its length:

      > remove("hello world", 7, 4)     --> hello d

Exercise 4.6: Redo the previous exercise for UTF-8 strings:

      > remove("ação", 2, 2)     --> ao

(Here, both the initial position and the length should be counted in codepoints.)

Exercise 4.7: Write a function to check whether a given string is a palindrome:

      > ispali("step on no pets")     --> true
      > ispali("banana")              --> false

Exercise 4.8: Redo the previous exercise so that it ignores differences in spaces and punctuation.

Exercise 4.9: Redo the previous exercise for UTF-8 strings.