10 Pattern Matching

Unlike several other scripting languages, Lua uses neither POSIX regex nor Perl regular expressions for pattern matching. The main reason for this decision is size: a typical implementation of POSIX regular expressions takes more than 4000 lines of code, which is more than half the size of all Lua standard libraries together. In comparison, the implementation of pattern matching in Lua has less than 600 lines. Of course, pattern matching in Lua cannot do all that a full POSIX implementation does. Nevertheless, pattern matching in Lua is a powerful tool, and includes some features that are difficult to match with standard POSIX implementations.

The Pattern-Matching Functions

The string library offers four functions based on patterns. We have already had a glimpse at find and gsub; the other two are match and gmatch (Global Match). Now we will see all of them in detail.

The function `string.find`

The function string.find searches for a pattern inside a given subject string. The simplest form of a pattern is a word, which matches only a copy of itself. For instance, the pattern ’hello’ will search for the substring "hello" inside the subject string. When string.find finds its pattern, it returns two values: the index where the match begins and the index where the match ends. If it does not find a match, it returns nil:

      s = "hello world"
      i, j = string.find(s, "hello")
      print(i, j)                      --> 1    5
      print(string.sub(s, i, j))       --> hello
      print(string.find(s, "world"))   --> 7    11
      i, j = string.find(s, "l")
      print(i, j)                      --> 3    3
      print(string.find(s, "lll"))     --> nil

When a match succeeds, we can call string.sub with the values returned by find to get the part of the subject string that matched the pattern. For simple patterns, this is necessarily the pattern itself.

The function string.find has two optional parameters. The third parameter is an index that tells where in the subject string to start the search. The fourth parameter, a Boolean, indicates a plain search. A plain search, as the name implies, does a plain “find substring” search in the subject, ignoring patterns:

      > string.find("a [word]", "[")
      stdin:1: malformed pattern (missing ']')
      > string.find("a [word]", "[", 1, true)   --> 3   3

In the first call, the function complains because ’[’ has a special meaning in patterns. In the second call, the function treats ’[’ as a simple string. Note that we cannot pass the fourth optional parameter without the third one.

The function `string.match`

The function string.match is similar to find, in the sense that it also searches for a pattern in a string. However, instead of returning the positions where it found the pattern, it returns the part of the subject string that matched the pattern:

      print(string.match("hello world", "hello"))   --> hello

For fixed patterns such as ’hello’, this function is pointless. It shows its power when used with variable patterns, as in the next example:

      date = "Today is 17/7/1990"
      d = string.match(date, "%d+/%d+/%d+")
      print(d)  --> 17/7/1990

Shortly we will discuss the meaning of the pattern ’%d+/%d+/%d+’ and more advanced uses for string.match.

The function `string.gsub`

The function string.gsub has three mandatory parameters: a subject string, a pattern, and a replacement string. Its basic use is to substitute the replacement string for all occurrences of the pattern inside the subject string:

      s = string.gsub("Lua is cute", "cute", "great")
      print(s)         --> Lua is great
      s = string.gsub("all lii", "l", "x")
      print(s)         --> axx xii
      s = string.gsub("Lua is great", "Sol", "Sun")
      print(s)         --> Lua is great

An optional fourth parameter limits the number of substitutions to be made:

      s = string.gsub("all lii", "l", "x", 1)
      print(s)          --> axl lii
      s = string.gsub("all lii", "l", "x", 2)
      print(s)          --> axx lii

Instead of a replacement string, the third argument to string.gsub can be also a function or a table, which is called (or indexed) to produce the replacement string; we will cover this feature in the section called “Replacements”.

The function string.gsub also returns as a second result the number of times it made the substitution.

The function `string.gmatch`

The function string.gmatch returns a function that iterates over all occurrences of a pattern in a string. For instance, the following example collects all words of a given string s:

      s = "some string"
      words = {}
      for w in string.gmatch(s, "%a+") do
        words[#words + 1] = w
      end

As we will discuss shortly, the pattern ’%a+’ matches sequences of one or more alphabetic characters (that is, words). So, the for loop will iterate over all words of the subject string, storing them in the list words.

Patterns

Most pattern-matching libraries use the backslash as an escape. However, this choice has some annoying consequences. For the Lua parser, patterns are regular strings. They have no special treatment and follow the same rules as other strings. Only the pattern-matching functions interpret them as patterns. Because the backslash is the escape character in Lua, we have to escape it to pass it to any function. Patterns are naturally hard to read, and writing "\\" instead of "\" everywhere does not help.

We could ameliorate this problem with long strings, enclosing patterns between double brackets. (Some languages recommend this practice.) However, the long-string notation seems cumbersome for patterns, which are usually short. Moreover, we would lose the ability to use escapes inside patterns. (Some pattern-matching tools work around this limitation by reimplementing the usual string escapes.)

Lua’s solution is simpler: patterns in Lua use the percent sign as an escape. (Several functions in C, such as printf and strftime, adopt the same solution.) In general, any escaped alphanumeric character has some special meaning (e.g., ’%a’ matches any letter), while any escaped non-alphanumeric character represents itself (e.g., ’%.’ matches a dot).

We will start our discussion about patterns with character classes. A character class is an item in a pattern that can match any character in a specific set. For instance, the class %d matches any digit. Therefore, we can search for a date in the format dd/mm/yyyy with the pattern ’%d%d/%d%d/%d%d%d%d’:

      s = "Deadline is 30/05/1999, firm"
      date = "%d%d/%d%d/%d%d%d%d"
      print(string.match(s, date))   --> 30/05/1999

The following table lists the predefined character classes with their meanings:

`.`	all characters
`%a`	letters
`%c`	control characters
`%d`	digits
`%g`	printable characters except spaces
`%l`	lower-case letters
`%p`	punctuation characters
`%s`	space characters
`%u`	upper-case letters
`%w`	alphanumeric characters
`%x`	hexadecimal digits

An upper-case version of any of these classes represents the complement of the class. For instance, ’%A’ represents all non-letter characters:

      print((string.gsub("hello, up-down!", "%A", ".")))
        --> hello..up.down.

(When printing the results of gsub, I am using extra parentheses to discard its second result, which is the number of substitutions.)

Some characters, called magic characters, have special meanings when used in a pattern. Patterns in Lua use the following magic characters:

      ( ) . % + - * ? [ ] ^ $

As we have seen, the percent sign works as an escape for these magic characters. So, ’%?’ matches a question mark and ’%%’ matches a percent sign itself. We can escape not only the magic characters, but also any non-alphanumeric character. When in doubt, play safe and use an escape.

A char-set allows us to create our own character classes, grouping single characters and classes inside square brackets. For instance, the char-set ’[%w_]’ matches both alphanumeric characters and underscores, ’[01]’ matches binary digits, and ’[%[%]]’ matches square brackets. To count the number of vowels in a text, we can write this code:

      _, nvow = string.gsub(text, "[AEIOUaeiou]", "")

We can also include character ranges in a char-set, by writing the first and the last characters of the range separated by a hyphen. I seldom use this feature, because most useful ranges are predefined; for instance, ’%d’ substitutes ’[0-9]’, and ’%x’ substitutes ’[0-9a-fA-F]’. However, if you need to find an octal digit, you may prefer ’[0-7]’ instead of an explicit enumeration like ’[01234567]’.

We can get the complement of any char-set by starting it with a caret: the pattern ’[^0-7]’ finds any character that is not an octal digit and ’[^\n]’ matches any character different from newline. Nevertheless, remember that you can negate simple classes with its upper-case version: ’%S’ is simpler than ’[^%s]’.

We can make patterns still more useful with modifiers for repetitions and optional parts. Patterns in Lua offer four modifiers:

`+`	1 or more repetitions
`*`	0 or more repetitions
`-`	0 or more lazy repetitions
`?`	optional (0 or 1 occurrence)

The plus modifier matches one or more characters of the original class. It will always get the longest sequence that matches the pattern. For instance, the pattern ’%a+’ means one or more letters (a word):

      print((string.gsub("one, and two; and three", "%a+", "word")))
        --> word, word word; word word

The pattern ’%d+’ matches one or more digits (an integer numeral):

      print(string.match("the number 1298 is even", "%d+"))   --> 1298

The asterisk modifier is similar to plus, but it also accepts zero occurrences of characters of the class. A typical use is to match optional spaces between parts of a pattern. For instance, to match an empty parenthesis pair, such as () or ( ), we can use the pattern ’%(%s*%)’, where the pattern ’%s*’ matches zero or more spaces. (Parentheses have a special meaning in a pattern, so we must escape them.) As another example, the pattern ’[_%a][_%w]*’ matches identifiers in a Lua program: a sequence starting with a letter or an underscore, followed by zero or more underscores or alphanumeric characters.

Like an asterisk, the minus modifier also matches zero or more occurrences of characters of the original class. However, instead of matching the longest sequence, it matches the shortest one. Sometimes there is no difference between asterisk and minus, but usually they give rather different results. For instance, if we try to find an identifier with the pattern ’[_%a][_%w]-’, we will find only the first letter, because the ’[_%w]-’ will always match the empty sequence. On the other hand, suppose we want to erase comments in a C program. Many people would first try ’/%*.*%*/’ (that is, a "/*" followed by a sequence of any characters followed by "*/", written with the appropriate escapes). However, because the ’.*’ expands as far as it can, the first "/*" in the program would close only with the last "*/":

      test = "int x; /* x */  int y; /* y */"
      print((string.gsub(test, "/%*.*%*/", "")))
        --> int x;

The pattern ’.-’, instead, will expand only as much as necessary to find the first "*/", so that we get the desired result:

      test = "int x; /* x */  int y; /* y */"
      print((string.gsub(test, "/%*.-%*/", "")))
        --> int x;   int y;

The last modifier, the question mark, matches an optional character. As an example, suppose we want to find an integer in a text, where the number can contain an optional sign. The pattern ’[+-]?%d+’ does the job, matching numerals like "-12", "23", and "+1009". The character class ’[+-]’ matches either a plus or a minus sign; the following ? makes this sign optional.

Unlike some other systems, in Lua we can apply a modifier only to a character class; there is no way to group patterns under a modifier. For instance, there is no pattern that matches an optional word (unless the word has only one letter). Usually, we can circumvent this limitation using some of the advanced techniques that we will see in the end of this chapter.

If a pattern begins with a caret, it will match only at the beginning of the subject string. Similarly, if it ends with a dollar sign, it will match only at the end of the subject string. We can use these marks both to restrict the matches that we find and to anchor patterns. For instance, the next test checks whether the string s starts with a digit:

      if string.find(s, "^%d") then ...

The next one checks whether that string represents an integer number, without any other leading or trailing characters:

      if string.find(s, "^[+-]?%d+$") then ...

The caret and dollar signs are magic only when used in the beginning or end of the pattern. Otherwise, they act as regular characters matching themselves.

Another item in a pattern is ’%b’, which matches balanced strings. We write this item as ’%bxy’, where x and y are any two distinct characters; the x acts as an opening character and the y as the closing one. For instance, the pattern ’%b()’ matches parts of the string that start with a left parenthesis and finish at the respective right one:

      s = "a (enclosed (in) parentheses) line"
      print((string.gsub(s, "%b()", "")))     --> a  line

Typically, we use this pattern as ’%b()’, ’%b[]’, ’%b{}’, or ’%b<>’, but we can use any two distinct characters as delimiters.

Finally, the item ’%f[char-set]’ represents a frontier pattern. It matches an empty string only if the next character is in char-set but the previous one is not:

      s = "the anthem is the theme"
      print((string.gsub(s, "%f[%w]the%f[%W]", "one")))
        --> one anthem is one theme

The pattern ’%f[%w]’ matches a frontier between a non-alphanumeric and an alphanumeric character, and the pattern ’%f[%W]’ matches a frontier between an alphanumeric and a non-alphanumeric character. Therefore, the given pattern matches the string "the" only as an entire word. Note that we must write the char-set inside brackets, even when it is a single class.

The frontier pattern treats the positions before the first and after the last characters in the subject string as if they had the null character (ASCII code zero). In the previous example, the first "the" starts with a frontier between a null character, which is not in the set ’[%w]’, and a t, which is.

Captures

The capture mechanism allows a pattern to yank parts of the subject string that match parts of the pattern for further use. We specify a capture by writing the parts of the pattern that we want to capture between parentheses.

When a pattern has captures, the function string.match returns each captured value as a separate result; in other words, it breaks a string into its captured parts.

      pair = "name = Anna"
      key, value = string.match(pair, "(%a+)%s*=%s*(%a+)")
      print(key, value)  --> name  Anna

The pattern ’%a+’ specifies a non-empty sequence of letters; the pattern ’%s*’ specifies a possibly empty sequence of spaces. So, in the example above, the whole pattern specifies a sequence of letters, followed by a sequence of spaces, followed by an equals sign, again followed by spaces, plus another sequence of letters. Both sequences of letters have their patterns enclosed in parentheses, so that they will be captured if a match occurs. Below is a similar example:

      date = "Today is 17/7/1990"
      d, m, y = string.match(date, "(%d+)/(%d+)/(%d+)")
      print(d, m, y)  --> 17  7  1990

In this example, we use three captures, one for each sequence of digits.

In a pattern, an item like ’%n’, where n is a single digit, matches only a copy of the n-th capture. As a typical use, suppose we want to find, inside a string, a substring enclosed between single or double quotes. We could try a pattern such as ’["'].-["']’, that is, a quote followed by anything followed by another quote; but we would have problems with strings like "it's all right". To solve this problem, we can capture the first quote and use it to specify the second one:

      s = [[then he said: "it's all right"!]]
      q, quotedPart = string.match(s, "([\"'])(.-)%1")
      print(quotedPart)   --> it's all right
      print(q)            --> "

The first capture is the quote character itself and the second capture is the contents of the quote (the substring matching the ’.-’).

A similar example is this pattern, which matches long strings in Lua:

      %[(=*)%[(.-)%]%1%]

It will match an opening square bracket followed by zero or more equals signs, followed by another opening square bracket, followed by anything (the string content), followed by a closing square bracket, followed by the same number of equals signs, followed by another closing square bracket:

      p = "%[(=*)%[(.-)%]%1%]"
      s = "a = [=[[[ something ]] ]==] ]=]; print(a)"
      print(string.match(s, p))    --> =       [[ something ]] ]==]

The first capture is the sequence of equals signs (only one sign in this example); the second is the string content.

The third use of captured values is in the replacement string of gsub. Like the pattern, the replacement string can also contain items like "%n", which are changed to the respective captures when the substitution is made. In particular, the item "%0" becomes the whole match. (By the way, a percent sign in the replacement string must be escaped as "%%".) As an example, the following command duplicates every letter in a string, with a hyphen between the copies:

      print((string.gsub("hello Lua!", "%a", "%0-%0")))
        --> h-he-el-ll-lo-o L-Lu-ua-a!

This one interchanges adjacent characters:

      print((string.gsub("hello Lua", "(.)(.)", "%2%1")))
        --> ehll ouLa

As a more useful example, let us write a primitive format converter, which gets a string with commands written in a LaTeX style and changes them to a format in XML style:

      \command{some text}    -->    <command>some text</command>

If we disallow nested commands, the following call to string.gsub does the job:

      s = [[the \quote{task} is to \em{change} that.]]
      s = string.gsub(s, "\\(%a+){(.-)}", "<%1>%2</%1>")
      print(s)
        --> the <quote>task</quote> is to <em>change</em> that.

(In the next section, we will see how to handle nested commands.)

Another useful example is how to trim a string:

      function trim (s)
        s = string.gsub(s, "^%s*(.-)%s*$", "%1")
        return s
      end

Note the judicious use of pattern modifiers. The two anchors (^ and $) ensure that we get the whole string. Because the ’.-’ in the middle tries to expand as little as possible, the two enclosing patterns ’%s*’ match all spaces at both extremities.

Replacements

As we have seen already, we can use either a function or a table as the third argument to string.gsub, instead of a string. When invoked with a function, string.gsub calls the function every time it finds a match; the arguments to each call are the captures, and the value that the function returns becomes the replacement string. When invoked with a table, string.gsub looks up the table using the first capture as the key, and the associated value is used as the replacement string. If the result from the call or from the table lookup is nil, gsub does not change the match.

As a first example, the following function does variable expansion: it substitutes the value of the global variable varname for every occurrence of $varname in a string:

      function expand (s)
        return (string.gsub(s, "$(%w+)", _G))
      end
      
      name = "Lua"; status = "great"
      print(expand("$name is $status, isn't it?"))
        --> Lua is great, isn't it?

(As we will discuss in detail in Chapter 22, The Environment, _G is a predefined table containing all global variables.) For each match with ’$(%w+)’ (a dollar sign followed by a name), gsub looks up the captured name in the global table _G; the result replaces the match. When the table does not have the key, there is no replacement:

      print(expand("$othername is $status, isn't it?"))
        --> $othername is great, isn't it?

If we are not sure whether the given variables have string values, we may want to apply tostring to their values. In this case, we can use a function as the replacement value:

      function expand (s)
        return (string.gsub(s, "$(%w+)", function (n)
                  return tostring(_G[n])
                end))
      end
      
      print(expand("print = $print; a = $a"))
        --> print = function: 0x8050ce0; a = nil

Inside expand, for each match with ’$(%w+)’, gsub calls the given function with the captured name as argument; the returned string replaces the match.

The last example goes back to our format converter from the previous section. Again, we want to convert commands in LaTeX style (\example{text}) to XML style (<example>text</example>), but allowing nested commands this time. The following function uses recursion to do the job:

      function toxml (s)
        s = string.gsub(s, "\\(%a+)(%b{})", function (tag, body)
              body = string.sub(body, 2, -2)  -- remove the brackets
              body = toxml(body)              -- handle nested commands
              return string.format("<%s>%s</%s>", tag, body, tag)
            end)
        return s
      end
      
      print(toxml("\\title{The \\bold{big} example}"))
        --> <title>The <bold>big</bold> example</title>

URL encoding

Our next example will use URL encoding, which is the encoding used by HTTP to send parameters embedded in a URL. This encoding represents special characters (such as =, &, and +) as "%xx", where xx is the character code in hexadecimal. After that, it changes spaces to plus signs. For instance, it encodes the string "a+b = c" as "a%2Bb+%3D+c". Finally, it writes each parameter name and parameter value with an equals sign in between and appends all resulting pairs name = value with an ampersand in between. For instance, the values

      name = "al";  query = "a+b = c"; q="yes or no"

are encoded as "name=al&query=a%2Bb+%3D+c&q=yes+or+no".

Now, suppose we want to decode this URL and store each value in a table, indexed by its corresponding name. The following function does the basic decoding:

      function unescape (s)
        s = string.gsub(s, "+", " ")
        s = string.gsub(s, "%%(%x%x)", function (h)
              return string.char(tonumber(h, 16))
            end)
        return s
      end
      
      print(unescape("a%2Bb+%3D+c"))  --> a+b = c

The first gsub changes each plus sign in the string to a space. The second gsub matches all two-digit hexadecimal numerals preceded by a percent sign and calls an anonymous function for each match. This function converts the hexadecimal numeral into a number (using tonumber with base 16) and returns the corresponding character (string.char).

To decode the pairs name=value, we use gmatch. Because neither names nor values can contain either ampersands or equals signs, we can match them with the pattern ’[^&=]+’:

      cgi = {}
      function decode (s)
        for name, value in string.gmatch(s, "([^&=]+)=([^&=]+)") do
          name = unescape(name)
          value = unescape(value)
          cgi[name] = value
        end
      end

The call to gmatch matches all pairs in the form name=value. For each pair, the iterator returns the corresponding captures (as marked by the parentheses in the matching string) as the values for name and value. The loop body simply applies unescape to both strings and stores the pair in the cgi table.

The corresponding encoding is also easy to write. First, we write the escape function; this function encodes all special characters as a percent sign followed by the character code in hexadecimal (the format option "%02X" makes a hexadecimal number with two digits, using 0 for padding), and then changes spaces to plus signs:

      function escape (s)
        s = string.gsub(s, "[&=+%%%c]", function (c)
              return string.format("%%%02X", string.byte(c))
            end)
        s = string.gsub(s, " ", "+")
        return s
      end

The encode function traverses the table to be encoded, building the resulting string:

      function encode (t)
        local b = {}
        for k,v in pairs(t) do
          b[#b + 1] = (escape(k) .. "=" .. escape(v))
        end
        -- concatenates all entries in 'b', separated by "&"
        return table.concat(b, "&")
      end
      
      t = {name = "al",  query = "a+b = c", q = "yes or no"}
      print(encode(t)) --> q=yes+or+no&query=a%2Bb+%3D+c&name=al

Tab expansion

An empty capture like ’()’ has a special meaning in Lua. Instead of capturing nothing (a useless task), this pattern captures its position in the subject string, as a number:

      print(string.match("hello", "()ll()"))   --> 3  5

(Note that the result of this example is not the same as what we get from string.find, because the position of the second empty capture is after the match.)

A nice example of the use of position captures is for expanding tabs in a string:

      function expandTabs (s, tab)
        tab = tab or 8       -- tab "size" (default is 8)
        local corr = 0       -- correction
        s = string.gsub(s, "()\t", function (p)
              local sp = tab - (p - 1 + corr)%tab
              corr = corr - 1 + sp
              return string.rep(" ", sp)
            end)
        return s
      end

The gsub pattern matches all tabs in the string, capturing their positions. For each tab, the anonymous function uses this position to compute the number of spaces needed to arrive at a column that is a multiple of tab: it subtracts one from the position to make it relative to zero and adds corr to compensate for previous tabs. (The expansion of each tab affects the position of the following ones.) It then updates the correction for the next tab: minus one for the tab being removed, plus sp for the spaces being added. Finally, it returns a string with the appropriate number of spaces to replace the tab.

Just for completeness, let us see how to reverse this operation, converting spaces to tabs. A first approach could also involve the use of empty captures to manipulate positions, but there is a simpler solution: at every eighth character, we insert a mark in the string. Then, wherever the mark is preceded by spaces, we replace the sequence spaces–mark by a tab:

      function unexpandTabs (s, tab)
        tab = tab or 8
        s = expandTabs(s, tab)
        local pat = string.rep(".", tab)
        s = string.gsub(s, pat, "%0\1")
        s = string.gsub(s, " +\1", "\t")
        s = string.gsub(s, "\1", "")
        return s
      end

The function starts by expanding the string to remove any previous tabs. Then it computes an auxiliary pattern for matching all sequences of eight characters, and uses this pattern to add a mark (the control character \1) after every eight characters. It then substitutes a tab for all sequences of one or more spaces followed by a mark. Finally, it removes the marks left (those not preceded by spaces).

Tricks of the Trade

Pattern matching is a powerful tool for manipulating strings. We can perform many complex operations with only a few calls to string.gsub. However, as with any power, we must use it carefully.

Pattern matching is not a replacement for a proper parser. For quick-and-dirty programs, we can do useful manipulations on source code, but it may be hard to build a product with quality. As a good example, consider the pattern we used to match comments in a C program: ’/%*.-%*/’. If the program has a literal string containing "/*", we may get a wrong result:

      test = [[char s[] = "a /* here";  /* a tricky string */]]
      print((string.gsub(test, "/%*.-%*/", "<COMMENT>")))
        --> char s[] = "a <COMMENT>

Strings with such contents are rare. For our own use, that pattern will probably do its job, but we should not distribute a program with such a flaw.

Usually, pattern matching is efficient enough for Lua programs: my new machine takes less than 0.2 seconds to count all words in a 4.4 MB text (850 K-words).^[12] But we can take precautions. We should always make the pattern as specific as possible; loose patterns are slower than specific ones. An extreme example is ’(.-)%$’, to get all text in a string up to the first dollar sign. If the subject string has a dollar sign, everything goes fine, but suppose that the string does not contain any dollar signs. The algorithm will first try to match the pattern starting at the first position of the string. It will go through all the string, looking for a dollar. When the string ends, the pattern fails for the first position of the string. Then, the algorithm will do the whole search again, starting at the second position of the string, only to discover that the pattern does not match there, too, repeating the search for every position in the string. This will take a quadratic time, resulting in more than four minutes in my new machine for a string of 200K characters. We can correct this problem simply by anchoring the pattern at the first position of the string, with ’^(.-)%$’. The anchor tells the algorithm to stop the search if it cannot find a match at the first position. With the anchor, the match runs in a hundredth of a second.

Beware also of empty patterns, that is, patterns that match the empty string. For instance, if we try to match names with a pattern like ’%a*’, we will find names everywhere:

      i, j = string.find(";$%  **#$hello13", "%a*")
      print(i,j)   --> 1  0

In this example, the call to string.find has correctly found an empty sequence of letters at the beginning of the string.

It never makes sense to write a pattern that ends with the minus modifier, because it will match only the empty string. This modifier always needs something after it to anchor its expansion. Similarly, patterns that include ’.*’ are tricky, because this construction can expand much more than we intended.

Sometimes, it is useful to use Lua itself to build a pattern. We already used this trick in our function to convert spaces to tabs. As another example, let us see how we can find long lines in a text, for instance lines with more than 70 characters. A long line is a sequence of 70 or more characters different from newline. We can match a single character different from newline with the character class ’[^\n]’. Therefore, we can match a long line with a pattern that repeats 70 times the pattern for one character, finishing in a repetition for that pattern (to match the rest of the line). Instead of writing this pattern by hand, we can create it with string.rep:

      pattern = string.rep("[^\n]", 70) .. "+"

As another example, suppose we want to make a case-insensitive search. A way of doing this is to change any letter x in the pattern to the class ’[xX]’, that is, a class including both the lower and the upper-case versions of the original letter. We can automate this conversion with a function:

      function nocase (s)
        s = string.gsub(s, "%a", function (c)
              return "[" .. string.lower(c) .. string.upper(c) .. "]"
            end)
        return s
      end
      
      print(nocase("Hi there!"))   --> [hH][iI] [tT][hH][eE][rR][eE]!

Sometimes, we want to replace every plain occurrence of s1 with s2, without regarding any character as magic. If the strings s1 and s2 are literals, we can add proper escapes to magic characters while we write the strings. If these strings are variable values, we can use another gsub to put the escapes for us:

      s1 = string.gsub(s1, "(%W)", "%%%1")
      s2 = string.gsub(s2, "%%", "%%%%")

In the search string, we escape all non-alphanumeric characters (thus the upper-case W). In the replacement string, we escape only the percent sign.

Another useful technique for pattern matching is to preprocess the subject string before the real work. Suppose we want to change to upper case all quoted strings in a text, where a quoted string starts and ends with a double quote ("), but may contain escaped quotes ("\""):

      follows a typical string: "This is \"great\"!".

One approach for handling such cases is to preprocess the text to encode the problematic sequence as something else. For instance, we could code "\"" as "\1". However, if the original text already contains a "\1", we are in trouble. An easy way to do the encoding and avoid this problem is to code all sequences "\x" as "\ddd", where ddd is the decimal representation of the character x:

      function code (s)
        return (string.gsub(s, "\\(.)", function (x)
                  return string.format("\\%03d", string.byte(x))
                end))
      end

Now any sequence "\ddd" in the encoded string must have come from the coding, because any "\ddd" in the original string has been coded, too. So, the decoding is an easy task:

      function decode (s)
        return (string.gsub(s, "\\(%d%d%d)", function (d)
                  return "\\" .. string.char(tonumber(d))
                end))
      end

Now we can complete our task. As the encoded string does not contain any escaped quote ("\""), we can search for quoted strings simply with ’".-"’:

      s = [[follows a typical string: "This is \"great\"!".]]
      s = code(s)
      s = string.gsub(s, '".-"', string.upper)
      s = decode(s)
      print(s)    --> follows a typical string: "THIS IS \"GREAT\"!".

We can also write it like here:

      print(decode(string.gsub(code(s), '".-"', string.upper)))

The applicability of pattern-matching functions to UTF-8 strings depends on the pattern. Literal patterns work without problems, due to the key property of UTF-8 that the encoding of any character never appears inside the encoding of any other character. Character classes and character sets work only for ASCII characters. For instance, the pattern ’%s’ works on UTF-8 strings, but it will match only the ASCII white spaces; it will not match extra Unicode white spaces such as a non-break space (U+00A0) or a Mongolian vowel separator (U+180E).

Judicious patterns can bring some extra power to Unicode handling. A good example is the predefined pattern utf8.charpattern, which matches exactly one UTF-8 character. The utf8 library defines this pattern as follows:

      utf8.charpattern = [\0-\x7F\xC2-\xF4][\x80-\xBF]*

The first part is a class that matches either ASCII characters (range [0, 0x7F]) or initial bytes for multibyte sequences (range [0xC2, 0xF4]). The second part matches zero or more continuation bytes (range [0x80, 0xBF]).

Exercises

Exercise 10.1: Write a function split that receives a string and a delimiter pattern and returns a sequence with the chunks in the original string separated by the delimiter:

      t = split("a whole new world", " ")
      -- t = {"a", "whole", "new", "world"}

How does your function handle empty strings? (In particular, is an empty string an empty sequence or a sequence with one empty string?)

Exercise 10.2: The patterns ’%D’ and ’[^%d]’ are equivalent. What about the patterns ’[^%d%u]’ and ’[%D%U]’?

Exercise 10.3: Write a function transliterate. This function receives a string and replaces each character in that string with another character, according to a table given as a second argument. If the table maps a to b, the function should replace any occurrence of a with b. If the table maps a to false, the function should remove occurrences of a from the resulting string.

Exercise 10.4: At the end of the section called “Captures”, we defined a trim function. Because of its use of backtracking, this function can take a quadratic time for some strings. (For instance, in my new machine, a match for a 100 KB string can take 52 seconds.)

Create a string that triggers this quadratic behavior in function trim.
Rewrite that function so that it always works in linear time.

Exercise 10.5: Write a function to format a binary string as a literal in Lua, using the escape sequence \x for all bytes:

      print(escape("\0\1hello\200"))
        --> \x00\x01\x68\x65\x6C\x6C\x6F\xC8

As an improved version, use also the escape sequence \z to break long lines.

Exercise 10.6: Rewrite the function transliterate for UTF-8 characters.

Exercise 10.7: Write a function to reverse a UTF-8 string.

^[12] “My new machine” is an Intel Core i7-4790 3.6 GHz, with 8 GB of RAM. I measured all performance data in this book on that machine.