Unlike several other scripting languages, Lua uses neither POSIX regex nor Perl regular expressions for pattern matching. The main reason for this decision is size: a typical implementation of POSIX regular expressions takes more than 4000 lines of code, which is more than half the size of all Lua standard libraries together. In comparison, the implementation of pattern matching in Lua has less than 600 lines. Of course, pattern matching in Lua cannot do all that a full POSIX implementation does. Nevertheless, pattern matching in Lua is a powerful tool, and includes some features that are difficult to match with standard POSIX implementations.
The string library offers four functions based on patterns.
We have already had a glimpse at find
and gsub
;
the other two are match
and gmatch
(Global Match).
Now we will see all of them in detail.
The function string.find
searches for
a pattern inside a given subject string.
The simplest form of a pattern is a word,
which matches only a copy of itself.
For instance, the pattern ’hello
’ will search for
the substring "hello"
inside the subject string.
When string.find
finds its pattern,
it returns two values:
the index where the match begins and the index where the match ends.
If it does not find a match,
it returns nil:
s = "hello world" i, j = string.find(s, "hello") print(i, j) --> 1 5 print(string.sub(s, i, j)) --> hello print(string.find(s, "world")) --> 7 11 i, j = string.find(s, "l") print(i, j) --> 3 3 print(string.find(s, "lll")) --> nil
When a match succeeds,
we can call string.sub
with the values returned by find
to
get the part of the subject string that matched the pattern.
For simple patterns, this is necessarily the pattern itself.
The function string.find
has two optional parameters.
The third parameter is
an index that tells where in the subject string to start the search.
The fourth parameter, a Boolean,
indicates a plain search.
A plain search, as the name implies,
does a plain “find substring” search in the subject,
ignoring patterns:
> string.find("a [word]", "[") stdin:1: malformed pattern (missing ']') > string.find("a [word]", "[", 1, true) --> 3 3
In the first call,
the function complains because ’[
’
has a special meaning in patterns.
In the second call, the function treats ’[
’ as a simple string.
Note that we cannot pass the fourth optional parameter
without the third one.
The function string.match
is similar to find
,
in the sense that it also searches for a pattern in a string.
However, instead of returning the positions where it found the pattern,
it returns the part of the subject string that matched the pattern:
print(string.match("hello world", "hello")) --> hello
For fixed patterns such as ’hello
’,
this function is pointless.
It shows its power when used with variable patterns,
as in the next example:
date = "Today is 17/7/1990" d = string.match(date, "%d+/%d+/%d+") print(d) --> 17/7/1990
Shortly we will discuss the meaning of the pattern ’%d+/%d+/%d+
’
and more advanced uses for string.match
.
The function string.gsub
has three mandatory parameters:
a subject string, a pattern, and a replacement string.
Its basic use is to substitute
the replacement string for all occurrences of
the pattern inside the subject string:
s = string.gsub("Lua is cute", "cute", "great") print(s) --> Lua is great s = string.gsub("all lii", "l", "x") print(s) --> axx xii s = string.gsub("Lua is great", "Sol", "Sun") print(s) --> Lua is great
An optional fourth parameter limits the number of substitutions to be made:
s = string.gsub("all lii", "l", "x", 1) print(s) --> axl lii s = string.gsub("all lii", "l", "x", 2) print(s) --> axx lii
Instead of a replacement string,
the third argument to string.gsub
can be also
a function or a table,
which is called (or indexed) to produce the replacement string;
we will cover this feature in the section called “Replacements”.
The function string.gsub
also returns as a
second result the number of times
it made the substitution.
The function string.gmatch
returns a function that iterates over all occurrences of
a pattern in a string.
For instance, the following example collects all
words of a given string s
:
s = "some string" words = {} for w in string.gmatch(s, "%a+") do words[#words + 1] = w end
As we will discuss shortly,
the pattern ’%a+
’ matches sequences of one or more
alphabetic characters (that is, words).
So, the for loop will iterate over all words
of the subject string,
storing them in the list words
.
Most pattern-matching libraries use the backslash
as an escape.
However, this choice has some annoying consequences.
For the Lua parser, patterns are regular strings.
They have no special treatment
and follow the same rules as other strings.
Only the pattern-matching functions interpret them as patterns.
Because the backslash is the escape character in Lua,
we have to escape it to pass it to any function.
Patterns are naturally hard to read,
and writing "\\"
instead of "\"
everywhere
does not help.
We could ameliorate this problem with long strings, enclosing patterns between double brackets. (Some languages recommend this practice.) However, the long-string notation seems cumbersome for patterns, which are usually short. Moreover, we would lose the ability to use escapes inside patterns. (Some pattern-matching tools work around this limitation by reimplementing the usual string escapes.)
Lua’s solution is simpler:
patterns in Lua use the percent sign as an escape.
(Several functions in C,
such as printf
and strftime
,
adopt the same solution.)
In general,
any escaped alphanumeric character has some special meaning
(e.g., ’%a
’ matches any letter),
while any escaped non-alphanumeric character represents itself
(e.g., ’%.
’ matches a dot).
We will start our discussion about patterns with character classes.
A character class is an item in a pattern that can match any
character in a specific set.
For instance, the class %d
matches any digit.
Therefore, we can search for a date in the format dd/mm/yyyy
with the pattern ’%d%d/%d%d/%d%d%d%d
’:
s = "Deadline is 30/05/1999, firm" date = "%d%d/%d%d/%d%d%d%d" print(string.match(s, date)) --> 30/05/1999
The following table lists the predefined character classes with their meanings:
| all characters |
| letters |
| control characters |
| digits |
| printable characters except spaces |
| lower-case letters |
| punctuation characters |
| space characters |
| upper-case letters |
| alphanumeric characters |
| hexadecimal digits |
An upper-case version of any of these classes represents
the complement of the class.
For instance, ’%A
’ represents all non-letter characters:
print((string.gsub("hello, up-down!", "%A", "."))) --> hello..up.down.
(When printing the results of gsub
,
I am using extra parentheses to discard its second result,
which is the number of substitutions.)
Some characters, called magic characters, have special meanings when used in a pattern. Patterns in Lua use the following magic characters:
( ) . % + - * ? [ ] ^ $
As we have seen,
the percent sign works as an escape
for these magic characters.
So, ’%?
’ matches a question mark and
’%%
’ matches a percent sign itself.
We can escape not only the magic characters,
but also any non-alphanumeric character.
When in doubt, play safe and use an escape.
A char-set allows us to create our own character classes,
grouping single characters and classes
inside square brackets.
For instance,
the char-set ’[%w_]
’ matches both alphanumeric characters
and underscores,
’[01]
’ matches binary digits,
and ’[%[%]]
’ matches square brackets.
To count the number of vowels in a text,
we can write this code:
_, nvow = string.gsub(text, "[AEIOUaeiou]", "")
We can also include character ranges in a char-set,
by writing the first and the last characters of the range
separated by a hyphen.
I seldom use this feature,
because most useful ranges are predefined;
for instance,
’%d
’ substitutes ’[0-9]
’,
and ’%x
’ substitutes ’[0-9a-fA-F]
’.
However, if you need to find an octal digit,
you may prefer ’[0-7]
’
instead of an explicit enumeration like ’[01234567]
’.
We can get the complement of any char-set
by starting it with a caret:
the pattern ’[^0-7]
’ finds any character that is not an octal digit
and ’[^\n]
’ matches any character different from newline.
Nevertheless, remember that you can negate simple
classes with its upper-case version:
’%S
’ is simpler than ’[^%s]
’.
We can make patterns still more useful with modifiers for repetitions and optional parts. Patterns in Lua offer four modifiers:
| 1 or more repetitions |
| 0 or more repetitions |
| 0 or more lazy repetitions |
| optional (0 or 1 occurrence) |
The plus modifier matches one or more
characters of the original class.
It will always get the longest sequence that matches the pattern.
For instance, the pattern ’%a+
’ means
one or more letters (a word):
print((string.gsub("one, and two; and three", "%a+", "word"))) --> word, word word; word word
The pattern ’%d+
’ matches one or more digits (an integer numeral):
print(string.match("the number 1298 is even", "%d+")) --> 1298
The asterisk modifier is similar to plus,
but it also accepts zero occurrences of characters of the class.
A typical use is to match optional
spaces between parts of a pattern.
For instance, to match an empty parenthesis pair,
such as ()
or ( )
,
we can use the pattern ’%(%s*%)
’,
where the pattern ’%s*
’ matches zero or more spaces.
(Parentheses have a special meaning in a pattern,
so we must escape them.)
As another example, the pattern ’[_%a][_%w]*
’
matches identifiers in a Lua program:
a sequence starting with a letter or an underscore,
followed by zero or more underscores or alphanumeric characters.
Like an asterisk,
the minus modifier also matches zero or more occurrences
of characters of the original class.
However, instead of matching the longest sequence,
it matches the shortest one.
Sometimes there is no difference between asterisk and minus,
but usually they give rather different results.
For instance, if we try to find an identifier with the
pattern ’[_%a][_%w]-
’,
we will find only the first letter,
because the ’[_%w]-
’ will always match the empty sequence.
On the other hand,
suppose we want to erase comments in a C program.
Many people would first try ’/%*.*%*/
’
(that is, a "/*"
followed by a sequence of any
characters followed by "*/"
,
written with the appropriate escapes).
However, because the ’.*
’ expands as far as it can,
the first "/*"
in the program would close only
with the last "*/"
:
test = "int x; /* x */ int y; /* y */" print((string.gsub(test, "/%*.*%*/", ""))) --> int x;
The pattern ’.-
’, instead,
will expand only as much as necessary to find the first "*/"
,
so that we get the desired result:
test = "int x; /* x */ int y; /* y */" print((string.gsub(test, "/%*.-%*/", ""))) --> int x; int y;
The last modifier, the question mark,
matches an optional character.
As an example, suppose we want to find an integer in a text,
where the number can contain an optional sign.
The pattern ’[+-]?%d+
’ does the job,
matching numerals like "-12"
, "23"
, and "+1009"
.
The character class ’[+-]
’ matches
either a plus or a minus sign;
the following ?
makes this sign optional.
Unlike some other systems, in Lua we can apply a modifier only to a character class; there is no way to group patterns under a modifier. For instance, there is no pattern that matches an optional word (unless the word has only one letter). Usually, we can circumvent this limitation using some of the advanced techniques that we will see in the end of this chapter.
If a pattern begins with a caret,
it will match only at the beginning of the subject string.
Similarly, if it ends with a dollar sign,
it will match only at the end of the subject string.
We can use these marks
both to restrict the matches that we find
and to anchor patterns.
For instance,
the next test checks whether the string s
starts with a digit:
if string.find(s, "^%d") then ...
The next one checks whether that string represents an integer number, without any other leading or trailing characters:
if string.find(s, "^[+-]?%d+$") then ...
The caret and dollar signs are magic only when used in the beginning or end of the pattern. Otherwise, they act as regular characters matching themselves.
Another item in a pattern is ’%b
’,
which matches balanced strings.
We write this item as ’%b
’,
where xy
x
and y
are any two distinct characters;
the x
acts as an opening character
and the y
as the closing one.
For instance, the pattern ’%b()
’ matches
parts of the string that start with a
left parenthesis and finish at the respective right one:
s = "a (enclosed (in) parentheses) line" print((string.gsub(s, "%b()", ""))) --> a line
Typically, we use this pattern as
’%b()
’, ’%b[]
’, ’%b{}
’, or ’%b<>
’,
but we can use any two distinct characters as delimiters.
Finally,
the item ’%f[
’ represents a frontier pattern.
It matches an empty string only if
the next character is in char-set
]char-set
but the previous one is not:
s = "the anthem is the theme" print((string.gsub(s, "%f[%w]the%f[%W]", "one"))) --> one anthem is one theme
The pattern ’%f[%w]
’ matches a frontier between
a non-alphanumeric and an alphanumeric character,
and
the pattern ’%f[%W]
’ matches a frontier between
an alphanumeric and a non-alphanumeric character.
Therefore, the given pattern matches the string
"the"
only as an entire word.
Note that we must write the char-set inside brackets,
even when it is a single class.
The frontier pattern treats the positions
before the first and after the last characters in the subject string
as if they had the null character (ASCII code zero).
In the previous example, the first "the"
starts with a frontier
between a null character, which is not in the set ’[%w]
’,
and a t
, which is.
The capture mechanism allows a pattern to yank parts of the subject string that match parts of the pattern for further use. We specify a capture by writing the parts of the pattern that we want to capture between parentheses.
When a pattern has captures,
the function string.match
returns each captured value
as a separate result;
in other words,
it breaks a string into its captured parts.
pair = "name = Anna" key, value = string.match(pair, "(%a+)%s*=%s*(%a+)") print(key, value) --> name Anna
The pattern ’%a+
’ specifies a non-empty sequence of letters;
the pattern ’%s*
’ specifies a possibly empty sequence of spaces.
So, in the example above,
the whole pattern specifies a sequence of letters,
followed by a sequence of spaces,
followed by an equals sign, again followed by spaces,
plus another sequence of letters.
Both sequences of letters have their
patterns enclosed in parentheses,
so that they will be captured if a match occurs.
Below is a similar example:
date = "Today is 17/7/1990" d, m, y = string.match(date, "(%d+)/(%d+)/(%d+)") print(d, m, y) --> 17 7 1990
In this example, we use three captures, one for each sequence of digits.
In a pattern,
an item like ’%
’, where n
n
is a single digit,
matches only a copy of the n
-th capture.
As a typical use, suppose we want to find, inside a string,
a substring enclosed between single or double quotes.
We could try a pattern such as ’["'].-["']
’,
that is, a quote followed by anything followed by another quote;
but we would have problems with strings like
"it's all right"
.
To solve this problem, we can capture the first quote
and use it to specify the second one:
s = [[then he said: "it's all right"!]] q, quotedPart = string.match(s, "([\"'])(.-)%1") print(quotedPart) --> it's all right print(q) --> "
The first capture is the quote character itself
and the second capture is the contents of the quote
(the substring matching the ’.-
’).
A similar example is this pattern, which matches long strings in Lua:
%[(=*)%[(.-)%]%1%]
It will match an opening square bracket followed by zero or more equals signs, followed by another opening square bracket, followed by anything (the string content), followed by a closing square bracket, followed by the same number of equals signs, followed by another closing square bracket:
p = "%[(=*)%[(.-)%]%1%]" s = "a = [=[[[ something ]] ]==] ]=]; print(a)" print(string.match(s, p)) --> = [[ something ]] ]==]
The first capture is the sequence of equals signs (only one sign in this example); the second is the string content.
The third use of captured values is in the
replacement string of gsub
.
Like the pattern,
the replacement string can also contain items like "%
,
which are changed to the respective captures when the
substitution is made.
In particular, the item n
""%0"
becomes the whole match.
(By the way, a percent sign in the replacement string
must be escaped as "%%"
.)
As an example,
the following command duplicates every letter in a string,
with a hyphen between the copies:
print((string.gsub("hello Lua!", "%a", "%0-%0"))) --> h-he-el-ll-lo-o L-Lu-ua-a!
This one interchanges adjacent characters:
print((string.gsub("hello Lua", "(.)(.)", "%2%1"))) --> ehll ouLa
As a more useful example, let us write a primitive format converter, which gets a string with commands written in a LaTeX style and changes them to a format in XML style:
\command{some text} --> <command>some text</command>
If we disallow nested commands,
the following call to string.gsub
does the job:
s = [[the \quote{task} is to \em{change} that.]] s = string.gsub(s, "\\(%a+){(.-)}", "<%1>%2</%1>") print(s) --> the <quote>task</quote> is to <em>change</em> that.
(In the next section, we will see how to handle nested commands.)
Another useful example is how to trim a string:
function trim (s) s = string.gsub(s, "^%s*(.-)%s*$", "%1") return s end
Note the judicious use of pattern modifiers.
The two anchors (^
and $
)
ensure that we get the whole string.
Because the ’.-
’ in the middle tries to expand as little as possible,
the two enclosing patterns ’%s*
’ match all spaces at both extremities.
As we have seen already,
we can use either a function or a table
as the third argument to string.gsub
,
instead of a string.
When invoked with a function,
string.gsub
calls the
function every time it finds a match;
the arguments to each call are the captures,
and the value that the function returns
becomes the replacement string.
When invoked with a table,
string.gsub
looks up the table
using the first capture as the key,
and the associated value is used as the replacement string.
If the result from the call or from the table lookup is nil,
gsub
does not change the match.
As a first example,
the following function does variable expansion:
it substitutes the value of the global variable varname
for every occurrence of $varname
in a string:
function expand (s) return (string.gsub(s, "$(%w+)", _G)) end name = "Lua"; status = "great" print(expand("$name is $status, isn't it?")) --> Lua is great, isn't it?
(As we will discuss in detail in Chapter 22, The Environment,
_G
is a predefined table containing all global variables.)
For each match with ’$(%w+)
’ (a dollar sign followed by a name),
gsub
looks up the captured name in the global table _G
;
the result replaces the match.
When the table does not have the key,
there is no replacement:
print(expand("$othername is $status, isn't it?")) --> $othername is great, isn't it?
If we are not sure whether the given variables have string values,
we may want to apply tostring
to their values.
In this case, we can use a function as the replacement value:
function expand (s) return (string.gsub(s, "$(%w+)", function (n) return tostring(_G[n]) end)) end print(expand("print = $print; a = $a")) --> print = function: 0x8050ce0; a = nil
Inside expand
, for each match with ’$(%w+)
’,
gsub
calls the given function with the captured name
as argument;
the returned string replaces the match.
The last example goes back to our format converter
from the previous section.
Again, we want to convert commands in
LaTeX style (\example{text}
)
to XML style (<example>text</example>
),
but allowing nested commands this time.
The following function uses recursion to do the job:
function toxml (s) s = string.gsub(s, "\\(%a+)(%b{})", function (tag, body) body = string.sub(body, 2, -2) -- remove the brackets body = toxml(body) -- handle nested commands return string.format("<%s>%s</%s>", tag, body, tag) end) return s end print(toxml("\\title{The \\bold{big} example}")) --> <title>The <bold>big</bold> example</title>
Our next example will use URL encoding,
which is the encoding used by HTTP to send parameters
embedded in a URL.
This encoding represents special characters
(such as =
, &
, and +
) as "%
,
where xx
"xx
is the character code in hexadecimal.
After that, it changes spaces to plus signs.
For instance,
it encodes the string "a+b = c"
as "a%2Bb+%3D+c"
.
Finally, it writes each parameter name and parameter
value with an equals sign in between
and appends all resulting pairs name = value
with an ampersand in between.
For instance, the values
name = "al"; query = "a+b = c"; q="yes or no"
are encoded as "name=al&query=a%2Bb+%3D+c&q=yes+or+no"
.
Now, suppose we want to decode this URL and store each value in a table, indexed by its corresponding name. The following function does the basic decoding:
function unescape (s) s = string.gsub(s, "+", " ") s = string.gsub(s, "%%(%x%x)", function (h) return string.char(tonumber(h, 16)) end) return s end print(unescape("a%2Bb+%3D+c")) --> a+b = c
The first gsub
changes each plus sign in the string to a space.
The second gsub
matches all two-digit hexadecimal numerals
preceded by a percent sign
and calls an anonymous function for each match.
This function converts the hexadecimal numeral into a number
(using tonumber
with base 16)
and returns the corresponding character (string.char
).
To decode the pairs name=value
, we use gmatch
.
Because neither names nor values
can contain either ampersands or equals signs,
we can match them with the pattern ’[^&=]+
’:
cgi = {} function decode (s) for name, value in string.gmatch(s, "([^&=]+)=([^&=]+)") do name = unescape(name) value = unescape(value) cgi[name] = value end end
The call to gmatch
matches all pairs
in the form name=value
.
For each pair,
the iterator returns the corresponding captures
(as marked by the parentheses in the matching string)
as the values for name
and value
.
The loop body simply applies unescape
to both strings
and stores the pair in the cgi
table.
The corresponding encoding is also easy to write.
First, we write the escape
function;
this function encodes all special characters as a percent sign
followed by the character code in hexadecimal
(the format
option "%02X"
makes
a hexadecimal number with two digits,
using 0 for padding),
and then changes spaces to plus signs:
function escape (s) s = string.gsub(s, "[&=+%%%c]", function (c) return string.format("%%%02X", string.byte(c)) end) s = string.gsub(s, " ", "+") return s end
The encode
function traverses the table to be encoded,
building the resulting string:
function encode (t) local b = {} for k,v in pairs(t) do b[#b + 1] = (escape(k) .. "=" .. escape(v)) end -- concatenates all entries in 'b', separated by "&" return table.concat(b, "&") end t = {name = "al", query = "a+b = c", q = "yes or no"} print(encode(t)) --> q=yes+or+no&query=a%2Bb+%3D+c&name=al
An empty capture like ’()
’ has a special meaning in Lua.
Instead of capturing nothing (a useless task),
this pattern captures its position in the subject string,
as a number:
print(string.match("hello", "()ll()")) --> 3 5
(Note that the result of this example is not the same as what
we get from string.find
,
because the position of the second empty capture is
after the match.)
A nice example of the use of position captures is for expanding tabs in a string:
function expandTabs (s, tab) tab = tab or 8 -- tab "size" (default is 8) local corr = 0 -- correction s = string.gsub(s, "()\t", function (p) local sp = tab - (p - 1 + corr)%tab corr = corr - 1 + sp return string.rep(" ", sp) end) return s end
The gsub
pattern matches all tabs in the string,
capturing their positions.
For each tab, the anonymous function uses this position to
compute the number of spaces needed to arrive at a column
that is a multiple of tab
:
it subtracts one from the position to make it relative to zero and
adds corr
to compensate for previous tabs.
(The expansion of each tab affects the position of the following ones.)
It then updates the correction for the next tab:
minus one for the tab being removed,
plus sp
for the spaces being added.
Finally, it returns a string with the appropriate number of spaces
to replace the tab.
Just for completeness, let us see how to reverse this operation, converting spaces to tabs. A first approach could also involve the use of empty captures to manipulate positions, but there is a simpler solution: at every eighth character, we insert a mark in the string. Then, wherever the mark is preceded by spaces, we replace the sequence spaces–mark by a tab:
function unexpandTabs (s, tab) tab = tab or 8 s = expandTabs(s, tab) local pat = string.rep(".", tab) s = string.gsub(s, pat, "%0\1") s = string.gsub(s, " +\1", "\t") s = string.gsub(s, "\1", "") return s end
The function starts by expanding the string
to remove any previous tabs.
Then it computes an
auxiliary pattern for matching all sequences of eight characters,
and uses this pattern to add a mark (the control character \1
)
after every eight characters.
It then substitutes a tab for all
sequences of one or more spaces followed by a mark.
Finally, it removes the marks left (those not preceded by spaces).
Pattern matching is a powerful tool for manipulating strings.
We can perform many complex operations with only a few
calls to string.gsub
.
However, as with any power, we must use it carefully.
Pattern matching is not a replacement for a proper parser.
For quick-and-dirty programs,
we can do useful manipulations on source code,
but it may be hard to build a product with quality.
As a good example, consider the pattern we used to match
comments in a C program:
’/%*.-%*/
’.
If the program has a literal string containing "/*"
,
we may get a wrong result:
test = [[char s[] = "a /* here"; /* a tricky string */]] print((string.gsub(test, "/%*.-%*/", "<COMMENT>"))) --> char s[] = "a <COMMENT>
Strings with such contents are rare. For our own use, that pattern will probably do its job, but we should not distribute a program with such a flaw.
Usually, pattern matching is efficient enough for Lua programs:
my new machine
takes less than 0.2 seconds to
count all words in a 4.4 MB text (850 K-words).[12]
But we can take precautions.
We should always make the pattern as specific as possible;
loose patterns are slower than specific ones.
An extreme example is ’(.-)%$
’,
to get all text in a string up to the first dollar sign.
If the subject string has a dollar sign,
everything goes fine,
but suppose that the string does not contain any dollar signs.
The algorithm will first try to match the pattern starting at the
first position of the string.
It will go through all the string, looking for a dollar.
When the string ends, the pattern fails
for the first position of the string.
Then, the algorithm will do the whole search again,
starting at the second position of the string,
only to discover that the pattern does not match there, too,
repeating the search for every position in the string.
This will take a quadratic time,
resulting in more than four minutes
in my new machine for a string of 200K characters.
We can correct this problem simply by anchoring the pattern at
the first position of the string, with ’^(.-)%$
’.
The anchor tells the algorithm to stop the search if it cannot find
a match at the first position.
With the anchor, the match runs in a hundredth of a second.
Beware also of empty patterns,
that is, patterns that match the empty string.
For instance, if we try to match names with a pattern like
’%a*
’, we will find names everywhere:
i, j = string.find(";$% **#$hello13", "%a*") print(i,j) --> 1 0
In this example,
the call to string.find
has correctly found an empty sequence of
letters at the beginning of the string.
It never makes sense to write a pattern that ends
with the minus modifier,
because it will match only the empty string.
This modifier always needs something after it
to anchor its expansion.
Similarly,
patterns that include ’.*
’ are tricky,
because this construction can expand much more than we intended.
Sometimes, it is useful to use Lua itself to build a pattern.
We already used this trick in our function to convert
spaces to tabs.
As another example,
let us see how we can find long lines in a text,
for instance lines with more than 70 characters.
A long line is a sequence of 70 or more characters
different from newline.
We can match a single character different from newline
with the character class ’[^\n]
’.
Therefore, we can match a long line with a pattern that repeats
70 times the pattern for one character,
finishing in a repetition for that pattern
(to match the rest of the line).
Instead of writing this pattern by hand,
we can create it with string.rep
:
pattern = string.rep("[^\n]", 70) .. "+"
As another example,
suppose we want to make a
case-insensitive search.
A way of doing this is to change any letter x
in the pattern
to the class ’[
’, that is,
a class including both the lower and the upper-case versions
of the original letter.
We can automate this conversion with a function:
xX
]
function nocase (s) s = string.gsub(s, "%a", function (c) return "[" .. string.lower(c) .. string.upper(c) .. "]" end) return s end print(nocase("Hi there!")) --> [hH][iI] [tT][hH][eE][rR][eE]!
Sometimes, we want to replace every plain occurrence
of s1
with s2
,
without regarding any character as magic.
If the strings s1
and s2
are literals,
we can add proper escapes to
magic characters while we write the strings.
If these strings are variable values,
we can use another gsub
to put the escapes for us:
s1 = string.gsub(s1, "(%W)", "%%%1") s2 = string.gsub(s2, "%%", "%%%%")
In the search string, we escape all non-alphanumeric characters
(thus the upper-case W
).
In the replacement string, we escape only the percent sign.
Another useful technique for pattern matching is to
preprocess the subject string before the real work.
Suppose we want to change to upper case all quoted strings in a text,
where a quoted string starts and
ends with a double quote ("
),
but may contain escaped quotes ("\""
):
follows a typical string: "This is \"great\"!".
One approach for handling such cases is to preprocess the text
to encode the problematic sequence as something else.
For instance, we could code "\""
as "\1"
.
However, if the original text already contains a "\1"
,
we are in trouble.
An easy way to do the encoding and avoid this problem is to
code all sequences "\
as x
""\
,
where ddd
"ddd
is the decimal representation
of the character x
:
function code (s) return (string.gsub(s, "\\(.)", function (x) return string.format("\\%03d", string.byte(x)) end)) end
Now any sequence "\
in the encoded string must have come from the coding,
because any ddd
""\
in the original string has been coded, too.
So, the decoding is an easy task:
ddd
"
function decode (s) return (string.gsub(s, "\\(%d%d%d)", function (d) return "\\" .. string.char(tonumber(d)) end)) end
Now we can complete our task.
As the encoded string does not contain any escaped quote ("\""
),
we can search for quoted strings simply with ’".-"
’:
s = [[follows a typical string: "This is \"great\"!".]] s = code(s) s = string.gsub(s, '".-"', string.upper) s = decode(s) print(s) --> follows a typical string: "THIS IS \"GREAT\"!".
We can also write it like here:
print(decode(string.gsub(code(s), '".-"', string.upper)))
The applicability of pattern-matching functions
to UTF-8 strings depends on the pattern.
Literal patterns work without problems,
due to the key property of UTF-8 that
the encoding of any character never appears
inside the encoding of any other character.
Character classes and character sets work only for ASCII characters.
For instance, the pattern ’%s
’ works on UTF-8 strings,
but it will match only the ASCII white spaces;
it will not match extra Unicode white spaces such as
a non-break space (U+00A0)
or a Mongolian vowel separator (U+180E).
Judicious patterns can bring
some extra power to Unicode handling.
A good example is the predefined pattern utf8.charpattern
,
which matches exactly one UTF-8 character.
The utf8
library defines this pattern as follows:
utf8.charpattern = [\0-\x7F\xC2-\xF4][\x80-\xBF]*
The first part is a class that matches either ASCII characters (range [0, 0x7F]) or initial bytes for multibyte sequences (range [0xC2, 0xF4]). The second part matches zero or more continuation bytes (range [0x80, 0xBF]).
Exercise 10.1:
Write a function split
that
receives a string and a delimiter pattern
and returns a sequence with the
chunks in the original string separated by the delimiter:
t = split("a whole new world", " ") -- t = {"a", "whole", "new", "world"}
How does your function handle empty strings? (In particular, is an empty string an empty sequence or a sequence with one empty string?)
Exercise 10.2:
The patterns ’%D
’ and ’[^%d]
’ are equivalent.
What about the patterns
’[^%d%u]
’ and ’[%D%U]
’?
Exercise 10.3:
Write a function transliterate.
This function receives a string and replaces each
character in that string with another character,
according to a table given as a second argument.
If the table maps a
to b
,
the function should replace any occurrence of a
with b
.
If the table maps a
to false,
the function should remove occurrences of a
from the resulting string.
Exercise 10.4:
At the end of the section called “Captures”,
we defined a trim
function.
Because of its use of backtracking,
this function can take a quadratic time for some
strings.
(For instance, in my new machine,
a match for a 100 KB string can take 52 seconds.)
Create a string that triggers this quadratic behavior
in function trim
.
Rewrite that function so that it always works in linear time.
Exercise 10.5:
Write a function to format a binary string as a literal in Lua,
using the escape sequence \x
for all bytes:
print(escape("\0\1hello\200")) --> \x00\x01\x68\x65\x6C\x6C\x6F\xC8
As an improved version,
use also the escape sequence \z
to break long lines.
Exercise 10.6: Rewrite the function transliterate for UTF-8 characters.
[12] “My new machine” is an Intel Core i7-4790 3.6 GHz, with 8 GB of RAM. I measured all performance data in this book on that machine.
Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>