Please explain the expression on your face
They Might Be Giants, “Unrelated Thing” (1994)
In this chapter, you will write a Rust version of grep
, which will find lines of input that match a given regular expression.1
By default the input comes from STDIN
, but you can provide the names of one or more files or directories if you use a recursive option to find all the files in those directories.
The normal output will be the lines that match the given pattern, but you can invert the match to find the lines that don’t match.
You can also instruct grep
to print the number of matching lines instead of the lines of text.
Pattern matching is normally case-sensitive, but you can use an option to perform case-insensitive matching.
While the original program can do more, the challenge program will go only this far.
In writing this program, you’ll learn about:
Using a case-sensitive regular expression
Variations of regular expression syntax
Another syntax to indicate a trait bound
Using Rust’s bitwise exclusive-OR operator
I’ll start by showing the manual page for the BSD grep
to give you a sense of the many options the command will accept:
GREP(1) BSD General Commands Manual GREP(1) NAME grep, egrep, fgrep, zgrep, zegrep, zfgrep -- file pattern searcher SYNOPSIS grep [-abcdDEFGHhIiJLlmnOopqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color[=when]] [--colour[=when]] [--context[=num]] [--label] [--line-buffered] [--null] [pattern] [file ...] DESCRIPTION The grep utility searches any given input files, selecting lines that match one or more patterns. By default, a pattern matches an input line if the regular expression (RE) in the pattern matches the input line without its trailing newline. An empty expression matches every line. Each input line that matches at least one of the patterns is written to the standard output. grep is used for simple patterns and basic regular expressions (BREs); egrep can handle extended regular expressions (EREs). See re_format(7) for more information on regular expressions. fgrep is quicker than both grep and egrep, but can only handle fixed patterns (i.e. it does not interpret regular expressions). Patterns may consist of one or more lines, allowing any of the pattern lines to match a portion of the input.
The GNU version is very similar:
GREP(1) General Commands Manual GREP(1) NAME grep, egrep, fgrep - print lines matching a pattern SYNOPSIS grep [OPTIONS] PATTERN [FILE...] grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...] DESCRIPTION grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a match to the given PATTERN. By default, grep prints the matching lines.
To demonstrate the features of grep
that the challenge program is expected to implement, I’ll use some files from the book’s GitHub repository.
If you want to follow along, change into the 09_grepr/tests/inputs directory:
$ cd 09_grepr/tests/inputs
Here are the files that I’ve included:
empty.txt: an empty file
fox.txt: a file with a single line of text
bustle.txt: a poem by Emily Dickinson with eight lines of text and one blank line
nobody.txt: another poem by the Belle of Amherst with eight lines of text and one blank line
To start, verify for yourself that grep fox empty.txt
will print nothing when using an empty file.
As shown by the usage, grep
accepts a regular expression as the first positional argument and optionally some input files for the rest.
Note that an empty regular expression will match all lines of input, and here I’ll use the input file fox.txt, which contains one line of text:
$ grep "" fox.txt The quick brown fox jumps over the lazy dog.
In the following Emily Dickinson poem, notice that Nobody is always capitalized:
$ cat nobody.txt I'm Nobody! Who are you? Are you—Nobody—too? Then there's a pair of us! Don't tell! they'd advertise—you know! How dreary—to be—Somebody! How public—like a Frog— To tell one's name—the livelong June— To an admiring Bog!
If I search for Nobody, the two lines containing the string are printed:
$ grep Nobody nobody.txt I'm Nobody! Who are you? Are you—Nobody—too?
If I search for lowercase nobody with grep nobody nobody.txt
, nothing is printed.
I can, however, use -i|--ignore-case
to find these lines:
$ grep -i nobody nobody.txt I'm Nobody! Who are you? Are you—Nobody—too?
I can use the -v|--invert-match
option to find the lines that don’t match the pattern:
$ grep -v Nobody nobody.txt Then there's a pair of us! Don't tell! they'd advertise—you know! How dreary—to be—Somebody! How public—like a Frog— To tell one's name—the livelong June— To an admiring Bog!
The -c|--count
option will cause the output to be a summary of the number of times a match occurs:
$ grep -c Nobody nobody.txt 2
I can combine -v
and -c
to count the lines not matching:
$ grep -vc Nobody nobody.txt 7
When searching multiple input files, each line of output includes the source filename:
$ grep The *.txt bustle.txt:The bustle in a house bustle.txt:The morning after death bustle.txt:The sweeping up the heart, fox.txt:The quick brown fox jumps over the lazy dog. nobody.txt:Then there's a pair of us!
The filename is also included for the counts:
$ grep -c The *.txt bustle.txt:3 empty.txt:0 fox.txt:1 nobody.txt:1
Normally, the positional arguments are files, and the inclusion of a directory such as my $HOME directory will cause grep
to print a warning:
$ grep The bustle.txt $HOME fox.txt bustle.txt:The bustle in a house bustle.txt:The morning after death bustle.txt:The sweeping up the heart, grep: /Users/kyclark: Is a directory fox.txt:The quick brown fox jumps over the lazy dog.
Directory names are acceptable only when using the -r|--recursive
option to find all the files in a directory that contain matching text.
In this command, I’ll use .
to indicate the current working directory:
$ grep -r The . ./nobody.txt:Then there's a pair of us! ./bustle.txt:The bustle in a house ./bustle.txt:The morning after death ./bustle.txt:The sweeping up the heart, ./fox.txt:The quick brown fox jumps over the lazy dog.
The -r
and -i
short flags can be combined to perform a recursive, case-insensitive search of one or more directories:
$ grep -ri the . ./nobody.txt:Then there's a pair of us! ./nobody.txt:Don't tell! they'd advertise—you know! ./nobody.txt:To tell one's name—the livelong June— ./bustle.txt:The bustle in a house ./bustle.txt:The morning after death ./bustle.txt:The sweeping up the heart, ./fox.txt:The quick brown fox jumps over the lazy dog.
Without any positional arguments for inputs, grep
will read STDIN
:
$ cat * | grep -i the The bustle in a house The morning after death The sweeping up the heart, The quick brown fox jumps over the lazy dog. Then there's a pair of us! Don't tell! they'd advertise—you know! To tell one's name—the livelong June—
The name of the challenge program should be grepr
(pronounced grep-er) for a Rust version of grep
.
Start with cargo new grepr
, then copy the book’s 09_grepr/tests directory into your new project.
Modify your Cargo.toml to include the following dependencies:
[dependencies]
clap
=
"2.33"
regex
=
"1"
walkdir
=
"2"
sys-info
=
"0.9"
[dev-dependencies]
assert_cmd
=
"2"
predicates
=
"2"
rand
=
"0.8"
You can run cargo test
to perform an initial build and run the tests, all of which should fail.
Update src/main.rs to the standard code used in previous programs:
fn
main
()
{
if
let
Err
(
e
)
=
grepr
::get_args
().
and_then
(
grepr
::run
)
{
eprintln
!
(
"{}"
,
e
);
std
::process
::exit
(
1
);
}
}
Following is how I started my src/lib.rs.
Note that all the Boolean options default to false
:
use
clap
:
:
{
App
,
Arg
}
;
use
regex
:
:
{
Regex
,
RegexBuilder
}
;
use
std
::
error
::
Error
;
type
MyResult
<
T
>
=
Result
<
T
,
Box
<
dyn
Error
>
>
;
#[
derive(Debug)
]
pub
struct
Config
{
pattern
:
Regex
,
files
:
Vec
<
String
>
,
recursive
:
bool
,
count
:
bool
,
invert_match
:
bool
,
}
The pattern
option is a compiled regular expression.
The files
option is a vector of strings.
The recursive
option is a Boolean for whether or not to recursively search directories.
The count
option is a Boolean for whether or not to display a count of the matches.
The invert_match
option is a Boolean for whether or not to find lines that do not match the pattern.
The program will have an insensitive
option, but you may notice that my Config
does not. Instead, I use regex::RegexBuilder
to create the regex using the case_insensitive
method.
Here is how I started my get_args
function.
You should fill in the missing parts:
pub
fn
get_args
()
->
MyResult
<
Config
>
{
let
matches
=
App
::new
(
"grepr"
)
.
version
(
"0.1.0"
)
.
author
(
"Ken Youens-Clark <kyclark@gmail.com>"
)
.
about
(
"Rust grep"
)
// What goes here?
.
get_matches
();
Ok
(
Config
{
pattern
:...
files
:...
recursive
:...
count
:...
invert_match
:...
})
}
Start your run
by printing the configuration:
pub
fn
run
(
config
:Config
)
->
MyResult
<
()
>
{
println
!
(
"{:#?}"
,
config
);
Ok
(())
}
Your next goal is to update your get_args
so that your program can produce the following usage:
$ cargo run -- -h grepr 0.1.0 Ken Youens-Clark <kyclark@gmail.com> Rust grep USAGE: grepr [FLAGS] <PATTERN> [FILE]... FLAGS: -c, --count Count occurrences -h, --help Prints help information -i, --insensitive Case-insensitive -v, --invert-match Invert match -r, --recursive Recursive search -V, --version Prints version information ARGS: <PATTERN> Search pattern<FILE>... Input file(s) [default: -]
The search pattern is a required argument.
The input files are optional and default to a dash for STDIN
.
Your program should be able to print a Config
like the following when provided a pattern and no input files:
$ cargo run -- dog Config { pattern: dog, files: [ "-", ], recursive: false, count: false, invert_match: false, }
Printing a regular expression means calling the Regex::as_str
method. RegexBuilder::build
notes that this “will produce the pattern given to new
verbatim. Notably, it will not incorporate any of the flags set on this builder.”
The program should be able to handle one or more input files and handle the flags:
$ cargo run -- dog -ricv tests/inputs/*.txt Config { pattern: dog, files: [ "tests/inputs/bustle.txt", "tests/inputs/empty.txt", "tests/inputs/fox.txt", "tests/inputs/nobody.txt", ], recursive: true, count: true, invert_match: true, }
Your program should reject an invalid regular expression, and you can reuse code from the findr
program in Chapter 7 to handle this.
For instance, *
signifies zero or more of the preceding pattern.
By itself, this is incomplete and should cause an error message:
$ cargo run -- \* Invalid pattern "*"
Stop reading here and write your get_args
to match the preceding description. Your program should also pass cargo test dies
.
Following is how I declared my arguments:
pub
fn
get_args
(
)
->
MyResult
<
Config
>
{
let
matches
=
App
::
new
(
"
grepr
"
)
.
version
(
"
0.1.0
"
)
.
author
(
"
Ken Youens-Clark <kyclark@gmail.com>
"
)
.
about
(
"
Rust grep
"
)
.
arg
(
Arg
::
with_name
(
"
pattern
"
)
.
value_name
(
"
PATTERN
"
)
.
help
(
"
Search pattern
"
)
.
required
(
true
)
,
)
.
arg
(
Arg
::
with_name
(
"
files
"
)
.
value_name
(
"
FILE
"
)
.
help
(
"
Input file(s)
"
)
.
multiple
(
true
)
.
default_value
(
"
-
"
)
,
)
.
arg
(
Arg
::
with_name
(
"
insensitive
"
)
.
short
(
"
i
"
)
.
long
(
"
insensitive
"
)
.
help
(
"
Case-insensitive
"
)
.
takes_value
(
false
)
,
)
.
arg
(
Arg
::
with_name
(
"
recursive
"
)
.
short
(
"
r
"
)
.
long
(
"
recursive
"
)
.
help
(
"
Recursive search
"
)
.
takes_value
(
false
)
,
)
.
arg
(
Arg
::
with_name
(
"
count
"
)
.
short
(
"
c
"
)
.
long
(
"
count
"
)
.
help
(
"
Count occurrences
"
)
.
takes_value
(
false
)
,
)
.
arg
(
Arg
::
with_name
(
"
invert
"
)
.
short
(
"
v
"
)
.
long
(
"
invert-match
"
)
.
help
(
"
Invert match
"
)
.
takes_value
(
false
)
,
)
.
get_matches
(
)
;
The rest of the positional arguments are for the inputs. The default is a dash.
The recursive
flag will handle searching for files in directories.
The invert
flag will search for lines not matching the pattern.
Here, the order in which you declare the positional parameters is important, as the first one defined will be for the first positional argument. You may define the optional arguments before or after the positional parameters.
Next, I used the arguments to create a regular expression that will incorporate the insensitive
option:
let
pattern
=
matches
.
value_of
(
"
pattern
"
)
.
unwrap
(
)
;
let
pattern
=
RegexBuilder
::
new
(
pattern
)
.
case_insensitive
(
matches
.
is_present
(
"
insensitive
"
)
)
.
build
(
)
.
map_err
(
|
_
|
format
!
(
"
Invalid pattern
\"
{}
\"
"
,
pattern
)
)
?
;
Ok
(
Config
{
pattern
,
files
:
matches
.
values_of_lossy
(
"
files
"
)
.
unwrap
(
)
,
recursive
:
matches
.
is_present
(
"
recursive
"
)
,
count
:
matches
.
is_present
(
"
count
"
)
,
invert_match
:
matches
.
is_present
(
"
invert
"
)
,
}
)
}
The pattern
is required, so it should be safe to unwrap the value.
The RegexBuilder::new
method will create a new regular expression.
The RegexBuilder::case_insensitive
method will cause the regex to disregard case in comparisons when the insensitive
flag is present.
The RegexBuilder::build
method will compile the regex.
If build
returns an error, use Result::map_err
to create an error message stating that the given pattern is invalid.
Return the Config
.
RegexBuilder::build
will reject any pattern that is not a valid regular expression, and this raises an interesting point.
There are many syntaxes for writing regular expressions.
If you look closely at the manual page for grep
, you’ll notice these options:
-E, --extended-regexp Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep). -e pattern, --regexp=pattern Specify a pattern used during the search of the input: an input line is selected if it matches any of the specified patterns. This option is most useful when multiple -e options are used to specify multiple patterns, or when a pattern begins with a dash ('-').
The converse of these options is:
-G, --basic-regexp Interpret pattern as a basic regular expression (i.e. force grep to behave as traditional grep).
Regular expressions have been around since the 1950s, when they were invented by the American mathematician Stephen Cole Kleene.2
Since that time, the syntax has been modified and expanded by various groups, perhaps most notably by the Perl community, which created Perl Compatible Regular Expressions (PCRE).
By default, grep
will parse only basic regexes, but the preceding flags can allow it to use other varieties.
For instance, I can use the pattern ee
to search for any lines containing two adjacent es.
Note that I have added the bold style in the following output to help you see the pattern that was found:
$ grep 'ee' tests/inputs/* tests/inputs/bustle.txt:The sweeping up the heart,
If I want to find any character that is repeated twice, the pattern is (.)\1
, where the dot (.
) represents any character and the capturing parentheses allow me to use the backreference \1
to refer to the first capture group.
This is an example of an extended expression and so requires the -E
flag:
$ grep -E '(.)\1' tests/inputs/* tests/inputs/bustle.txt:The sweeping up the heart, tests/inputs/bustle.txt:And putting love away tests/inputs/bustle.txt:We shall not want to use again tests/inputs/nobody.txt:Are you—Nobody—too? tests/inputs/nobody.txt:Don't tell! they'd advertise—you know! tests/inputs/nobody.txt:To tell one's name—the livelong June—
The Rust regex
crate’s documentation notes that its “syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences.”
(Look-around assertions allow the expression to assert that a pattern must be followed or preceded by another pattern, and backreferences allow the pattern to refer to previously captured values.)
This means that the challenge program will work more like egrep
in handling extended regular expressions by default.
Sadly, this also means that the program will not be able to handle the preceding pattern because it requires backreferences.
It will still be a wicked cool program to write, though, so let’s keep going.
Next, I need to find all the files to search.
Recall that the user might provide directory names with the --recursive
option to search for all the files contained in each directory; otherwise, directory names should result in a warning printed to STDERR
.
I decided to write a function called find_files
that will accept a vector of strings that may be file or directory names along with a Boolean for whether or not to recurse into directories.
It returns a vector of MyResult
values that will hold a string that is the name of a valid file or an error message:
fn
find_files
(
paths
:&
[
String
],
recursive
:bool
)
->
Vec
<
MyResult
<
String
>>
{
unimplemented
!
();
}
To test this, I can add a tests
module to src/lib.rs.
Note that this will use the rand
crate that should be listed in the [dev-dependencies]
section of your Cargo.toml, as noted earlier in the chapter:
#[cfg(test)]
mod
tests
{
use
super
::find_files
;
use
rand
::{
distributions
::Alphanumeric
,
Rng
};
#[test]
fn
test_find_files
()
{
// Verify that the function finds a file known to exist
let
files
=
find_files
(
&
[
"./tests/inputs/fox.txt"
.
to_string
()],
false
);
assert_eq
!
(
files
.
len
(),
1
);
assert_eq
!
(
files
[
0
].
as_ref
().
unwrap
(),
"./tests/inputs/fox.txt"
);
// The function should reject a directory without the recursive option
let
files
=
find_files
(
&
[
"./tests/inputs"
.
to_string
()],
false
);
assert_eq
!
(
files
.
len
(),
1
);
if
let
Err
(
e
)
=
&
files
[
0
]
{
assert_eq
!
(
e
.
to_string
(),
"./tests/inputs is a directory"
);
}
// Verify the function recurses to find four files in the directory
let
res
=
find_files
(
&
[
"./tests/inputs"
.
to_string
()],
true
);
let
mut
files
:Vec
<
String
>
=
res
.
iter
()
.
map
(
|
r
|
r
.
as_ref
().
unwrap
().
replace
(
"
\\
"
,
"/"
))
.
collect
();
files
.
sort
();
assert_eq
!
(
files
.
len
(),
4
);
assert_eq
!
(
files
,
vec
!
[
"./tests/inputs/bustle.txt"
,
"./tests/inputs/empty.txt"
,
"./tests/inputs/fox.txt"
,
"./tests/inputs/nobody.txt"
,
]
);
// Generate a random string to represent a nonexistent file
let
bad
:String
=
rand
::thread_rng
()
.
sample_iter
(
&
Alphanumeric
)
.
take
(
7
)
.
map
(
char
::from
)
.
collect
();
// Verify that the function returns the bad file as an error
let
files
=
find_files
(
&
[
bad
],
false
);
assert_eq
!
(
files
.
len
(),
1
);
assert
!
(
files
[
0
].
is_err
());
}
}
Stop reading and write the code to pass cargo test test_find_files
.
Here is how I can use find_files
in my code:
pub
fn
run
(
config
:Config
)
->
MyResult
<
()
>
{
println
!
(
"pattern
\"
{}
\"
"
,
config
.
pattern
);
let
entries
=
find_files
(
&
config
.
files
,
config
.
recursive
);
for
entry
in
entries
{
match
entry
{
Err
(
e
)
=>
eprintln
!
(
"{}"
,
e
),
Ok
(
filename
)
=>
println
!
(
"file
\"
{}
\"
"
,
filename
),
}
}
Ok
(())
}
My solution uses WalkDir
, which I introduced in Chapter 7.
See if you can get your program to reproduce the following output.
To start, the default input should be a dash (-
), to represent reading from STDIN
:
$ cargo run -- fox pattern "fox" file "-"
Explicitly listing a dash as the input should produce the same output:
$ cargo run -- fox - pattern "fox" file "-"
The program should handle multiple input files:
$ cargo run -- fox tests/inputs/* pattern "fox" file "tests/inputs/bustle.txt" file "tests/inputs/empty.txt" file "tests/inputs/fox.txt" file "tests/inputs/nobody.txt"
A directory name without the --recursive
option should be rejected:
$ cargo run -- fox tests/inputs pattern "fox" tests/inputs is a directory
With the --recursive
flag, it should find the directory’s files:
$ cargo run -- -r fox tests/inputs pattern "fox" file "tests/inputs/empty.txt" file "tests/inputs/nobody.txt" file "tests/inputs/bustle.txt" file "tests/inputs/fox.txt"
Invalid file arguments should be printed to STDERR
in the course of handling each entry.
In the following example, blargh represents a nonexistent file:
$ cargo run -- -r fox blargh tests/inputs/fox.txt pattern "fox" blargh: No such file or directory (os error 2) file "tests/inputs/fox.txt"
Now it’s time for your program to open the files and search for matching lines.
I suggest you again use the open
function from earlier chapters, which will open and read either an existing file or STDIN
for a filename that equals a dash (-
):
fn
open
(
filename
:&
str
)
->
MyResult
<
Box
<
dyn
BufRead
>>
{
match
filename
{
"-"
=>
Ok
(
Box
::new
(
BufReader
::new
(
io
::stdin
()))),
_
=>
Ok
(
Box
::new
(
BufReader
::new
(
File
::open
(
filename
)
?
))),
}
}
This will require you to expand your program’s imports with the following:
use
std
::{
error
::Error
,
fs
::{
self
,
File
},
io
::{
self
,
BufRead
,
BufReader
},
};
When reading the lines, be sure to preserve the line endings as one of the input files contains Windows-style CRLF endings.
My solution uses a function called find_lines
, which you can start with the following:
fn
find_lines
<
T
:
BufRead
>
(
mut
file
:
T
,
pattern
:
&
Regex
,
invert_match
:
bool
,
)
->
MyResult
<
Vec
<
String
>
>
{
unimplemented
!
(
)
;
}
The file
option must implement the std::io::BufRead
trait.
The pattern
option is a reference to a compiled regular expression.
The invert_match
option is a Boolean for whether to reverse the match
operation.
In the wcr
program from Chapter 5, I used impl BufRead
to indicate a value that must implement the BufRead
trait. In the preceding code, I’m using <T: BufRead>
to indicate the trait bound for the type T
. They both accomplish the same thing, but I wanted to show another common way to write this.
To test this function, I expanded my tests
module by adding the following test_find_lines
function, which again uses std::io::Cursor
to create a fake filehandle that implements BufRead
for testing:
#[cfg(test)]
mod
test
{
use
super
::{
find_files
,
find_lines
};
use
rand
::{
distributions
::Alphanumeric
,
Rng
};
use
regex
::{
Regex
,
RegexBuilder
};
use
std
::io
::Cursor
;
#[test]
fn
test_find_files
()
{}
// Same as before
#[test]
fn
test_find_lines
()
{
let
text
=
b"Lorem
\n
Ipsum
\r\n
DOLOR"
;
// The pattern _or_ should match the one line, "Lorem"
let
re1
=
Regex
::new
(
"or"
).
unwrap
();
let
matches
=
find_lines
(
Cursor
::new
(
&
text
),
&
re1
,
false
);
assert
!
(
matches
.
is_ok
());
assert_eq
!
(
matches
.
unwrap
().
len
(),
1
);
// When inverted, the function should match the other two lines
let
matches
=
find_lines
(
Cursor
::new
(
&
text
),
&
re1
,
true
);
assert
!
(
matches
.
is_ok
());
assert_eq
!
(
matches
.
unwrap
().
len
(),
2
);
// This regex will be case-insensitive
let
re2
=
RegexBuilder
::new
(
"or"
)
.
case_insensitive
(
true
)
.
build
()
.
unwrap
();
// The two lines "Lorem" and "DOLOR" should match
let
matches
=
find_lines
(
Cursor
::new
(
&
text
),
&
re2
,
false
);
assert
!
(
matches
.
is_ok
());
assert_eq
!
(
matches
.
unwrap
().
len
(),
2
);
// When inverted, the one remaining line should match
let
matches
=
find_lines
(
Cursor
::new
(
&
text
),
&
re2
,
true
);
assert
!
(
matches
.
is_ok
());
assert_eq
!
(
matches
.
unwrap
().
len
(),
1
);
}
}
Stop reading and write the function that will pass cargo test test_find_lines
.
Next, I suggest you incorporate these ideas into your run
:
pub
fn
run
(
config
:
Config
)
->
MyResult
<
(
)
>
{
let
entries
=
find_files
(
&
config
.
files
,
config
.
recursive
)
;
for
entry
in
entries
{
match
entry
{
Err
(
e
)
=
>
eprintln
!
(
"
{}
"
,
e
)
,
Ok
(
filename
)
=
>
match
open
(
&
filename
)
{
Err
(
e
)
=
>
eprintln
!
(
"
{}: {}
"
,
filename
,
e
)
,
Ok
(
file
)
=
>
{
let
matches
=
find_lines
(
file
,
&
config
.
pattern
,
config
.
invert_match
,
)
;
println
!
(
"
Found {:?}
"
,
matches
)
;
}
}
,
}
}
Ok
(
(
)
)
}
Look for the input files.
Handle the errors from finding input files.
Try to open a valid filename.
Handle errors opening a file.
Use the open filehandle to find the lines matching (or not matching) the regex.
At this point, the program should show the following output:
$ cargo run -- -r fox tests/inputs/* Found Ok([]) Found Ok([]) Found Ok(["The quick brown fox jumps over the lazy dog.\n"]) Found Ok([])
Modify this version to meet the criteria for the program. Start as simply as possible, perhaps by using an empty regular expression that should match all the lines from the input:
$ cargo run -- "" tests/inputs/fox.txt The quick brown fox jumps over the lazy dog.
Be sure you are reading STDIN
by default:
$ cargo run -- "" < tests/inputs/fox.txt The quick brown fox jumps over the lazy dog.
Run with several input files and a case-sensitive pattern:
$ cargo run -- The tests/inputs/* tests/inputs/bustle.txt:The bustle in a house tests/inputs/bustle.txt:The morning after death tests/inputs/bustle.txt:The sweeping up the heart, tests/inputs/fox.txt:The quick brown fox jumps over the lazy dog. tests/inputs/nobody.txt:Then there's a pair of us!
Then try to print the number of matches instead of the lines:
$ cargo run -- --count The tests/inputs/* tests/inputs/bustle.txt:3 tests/inputs/empty.txt:0 tests/inputs/fox.txt:1 tests/inputs/nobody.txt:1
Incorporate the --insensitive
option:
$ cargo run -- --count --insensitive The tests/inputs/* tests/inputs/bustle.txt:3 tests/inputs/empty.txt:0 tests/inputs/fox.txt:1 tests/inputs/nobody.txt:3
Next, try to invert the matching:
$ cargo run -- --count --invert-match The tests/inputs/* tests/inputs/bustle.txt:6 tests/inputs/empty.txt:0 tests/inputs/fox.txt:0 tests/inputs/nobody.txt:8
Be sure your --recursive
option works:
$ cargo run -- -icr the tests/inputs tests/inputs/empty.txt:0 tests/inputs/nobody.txt:3 tests/inputs/bustle.txt:3 tests/inputs/fox.txt:1
Handle errors such as the nonexistent file blargh while processing the files in order:
$ cargo run -- fox blargh tests/inputs/fox.txt blargh: No such file or directory (os error 2) tests/inputs/fox.txt:The quick brown fox jumps over the lazy dog.
Another potential problem you should gracefully handle is failure to open a file, perhaps due to insufficient permissions:
$ touch hammer && chmod 000 hammer $ cargo run -- fox hammer tests/inputs/fox.txt hammer: Permission denied (os error 13) tests/inputs/fox.txt:The quick brown fox jumps over the lazy dog.
It’s go time. These challenges are getting harder, so it’s OK to feel a bit overwhelmed by the requirements. Tackle each task in order, and keep running cargo test
to see how many you’re able to pass. When you get stuck, run grep
with the arguments from the test and closely examine the output. Then run your program with the same arguments and try to find the differences.
I will always stress that your solution can be written however you like as long as it passes the provided test suite.
In the following find_files
function, I choose to use the imperative approach of manually pushing to a vector rather than collecting from an iterator.
The function will either collect a single error for a bad path or flatten the iterable WalkDir
to recursively get the files.
Be sure you add use std::fs
and use walkdir::WalkDir
for this code:
fn
find_files
(
paths
:
&
[
String
]
,
recursive
:
bool
)
->
Vec
<
MyResult
<
String
>
>
{
let
mut
results
=
vec
!
[
]
;
for
path
in
paths
{
match
path
.
as_str
(
)
{
"
-
"
=
>
results
.
push
(
Ok
(
path
.
to_string
(
)
)
)
,
_
=
>
match
fs
::
metadata
(
path
)
{
Ok
(
metadata
)
=
>
{
if
metadata
.
is_dir
(
)
{
if
recursive
{
for
entry
in
WalkDir
::
new
(
path
)
.
into_iter
(
)
.
flatten
(
)
.
filter
(
|
e
|
e
.
file_type
(
)
.
is_file
(
)
)
{
results
.
push
(
Ok
(
entry
.
path
(
)
.
display
(
)
.
to_string
(
)
)
)
;
}
}
else
{
results
.
push
(
Err
(
From
::
from
(
format
!
(
"
{} is a directory
"
,
path
)
)
)
)
;
}
}
else
if
metadata
.
is_file
(
)
{
results
.
push
(
Ok
(
path
.
to_string
(
)
)
)
;
}
}
Err
(
e
)
=
>
{
results
.
push
(
Err
(
From
::
from
(
format
!
(
"
{}: {}
"
,
path
,
e
)
)
)
)
}
}
,
}
}
results
}
Initialize an empty vector to hold the results
.
Iterate over each of the given paths.
First, accept a dash (-
) as a path, for STDIN
.
Try to get the path’s metadata.
Check if the path is a directory.
Check if the user wants to recursively search directories.
Add all the files in the given directory to the results
.
Iterator::flatten
will take the Ok
or Some
variants for Result
and Option
types and will ignore the Err
and None
variants, meaning it will ignore any errors with files found by recursing through directories.
Note an error that the given entry is a directory.
If the path is a file, add it to the results
.
This arm will be triggered by nonexistent files.
Next, I will share my find_lines
function.
The following code requires that you add use std::mem
to your imports.
This borrows heavily from previous functions that read files line by line, so I won’t comment on code I’ve used before:
fn
find_lines
<
T
:
BufRead
>
(
mut
file
:
T
,
pattern
:
&
Regex
,
invert_match
:
bool
,
)
->
MyResult
<
Vec
<
String
>
>
{
let
mut
matches
=
vec
!
[
]
;
let
mut
line
=
String
::
new
(
)
;
loop
{
let
bytes
=
file
.
read_line
(
&
mut
line
)
?
;
if
bytes
=
=
0
{
break
;
}
if
pattern
.
is_match
(
&
line
)
^
invert_match
{
matches
.
push
(
mem
::
take
(
&
mut
line
)
)
;
}
line
.
clear
(
)
;
}
Ok
(
matches
)
}
Initialize a mutable vector to hold the matching lines.
Use the BitXor
bit-wise exclusive OR operator (^
) to determine if the line should be included.
Use std::mem::take
to take ownership of the line. I could have used clone
to copy the string and add it to the matches
, but take
avoids an unnecessary copy.
In the preceding function, the bitwise XOR comparison (^
) could also be expressed using a combination of the logical AND (&&
) and OR operators (||
) like so:
if
(
pattern
.
is_match
(
&
line
)
&
&
!
invert_match
)
|
|
(
!
pattern
.
is_match
(
&
line
)
&
&
invert_match
)
{
matches
.
push
(
line
.
clone
(
)
)
;
}
Verify that the line matches and the user does not want to invert the match.
Alternatively, check if the line does not match and the user wants to invert the match.
At the beginning of the run
function, I decided to create a closure to handle the printing of the output with or without the filenames given the number of input files:
pub
fn
run
(
config
:
Config
)
->
MyResult
<
(
)
>
{
let
entries
=
find_files
(
&
config
.
files
,
config
.
recursive
)
;
let
num_files
=
entries
.
len
(
)
;
let
=
|
fname
:
&
str
,
val
:
&
str
|
{
if
num_files
>
1
{
!
(
"
{}:{}
"
,
fname
,
val
)
;
}
else
{
!
(
"
{}
"
,
val
)
;
}
}
;
Find all the inputs.
Find the number of inputs.
Create a print
closure that uses the number of inputs to decide whether to print the filenames in the output.
Continuing from there, the program attempts to find the matching lines from the entries:
for
entry
in
entries
{
match
entry
{
Err
(
e
)
=
>
eprintln
!
(
"
{}
"
,
e
)
,
Ok
(
filename
)
=
>
match
open
(
&
filename
)
{
Err
(
e
)
=
>
eprintln
!
(
"
{}: {}
"
,
filename
,
e
)
,
Ok
(
file
)
=
>
{
match
find_lines
(
file
,
&
config
.
pattern
,
config
.
invert_match
,
)
{
Err
(
e
)
=
>
eprintln
!
(
"
{}
"
,
e
)
,
Ok
(
matches
)
=
>
{
if
config
.
count
{
(
&
filename
,
&
format
!
(
"
{}
\n
"
,
matches
.
len
(
)
)
,
)
;
}
else
{
for
line
in
&
matches
{
(
&
filename
,
line
)
;
}
}
}
}
}
}
,
}
}
Ok
(
(
)
)
}
Print errors like nonexistent files to STDERR
.
Attempt to open a file. This might fail due to permissions.
Attempt to find the matching lines of text.
Print errors to STDERR
.
Decide whether to print the number of matches or the matches themselves.
At this point, the program should pass all the tests.
The Rust ripgrep
tool implements many of the features of grep
and is worthy of your study.
You can install the program using the instructions provided and then execute rg
.
As shown in Figure 9-1, the matching text is highlighted in the output.
Try to add that feature to your program using Regex::find
to find the start and stop positions of the matching pattern and something like termcolor
to highlight the matches.
ripgrep
tool will highlight the matching text.The author of ripgrep
wrote an extensive blog post about design decisions that went into writing the program.
In the section “Repeat After Me: Thou Shalt Not Search Line by Line,” the author discusses the performance hit of searching over lines of text, the majority of which will not match.
This chapter challenged you to extend skills you learned in Chapter 7, such as recursively finding files in directories and using regular expressions. In this chapter, you combined those skills to find content inside files matching (or not matching) a given regex. In addition, you learned the following:
How to use RegexBuilder
to create more complicated regular expressions using, for instance, the case-insensitive option to match strings regardless of case.
There are multiple syntaxes for writing regular expressions that different tools recognize, such as PCRE. Rust’s regex
engine does not implement some features of PCRE, such as look-around assertions or backreferences.
You can indicate a trait bound like BufRead
in function signatures using either impl BufRead
or <T: BufRead>
.
Rust’s bitwise XOR operator can replace more complex logical operations that combine AND and OR comparisons.
In the next chapter, you’ll learn more about iterating the lines of a file, how to compare strings, and how to create a more complicated enum
type.
1 The name grep
comes from the ed
command g/re/p
, which means “global regular expression print,” where ed
is the standard text editor.
2 If you would like to learn more about regexes, I recommend Mastering Regular Expressions, 3rd ed., by Jeffrey E. F. Friedl (O’Reilly).