I’m a mess / Since you cut me out / But Chucky’s arm keeps me company
They Might Be Giants, “Cyclops Rock” (2001)
For the next challenge program, you will create a Rust version of cut
, which will excise text from a file or STDIN
.
The selected text could be some range of bytes or characters or might be fields denoted by a delimiter like a comma or tab that creates field boundaries.
You learned how to select a contiguous range of characters or bytes in Chapter 4, while working on the headr
program, but this challenge goes further as the selections may be noncontiguous and in any order.
For example, the selection 3,1,5-7
should cause the challenge program to print the third, first, and fifth through seventh bytes, characters, or fields, in that order.
The challenge program will capture the spirit of the original but will not strive for complete fidelity, as I will suggest a few changes that I feel are improvements.
In this chapter, you will learn how to do the following:
Read and write a delimited text file using the csv
crate
Deference a value using *
Use Iterator::flatten
to remove nested structures from iterators
Use Iterator::flat_map
to combine Iterator::map
and Iterator::flatten
I will start by reviewing the portion of the BSD cut
manual page that describes the features of the program you will write:
CUT(1) BSD General Commands Manual CUT(1) NAME cut -- cut out selected portions of each line of a file SYNOPSIS cut -b list [-n] [file ...] cut -c list [file ...] cut -f list [-d delim] [-s] [file ...] DESCRIPTION The cut utility cuts out selected portions of each line (as specified by list) from each file and writes them to the standard output. If no file arguments are specified, or a file argument is a single dash ('-'), cut reads from the standard input. The items specified by list can be in terms of column position or in terms of fields delimited by a special character. Column numbering starts from 1. The list option argument is a comma or whitespace separated set of num- bers and/or number ranges. Number ranges consist of a number, a dash ('-'), and a second number and select the fields or columns from the first number to the second, inclusive. Numbers or number ranges may be preceded by a dash, which selects all fields or columns from 1 to the last number. Numbers or number ranges may be followed by a dash, which selects all fields or columns from the last number to the end of the line. Numbers and number ranges may be repeated, overlapping, and in any order. If a field or column is specified multiple times, it will appear only once in the output. It is not an error to select fields or columns not present in the input line.
The original tool offers quite a few options, but the challenge program will implement only the following:
-b list The list specifies byte positions. -c list The list specifies character positions. -d delim Use delim as the field delimiter character instead of the tab character. -f list The list specifies fields, separated in the input by the field delimiter character (see the -d option.) Output fields are sepa- rated by a single occurrence of the field delimiter character.
As usual, the GNU version offers both short and long flags for these options:
NAME cut - remove sections from each line of files SYNOPSIS cut OPTION... [FILE]... DESCRIPTION Print selected parts of lines from each FILE to standard output. Mandatory arguments to long options are mandatory for short options too. -b, --bytes=LIST select only these bytes -c, --characters=LIST select only these characters -d, --delimiter=DELIM use DELIM instead of TAB for field delimiter -f, --fields=LIST select only these fields; also print any line that contains no delimiter character, unless the -s option is specified
Both tools implement the selection ranges in similar ways, where numbers can be selected individually, in closed ranges like 1-3
, or in partially defined ranges like
-3
to indicate 1 through 3 or 5-
to indicate 5 to the end, but the challenge program will support only closed ranges.
I’ll use some of the files found in the book’s 08_cutr/tests/inputs directory to show the features that the challenge program will implement.
You should change into this directory if you want to execute the following commands:
$ cd 08_cutr/tests/inputs
First, consider a file of fixed-width text where each column occupies a fixed number of characters:
$ cat books.txt Author Year Title Émile Zola 1865 La Confession de Claude Samuel Beckett 1952 Waiting for Godot Jules Verne 1870 20,000 Leagues Under the Sea
The Author column takes the first 20 characters:
$ cut -c 1-20 books.txt Author Émile Zola Samuel Beckett Jules Verne
The publication Year column spans the next five characters:
$ cut -c 21-25 books.txt Year 1865 1952 1870
The Title column fills the remainder of the line, where the longest title is 28 characters. Note here that I intentionally request a larger range than exists to show that this is not considered an error:
$ cut -c 26-70 books.txt Title La Confession de Claude Waiting for Godot 20,000 Leagues Under the Sea
The program does not allow me to rearrange the output by requesting the range 26-55
for the Title followed by the range 1-20
for the Author.
Instead, the selections are placed in their original, ascending order:
$ cut -c 26-55,1-20 books.txt Author Title Émile Zola La Confession de Claude Samuel Beckett Waiting for Godot Jules Verne 20,000 Leagues Under the Sea
I can use the option -c 1
to select the first character, like so:
$ cut -c 1 books.txt A É S J
As you’ve seen in previous chapters, bytes and characters are not always interchangeable. For instance, the É in Émile Zola is a Unicode character that is composed of two bytes, so asking for just one byte will result in invalid UTF-8 that is represented with the Unicode replacement character:
$ cut -b 1 books.txt A � S J
In my experience, fixed-width datafiles are less common than those where the columns of data are delimited by a character such as a comma or a tab. Consider the same data in the file books.tsv, where the file extension .tsv stands for tab-separated values (TSV) and the columns are delimited by the tab:
$ cat books.tsv Author Year Title Émile Zola 1865 La Confession de Claude Samuel Beckett 1952 Waiting for Godot Jules Verne 1870 20,000 Leagues Under the Sea
By default, cut
will assume the tab character is the field delimiter, so I can use the -f
option to select, for instance, the publication year in the second column and the title in the third column, like so:
$ cut -f 2,3 books.tsv Year Title 1865 La Confession de Claude 1952 Waiting for Godot 1870 20,000 Leagues Under the Sea
The comma is another common delimiter, and such files often have the extension .csv for comma-separated values (CSV). Following is the same data as a CSV file:
$ cat books.csv Author,Year,Title Émile Zola,1865,La Confession de Claude Samuel Beckett,1952,Waiting for Godot Jules Verne,1870,"20,000 Leagues Under the Sea"
To parse a CSV file, I must indicate the delimiter with the -d
option.
Note that I’m still unable to reorder the fields in the output, as I indicate 2,1
for the second column followed by the first, but I get the columns back in their original order:
$ cut -d , -f 2,1 books.csv Author,Year Émile Zola,1865 Samuel Beckett,1952 Jules Verne,1870
You may have noticed that the third title contains a comma in 20,000 and so the title has been enclosed in quotes to indicate that this comma is not a field delimiter.
This is a way to escape the delimiter, or to tell the parser to ignore it.
Unfortunately, neither the BSD nor the GNU version of cut
recognizes this and so will truncate the title
prematurely:
$ cut -d , -f 1,3 books.csv Author,Title Émile Zola,La Confession de Claude Samuel Beckett,Waiting for Godot Jules Verne,"20
Noninteger values for any of the list
option values are rejected:
$ cut -f foo,bar books.tsv cut: [-cf] list: illegal list value
Any error opening a file is handled in the course of processing, printing a message to STDERR
.
In the following example, blargh represents a nonexistent file:
$ cut -c 1 books.txt blargh movies1.csv A É S J cut: blargh: No such file or directory t T L
Finally, the program will read STDIN
by default or if the given input filename is a
dash (-
):
$ cat books.tsv | cut -f 2 Year 1865 1952 1870
The challenge program is expected to implement just this much, with the following changes:
The name of the challenge program should be cutr
(pronounced cut-er) for a Rust version of cut
.
I recommend you begin with cargo new cutr
and then copy the 08_cutr/tests directory into your project.
My solution will use the following crates, which you should add to your Cargo.toml:
[dependencies]
clap
=
"2.33"
csv
=
"1"
regex
=
"1"
[dev-dependencies]
assert_cmd
=
"2"
predicates
=
"2"
rand
=
"0.8"
The csv
crate will be used to parse delimited files such as CSV files.
Run cargo test
to download the dependencies and run the tests, all of which should fail.
Use the following structure for your src/main.rs:
fn
main
()
{
if
let
Err
(
e
)
=
cutr
::get_args
().
and_then
(
cutr
::run
)
{
eprintln
!
(
"{}"
,
e
);
std
::process
::exit
(
1
);
}
}
In the following code, I want to highlight that I’m creating an enum
where the
variants can hold a value.
In this case, the type alias PositionList
, which is a Vec<Range<usize>>
or a vector of std::ops::Range
structs, will represent spans of positive integer values.
Here is how I started my src/lib.rs:
use
crate
::
Extract
:
:
*
;
use
clap
:
:
{
App
,
Arg
}
;
use
std
:
:
{
error
::
Error
,
ops
::
Range
}
;
type
MyResult
<
T
>
=
Result
<
T
,
Box
<
dyn
Error
>
>
;
type
PositionList
=
Vec
<
Range
<
usize
>
>
;
#[
derive(Debug)
]
pub
enum
Extract
{
Fields
(
PositionList
)
,
Bytes
(
PositionList
)
,
Chars
(
PositionList
)
,
}
#[
derive(Debug)
]
pub
struct
Config
{
files
:
Vec
<
String
>
,
delimiter
:
u8
,
extract
:
Extract
,
}
This allows me to use Fields(...)
instead of Extract::Fields(...)
.
A PositionList
is a vector of Range<usize>
values.
Define an enum
to hold the variants for extracting fields, bytes, or characters.
The files
parameter will be a vector of strings.
The delimiter
should be a single byte.
The extract
field will hold one of the Extract
variants.
Unlike the original cut
tool, the challenge program will allow only for a comma-separated list of either single numbers or ranges like 2-4
.
Also, the challenge program will use the selections in the given order rather than rearranging them in ascending order.
You can start your get_args
by expanding on the following skeleton:
pub
fn
get_args
()
->
MyResult
<
Config
>
{
let
matches
=
App
::new
(
"cutr"
)
.
version
(
"0.1.0"
)
.
author
(
"Ken Youens-Clark <kyclark@gmail.com>"
)
.
about
(
"Rust cut"
)
// What goes here?
.
get_matches
();
Ok
(
Config
{
files
:...
delimiter
:...
extract
:...
})
}
Begin your run
by printing the config
:
pub
fn
run
(
config
:Config
)
->
MyResult
<
()
>
{
println
!
(
"{:#?}"
,
&
config
);
Ok
(())
}
Following is the expected usage for the program:
$ cargo run -- --help cutr 0.1.0 Ken Youens-Clark <kyclark@gmail.com> Rust cut USAGE: cutr [OPTIONS] [FILE]... FLAGS: -h, --help Prints help information -V, --version Prints version information OPTIONS: -b, --bytes <BYTES> Selected bytes -c, --chars <CHARS> Selected characters -d, --delim <DELIMITER> Field delimiter [default: ] -f, --fields <FIELDS> Selected fields ARGS: <FILE>... Input file(s) [default: -]
To parse and validate the range values for the byte, character, and field arguments, I wrote a function called parse_pos
that accepts a &str
and might return a PositionList
.
Here is how you might start it:
fn
parse_pos
(
range
:&
str
)
->
MyResult
<
PositionList
>
{
unimplemented
!
();
}
This function is similar to the parse_positive_int
function from Chapter 4. See how much of that code can be reused here.
To help you along, I have written an extensive unit test for the numbers and number ranges that should be accepted or rejected.
The numbers may have leading zeros but may not have any nonnumeric characters, and number ranges must be denoted with a dash (-
).
Multiple numbers and ranges can be separated with commas.
In this chapter, I will create a unit_tests
module so that cargo test unit
will run all the unit tests.
Note that my implementation of parse_pos
uses index positions where I subtract one from each value for zero-based indexing, but you may prefer to handle this differently.
Add the following to your src/lib.rs:
#[cfg(test)]
mod
unit_tests
{
use
super
::parse_pos
;
#[test]
fn
test_parse_pos
()
{
// The empty string is an error
assert
!
(
parse_pos
(
""
).
is_err
());
// Zero is an error
let
res
=
parse_pos
(
"0"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
0
\"
"
,);
let
res
=
parse_pos
(
"0-1"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
0
\"
"
,);
// A leading "+" is an error
let
res
=
parse_pos
(
"+1"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
+1
\"
"
,
);
let
res
=
parse_pos
(
"+1-2"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
+1-2
\"
"
,
);
let
res
=
parse_pos
(
"1-+2"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
1-+2
\"
"
,
);
// Any non-number is an error
let
res
=
parse_pos
(
"a"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
a
\"
"
,);
let
res
=
parse_pos
(
"1,a"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
a
\"
"
,);
let
res
=
parse_pos
(
"1-a"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
1-a
\"
"
,
);
let
res
=
parse_pos
(
"a-1"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"illegal list value:
\"
a-1
\"
"
,
);
// Wonky ranges
let
res
=
parse_pos
(
"-"
);
assert
!
(
res
.
is_err
());
let
res
=
parse_pos
(
","
);
assert
!
(
res
.
is_err
());
let
res
=
parse_pos
(
"1,"
);
assert
!
(
res
.
is_err
());
let
res
=
parse_pos
(
"1-"
);
assert
!
(
res
.
is_err
());
let
res
=
parse_pos
(
"1-1-1"
);
assert
!
(
res
.
is_err
());
let
res
=
parse_pos
(
"1-1-a"
);
assert
!
(
res
.
is_err
());
// First number must be less than second
let
res
=
parse_pos
(
"1-1"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"First number in range (1) must be lower than second number (1)"
);
let
res
=
parse_pos
(
"2-1"
);
assert
!
(
res
.
is_err
());
assert_eq
!
(
res
.
unwrap_err
().
to_string
(),
"First number in range (2) must be lower than second number (1)"
);
// All the following are acceptable
let
res
=
parse_pos
(
"1"
);
assert
!
(
res
.
is_ok
());
assert_eq
!
(
res
.
unwrap
(),
vec
!
[
0
..
1
]);
let
res
=
parse_pos
(
"01"
);
assert
!
(
res
.
is_ok
());
assert_eq
!
(
res
.
unwrap
(),
vec
!
[
0
..
1
]);
let
res
=
parse_pos
(
"1,3"
);
assert
!
(
res
.
is_ok
());
assert_eq
!
(
res
.
unwrap
(),
vec
!
[
0
..
1
,
2
..
3
]);
let
res
=
parse_pos
(
"001,0003"
);
assert
!
(
res
.
is_ok
());
assert_eq
!
(
res
.
unwrap
(),
vec
!
[
0
..
1
,
2
..
3
]);
let
res
=
parse_pos
(
"1-3"
);
assert
!
(
res
.
is_ok
());
assert_eq
!
(
res
.
unwrap
(),
vec
!
[
0
..
3
]);
let
res
=
parse_pos
(
"0001-03"
);
assert
!
(
res
.
is_ok
());
assert_eq
!
(
res
.
unwrap
(),
vec
!
[
0
..
3
]);
let
res
=
parse_pos
(
"1,7,3-5"
);
assert
!
(
res
.
is_ok
());
assert_eq
!
(
res
.
unwrap
(),
vec
!
[
0
..
1
,
6
..
7
,
2
..
5
]);
let
res
=
parse_pos
(
"15,19-20"
);
assert
!
(
res
.
is_ok
());
assert_eq
!
(
res
.
unwrap
(),
vec
!
[
14
..
15
,
18
..
20
]);
}
}
Some of the preceding tests check for a specific error message to help you write the parse_pos
function; however, these could prove troublesome if you were trying to internationalize the error messages.
An alternative way to check for specific errors would be to use enum
variants that would allow the user interface to customize the output while still testing for specific errors.
At this point, I expect you can read the preceding code well enough to understand how the function should work. I recommend you stop reading at this point and write the code that will pass this test.
After cargo test unit
passes, incorporate the parse_pos
function into get_args
so that your program will reject invalid arguments and print an error message like the following:
$ cargo run -- -f foo,bar tests/inputs/books.tsv illegal list value: "foo"
The program should also reject invalid ranges:
$ cargo run -- -f 3-2 tests/inputs/books.tsv First number in range (3) must be lower than second number (2)
When given valid arguments, your program should display a structure like so:
$ cargo run -- -f 1 -d , tests/inputs/movies1.csv Config { files: [ "tests/inputs/movies1.csv",], delimiter: 44,
extract: Fields(
[ 0..1, ], ), }
The positional argument goes into files
.
The -d
value of a comma has a byte value of 44
.
The -f 1
argument creates the Extract::Fields
variant that holds a single range, 0..1
.
When parsing a TSV file, use the tab as the default delimiter, which has a byte value of 9
:
$ cargo run -- -f 2-3 tests/inputs/movies1.tsv Config { files: [ "tests/inputs/movies1.tsv", ], delimiter: 9, extract: Fields( [ 1..3, ], ), }
Note that the options for -f|--fields
, -b|--bytes
, and -c|--chars
should all be mutually exclusive:
$ cargo run -- -f 1 -b 8-9 tests/inputs/movies1.tsv error: The argument '--fields <FIELDS>' cannot be used with '--bytes <BYTES>'
Stop here and get your program working as described. The program should be able to pass all the tests that verify the validity of the inputs, which you can run with cargo test dies
:
running 10 tests test dies_bad_delimiter ... ok test dies_chars_fields ... ok test dies_chars_bytes_fields ... ok test dies_bytes_fields ... ok test dies_chars_bytes ... ok test dies_not_enough_args ... ok test dies_empty_delimiter ... ok test dies_bad_digit_field ... ok test dies_bad_digit_bytes ... ok test dies_bad_digit_chars ... ok
If you find you need more guidance on writing the parse_pos
function, I’ll provide that in the next section.
The parse_pos
function I will show relies on a parse_index
function that attempts to parse a string into a positive index value one less than the given number, because the user will provide one-based values but Rust needs zero-offset indexes.
The given string may not start with a plus sign, and the parsed value must be greater than zero.
Note that closures normally accept arguments inside pipes (||
), but the following function uses two closures that accept no arguments, which is why the pipes are empty.
Both closures instead reference the provided input
value.
For the following code, be sure to add use std::num::NonZeroUsize
to your imports:
fn
parse_index
(
input
:
&
str
)
->
Result
<
usize
,
String
>
{
let
value_error
=
|
|
format
!
(
"
illegal list value:
\"
{}
\"
"
,
input
)
;
input
.
starts_with
(
'+'
)
.
then
(
|
|
Err
(
value_error
(
)
)
)
.
unwrap_or_else
(
|
|
{
input
.
parse
:
:
<
NonZeroUsize
>
(
)
.
map
(
|
n
|
usize
::
from
(
n
)
-
1
)
.
map_err
(
|
_
|
value_error
(
)
)
}
)
}
Create a closure that accepts no arguments and formats an error string.
Check if the input value starts with a plus sign.
If so, create an error.
Otherwise, continue with the following closure, which accepts no arguments.
Use str::parse
to parse the input value, and use the turbofish to indicate the return type of std::num::NonZeroUsize
, which is a positive integer value.
If the input value parses successfully, cast the value to a usize
and decrement the value to a zero-based offset.
If the value does not parse, generate an error by calling the value_error
closure.
The following is how parse_index
is used in the parse_pos
function.
Add use regex::Regex
to your imports for this:
fn
parse_pos
(
range
:
&
str
)
->
MyResult
<
PositionList
>
{
let
range_re
=
Regex
::
new
(
r"^(\d+)-(\d+)$"
)
.
unwrap
(
)
;
range
.
split
(
','
)
.
into_iter
(
)
.
map
(
|
val
|
{
parse_index
(
val
)
.
map
(
|
n
|
n
.
.
n
+
1
)
.
or_else
(
|
e
|
{
range_re
.
captures
(
val
)
.
ok_or
(
e
)
.
and_then
(
|
captures
|
{
let
n1
=
parse_index
(
&
captures
[
1
]
)
?
;
let
n2
=
parse_index
(
&
captures
[
2
]
)
?
;
if
n1
>
=
n2
{
return
Err
(
format
!
(
"
First number in range ({})
\
must be lower than second number ({})
"
,
n1
+
1
,
n2
+
1
)
)
;
}
Ok
(
n1
.
.
n2
+
1
)
}
)
}
)
}
)
.
collect
:
:
<
Result
<
_
,
_
>
>
(
)
.
map_err
(
From
::
from
)
}
Create a regular expression to match two integers separated by a dash, using parentheses to capture the matched numbers.
Split the provided range value on the comma and turn the result into an iterator. In the event there are no commas, the provided value itself will be used.
Map each split value into the closure.
If parse_index
parses a single number, then create a Range
for the value. Otherwise, note the error value e
and continue trying to parse a range.
If the Regex
matches the value, the numbers in parentheses will be available through Regex::captures
.
Parse the two captured numbers as index values.
If the first value is greater than or equal to the second, return an error.
Otherwise, create a Range
from the lower number to the higher number, adding 1 to ensure the upper number is included.
Use Iterator::collect
to gather the values as a Result
.
Map any problems through From::from
to create an error.
The regular expression in the preceding code is enclosed with r""
to denote a raw string, which prevents Rust from interpreting backslash-escaped values in the string.
For instance, you’ve seen that Rust will interpret \n
as a newline.
Without this, the compiler complains that \d
is an unknown character escape:
error: unknown character escape: `d` --> src/lib.rs:127:35 | 127 | let range_re = Regex::new("^(\d+)-(\d+)$").unwrap(); | ^ unknown character escape | = help: for more information, visit <https://static.rust-lang.org /doc/master/reference.html#literals>
I would like to highlight the parentheses in the regular expression ^(\d+)-(\d+)$
to indicate one or more digits followed by a dash followed by one or more digits, as shown in Figure 8-1.
If the regular expression matches the given string, then I can use Regex::captures
to extract the digits that are surrounded by the parentheses.
Note that they are available in one-based counting, so the contents of the first capturing parentheses are available in position 1
of the captures.
Now that you have a way to parse and validate the numeric ranges, finish your get_args
function before reading further.
Here is how I incorporate the parse_pos
function into my get_args
.
First, I define all the arguments:
pub
fn
get_args
(
)
->
MyResult
<
Config
>
{
let
matches
=
App
::
new
(
"
cutr
"
)
.
version
(
"
0.1.0
"
)
.
author
(
"
Ken Youens-Clark <kyclark@gmail.com>
"
)
.
about
(
"
Rust cut
"
)
.
arg
(
Arg
::
with_name
(
"
files
"
)
.
value_name
(
"
FILE
"
)
.
help
(
"
Input file(s)
"
)
.
multiple
(
true
)
.
default_value
(
"
-
"
)
,
)
.
arg
(
Arg
::
with_name
(
"
delimiter
"
)
.
value_name
(
"
DELIMITER
"
)
.
short
(
"
d
"
)
.
long
(
"
delim
"
)
.
help
(
"
Field delimiter
"
)
.
default_value
(
"
\t
"
)
,
)
.
arg
(
Arg
::
with_name
(
"
fields
"
)
.
value_name
(
"
FIELDS
"
)
.
short
(
"
f
"
)
.
long
(
"
fields
"
)
.
help
(
"
Selected fields
"
)
.
conflicts_with_all
(
&
[
"
chars
"
,
"
bytes
"
]
)
,
)
.
arg
(
Arg
::
with_name
(
"
bytes
"
)
.
value_name
(
"
BYTES
"
)
.
short
(
"
b
"
)
.
long
(
"
bytes
"
)
.
help
(
"
Selected bytes
"
)
.
conflicts_with_all
(
&
[
"
fields
"
,
"
chars
"
]
)
,
)
.
arg
(
Arg
::
with_name
(
"
chars
"
)
.
value_name
(
"
CHARS
"
)
.
short
(
"
c
"
)
.
long
(
"
chars
"
)
.
help
(
"
Selected characters
"
)
.
conflicts_with_all
(
&
[
"
fields
"
,
"
bytes
"
]
)
,
)
.
get_matches
(
)
;
The required files
option accepts multiple values and defaults to a dash.
The delimiter
option uses the tab as the default value.
The fields
option conflicts with chars
and bytes
.
The bytes
option conflicts with fields
and chars
.
The chars
options conflicts with fields
and bytes
.
Next, I convert the delimiter to a vector of bytes and verify that the vector contains a single byte:
let
delimiter
=
matches
.
value_of
(
"delimiter"
).
unwrap
();
let
delim_bytes
=
delimiter
.
as_bytes
();
if
delim_bytes
.
len
()
!=
1
{
return
Err
(
From
::from
(
format
!
(
"--delim
\"
{}
\"
must be a single byte"
,
delimiter
)));
}
I use the parse_pos
function to handle all the optional list values:
let
fields
=
matches
.
value_of
(
"fields"
).
map
(
parse_pos
).
transpose
()
?
;
let
bytes
=
matches
.
value_of
(
"bytes"
).
map
(
parse_pos
).
transpose
()
?
;
let
chars
=
matches
.
value_of
(
"chars"
).
map
(
parse_pos
).
transpose
()
?
;
Next, I figure out which Extract
variant to create or generate an error if the user fails to select bytes, characters, or fields:
let
extract
=
if
let
Some
(
field_pos
)
=
fields
{
Fields
(
field_pos
)
}
else
if
let
Some
(
byte_pos
)
=
bytes
{
Bytes
(
byte_pos
)
}
else
if
let
Some
(
char_pos
)
=
chars
{
Chars
(
char_pos
)
}
else
{
return
Err
(
From
::from
(
"Must have --fields, --bytes, or --chars"
));
};
If the code makes it to this point, then I appear to have valid arguments that I can return:
Ok
(
Config
{
files
:
matches
.
values_of_lossy
(
"
files
"
)
.
unwrap
(
)
,
delimiter
:
*
delim_bytes
.
first
(
)
.
unwrap
(
)
,
extract
,
}
)
}
Use Vec::first
to select the first element of the vector. Because I have verified that this vector has exactly one byte, it is safe to call Option::unwrap
.
In the preceding code, I use the Deref::deref
operator *
in the expression *delim_bytes
to dereference the variable, which is a &u8
.
The code will not compile without the asterisk, and the error message shows exactly where to add the dereference operator:
error[E0308]: mismatched types --> src/lib.rs:94:20 | 94 | delimiter: delim_bytes.first().unwrap(), | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `u8`, found `&u8` | help: consider dereferencing the borrow | 94 | delimiter: *delim_bytes.first().unwrap(), | +
Next, you will need to figure out how you will use this information to extract the desired bits from the inputs.
In Chapters 4 and 5, you learned how to process lines, bytes, and characters in a file.
You should draw on those programs to help you select characters and bytes in this challenge.
One difference is that line endings need not be preserved, so you may use BufRead::lines
to read the lines of input text.
To start, you might consider bringing in the open
function to open each file:
fn
open
(
filename
:&
str
)
->
MyResult
<
Box
<
dyn
BufRead
>>
{
match
filename
{
"-"
=>
Ok
(
Box
::new
(
BufReader
::new
(
io
::stdin
()))),
_
=>
Ok
(
Box
::new
(
BufReader
::new
(
File
::open
(
filename
)
?
))),
}
}
The preceding function will require some additional imports:
use
crate
::Extract
::*
;
use
clap
::{
App
,
Arg
};
use
regex
::Regex
;
use
std
::{
error
::Error
,
fs
::File
,
io
::{
self
,
BufRead
,
BufReader
},
num
::NonZeroUsize
,
ops
::Range
,
};
You can expand your run
to handle good and bad files:
pub
fn
run
(
config
:Config
)
->
MyResult
<
()
>
{
for
filename
in
&
config
.
files
{
match
open
(
filename
)
{
Err
(
err
)
=>
eprintln
!
(
"{}: {}"
,
filename
,
err
),
Ok
(
_
)
=>
println
!
(
"Opened {}"
,
filename
),
}
}
Ok
(())
}
At this point, the program should pass cargo test skips_bad_file
, and you can manually verify that it skips invalid files such as the nonexistent blargh:
$ cargo run -- -c 1 tests/inputs/books.csv blargh Opened tests/inputs/books.csv blargh: No such file or directory (os error 2)
Now consider how you might extract ranges of characters from each line of a filehandle.
I wrote a function called extract_chars
that will return a new string composed of the characters at the given index positions:
fn
extract_chars
(
line
:&
str
,
char_pos
:&
[
Range
<
usize
>
])
->
String
{
unimplemented
!
();
}
I originally wrote the preceding function with the type annotation &PositionList
for char_pos
.
When I checked the code with Clippy, it suggested the type &[Range<usize>]
instead.
The type &PositionList
is more restrictive on callers than is really necessary, and I do make use of the additional flexibility in the tests, so Clippy is being quite helpful here:
warning: writing `&Vec<_>` instead of `&[_]` involves one more reference and cannot be used with non-Vec-based slices --> src/lib.rs:223:40 | 223 | fn extract_chars(line: &str, char_pos: &PositionList) -> String { | ^^^^^^^^^^^^^ | = note: `#[warn(clippy::ptr_arg)]` on by default = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#ptr_arg
The following is a test you can add to the unit_tests
module.
Be sure to add extract_chars
to the module’s imports:
#[test]
fn
test_extract_chars
()
{
assert_eq
!
(
extract_chars
(
""
,
&
[
0
..
1
]),
""
.
to_string
());
assert_eq
!
(
extract_chars
(
"ábc"
,
&
[
0
..
1
]),
"á"
.
to_string
());
assert_eq
!
(
extract_chars
(
"ábc"
,
&
[
0
..
1
,
2
..
3
]),
"ác"
.
to_string
());
assert_eq
!
(
extract_chars
(
"ábc"
,
&
[
0
..
3
]),
"ábc"
.
to_string
());
assert_eq
!
(
extract_chars
(
"ábc"
,
&
[
2
..
3
,
1
..
2
]),
"cb"
.
to_string
());
assert_eq
!
(
extract_chars
(
"ábc"
,
&
[
0
..
1
,
1
..
2
,
4
..
5
]),
"áb"
.
to_string
()
);
}
I also wrote a similar extract_bytes
function to parse out bytes:
fn
extract_bytes
(
line
:&
str
,
byte_pos
:&
[
Range
<
usize
>
])
->
String
{
unimplemented
!
();
}
For the following unit test, be sure to add extract_bytes
to the module’s imports:
#[
test
]
fn
test_extract_bytes
(
)
{
assert_eq
!
(
extract_bytes
(
"
ábc
"
,
&
[
0
.
.
1
]
)
,
"
�
"
.
to_string
(
)
)
;
assert_eq
!
(
extract_bytes
(
"
ábc
"
,
&
[
0
.
.
2
]
)
,
"
á
"
.
to_string
(
)
)
;
assert_eq
!
(
extract_bytes
(
"
ábc
"
,
&
[
0
.
.
3
]
)
,
"
áb
"
.
to_string
(
)
)
;
assert_eq
!
(
extract_bytes
(
"
ábc
"
,
&
[
0
.
.
4
]
)
,
"
ábc
"
.
to_string
(
)
)
;
assert_eq
!
(
extract_bytes
(
"
ábc
"
,
&
[
3
.
.
4
,
2
.
.
3
]
)
,
"
cb
"
.
to_string
(
)
)
;
assert_eq
!
(
extract_bytes
(
"
ábc
"
,
&
[
0
.
.
2
,
5
.
.
6
]
)
,
"
á
"
.
to_string
(
)
)
;
}
Note that selecting one byte from the string ábc should break the multibyte á and result in the Unicode replacement character.
Once you have written these two functions so that they pass tests, incorporate them into your main program so that you pass the integration tests for printing bytes and characters. The failing tests that include tsv and csv in the names involve reading text delimited by tabs and commas, which I’ll discuss in the next section.
Next, you will need to learn how to parse comma- and tab-delimited text files.
Technically, all the files you’ve read to this point were delimited in some manner, such as with newlines to denote the end of a line.
In this case, a delimiter like a tab or a comma is used to separate the fields of a record, which is terminated with a newline.
Sometimes the delimiting character may also be part of the data, as when the title 20,000 Leagues Under the Sea occurs in a CSV file.
In this case, the field should be enclosed in quotes to escape the delimiter.
As noted in the chapter’s introduction, neither the BSD nor the GNU version of cut
respects this escaped delimiter, but the challenge program will.
The easiest way to properly parse delimited text is to use something like the csv
crate.
I highly recommend that you first read the tutorial, which explains the basics of working with delimited text files and how to use the csv
module effectively.
Consider the following example that shows how you can use this crate to parse delimited data.
If you would like to compile and run this code, start a new project, add the csv = "1"
dependency to your Cargo.toml, and copy the tests/inputs/books.csv file into the root directory of the new project.
Use the following for src/main.rs:
use
csv
:
:
{
ReaderBuilder
,
StringRecord
}
;
use
std
::
fs
::
File
;
fn
main
(
)
->
std
::
io
::
Result
<
(
)
>
{
let
mut
reader
=
ReaderBuilder
::
new
(
)
.
delimiter
(
b','
)
.
from_reader
(
File
::
open
(
"
books.csv
"
)
?
)
;
println
!
(
"
{}
"
,
fmt
(
reader
.
headers
(
)
?
)
)
;
for
record
in
reader
.
records
(
)
{
println
!
(
"
{}
"
,
fmt
(
&
record
?
)
)
;
}
Ok
(
(
)
)
}
fn
fmt
(
rec
:
&
StringRecord
)
->
String
{
rec
.
into_iter
(
)
.
map
(
|
v
|
format
!
(
"
{:20}
"
,
v
)
)
.
collect
(
)
}
Use csv::ReaderBuilder
to parse a file.
The delimiter
must be a single u8
byte.
The from_reader
method accepts a value that implements the Read
trait.
The Reader::headers
method will return the column names in the first row as a StringRecord
.
The Reader::records
method provides access to an iterator over StringRecord
values.
Print a formatted version of the record.
Use Iterator::map
to format the values into a field 20 characters wide and collect the values into a new String
.
If you run this program, you will see that the comma in 20,000 Leagues Under the Sea was not used as a field delimiter because it was found within quotes, which themselves are metacharacters that have been removed:
$ cargo run Author Year Title Émile Zola 1865 La Confession de Claude Samuel Beckett 1952 Waiting for Godot Jules Verne 1870 20,000 Leagues Under the Sea
In addition to csv::ReaderBuilder
, you should use csv::WriterBuilder
in your solution to escape the input delimiter in the output of the program.
Think about how you might use some of the ideas I just demonstrated in your challenge program.
For example, you could write a function like extract_fields
that accepts a csv::StringRecord
and pulls out the fields found in the PositionList
.
For the following function, add use csv::StringRecord
to the top of src/lib.rs:
fn
extract_fields
(
record
:&
StringRecord
,
field_pos
:&
[
Range
<
usize
>
]
)
->
Vec
<
String
>
{
unimplemented
!
();
}
Following is a unit test for this function that you can add to the unit_tests
module:
#[test]
fn
test_extract_fields
()
{
let
rec
=
StringRecord
::from
(
vec
!
[
"Captain"
,
"Sham"
,
"12345"
]);
assert_eq
!
(
extract_fields
(
&
rec
,
&
[
0
..
1
]),
&
[
"Captain"
]);
assert_eq
!
(
extract_fields
(
&
rec
,
&
[
1
..
2
]),
&
[
"Sham"
]);
assert_eq
!
(
extract_fields
(
&
rec
,
&
[
0
..
1
,
2
..
3
]),
&
[
"Captain"
,
"12345"
]
);
assert_eq
!
(
extract_fields
(
&
rec
,
&
[
0
..
1
,
3
..
4
]),
&
[
"Captain"
]);
assert_eq
!
(
extract_fields
(
&
rec
,
&
[
1
..
2
,
0
..
1
]),
&
[
"Sham"
,
"Captain"
]);
}
At this point, the unit_tests
module will need all of the following imports:
use
super
::{
extract_bytes
,
extract_chars
,
extract_fields
,
parse_pos
};
use
csv
::StringRecord
;
Once you are able to pass this last unit test, you should use all of the extract_*
functions to print the desired bytes, characters, and fields from the input files. Be sure to run cargo test
to see what is and is not working. This is a challenging program, so don’t give up too quickly. Fear is the mind-killer.
I’ll show you my solution now, but I would again stress that there are many ways to write this program.
Any version that passes the test suite is acceptable.
I’ll begin by showing how I evolved extract_chars
to select the characters.
In this first version of extract_chars
, I initialize a mutable vector to accumulate the results and then use an imperative approach to select the desired characters:
fn
extract_chars
(
line
:
&
str
,
char_pos
:
&
[
Range
<
usize
>
]
)
->
String
{
let
chars
:
Vec
<
_
>
=
line
.
chars
(
)
.
collect
(
)
;
let
mut
selected
:
Vec
<
char
>
=
vec
!
[
]
;
for
range
in
char_pos
.
iter
(
)
.
cloned
(
)
{
for
i
in
range
{
if
let
Some
(
val
)
=
chars
.
get
(
i
)
{
selected
.
push
(
*
val
)
}
}
}
selected
.
iter
(
)
.
collect
(
)
}
Use str::chars
to split the line of text into characters. The Vec
type annotation is required by Rust because Iterator::collect
can return many different types of collections.
Initialize a mutable vector to hold the selected characters.
Iterate over each Range
of indexes.
Iterate over each value in the Range
.
Use Vec::get
to select the character at the index. This might fail if the user has requested positions beyond the end of the string, but a failure to select a character should not generate an error.
If it’s possible to select the character, use Vec::push
to add it to the selected
characters. Note the use of *
to dereference &val
.
Use Iterator::collect
to create a String
from the characters.
I can simplify the selection of the characters by using Iterator::filter_map
, which yields only the values for which the supplied closure returns Some(value)
:
fn
extract_chars
(
line
:
&
str
,
char_pos
:
&
[
Range
<
usize
>
]
)
->
String
{
let
chars
:
Vec
<
_
>
=
line
.
chars
(
)
.
collect
(
)
;
let
mut
selected
:
Vec
<
char
>
=
vec
!
[
]
;
for
range
in
char_pos
.
iter
(
)
.
cloned
(
)
{
selected
.
extend
(
range
.
filter_map
(
|
i
|
chars
.
get
(
i
)
)
)
;
}
selected
.
iter
(
)
.
collect
(
)
}
The preceding versions both initialize a variable to collect the results.
In this next version, an iterative approach avoids mutability and leads to a shorter function by using Iterator::map
and Iterator::flatten
, which, according to the documentation, “is useful when you have an iterator of iterators or an iterator of things that can be turned into iterators and you want to remove one level of
indirection”:
fn
extract_chars
(
line
:
&
str
,
char_pos
:
&
[
Range
<
usize
>
]
)
->
String
{
let
chars
:
Vec
<
_
>
=
line
.
chars
(
)
.
collect
(
)
;
char_pos
.
iter
(
)
.
cloned
(
)
.
map
(
|
range
|
range
.
filter_map
(
|
i
|
chars
.
get
(
i
)
)
)
.
flatten
(
)
.
collect
(
)
}
Use Iterator::map
to process each Range
to select the characters.
Use Iterator::flatten
to remove nested structures.
Without Iterator::flatten
, Rust will show the following error:
error[E0277]: a value of type `String` cannot be built from an iterator over elements of type `FilterMap<std::ops::Range<usize>,
In the findr
program from Chapter 7, I used Iterator::filter_map
to combine the operations of filter
and map
.
Similarly, the operations of flatten
and map
can be combined with Iterator::flat_map
in this shortest and final version of the
function:
fn
extract_chars
(
line
:
&
str
,
char_pos
:
&
[
Range
<
usize
>
]
)
->
String
{
let
chars
:
Vec
<
_
>
=
line
.
chars
(
)
.
collect
(
)
;
char_pos
.
iter
(
)
.
cloned
(
)
.
flat_map
(
|
range
|
range
.
filter_map
(
|
i
|
chars
.
get
(
i
)
)
)
.
collect
(
)
}
The selection of bytes is very similar, but I have to deal with the fact that String::from_utf8_lossy
needs a slice of bytes, unlike the previous example where I could collect an iterator of references to characters into a String
.
As with extract_chars
, the goal is to return a new string, but there is a potential problem if the byte selection breaks Unicode characters and so produces an invalid UTF-8 string:
fn
extract_bytes
(
line
:
&
str
,
byte_pos
:
&
[
Range
<
usize
>
]
)
->
String
{
let
bytes
=
line
.
as_bytes
(
)
;
let
selected
:
Vec
<
_
>
=
byte_pos
.
iter
(
)
.
cloned
(
)
.
flat_map
(
|
range
|
range
.
filter_map
(
|
i
|
bytes
.
get
(
i
)
)
.
copied
(
)
)
.
collect
(
)
;
String
::
from_utf8_lossy
(
&
selected
)
.
into_owned
(
)
}
Break the line into a vector of bytes.
Use Iterator::flat_map
to select bytes at the wanted positions and copy the selected bytes.
Use String::from_utf8_lossy
to generate a possibly invalid UTF-8 string from the selected bytes. Use Cow::into_owned
to clone the data, if needed.
In the preceding code, I’m using Iterator::get
to select the bytes.
This function returns a vector of byte references (&Vec<&u8>
), but String::from_utf8_lossy
expects a slice of bytes (&[u8]
).
To fix this, I use std::iter::Copied
to create copies of the elements and avoid the following error:
error[E0308]: mismatched types --> src/lib.rs:215:29 | 215 | String::from_utf8_lossy(&selected).into_owned() | ^^^^^^^^^ expected slice `[u8]`, | found struct `Vec` | = note: expected reference `&[u8]` found reference `&Vec<&u8>`
Finally, I would note the necessity of using Cow::into_owned
at the end of the function.
Without this, I get a compilation error that suggests an alternate solution to convert the Cow
value to a String
:
error[E0308]: mismatched types --> src/lib.rs:178:5 | 171 | fn extract_bytes(line: &str, byte_pos: &[Range<usize>]) -> String { | ------ | expected `String` because of return type ... 178 | String::from_utf8_lossy(&selected) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^- help: try using a conversion | | method: `.to_string()` | | | expected struct `String`, found enum `Cow` | = note: expected struct `String` found enum `Cow<'_, str>`
While the Rust compiler is extremely strict, I appreciate how informative and helpful the error messages are.
Selecting the fields from a csv::StringRecord
is almost identical to extracting characters from a line:
fn
extract_fields
(
record
:
&
StringRecord
,
field_pos
:
&
[
Range
<
usize
>
]
,
)
->
Vec
<
String
>
{
field_pos
.
iter
(
)
.
cloned
(
)
.
flat_map
(
|
range
|
range
.
filter_map
(
|
i
|
record
.
get
(
i
)
)
)
.
map
(
String
::
from
)
.
collect
(
)
}
Use StringRecord::get
to try to get the field for the index position.
There’s another way to write this function so that it will return a Vec<&str>
, which will be slightly more memory efficient as it will not make copies of the strings.
The trade-off is that I must indicate the lifetimes.
First, let me naively try to write it like so:
// This will not compile
fn
extract_fields
(
record
:&
StringRecord
,
field_pos
:&
[
Range
<
usize
>
],
)
->
Vec
<&
str
>
{
field_pos
.
iter
()
.
cloned
()
.
flat_map
(
|
range
|
range
.
filter_map
(
|
i
|
record
.
get
(
i
)))
.
collect
()
}
If I try to compile this, the Rust compiler will complain about lifetimes:
error[E0106]: missing lifetime specifier --> src/lib.rs:203:10 | 201 | record: &StringRecord, | ------------- 202 | field_pos: &[Range<usize>], | --------------- 203 | ) -> Vec<&str> { | ^ expected named lifetime parameter = help: this function's return type contains a borrowed value, but the signature does not say whether it is borrowed from `record` or `field_pos`
The error message continues with directions for how to amend the code to add lifetimes:
help: consider introducing a named lifetime parameter 200 ~ fn extract_fields<'a>( 201 ~ record: &'a StringRecord, 202 ~ field_pos: &'a [Range<usize>], 203 ~ ) -> Vec<&'a str> {
The suggestion is actually overconstraining the lifetimes.
The returned string slices refer to values owned by the StringRecord
, so only record
and the return value need to have the same lifetime.
The following version with lifetimes works well:
fn
extract_fields
<
'a
>
(
record
:
&
'a
StringRecord
,
field_pos
:
&
[
Range
<
usize
>
]
,
)
->
Vec
<
&
'a
str
>
{
field_pos
.
iter
(
)
.
cloned
(
)
.
flat_map
(
|
range
|
range
.
filter_map
(
|
i
|
record
.
get
(
i
)
)
)
.
collect
(
)
}
Both the version returning Vec<String>
and the version returning Vec<&'a str>
will pass the test_extract_fields
unit test.
The latter version is slightly more efficient and shorter but also has more cognitive overhead.
Choose whichever version you feel you’ll be able to understand six weeks from now.
For the following code, be sure to add the following imports to src/lib.rs:
use
csv
::{
ReaderBuilder
,
StringRecord
,
WriterBuilder
};
Here is my run
function that passes all the tests for printing the desired ranges of characters, bytes, and records:
pub
fn
run
(
config
:
Config
)
->
MyResult
<
(
)
>
{
for
filename
in
&
config
.
files
{
match
open
(
filename
)
{
Err
(
err
)
=
>
eprintln
!
(
"
{}: {}
"
,
filename
,
err
)
,
Ok
(
file
)
=
>
match
&
config
.
extract
{
Fields
(
field_pos
)
=
>
{
let
mut
reader
=
ReaderBuilder
::
new
(
)
.
delimiter
(
config
.
delimiter
)
.
has_headers
(
false
)
.
from_reader
(
file
)
;
let
mut
wtr
=
WriterBuilder
::
new
(
)
.
delimiter
(
config
.
delimiter
)
.
from_writer
(
io
::
stdout
(
)
)
;
for
record
in
reader
.
records
(
)
{
let
record
=
record
?
;
wtr
.
write_record
(
extract_fields
(
&
record
,
field_pos
,
)
)
?
;
}
}
Bytes
(
byte_pos
)
=
>
{
for
line
in
file
.
lines
(
)
{
println
!
(
"
{}
"
,
extract_bytes
(
&
line
?
,
byte_pos
)
)
;
}
}
Chars
(
char_pos
)
=
>
{
for
line
in
file
.
lines
(
)
{
println
!
(
"
{}
"
,
extract_chars
(
&
line
?
,
char_pos
)
)
;
}
}
}
,
}
}
Ok
(
(
)
)
}
If the user has requested fields from a delimited file, use csv::ReaderBuilder
to create a mutable reader using the given delimiter, and do not treat the first row as headers.
Use csv::WriterBuilder
to correctly escape delimiters in the output.
Iterate through the records.
Write the extracted fields to the output.
Iterate the lines of text and print the extracted bytes.
Iterate the lines of text and print the extracted characters.
The csv::Reader
will attempt to parse the first row for the column names by default.
For this program, I don’t need to do anything special with these values, so I don’t parse the first line as a header row.
If I used the default behavior, I would have to handle the headers separately from the rest of the records.
Note that I’m using the csv
crate to both parse the input and write the output, so this program will correctly handle delimited text files, which I feel is an improvement over the original cut
programs.
I’ll use tests/inputs/books.csv again to demonstrate that cutr
will correctly select a field containing the delimiter and will create output that properly escapes the delimiter:
$ cargo run -- -d , -f 1,3 tests/inputs/books.csv Author,Title Émile Zola,La Confession de Claude Samuel Beckett,Waiting for Godot Jules Verne,"20,000 Leagues Under the Sea"
This was a fairly complex program with a lot of options, but I found the strictness of the Rust compiler kept me focused on how to write a solution.
I have several ideas for how you can expand this program.
Alter the program to allow partial ranges like -3
, meaning 1–3, or 5-
to mean 5 to the end.
Consider using std::ops::RangeTo
to model -3
and std::ops::RangeFrom
for 5-
.
Be aware that clap
will try to interpret the value -3
as an option when you run cargo run -- -f -3 tests/inputs/books.tsv
, so use
-f=-3
instead.
The final version of the challenge program uses the --delimiter
as the input and output delimiter.
Add an option to specify the output delimiter, and have it default to the input delimiter.
Add an optional output filename, and let it default to STDOUT
.
The -n
option from the BSD and GNU cut
versions that prevents multibyte characters from being split seems like a fun challenge to implement, and I also quite like the --complement
option from GNU cut
that complements the set of selected bytes, characters, or fields so that the positions not indicated are shown.
Finally, for more ideas on how to deal with delimited text records, check out the xsv
crate, a “fast CSV command line toolkit written in Rust.”
Gaze upon the knowledge you gained in this chapter:
You learned how to dereference a variable that contains a reference using the
*
operator.
Sometimes actions on iterators return other iterators. You saw how Iterator::flatten
will remove the inner structures to flatten the result.
You learned how the Iterator::flat_map
method combines Iterator::map
and Iterator::flatten
into one operation for more concise code.
You used a get
function for selecting positions from a vector or fields from a csv::StringRecord
. This action might fail, so you used Iterator::filter_map
to return only those values that are successfully retrieved.
You compared how to return a String
versus a &str
from a function, the latter of which required indicating lifetimes.
You can now parse and create delimited text using the csv
crate.
In the next chapter, you will learn more about regular expressions and chaining operations on iterators.