Never looked at you before with / Common sense
They Might Be Giants, “Circular Karate Chop” (2013)
In this chapter, you will write a Rust version of the comm
(common) utility, which will read two files and report the lines of text that are common to both and the lines that are unique to each.
These are set operations where the common lines are the intersection of the two files and the unique lines are the difference.
If you are familiar with databases, you might also consider these as types of join operations.
You will learn how to:
Manually iterate the lines of a filehandle using Iterator::next
match
on combinations of possibilities using a tuple
Use std::cmp::Ordering
when comparing strings
To show you what will be expected of your program, I’ll start by reviewing part of the manual page for the BSD comm
to see how the tool works:
COMM(1) BSD General Commands Manual COMM(1) NAME comm -- select or reject lines common to two files SYNOPSIS comm [-123i] file1 file2 DESCRIPTION The comm utility reads file1 and file2, which should be sorted lexically, and produces three text columns as output: lines only in file1; lines only in file2; and lines in both files. The filename ''-'' means the standard input. The following options are available: -1 Suppress printing of column 1. -2 Suppress printing of column 2. -3 Suppress printing of column 3. -i Case insensitive comparison of lines. Each column will have a number of tab characters prepended to it equal to the number of lower numbered columns that are being printed. For exam- ple, if column number two is being suppressed, lines printed in column number one will not have any tabs preceding them, and lines printed in column number three will have one. The comm utility assumes that the files are lexically sorted; all charac- ters participate in line comparisons.
The GNU version has some additional options but lacks a case-insensitive option:
$ comm --help Usage: comm [OPTION]... FILE1 FILE2 Compare sorted files FILE1 and FILE2 line by line. When FILE1 or FILE2 (not both) is -, read standard input. With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files. -1 suppress column 1 (lines unique to FILE1) -2 suppress column 2 (lines unique to FILE2) -3 suppress column 3 (lines that appear in both files) --check-order check that the input is correctly sorted, even if all input lines are pairable --nocheck-order do not check that the input is correctly sorted --output-delimiter=STR separate columns with STR --total output a summary -z, --zero-terminated line delimiter is NUL, not newline --help display this help and exit --version output version information and exit Note, comparisons honor the rules specified by 'LC_COLLATE'. Examples: comm -12 file1 file2 Print only lines present in both file1 and file2. comm -3 file1 file2 Print lines in file1 not in file2, and vice versa.
At this point, you may be wondering exactly why you’d use this. Suppose you have a file containing a list of cities where your favorite band played on their last tour:
$ cd 10_commr/tests/inputs/ $ cat cities1.txt Jackson Denton Cincinnati Boston Santa Fe Tucson
Another file lists the cities on their current tour:
$ cat cities2.txt San Francisco Denver Ypsilanti Denton Cincinnati Boston
You can use comm
to find which cities occur in both sets by suppressing columns 1 (the lines unique to the first file) and 2 (the lines unique to the second file) and only showing column 3 (the lines common to both files).
This is like an inner join in SQL, where only data that occurs in both inputs is shown.
Note that both files need to be sorted first:
$ comm -12 <(sort cities1.txt) <(sort cities2.txt) Boston Cincinnati Denton
If you wanted the cities the band played only on the first tour, you could suppress columns 2 and 3:
$ comm -23 <(sort cities1.txt) <(sort cities2.txt) Jackson Santa Fe Tucson
Finally, if you wanted the cities they played only on the second tour, you could suppress columns 1 and 3:
$ comm -13 <(sort cities1.txt) <(sort cities2.txt) Denver San Francisco Ypsilanti
The first or second file can be STDIN
, as denoted by a filename consisting of a dash (-
):
$ sort cities2.txt | comm -12 <(sort cities1.txt) - Boston Cincinnati Denton
As with the GNU comm
, only one of the inputs may be a dash with the challenge program.
Note that BSD comm
can perform case-insensitive comparisons when the -i
flag is present.
For instance, I can put the first tour cities in lowercase:
$ cat cities1_lower.txt jackson denton cincinnati boston santa fe tucson
and the second tour cities in uppercase:
$ cat cities2_upper.txt SAN FRANCISCO DENVER YPSILANTI DENTON CINCINNATI BOSTON
Then I can use the -i
flag to find the cities in common:
$ comm -i -12 <(sort cities1_lower.txt) <(sort cities2_upper.txt) boston cincinnati denton
I know the tour cities example is a trivial one, so I’ll give you another example drawn from my experience in bioinformatics, which is the intersection of computer science and biology. Given a file of protein sequences, I can run an analysis that will group similar sequences into clusters. I can then use comm
to compare the clustered proteins to the original list and find the proteins that failed to cluster. There may be something unique to these unclustered proteins that bears further analysis.
This is as much as the challenge program is expected to implement.
One change from the BSD version is that I use the GNU version’s optional output column delimiter that defaults to a tab character, which is the normal output from comm
.
The program in this chapter will be called commr
(pronounced comm-er, which is basically how the British pronounce the word comma) for a Rust version of comm
.
I suggest you use cargo new commr
to start, then add the following dependencies to your Cargo.toml file:
[dependencies]
clap
=
"2.33"
[dev-dependencies]
assert_cmd
=
"2"
predicates
=
"2"
rand
=
"0.8"
Copy my 10_commr/tests directory into your project, and then run cargo test
to run the tests, which should all fail.
No surprises here, but I suggest the following for your src/main.rs:
fn
main
()
{
if
let
Err
(
e
)
=
commr
::get_args
().
and_then
(
commr
::run
)
{
eprintln
!
(
"{}"
,
e
);
std
::process
::exit
(
1
);
}
}
You can start src/lib.rs with the following code:
use
clap
:
:
{
App
,
Arg
}
;
use
std
::
error
::
Error
;
type
MyResult
<
T
>
=
Result
<
T
,
Box
<
dyn
Error
>
>
;
#[
derive(Debug)
]
pub
struct
Config
{
file1
:
String
,
file2
:
String
,
show_col1
:
bool
,
show_col2
:
bool
,
show_col3
:
bool
,
insensitive
:
bool
,
delimiter
:
String
,
}
The first input filename is a String
.
The second input filename is a String
.
A Boolean for whether or not to show the first column of output.
A Boolean for whether or not to show the second column of output.
A Boolean for whether or not to show the third column of output.
A Boolean for whether or not to perform case-insensitive comparisons.
The output column delimiter, which will default to a tab.
Normally I give my Config
fields the same names as the arguments, but I don’t like the negative suppress verb, preferring instead the positive show. I feel this leads to more readable code, as I will demonstrate later.
You can fill in the missing parts of the following code to begin your get_args
function:
pub
fn
get_args
()
->
MyResult
<
Config
>
{
let
matches
=
App
::new
(
"commr"
)
.
version
(
"0.1.0"
)
.
author
(
"Ken Youens-Clark <kyclark@gmail.com>"
)
.
about
(
"Rust comm"
)
// What goes here?
.
get_matches
();
Ok
(
Config
{
file1
:...
file2
:...
show_col1
:...
show_col2
:...
show_col3
:...
insensitive
:...
delimiter
:...
})
}
Start your run
function by printing the config
:
pub
fn
run
(
config
:Config
)
->
MyResult
<
()
>
{
println
!
(
"{:#?}"
,
config
);
Ok
(())
}
Your program should be able to produce the following usage:
$ cargo run -- -h commr 0.1.0 Ken Youens-Clark <kyclark@gmail.com> Rust comm USAGE: commr [FLAGS] [OPTIONS] <FILE1> <FILE2> FLAGS: -h, --help Prints help information -i Case-insensitive comparison of lines -1 Suppress printing of column 1 -2 Suppress printing of column 2 -3 Suppress printing of column 3 -V, --version Prints version information OPTIONS: -d, --output-delimiter <DELIM> Output delimiter [default: ] ARGS: <FILE1> Input file 1 <FILE2> Input file 2
If you run your program with no arguments, it should fail with a message that the two file arguments are required:
$ cargo run error: The following required arguments were not provided: <FILE1> <FILE2> USAGE: commr <FILE1> <FILE2> --output-delimiter <DELIM> For more information try --help
If you supply two positional arguments, you should get the following output:
$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt Config { file1: "tests/inputs/file1.txt",file2: "tests/inputs/file2.txt", show_col1: true,
show_col2: true, show_col3: true, insensitive: false, delimiter: "\t", }
The two positional arguments are parsed into file1
and file2
.
All the rest of the values use defaults, which are true
for the Booleans and the tab character for the output delimiter.
Verify that you can set all the other arguments as well:
$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt -123 -d , -i Config { file1: "tests/inputs/file1.txt", file2: "tests/inputs/file2.txt", show_col1: false,show_col2: false, show_col3: false, insensitive: true,
delimiter: ",",
}
Stop reading and make your program match the preceding output.
Following is how I defined the arguments in my get_args
.
I don’t have much to comment on here since it’s so similar to previous programs:
pub
fn
get_args
()
->
MyResult
<
Config
>
{
let
matches
=
App
::new
(
"commr"
)
.
version
(
"0.1.0"
)
.
author
(
"Ken Youens-Clark <kyclark@gmail.com>"
)
.
about
(
"Rust comm"
)
.
arg
(
Arg
::with_name
(
"file1"
)
.
value_name
(
"FILE1"
)
.
help
(
"Input file 1"
)
.
takes_value
(
true
)
.
required
(
true
),
)
.
arg
(
Arg
::with_name
(
"file2"
)
.
value_name
(
"FILE2"
)
.
help
(
"Input file 2"
)
.
takes_value
(
true
)
.
required
(
true
),
)
.
arg
(
Arg
::with_name
(
"suppress_col1"
)
.
short
(
"1"
)
.
takes_value
(
false
)
.
help
(
"Suppress printing of column 1"
),
)
.
arg
(
Arg
::with_name
(
"suppress_col2"
)
.
short
(
"2"
)
.
takes_value
(
false
)
.
help
(
"Suppress printing of column 2"
),
)
.
arg
(
Arg
::with_name
(
"suppress_col3"
)
.
short
(
"3"
)
.
takes_value
(
false
)
.
help
(
"Suppress printing of column 3"
),
)
.
arg
(
Arg
::with_name
(
"insensitive"
)
.
short
(
"i"
)
.
takes_value
(
false
)
.
help
(
"Case-insensitive comparison of lines"
),
)
.
arg
(
Arg
::with_name
(
"delimiter"
)
.
short
(
"d"
)
.
long
(
"output-delimiter"
)
.
value_name
(
"DELIM"
)
.
help
(
"Output delimiter"
)
.
default_value
(
"
\t
"
)
.
takes_value
(
true
),
)
.
get_matches
();
Ok
(
Config
{
file1
:matches
.
value_of
(
"file1"
).
unwrap
().
to_string
(),
file2
:matches
.
value_of
(
"file2"
).
unwrap
().
to_string
(),
show_col1
:!
matches
.
is_present
(
"suppress_col1"
),
show_col2
:!
matches
.
is_present
(
"suppress_col2"
),
show_col3
:!
matches
.
is_present
(
"suppress_col3"
),
insensitive
:matches
.
is_present
(
"insensitive"
),
delimiter
:matches
.
value_of
(
"delimiter"
).
unwrap
().
to_string
(),
})
}
The next step is checking and opening the input files.
I suggest a modification to the open
function used in several previous chapters:
fn
open
(
filename
:
&
str
)
->
MyResult
<
Box
<
dyn
BufRead
>
>
{
match
filename
{
"
-
"
=
>
Ok
(
Box
::
new
(
BufReader
::
new
(
io
::
stdin
(
)
)
)
)
,
_
=
>
Ok
(
Box
::
new
(
BufReader
::
new
(
File
::
open
(
filename
)
.
map_err
(
|
e
|
format
!
(
"
{}: {}
"
,
filename
,
e
)
)
?
,
)
)
)
,
}
}
This will require you to expand your imports with the following:
use
std
::{
error
::Error
,
fs
::File
,
io
::{
self
,
BufRead
,
BufReader
},
};
As noted earlier, only one of the inputs is allowed to be a dash, for STDIN
.
You can use the following code for your run
that will check the filenames and then open the files:
pub
fn
run
(
config
:
Config
)
->
MyResult
<
(
)
>
{
let
file1
=
&
config
.
file1
;
let
file2
=
&
config
.
file2
;
if
file1
=
=
"
-
"
&
&
file2
=
=
"
-
"
{
return
Err
(
From
::
from
(
"
Both input files cannot be STDIN (
\"
-
\"
)
"
)
)
;
}
let
_file1
=
open
(
file1
)
?
;
let
_file2
=
open
(
file2
)
?
;
println
!
(
"
Opened {} and {}
"
,
file1
,
file2
)
;
Ok
(
(
)
)
}
Check that both of the filenames are not a dash (-
).
Attempt to open the two input files.
Print a message so you know what happened.
Your program should reject two STDIN
arguments:
$ cargo run -- - - Both input files cannot be STDIN ("-")
It should be able to print the following for two good input files:
$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt Opened tests/inputs/file1.txt and tests/inputs/file2.txt
It should reject a bad file for either argument, such as the nonexistent blargh:
$ cargo run -- tests/inputs/file1.txt blargh blargh: No such file or directory (os error 2)
At this point, your program should pass all the tests for cargo test dies
that check for missing or bad input arguments:
running 4 tests test dies_both_stdin ... ok test dies_no_args ... ok test dies_bad_file1 ... ok test dies_bad_file2 ... ok
Your program can now validate all the arguments and open the input files, either of which may be STDIN
.
Next, you need to iterate over the lines from each file to compare them.
The files in 10_commr/tests/inputs that are used in the tests are:
empty.txt: an empty file
blank.txt: a file with one blank line
file1.txt: a file with four lines of text
file2.txt: a file with two lines of text
You may use BufRead::lines
to read files as it is not necessary to preserve line endings.
Start simply, perhaps using the empty.txt file and file1.txt.
Try to get your program to reproduce the following output from comm
:
$ cd tests/inputs/ $ comm file1.txt empty.txt a b c d
Then reverse the argument order and ensure that you get the same output, but now in column 2, like this:
$ comm empty.txt file1.txt a b c d
Next, look at the output from the BSD version of comm
using file1.txt and file2.txt.
The order of the lines shown in the following command is the expected output for the challenge program:
$ comm file1.txt file2.txt B a b c d
The GNU comm
uses a different ordering for which lines to show first when they are not equal.
Note that the line B is shown after b:
$ comm file1.txt file2.txt a b B c d
Next, consider how you will handle the blank.txt file that contains a single blank line. In the following output, notice that the blank line is shown first, then the two lines from file2.txt:
$ comm tests/inputs/blank.txt tests/inputs/file2.txt B c
I suggest you start by trying to read a line from each file.
The documentation for
BufRead::lines
notes that it will return a None
when it reaches the end of the file.
Starting with the empty file as one of the arguments will force you to deal with having an uneven number of lines, where you will have to advance one of the filehandles while the other stays the same.
Then, when you use two nonempty files, you’ll have to consider how to read the files until you have matching lines and move them independently otherwise.
As always, I’ll stress that the only requirement for your code is to pass the test suite. I doubt you will have written the same code as I did, but that’s what I find so fun and creative about coding. In my solution, I decided to create iterators to retrieve the lines from the filehandles. These iterators incorporate a closure to handle case-insensitive comparisons:
pub
fn
run
(
config
:
Config
)
->
MyResult
<
(
)
>
{
let
file1
=
&
config
.
file1
;
let
file2
=
&
config
.
file2
;
if
file1
=
=
"
-
"
&
&
file2
=
=
"
-
"
{
return
Err
(
From
::
from
(
"
Both input files cannot be STDIN (
\"
-
\"
)
"
)
)
;
}
let
case
=
|
line
:
String
|
{
if
config
.
insensitive
{
line
.
to_lowercase
(
)
}
else
{
line
}
}
;
let
mut
lines1
=
open
(
file1
)
?
.
lines
(
)
.
filter_map
(
Result
::
ok
)
.
map
(
case
)
;
let
mut
lines2
=
open
(
file2
)
?
.
lines
(
)
.
filter_map
(
Result
::
ok
)
.
map
(
case
)
;
let
line1
=
lines1
.
next
(
)
;
let
line2
=
lines2
.
next
(
)
;
println
!
(
"
line1 = {:?}
"
,
line1
)
;
println
!
(
"
line2 = {:?}
"
,
line2
)
;
Ok
(
(
)
)
}
Create a closure to lowercase each line of text when config.insensitive
is true
.
Open the files, create iterators that remove errors, and then map the lines through the case
closure.
The Iterator::next
method advances an iterator and returns the next value. Here, it will retrieve the first line from a filehandle.
Print the first two values.
In the preceding code, I used the function Result::ok
rather than writing a closure |line| line.ok()
. They both accomplish the same thing, but the first is shorter.
As I suggested, I’ll start with one of the files being empty. Moving to the root directory of the chapter, I ran the program with the following input files:
$ cd ../.. $ cargo run -- tests/inputs/file1.txt tests/inputs/empty.txt line1 = Some("a") line2 = None
That led me to think about how I can move through the lines of each iterator based on the four different combinations of Some(line)
and None
that I can get from two iterators.
In the following code, I place the possibilities inside a tuple, which is a finite heterogeneous sequence surrounded by parentheses:
let
mut
line1
=
lines1
.
next
(
)
;
let
mut
line2
=
lines2
.
next
(
)
;
while
line1
.
is_some
(
)
|
|
line2
.
is_some
(
)
{
match
(
&
line1
,
&
line2
)
{
(
Some
(
_
)
,
Some
(
_
)
)
=
>
{
line1
=
lines1
.
next
(
)
;
line2
=
lines2
.
next
(
)
;
}
(
Some
(
_
)
,
None
)
=
>
{
line1
=
lines1
.
next
(
)
;
}
(
None
,
Some
(
_
)
)
=
>
{
line2
=
lines2
.
next
(
)
;
}
_
=
>
(
)
,
}
;
}
Make the line variables mutable.
Execute the loop as long as one of the filehandles produces a line.
Compare all possible combinations of the two line variables for two variants.
When both are Some
values, use Iterator::next
to retrieve the next line from both filehandles.
When there is only the first value, ask for the next line from the first filehandle.
Do the same for the second filehandle.
Do nothing for any other condition.
When I have only one value from the first or second file, I should print the value in the first or second column, respectively. When I have two values from the files and they are the same, I should print a value in column 3. When I have two values and the first value is less than the second, I should print the first value in column 1; otherwise, I should print the second value in column 2. To understand this last point, consider the following two input files, which I’ll place side by side so you can imagine how the code will read the lines:
$ cat tests/inputs/file1.txt $ cat tests/inputs/file2.txt a B b c c d
To help you see the output from BSD comm
, I will pipe the output into sed
(stream editor) to replace each tab character (\t
) with the string --->
to make it clear which columns are being printed:
$ comm tests/inputs/file1.txt tests/inputs/file2.txt | sed "s/\t/--->/g"--->B a b --->--->c d
The sed
command s//
will substitute values, replacing the string between the first pair of slashes with the string between the second pair. The final g
is the global flag to substitute every occurrence.
Now imagine your code reads the first line from each input and has a from file1.txt and B from file2.txt.
They are not equal, so the question is which to print.
The goal is to mimic BSD comm
, so I know that the B should come first and be printed in the second column.
When I compare a and B, I find that B is less than a when they are ordered by their code point, or numerical value.
To help you see this, I’ve included a program in util/ascii that will show you a range of the ASCII table starting at the first printable character.
Note that B has a value of 66 while a is 97:
33: ! 52: 4 71: G 90: Z 109: m 34: " 53: 5 72: H 91: [ 110: n 35: # 54: 6 73: I 92: \ 111: o 36: $ 55: 7 74: J 93: ] 112: p 37: % 56: 8 75: K 94: ^ 113: q 38: & 57: 9 76: L 95: _ 114: r 39: ' 58: : 77: M 96: ` 115: s 40: ( 59: ; 78: N 97: a 116: t 41: ) 60: < 79: O 98: b 117: u 42: * 61: = 80: P 99: c 118: v 43: + 62: > 81: Q 100: d 119: w 44: , 63: ? 82: R 101: e 120: x 45: - 64: @ 83: S 102: f 121: y 46: . 65: A 84: T 103: g 122: z 47: / 66: B 85: U 104: h 123: { 48: 0 67: C 86: V 105: i 124: | 49: 1 68: D 87: W 106: j 125: } 50: 2 69: E 88: X 107: k 126: ~ 51: 3 70: F 89: Y 108: l 127: DEL
To mimic BSD comm
, I should print the lower value (B) first and draw another value from that file for the next iteration; the GNU version does the opposite.
In the following code, I’m concerned only with the ordering, and I’ll handle the indentation in a moment.
Note that you should add use std::cmp::Ordering::*
to your imports for this code:
let
mut
line1
=
lines1
.
next
(
)
;
let
mut
line2
=
lines2
.
next
(
)
;
while
line1
.
is_some
(
)
|
|
line2
.
is_some
(
)
{
match
(
&
line1
,
&
line2
)
{
(
Some
(
val1
)
,
Some
(
val2
)
)
=
>
match
val1
.
cmp
(
val2
)
{
Equal
=
>
{
println
!
(
"
{}
"
,
val1
)
;
line1
=
lines1
.
next
(
)
;
line2
=
lines2
.
next
(
)
;
}
Less
=
>
{
println
!
(
"
{}
"
,
val1
)
;
line1
=
lines1
.
next
(
)
;
}
Greater
=
>
{
println
!
(
"
{}
"
,
val2
)
;
line2
=
lines2
.
next
(
)
;
}
}
,
(
Some
(
val1
)
,
None
)
=
>
{
println
!
(
"
{}
"
,
val1
)
;
line1
=
lines1
.
next
(
)
;
}
(
None
,
Some
(
val2
)
)
=
>
{
println
!
(
"
{}
"
,
val2
)
;
line2
=
lines2
.
next
(
)
;
}
_
=
>
(
)
,
}
}
Use Ord::cmp
to compare the first value to the second. This will return a variant from std::cmp::Ordering
.
When the two values are equal, print the first and get values from each of the files.
When the value from the first file is less than the value from the second file, print the first and request the next value from the first file.
When the first value is greater than the second, print the value from the second file and request the next value from the second file.
When there is a value only from the first file, print it and continue requesting values from the first file.
When there is a value only from the second file, print it and continue requesting values from the second file.
If I run this code using a nonempty file and an empty file, it works:
$ cargo run -- tests/inputs/file1.txt tests/inputs/empty.txt a b c d
If I use file1.txt and file2.txt, it’s not far from the expected output:
$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt B a b c d
I decided to create an enum
called Column
to represent in which column I should print a value.
Each variant holds a &str
, which requires a lifetime annotation.
You can place the following at the top of src/lib.rs, near your Config
declaration.
Be sure to add use crate::Column::*
to your import so you can reference Col1
instead of
Column::Col1
:
enum
Column
<
'a
>
{
Col1
(
&
'a
str
),
Col2
(
&
'a
str
),
Col3
(
&
'a
str
),
}
Next, I created a closure called print
to handle the printing of the output.
The following code belongs in the run
function:
let
=
|
col
:
Column
|
{
let
mut
columns
=
vec
!
[
]
;
match
col
{
Col1
(
val
)
=
>
{
if
config
.
show_col1
{
columns
.
push
(
val
)
;
}
}
Col2
(
val
)
=
>
{
if
config
.
show_col2
{
if
config
.
show_col1
{
columns
.
push
(
"
"
)
;
}
columns
.
push
(
val
)
;
}
}
Col3
(
val
)
=
>
{
if
config
.
show_col3
{
if
config
.
show_col1
{
columns
.
push
(
"
"
)
;
}
if
config
.
show_col2
{
columns
.
push
(
"
"
)
;
}
columns
.
push
(
val
)
;
}
}
}
;
if
!
columns
.
is_empty
(
)
{
println
!
(
"
{}
"
,
columns
.
join
(
&
config
.
delimiter
)
)
;
}
}
;
Create a mutable vector to hold the output columns.
Given text for column 1, add the value only if the column is shown.
Given text for column 2, add the values for the two columns only if they are shown.
Given text for column 3, add the values for the three columns only if they are shown.
If there are columns to print, join them on the output delimiter.
Originally I used the field suppress_col1
, which had me writing if !config.suppress_col1
, a double negative that is much harder to comprehend. In general, I would recommend using positive names like do_something rather than dont_do_something.
Here is how I incorporate the print
closure:
let
mut
line1
=
lines1
.
next
(
)
;
let
mut
line2
=
lines2
.
next
(
)
;
while
line1
.
is_some
(
)
|
|
line2
.
is_some
(
)
{
match
(
&
line1
,
&
line2
)
{
(
Some
(
val1
)
,
Some
(
val2
)
)
=
>
match
val1
.
cmp
(
val2
)
{
Equal
=
>
{
(
Col3
(
val1
)
)
;
line1
=
lines1
.
next
(
)
;
line2
=
lines2
.
next
(
)
;
}
Less
=
>
{
(
Col1
(
val1
)
)
;
line1
=
lines1
.
next
(
)
;
}
Greater
=
>
{
(
Col2
(
val2
)
)
;
line2
=
lines2
.
next
(
)
;
}
}
,
(
Some
(
val1
)
,
None
)
=
>
{
(
Col1
(
val1
)
)
;
line1
=
lines1
.
next
(
)
;
}
(
None
,
Some
(
val2
)
)
=
>
{
(
Col2
(
val2
)
)
;
line2
=
lines2
.
next
(
)
;
}
_
=
>
(
)
,
}
}
Draw the initial values from the two input files.
When the values are the same, print one of them in column 3.
When the first value is less than the second, print the first value in column 1.
When the first value is greater than the second, print the second value in column 2.
When there is a value only from the first file, print it in column 1.
When there is a value only from the second file, print it in column 2.
I like having the option to change the output delimiter from a tab to something more visible:
$ cargo run -- -d="--->" tests/inputs/file1.txt tests/inputs/file2.txt --->B a b --->--->c d
The version I presented mimics the BSD version of comm
.
Alter the program to match the GNU output, and also add the additional options from that version.
Be sure you update the test suite and test files to verify that your program works exactly like the GNU version.
Change the column suppression flags to selection flags, so -12
would mean show the first two columns only.
Without any column selections, all the columns should be shown.
This is similar to how the wcr
program works, where the default is to show all the columns for lines, words, and characters, and the selection of any of those columns suppresses those not selected.
Update the tests to verify that your program works correctly.
As I noted in the chapter introduction, comm
performs basic join operations on two files, which is similar to the join
program.
Run man join
to read the manual page for that program, and use your experience from writing commr
to write a Rust version.
I would suggest the ingenious name joinr
.
Generate input files, and then use join
to create the output files you can use to verify that your version maintains fidelity to the original tool.
Until I wrote my own version of the utility, I had to look at the manual page every time I used comm
, to remember what the flags meant.
I also imagined it to be a very complicated program, but I find the solution quite simple and elegant.
Consider what you learned:
You can choose when to advance any iterator by using Iterator::next
. For instance, when used with a filehandle, you can manually select the next line.
You can use match
on combinations of possibilities by grouping them into a tuple.
You can use the cmp
method of the Ord
trait to compare one value to another. The result is a variant of std::cmp::Ordering
.
You can create an enum
called Column
where the variants can hold a &str
value as long as you include lifetime annotations.
In the next chapter, you’ll learn how to move to a line or byte position in a file.