Chapter 10. Boston Commons

Never looked at you before with / Common sense

They Might Be Giants, “Circular Karate Chop” (2013)

In this chapter, you will write a Rust version of the comm (common) utility, which will read two files and report the lines of text that are common to both and the lines that are unique to each. These are set operations where the common lines are the intersection of the two files and the unique lines are the difference. If you are familiar with databases, you might also consider these as types of join operations.

You will learn how to:

  • Manually iterate the lines of a filehandle using Iterator::next

  • match on combinations of possibilities using a tuple

  • Use std::cmp::Ordering when comparing strings

How comm Works

To show you what will be expected of your program, I’ll start by reviewing part of the manual page for the BSD comm to see how the tool works:

COMM(1)                   BSD General Commands Manual                  COMM(1)

NAME
     comm -- select or reject lines common to two files

SYNOPSIS
     comm [-123i] file1 file2

DESCRIPTION
     The comm utility reads file1 and file2, which should be sorted lexically,
     and produces three text columns as output: lines only in file1; lines
     only in file2; and lines in both files.

     The filename ''-'' means the standard input.

     The following options are available:

     -1      Suppress printing of column 1.

     -2      Suppress printing of column 2.

     -3      Suppress printing of column 3.

     -i      Case insensitive comparison of lines.

     Each column will have a number of tab characters prepended to it equal to
     the number of lower numbered columns that are being printed.  For exam-
     ple, if column number two is being suppressed, lines printed in column
     number one will not have any tabs preceding them, and lines printed in
     column number three will have one.

     The comm utility assumes that the files are lexically sorted; all charac-
     ters participate in line comparisons.

The GNU version has some additional options but lacks a case-insensitive option:

$ comm --help
Usage: comm [OPTION]... FILE1 FILE2
Compare sorted files FILE1 and FILE2 line by line.

When FILE1 or FILE2 (not both) is -, read standard input.

With no options, produce three-column output.  Column one contains
lines unique to FILE1, column two contains lines unique to FILE2,
and column three contains lines common to both files.

  -1              suppress column 1 (lines unique to FILE1)
  -2              suppress column 2 (lines unique to FILE2)
  -3              suppress column 3 (lines that appear in both files)

  --check-order     check that the input is correctly sorted, even
                      if all input lines are pairable
  --nocheck-order   do not check that the input is correctly sorted
  --output-delimiter=STR  separate columns with STR
  --total           output a summary
  -z, --zero-terminated    line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit

Note, comparisons honor the rules specified by 'LC_COLLATE'.

Examples:
  comm -12 file1 file2  Print only lines present in both file1 and file2.
  comm -3 file1 file2  Print lines in file1 not in file2, and vice versa.

At this point, you may be wondering exactly why you’d use this. Suppose you have a file containing a list of cities where your favorite band played on their last tour:

$ cd 10_commr/tests/inputs/
$ cat cities1.txt
Jackson
Denton
Cincinnati
Boston
Santa Fe
Tucson

Another file lists the cities on their current tour:

$ cat cities2.txt
San Francisco
Denver
Ypsilanti
Denton
Cincinnati
Boston

You can use comm to find which cities occur in both sets by suppressing columns 1 (the lines unique to the first file) and 2 (the lines unique to the second file) and only showing column 3 (the lines common to both files). This is like an inner join in SQL, where only data that occurs in both inputs is shown. Note that both files need to be sorted first:

$ comm -12 <(sort cities1.txt) <(sort cities2.txt)
Boston
Cincinnati
Denton

If you wanted the cities the band played only on the first tour, you could suppress columns 2 and 3:

$ comm -23 <(sort cities1.txt) <(sort cities2.txt)
Jackson
Santa Fe
Tucson

Finally, if you wanted the cities they played only on the second tour, you could suppress columns 1 and 3:

$ comm -13 <(sort cities1.txt) <(sort cities2.txt)
Denver
San Francisco
Ypsilanti

The first or second file can be STDIN, as denoted by a filename consisting of a dash (-):

$ sort cities2.txt | comm -12 <(sort cities1.txt) -
Boston
Cincinnati
Denton

As with the GNU comm, only one of the inputs may be a dash with the challenge program. Note that BSD comm can perform case-insensitive comparisons when the -i flag is present. For instance, I can put the first tour cities in lowercase:

$ cat cities1_lower.txt
jackson
denton
cincinnati
boston
santa fe
tucson

and the second tour cities in uppercase:

$ cat cities2_upper.txt
SAN FRANCISCO
DENVER
YPSILANTI
DENTON
CINCINNATI
BOSTON

Then I can use the -i flag to find the cities in common:

$ comm -i -12 <(sort cities1_lower.txt) <(sort cities2_upper.txt)
boston
cincinnati
denton
Note

I know the tour cities example is a trivial one, so I’ll give you another example drawn from my experience in bioinformatics, which is the intersection of computer science and biology. Given a file of protein sequences, I can run an analysis that will group similar sequences into clusters. I can then use comm to compare the clustered proteins to the original list and find the proteins that failed to cluster. There may be something unique to these unclustered proteins that bears further analysis.

This is as much as the challenge program is expected to implement. One change from the BSD version is that I use the GNU version’s optional output column delimiter that defaults to a tab character, which is the normal output from comm.

Getting Started

The program in this chapter will be called commr (pronounced comm-er, which is basically how the British pronounce the word comma) for a Rust version of comm. I suggest you use cargo new commr to start, then add the following dependencies to your Cargo.toml file:

[dependencies]
clap = "2.33"

[dev-dependencies]
assert_cmd = "2"
predicates = "2"
rand = "0.8"

Copy my 10_commr/tests directory into your project, and then run cargo test to run the tests, which should all fail.

Defining the Arguments

No surprises here, but I suggest the following for your src/main.rs:

fn main() {
    if let Err(e) = commr::get_args().and_then(commr::run) {
        eprintln!("{}", e);
        std::process::exit(1);
    }
}

You can start src/lib.rs with the following code:

use clap::{App, Arg};
use std::error::Error;

type MyResult<T> = Result<T, Box<dyn Error>>;

#[derive(Debug)]
pub struct Config {
    file1: String, 1
    file2: String, 2
    show_col1: bool, 3
    show_col2: bool, 4
    show_col3: bool, 5
    insensitive: bool, 6
    delimiter: String, 7
}
1

The first input filename is a String.

2

The second input filename is a String.

3

A Boolean for whether or not to show the first column of output.

4

A Boolean for whether or not to show the second column of output.

5

A Boolean for whether or not to show the third column of output.

6

A Boolean for whether or not to perform case-insensitive comparisons.

7

The output column delimiter, which will default to a tab.

Note

Normally I give my Config fields the same names as the arguments, but I don’t like the negative suppress verb, preferring instead the positive show. I feel this leads to more readable code, as I will demonstrate later.

You can fill in the missing parts of the following code to begin your get_args function:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("commr")
        .version("0.1.0")
        .author("Ken Youens-Clark <kyclark@gmail.com>")
        .about("Rust comm")
        // What goes here?
        .get_matches();

    Ok(Config {
        file1: ...
        file2: ...
        show_col1: ...
        show_col2: ...
        show_col3: ...
        insensitive: ...
        delimiter: ...
    })
}

Start your run function by printing the config:

pub fn run(config: Config) -> MyResult<()> {
    println!("{:#?}", config);
    Ok(())
}

Your program should be able to produce the following usage:

$ cargo run -- -h
commr 0.1.0
Ken Youens-Clark <kyclark@gmail.com>
Rust comm

USAGE:
    commr [FLAGS] [OPTIONS] <FILE1> <FILE2>

FLAGS:
    -h, --help       Prints help information
    -i               Case-insensitive comparison of lines
    -1               Suppress printing of column 1
    -2               Suppress printing of column 2
    -3               Suppress printing of column 3
    -V, --version    Prints version information

OPTIONS:
    -d, --output-delimiter <DELIM>    Output delimiter [default: 	]

ARGS:
    <FILE1>    Input file 1
    <FILE2>    Input file 2

If you run your program with no arguments, it should fail with a message that the two file arguments are required:

$ cargo run
error: The following required arguments were not provided:
    <FILE1>
    <FILE2>

USAGE:
    commr <FILE1> <FILE2> --output-delimiter <DELIM>

For more information try --help

If you supply two positional arguments, you should get the following output:

$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt
Config {
    file1: "tests/inputs/file1.txt", 1
    file2: "tests/inputs/file2.txt",
    show_col1: true, 2
    show_col2: true,
    show_col3: true,
    insensitive: false,
    delimiter: "\t",
}
1

The two positional arguments are parsed into file1 and file2.

2

All the rest of the values use defaults, which are true for the Booleans and the tab character for the output delimiter.

Verify that you can set all the other arguments as well:

$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt -123 -d , -i
Config {
    file1: "tests/inputs/file1.txt",
    file2: "tests/inputs/file2.txt",
    show_col1: false, 1
    show_col2: false,
    show_col3: false,
    insensitive: true, 2
    delimiter: ",", 3
}
1

The -123 sets each of the show values to false.

2

The -i sets insensitive to true.

3

The -d option sets the output delimiter to a comma (,).

Note

Stop reading and make your program match the preceding output.

Following is how I defined the arguments in my get_args. I don’t have much to comment on here since it’s so similar to previous programs:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("commr")
        .version("0.1.0")
        .author("Ken Youens-Clark <kyclark@gmail.com>")
        .about("Rust comm")
        .arg(
            Arg::with_name("file1")
                .value_name("FILE1")
                .help("Input file 1")
                .takes_value(true)
                .required(true),
        )
        .arg(
            Arg::with_name("file2")
                .value_name("FILE2")
                .help("Input file 2")
                .takes_value(true)
                .required(true),
        )
        .arg(
            Arg::with_name("suppress_col1")
                .short("1")
                .takes_value(false)
                .help("Suppress printing of column 1"),
        )
        .arg(
            Arg::with_name("suppress_col2")
                .short("2")
                .takes_value(false)
                .help("Suppress printing of column 2"),
        )
        .arg(
            Arg::with_name("suppress_col3")
                .short("3")
                .takes_value(false)
                .help("Suppress printing of column 3"),
        )
        .arg(
            Arg::with_name("insensitive")
                .short("i")
                .takes_value(false)
                .help("Case-insensitive comparison of lines"),
        )
        .arg(
            Arg::with_name("delimiter")
                .short("d")
                .long("output-delimiter")
                .value_name("DELIM")
                .help("Output delimiter")
                .default_value("\t")
                .takes_value(true),
        )
        .get_matches();

    Ok(Config {
        file1: matches.value_of("file1").unwrap().to_string(),
        file2: matches.value_of("file2").unwrap().to_string(),
        show_col1: !matches.is_present("suppress_col1"),
        show_col2: !matches.is_present("suppress_col2"),
        show_col3: !matches.is_present("suppress_col3"),
        insensitive: matches.is_present("insensitive"),
        delimiter: matches.value_of("delimiter").unwrap().to_string(),
    })
}

Validating and Opening the Input Files

The next step is checking and opening the input files. I suggest a modification to the open function used in several previous chapters:

fn open(filename: &str) -> MyResult<Box<dyn BufRead>> {
    match filename {
        "-" => Ok(Box::new(BufReader::new(io::stdin()))),
        _ => Ok(Box::new(BufReader::new(
            File::open(filename)
                .map_err(|e| format!("{}: {}", filename, e))?, 1
        ))),
    }
}
1

Incorporate the filename into the error message.

This will require you to expand your imports with the following:

use std::{
    error::Error,
    fs::File,
    io::{self, BufRead, BufReader},
};

As noted earlier, only one of the inputs is allowed to be a dash, for STDIN. You can use the following code for your run that will check the filenames and then open the files:

pub fn run(config: Config) -> MyResult<()> {
    let file1 = &config.file1;
    let file2 = &config.file2;

    if file1 == "-" && file2 == "-" { 1
        return Err(From::from("Both input files cannot be STDIN (\"-\")"));
    }

    let _file1 = open(file1)?; 2
    let _file2 = open(file2)?;
    println!("Opened {} and {}", file1, file2); 3

    Ok(())
}
1

Check that both of the filenames are not a dash (-).

2

Attempt to open the two input files.

3

Print a message so you know what happened.

Your program should reject two STDIN arguments:

$ cargo run -- - -
Both input files cannot be STDIN ("-")

It should be able to print the following for two good input files:

$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt
Opened tests/inputs/file1.txt and tests/inputs/file2.txt

It should reject a bad file for either argument, such as the nonexistent blargh:

$ cargo run -- tests/inputs/file1.txt blargh
blargh: No such file or directory (os error 2)

At this point, your program should pass all the tests for cargo test dies that check for missing or bad input arguments:

running 4 tests
test dies_both_stdin ... ok
test dies_no_args ... ok
test dies_bad_file1 ... ok
test dies_bad_file2 ... ok

Processing the Files

Your program can now validate all the arguments and open the input files, either of which may be STDIN. Next, you need to iterate over the lines from each file to compare them. The files in 10_commr/tests/inputs that are used in the tests are:

  • empty.txt: an empty file

  • blank.txt: a file with one blank line

  • file1.txt: a file with four lines of text

  • file2.txt: a file with two lines of text

You may use BufRead::lines to read files as it is not necessary to preserve line endings. Start simply, perhaps using the empty.txt file and file1.txt. Try to get your program to reproduce the following output from comm:

$ cd tests/inputs/
$ comm file1.txt empty.txt
a
b
c
d

Then reverse the argument order and ensure that you get the same output, but now in column 2, like this:

$ comm empty.txt file1.txt
	a
	b
	c
	d

Next, look at the output from the BSD version of comm using file1.txt and file2.txt. The order of the lines shown in the following command is the expected output for the challenge program:

$ comm file1.txt file2.txt
	B
a
b
		c
d

The GNU comm uses a different ordering for which lines to show first when they are not equal. Note that the line B is shown after b:

$ comm file1.txt file2.txt
a
b
	B
		c
d

Next, consider how you will handle the blank.txt file that contains a single blank line. In the following output, notice that the blank line is shown first, then the two lines from file2.txt:

$ comm tests/inputs/blank.txt tests/inputs/file2.txt

	B
	c

I suggest you start by trying to read a line from each file. The documentation for BufRead::lines notes that it will return a None when it reaches the end of the file. Starting with the empty file as one of the arguments will force you to deal with having an uneven number of lines, where you will have to advance one of the filehandles while the other stays the same. Then, when you use two nonempty files, you’ll have to consider how to read the files until you have matching lines and move them independently otherwise.

Solution

As always, I’ll stress that the only requirement for your code is to pass the test suite. I doubt you will have written the same code as I did, but that’s what I find so fun and creative about coding. In my solution, I decided to create iterators to retrieve the lines from the filehandles. These iterators incorporate a closure to handle case-insensitive comparisons:

pub fn run(config: Config) -> MyResult<()> {
    let file1 = &config.file1;
    let file2 = &config.file2;

    if file1 == "-" && file2 == "-" {
        return Err(From::from("Both input files cannot be STDIN (\"-\")"));
    }

    let case = |line: String| { 1
        if config.insensitive {
            line.to_lowercase()
        } else {
            line
        }
    };

    let mut lines1 = open(file1)?.lines().filter_map(Result::ok).map(case); 2
    let mut lines2 = open(file2)?.lines().filter_map(Result::ok).map(case);

    let line1 = lines1.next(); 3
    let line2 = lines2.next();
    println!("line1 = {:?}", line1); 4
    println!("line2 = {:?}", line2);

    Ok(())
}
1

Create a closure to lowercase each line of text when config.insensitive is true.

2

Open the files, create iterators that remove errors, and then map the lines through the case closure.

3

The Iterator::next method advances an iterator and returns the next value. Here, it will retrieve the first line from a filehandle.

4

Print the first two values.

Note

In the preceding code, I used the function Result::ok rather than writing a closure |line| line.ok(). They both accomplish the same thing, but the first is shorter.

As I suggested, I’ll start with one of the files being empty. Moving to the root directory of the chapter, I ran the program with the following input files:

$ cd ../..
$ cargo run -- tests/inputs/file1.txt tests/inputs/empty.txt
line1 = Some("a")
line2 = None

That led me to think about how I can move through the lines of each iterator based on the four different combinations of Some(line) and None that I can get from two iterators. In the following code, I place the possibilities inside a tuple, which is a finite heterogeneous sequence surrounded by parentheses:

    let mut line1 = lines1.next(); 1
    let mut line2 = lines2.next();

    while line1.is_some() || line2.is_some() { 2
        match (&line1, &line2) { 3
            (Some(_), Some(_)) => { 4
                line1 = lines1.next();
                line2 = lines2.next();
            }
            (Some(_), None) => { 5
                line1 = lines1.next();
            }
            (None, Some(_)) => { 6
                line2 = lines2.next();
            }
            _ => (), 7
        };
    }
1

Make the line variables mutable.

2

Execute the loop as long as one of the filehandles produces a line.

3

Compare all possible combinations of the two line variables for two variants.

4

When both are Some values, use Iterator::next to retrieve the next line from both filehandles.

5

When there is only the first value, ask for the next line from the first filehandle.

6

Do the same for the second filehandle.

7

Do nothing for any other condition.

When I have only one value from the first or second file, I should print the value in the first or second column, respectively. When I have two values from the files and they are the same, I should print a value in column 3. When I have two values and the first value is less than the second, I should print the first value in column 1; otherwise, I should print the second value in column 2. To understand this last point, consider the following two input files, which I’ll place side by side so you can imagine how the code will read the lines:

$ cat tests/inputs/file1.txt          $ cat tests/inputs/file2.txt
a                                     B
b                                     c
c
d

To help you see the output from BSD comm, I will pipe the output into sed (stream editor) to replace each tab character (\t) with the string ---> to make it clear which columns are being printed:

$ comm tests/inputs/file1.txt tests/inputs/file2.txt | sed "s/\t/--->/g" 1
--->B
a
b
--->--->c
d
1

The sed command s// will substitute values, replacing the string between the first pair of slashes with the string between the second pair. The final g is the global flag to substitute every occurrence.

Now imagine your code reads the first line from each input and has a from file1.txt and B from file2.txt. They are not equal, so the question is which to print. The goal is to mimic BSD comm, so I know that the B should come first and be printed in the second column. When I compare a and B, I find that B is less than a when they are ordered by their code point, or numerical value. To help you see this, I’ve included a program in util/ascii that will show you a range of the ASCII table starting at the first printable character. Note that B has a value of 66 while a is 97:

 33: !	 52: 4	 71: G	 90: Z	109: m
 34: "	 53: 5	 72: H	 91: [	110: n
 35: #	 54: 6	 73: I	 92: \	111: o
 36: $	 55: 7	 74: J	 93: ]	112: p
 37: %	 56: 8	 75: K	 94: ^	113: q
 38: &	 57: 9	 76: L	 95: _	114: r
 39: '	 58: :	 77: M	 96: `	115: s
 40: (	 59: ;	 78: N	 97: a	116: t
 41: )	 60: <	 79: O	 98: b	117: u
 42: *	 61: =	 80: P	 99: c	118: v
 43: +	 62: >	 81: Q	100: d	119: w
 44: ,	 63: ?	 82: R	101: e	120: x
 45: -	 64: @	 83: S	102: f	121: y
 46: .	 65: A	 84: T	103: g	122: z
 47: /	 66: B	 85: U	104: h	123: {
 48: 0	 67: C	 86: V	105: i	124: |
 49: 1	 68: D	 87: W	106: j	125: }
 50: 2	 69: E	 88: X	107: k	126: ~
 51: 3	 70: F	 89: Y	108: l	127: DEL

To mimic BSD comm, I should print the lower value (B) first and draw another value from that file for the next iteration; the GNU version does the opposite. In the following code, I’m concerned only with the ordering, and I’ll handle the indentation in a moment. Note that you should add use std::cmp::Ordering::* to your imports for this code:

    let mut line1 = lines1.next();
    let mut line2 = lines2.next();

    while line1.is_some() || line2.is_some() {
        match (&line1, &line2) {
            (Some(val1), Some(val2)) => match val1.cmp(val2) { 1
                Equal => { 2
                    println!("{}", val1);
                    line1 = lines1.next();
                    line2 = lines2.next();
                }
                Less => { 3
                    println!("{}", val1);
                    line1 = lines1.next();
                }
                Greater => { 4
                    println!("{}", val2);
                    line2 = lines2.next();
                }
            },
            (Some(val1), None) => {
                println!("{}", val1); 5
                line1 = lines1.next();
            }
            (None, Some(val2)) => {
                println!("{}", val2); 6
                line2 = lines2.next();
            }
            _ => (),
        }
    }
1

Use Ord::cmp to compare the first value to the second. This will return a variant from std::cmp::Ordering.

2

When the two values are equal, print the first and get values from each of the files.

3

When the value from the first file is less than the value from the second file, print the first and request the next value from the first file.

4

When the first value is greater than the second, print the value from the second file and request the next value from the second file.

5

When there is a value only from the first file, print it and continue requesting values from the first file.

6

When there is a value only from the second file, print it and continue requesting values from the second file.

If I run this code using a nonempty file and an empty file, it works:

$ cargo run -- tests/inputs/file1.txt tests/inputs/empty.txt
a
b
c
d

If I use file1.txt and file2.txt, it’s not far from the expected output:

$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt
B
a
b
c
d

I decided to create an enum called Column to represent in which column I should print a value. Each variant holds a &str, which requires a lifetime annotation. You can place the following at the top of src/lib.rs, near your Config declaration. Be sure to add use crate::Column::* to your import so you can reference Col1 instead of Column::Col1:

enum Column<'a> {
    Col1(&'a str),
    Col2(&'a str),
    Col3(&'a str),
}

Next, I created a closure called print to handle the printing of the output. The following code belongs in the run function:

    let print = |col: Column| {
        let mut columns = vec![]; 1
        match col {
            Col1(val) => {
                if config.show_col1 { 2
                    columns.push(val);
                }
            }
            Col2(val) => {
                if config.show_col2 { 3
                    if config.show_col1 {
                        columns.push("");
                    }
                    columns.push(val);
                }
            }
            Col3(val) => {
                if config.show_col3 { 4
                    if config.show_col1 {
                        columns.push("");
                    }
                    if config.show_col2 {
                        columns.push("");
                    }
                    columns.push(val);
                }
            }
        };

        if !columns.is_empty() { 5
            println!("{}", columns.join(&config.delimiter));
        }
    };
1

Create a mutable vector to hold the output columns.

2

Given text for column 1, add the value only if the column is shown.

3

Given text for column 2, add the values for the two columns only if they are shown.

4

Given text for column 3, add the values for the three columns only if they are shown.

5

If there are columns to print, join them on the output delimiter.

Tip

Originally I used the field suppress_col1, which had me writing if !config.suppress_col1, a double negative that is much harder to comprehend. In general, I would recommend using positive names like do_something rather than dont_do_something.

Here is how I incorporate the print closure:

    let mut line1 = lines1.next(); 1
    let mut line2 = lines2.next();

    while line1.is_some() || line2.is_some() {
        match (&line1, &line2) {
            (Some(val1), Some(val2)) => match val1.cmp(val2) {
                Equal => {
                    print(Col3(val1)); 2
                    line1 = lines1.next();
                    line2 = lines2.next();
                }
                Less => {
                    print(Col1(val1)); 3
                    line1 = lines1.next();
                }
                Greater => {
                    print(Col2(val2)); 4
                    line2 = lines2.next();
                }
            },
            (Some(val1), None) => {
                print(Col1(val1)); 5
                line1 = lines1.next();
            }
            (None, Some(val2)) => {
                print(Col2(val2)); 6
                line2 = lines2.next();
            }
            _ => (),
        }
    }
1

Draw the initial values from the two input files.

2

When the values are the same, print one of them in column 3.

3

When the first value is less than the second, print the first value in column 1.

4

When the first value is greater than the second, print the second value in col­umn 2.

5

When there is a value only from the first file, print it in column 1.

6

When there is a value only from the second file, print it in column 2.

I like having the option to change the output delimiter from a tab to something more visible:

$ cargo run -- -d="--->" tests/inputs/file1.txt tests/inputs/file2.txt
--->B
a
b
--->--->c
d

With these changes, all the tests pass.