Chapter 11. Tailor Swyfte

From the embryonic whale to the monkey with no tail

They Might Be Giants, “Mammal” (1992)

The challenge in this chapter will be to write a version of tail, which is the converse of head from Chapter 4. The program will show you the last bytes or lines of one or more files or STDIN, usually defaulting to the last 10 lines. Again the program will have to deal with bad input and will possibly mangle Unicode characters. Due to some limitations with how Rust currently handles STDIN, the challenge program will read only regular files.

In this chapter, you will learn how to do the following:

  • Initialize a static, global, computed value

  • Seek to a line or byte position in a filehandle

  • Indicate multiple trait bounds on a type using the where clause

  • Build a release binary with Cargo

  • Benchmark programs to compare runtime performance

How tail Works

To demonstrate how the challenge program should work, I’ll first show you a portion of the manual page for the BSD tail. Note that the challenge program will only implement some of these features:

TAIL(1)                   BSD General Commands Manual                  TAIL(1)

NAME
     tail -- display the last part of a file

SYNOPSIS
     tail [-F | -f | -r] [-q] [-b number | -c number | -n number] [file ...]

DESCRIPTION
     The tail utility displays the contents of file or, by default, its stan-
     dard input, to the standard output.

     The display begins at a byte, line or 512-byte block location in the
     input.  Numbers having a leading plus ('+') sign are relative to the
     beginning of the input, for example, ''-c +2'' starts the display at the
     second byte of the input.  Numbers having a leading minus ('-') sign or
     no explicit sign are relative to the end of the input, for example,
     ''-n2'' displays the last two lines of the input.  The default starting
     location is ''-n 10'', or the last 10 lines of the input.

The BSD version has many options, but these are the only ones relevant to the challenge program:

     -c number
             The location is number bytes.

     -n number
             The location is number lines.

     -q      Suppresses printing of headers when multiple files are being
             examined.

     If more than a single file is specified, each file is preceded by a
     header consisting of the string ''==> XXX <=='' where XXX is the name of
     the file unless -q flag is specified.

Here’s part of the manual page for GNU tail, which includes long option names:

TAIL(1)                          User Commands                         TAIL(1)

NAME
    tail - output the last part of files

SYNOPSIS
    tail [OPTION]... [FILE]...

DESCRIPTION
    Print  the  last  10  lines of each FILE to standard output.  With more
    than one FILE, precede each with a header giving the file  name.   With
    no FILE, or when FILE is -, read standard input.

    Mandatory  arguments  to  long  options are mandatory for short options
    too.

    -c, --bytes=K
           output the last K bytes; or use -c +K to output  bytes  starting
           with the Kth of each file

    -n, --lines=K
           output the last K lines, instead of the last 10; or use -n +K to
              output starting with the Kth

I’ll use files in the book’s 11_tailr/tests/inputs directory to demonstrate the features of tail that the challenge will implement. As in previous chapters, there are examples with Windows line endings that must be preserved in the output. The files I’ll use are:

  • empty.txt: an empty file

  • one.txt: a file with one line of UTF-8 Unicode text

  • two.txt: a file with two lines of ASCII text

  • three.txt: a file with three lines of ASCII text and CRLF line terminators

  • ten.txt: a file with 10 lines of ASCII text

Change into the chapter’s directory:

$ cd 11_tailr

By default, tail will show the last 10 lines of a file, which you can see with tests/inputs/ten.txt:

$ tail tests/inputs/ten.txt
one
two
three
four
five
six
seven
eight
nine
ten

Run it with -n 4 to see the last four lines:

$ tail -n 4 tests/inputs/ten.txt
seven
eight
nine
ten

Use -c 8 to select the last eight bytes of the file. In the following output, there are six byte-sized characters and two byte-sized newline characters, for a total of eight bytes. Pipe the output to cat -e to display the dollar sign ($) at the end of each line:

$ tail -c 8 tests/inputs/ten.txt | cat -e
ine$
ten$

With multiple input files, tail will print separators between each file. Any errors opening files (such as for nonexistent or unreadable files) will be noted to STDERR without any file headers. For instance, blargh represents a nonexistent file in the following command:

$ tail -n 1 tests/inputs/one.txt blargh tests/inputs/three.txt
==> tests/inputs/one.txt <==
Öne line, four wordś.
tail: blargh: No such file or directory

==> tests/inputs/three.txt <==
four words.

The -q flag will suppress the file headers:

$ tail -q -n 1 tests/inputs/*.txt
Öne line, four wordś.
ten
four words.
Four words.

The end of tests/inputs/one.txt has a funky Unicode ś thrown in for good measure, which is a multibyte Unicode character. If you request the last four bytes of the file, two will be for ś, one for the period, and one for the final newline:

$ tail -c 4 tests/inputs/one.txt
ś.

If you ask for only three, the ś will be split, and you should see the Unicode unknown character:

$ tail -c 3 tests/inputs/one.txt
�.

Requesting more lines or bytes than a file contains is not an error and will cause tail to print the entire file:

$ tail -n 1000 tests/inputs/one.txt
Öne line, four wordś.
$ tail -c 1000 tests/inputs/one.txt
Öne line, four wordś.

As noted in the manual pages, -n or -c values may begin with a plus sign to indicate a line or byte position from the beginning of the file rather than the end. A start position beyond the end of the file is not an error, and tail will print nothing, which you can see if you run tail -n +1000 tests/inputs/one.txt. In the following command, I use -n +8 to start printing from line 8:

$ tail -n +8 tests/inputs/ten.txt
eight
nine
ten

It’s possible to split multibyte characters with byte selection. For example, the tests​/⁠inputs/one.txt file starts with the Unicode character Ö, which is two bytes long. In the following command, I use -c +2 to start printing from the second byte, which will split the multibyte character, resulting in the unknown character:

$ tail -c +2 tests/inputs/one.txt
�ne line, four wordś.

To start printing from the second character, I must use -c +3 to start printing from the third byte:

$ tail -c +3 tests/inputs/one.txt
ne line, four wordś.

Both the BSD and GNU versions will accept 0 and -0 for -n or -c. The GNU version will show no output at all, while the BSD version will show no output when run with a single file but will still show the file headers when there are multiple input files. The following behavior of BSD is expected of the challenge program:

$ tail -n 0 tests/inputs/*
==> tests/inputs/empty.txt <==

==> tests/inputs/one.txt <==

==> tests/inputs/ten.txt <==

==> tests/inputs/three.txt <==

==> tests/inputs/two.txt <==

Both versions interpret the value +0 as starting at the zeroth line or byte, so the whole file will be shown:

$ tail -n +0 tests/inputs/one.txt
Öne line, four wordś.
$ tail -c +0 tests/inputs/one.txt
Öne line, four wordś.

Both versions will reject any value for -n or -c that cannot be parsed as an integer:

$ tail -c foo tests/inputs/one.txt
tail: illegal offset -- foo

While tail has several more features, this is as much as your program needs to implement.

Getting Started

The challenge program will be called tailr (pronounced tay-ler). I recommend you begin with cargo new tailr and then add the following dependencies to Cargo.toml:

[dependencies]
clap = "2.33"
num = "0.4"
regex = "1"
once_cell = "1" 1

[dev-dependencies]
assert_cmd = "2"
predicates = "2"
rand = "0.8"
1

The once_cell crate will be used to create a computed static value.

Copy the book’s 11_tailr/tests directory into your project, and then run cargo test to download the needed crates, build your program, and ensure that you fail all the tests.

Defining the Arguments

Use the same structure for src/main.rs as in the previous chapters:

fn main() {
    if let Err(e) = tailr::get_args().and_then(tailr::run) {
        eprintln!("{}", e);
        std::process::exit(1);
    }
}

The challenge program should have similar options as headr, but this program will need to handle both positive and negative values for the number of lines or bytes. In headr, I used the usize type, which is an unsigned integer that can represent only positive values. In this program, I will use i64, the 64-bit signed integer type, to also store negative numbers. Additionally, I need some way to differentiate between 0, which means nothing should be selected, and +0, which means everything should be selected. I decided to create an enum called TakeValue to represent this, but you may choose a different way. You can start src/lib.rs with the following if you want to follow my lead:

use crate::TakeValue::*; 1
use clap::{App, Arg};
use std::error::Error;

type MyResult<T> = Result<T, Box<dyn Error>>;

#[derive(Debug, PartialEq)] 2
enum TakeValue {
    PlusZero, 3
    TakeNum(i64), 4
}
1

This will allow the code to use PlusZero instead of TakeValue::PlusZero.

2

The PartialEq is needed by the tests to compare values.

3

This variant represents an argument of +0.

4

This variant represents a valid integer value.

Here is the Config I suggest you create to represent the program’s arguments:

#[derive(Debug)]
pub struct Config {
    files: Vec<String>, 1
    lines: TakeValue, 2
    bytes: Option<TakeValue>, 3
    quiet: bool, 4
}
1

files is a vector of strings.

2

lines is a TakeValue that should default to TakeNum(-10) to indicate the last 10 lines.

3

bytes is an optional TakeValue for how many bytes to select.

4

The quiet flag is a Boolean for whether or not to suppress the headers between multiple files.

Following is a skeleton for get_args you can fill in:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("tailr")
        .version("0.1.0")
        .author("Ken Youens-Clark <kyclark@gmail.com>")
        .about("Rust tail")
        // What goes here?
        .get_matches();

    Ok(Config {
        files: ...
        lines: ...
        bytes: ...
        quiet: ...
    })
}

I suggest you start your run function by printing the config:

pub fn run(config: Config) -> MyResult<()> {
    println!("{:#?}", config);
    Ok(())
}

First, get your program to print the following usage:

$ cargo run -- -h
tailr 0.1.0
Ken Youens-Clark <kyclark@gmail.com>
Rust tail

USAGE:
    tailr [FLAGS] [OPTIONS] <FILE>...

FLAGS:
    -h, --help       Prints help information
    -q, --quiet      Suppress headers
    -V, --version    Prints version information

OPTIONS:
    -c, --bytes <BYTES>    Number of bytes
    -n, --lines <LINES>    Number of lines [default: 10]

ARGS:
    <FILE>...    Input file(s)

If you run the program with no arguments, it should fail with an error that at least one file argument is required because this program will not read STDIN by default:

$ cargo run
error: The following required arguments were not provided:
    <FILE>...

USAGE:
    tailr <FILE>... --lines <LINES>

For more information try --help

Run the program with a file argument and see if you can get this output:

$ cargo run -- tests/inputs/one.txt
Config {
    files: [
        "tests/inputs/one.txt", 1
    ],
    lines: TakeNum( 2
        -10,
    ),
    bytes: None, 3
    quiet: false, 4
}
1

The positional arguments belong in files.

2

The lines argument should default to TakeNum(-10) to take the last 10 lines.

3

The bytes argument should default to None.

4

The quiet option should default to false.

Run the program with multiple file arguments and the -c option to ensure you get the following output:

$ cargo run -- -q -c 4 tests/inputs/*.txt
Config {
    files: [
        "tests/inputs/empty.txt", 1
        "tests/inputs/one.txt",
        "tests/inputs/ten.txt",
        "tests/inputs/three.txt",
        "tests/inputs/two.txt",
    ],
    lines: TakeNum( 2
        -10,
    ),
    bytes: Some( 3
        TakeNum(
            -4,
        ),
    ),
    quiet: true, 4
}
1

The positional arguments are parsed as files.

2

The lines argument is still set to the default.

3

Now bytes is set to Some(TakeNum(-4)) to indicate the last four bytes should be taken.

4

The -q flag causes the quiet option to be true.

You probably noticed that the value 4 was parsed as a negative number even though it was provided as a positive value. The numeric values for lines and bytes should be negative to indicate that the program will take values from the end of the file. A plus sign is required to indicate that the starting position is from the beginning of the file:

$ cargo run -- -n +5 tests/inputs/ten.txt 1
Config {
    files: [
        "tests/inputs/ten.txt",
    ],
    lines: TakeNum(
        5, 2
    ),
    bytes: None,
    quiet: false,
}
1

The +5 argument indicates the program should start printing on the fifth line.

2

The value is recorded as a positive integer.

Both -n and -c are allowed to have a value of 0, which will mean that no lines or bytes will be shown:

$ cargo run -- tests/inputs/empty.txt -c 0
Config {
    files: [
        "tests/inputs/empty.txt",
    ],
    lines: TakeNum(
        -10,
    ),
    bytes: Some(
        TakeNum(
            0,
        ),
    ),
    quiet: false,
}

As with the original versions, the value +0 indicates that the starting point is the beginning of the file, so all the content will be shown:

$ cargo run -- tests/inputs/empty.txt -n +0
Config {
    files: [
        "tests/inputs/empty.txt",
    ],
    lines: PlusZero, 1
    bytes: None,
    quiet: false,
}
1

The PlusZero variant represents +0.

Any noninteger value for -n and -c should be rejected:

$ cargo run -- tests/inputs/empty.txt -n foo
illegal line count -- foo
$ cargo run -- tests/inputs/empty.txt -c bar
illegal byte count -- bar

The challenge program should consider -n and -c mutually exclusive:

$ cargo run -- tests/inputs/empty.txt -n 1 -c 1
error: The argument '--lines <LINES>' cannot be used with '--bytes <BYTES>'

Parsing Positive and Negative Numeric Arguments

In Chapter 4, the challenge program validated the numeric arguments using the parse_positive_int function to reject any values that were not positive integers. This program needs to accept any integer value, and it also needs to handle an optional + or - sign. Here is the start of the function parse_num I’d like you to write that will accept a &str and will return a TakeValue or an error:

fn parse_num(val: &str) -> MyResult<TakeValue> {
    unimplemented!();
}

Add the following unit test to a tests module in your src/lib.rs:

#[cfg(test)]
mod tests {
    use super::{parse_num, TakeValue::*};

    #[test]
    fn test_parse_num() {
        // All integers should be interpreted as negative numbers
        let res = parse_num("3");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), TakeNum(-3));

        // A leading "+" should result in a positive number
        let res = parse_num("+3");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), TakeNum(3));

        // An explicit "-" value should result in a negative number
        let res = parse_num("-3");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), TakeNum(-3));

        // Zero is zero
        let res = parse_num("0");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), TakeNum(0));

        // Plus zero is special
        let res = parse_num("+0");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), PlusZero);

        // Test boundaries
        let res = parse_num(&i64::MAX.to_string());
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), TakeNum(i64::MIN + 1));

        let res = parse_num(&(i64::MIN + 1).to_string());
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), TakeNum(i64::MIN + 1));

        let res = parse_num(&format!("+{}", i64::MAX));
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), TakeNum(i64::MAX));

        let res = parse_num(&i64::MIN.to_string());
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), TakeNum(i64::MIN));

        // A floating-point value is invalid
        let res = parse_num("3.14");
        assert!(res.is_err());
        assert_eq!(res.unwrap_err().to_string(), "3.14");

        // Any noninteger string is invalid
        let res = parse_num("foo");
        assert!(res.is_err());
        assert_eq!(res.unwrap_err().to_string(), "foo");
    }
}
Note

I suggest that you stop reading and take some time to write this function. Do not proceed until it passes cargo test test_parse_num. In the next section, I’ll share my solution.

Using a Regular Expression to Match an Integer with an Optional Sign

Following is one version of the parse_num function that passes the tests. Here I chose to use a regular expression to see if the input value matches an expected pattern of text. If you want to include this version in your program, be sure to add use regex::Regex:

fn parse_num(val: &str) -> MyResult<TakeValue> {
    let num_re = Regex::new(r"^([+-])?(\d+)$").unwrap(); 1

    match num_re.captures(val) {
        Some(caps) => {
            let sign = caps.get(1).map_or("-", |m| m.as_str()); 2
            let num = format!("{}{}", sign, caps.get(2).unwrap().as_str()); 3
            if let Ok(val) = num.parse() { 4
                if sign == "+" && val == 0 { 5
                    Ok(PlusZero) 6
                } else {
                    Ok(TakeNum(val)) 7
                }
            } else {
                Err(From::from(val)) 8
            }
        }
        _ => Err(From::from(val)), 9
    }
}
1

Create a regex to find an optional leading + or - sign followed by one or more numbers.

2

If the regex matches, the optional sign will be the first capture. Assume the minus sign if there is no match.

3

The digits of the number will be in the second capture. Format the sign and digits into a string.

4

Attempt to parse the number as an i64, which Rust infers from the function’s return type.

5

Check if the sign is a plus and the parsed value is 0.

6

If so, return the PlusZero variant.

7

Otherwise, return the parsed value.

8

Return the unparsable number as an error.

9

Return an invalid argument as an error.

Regular expression syntax can be daunting to the uninitiated. Figure 11-1 shows each element of the pattern used in the preceding function.

clru 1101
Figure 11-1. This is a regular expression that will match a positive or negative integer.

You’ve seen much of this syntax in previous programs. Here’s a review of all the parts of this regex:

I’d like to make one small change. The first line of the preceding function creates a regular expression by parsing the pattern each time the function is called:

fn parse_num(val: &str) -> MyResult<TakeValue> {
    let num_re = Regex::new(r"^([+-])?(\d+)$").unwrap();
    ...
}

I’d like my program to do the work of compiling the regex just once. You’ve seen in earlier tests how I’ve used const to create a constant value. It’s common to use ALL_CAPS to name global constants and to place them near the top of the crate, like so:

// This will not compile
const NUM_RE: Regex = Regex::new(r"^([+-])?(\d+)$").unwrap();

If I try to run the test again, I get the following error telling me that I cannot use a computed value for a constant:

error[E0015]: calls in constants are limited to constant functions, tuple
structs and tuple variants
  --> src/lib.rs:10:23
   |
10 | const NUM_RE: Regex = Regex::new(r"^([+-])?(\d+)$").unwrap();
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Enter once_cell, which provides a mechanism for creating lazily evaluated statics. To use this, you must first add the dependency to Cargo.toml, which I included at the start of this chapter. To create a lazily evaluated regular expression just one time in my program, I add the following to the top of src/lib.rs:

use once_cell::sync::OnceCell;

static NUM_RE: OnceCell<Regex> = OnceCell::new();

The only change to the parse_num function is to initialize NUM_RE the first time the function is called:

fn parse_num(val: &str) -> MyResult<TakeValue> {
    let num_re =
        NUM_RE.get_or_init(|| Regex::new(r"^([+-])?(\d+)$").unwrap());
    // Same as before
}

It is not a requirement that you use a regular expression to parse the numeric arguments. Here’s a method that relies only on Rust’s internal parsing capabilities:

fn parse_num(val: &str) -> MyResult<TakeValue> {
    let signs: &[char] = &['+', '-']; 1
    let res = val
        .starts_with(signs) 2
        .then(|| val.parse())
        .unwrap_or_else(|| val.parse().map(i64::wrapping_neg)); 3

    match res {
        Ok(num) => {
            if num == 0 && val.starts_with('+') { 4
                Ok(PlusZero)
            } else {
                Ok(TakeNum(num))
            }
        }
        _ => Err(From::from(val)), 5
    }
}
1

The type annotation is required because Rust infers the type &[char; 2], which is a reference to an array, but I want to coerce the value to a slice.

2

If the given value starts with a plus or minus sign, use str::parse, which will use the sign to create a positive or negative number, respectively.

3

Otherwise, parse the number and use i64::wrapping_neg to compute the negative value; that is, a positive value will be returned as negative, while a negative value will remain negative.

4

If the result is a successfully parsed i64, check whether to return PlusZero when the number is 0 and the given value starts with a plus sign; otherwise, return the parsed value.

5

Return the unparsable value as an error.

You may have found another way to figure this out, and that’s the point with functions and testing. It doesn’t much matter how a function is written as long as it passes the tests. A function is a black box where something goes in and something comes out, and we write enough tests to convince ourselves that the function works correctly.

Note

Now that you have a way to validate the numeric arguments, stop and finish your get_args function.

Parsing and Validating the Command-Line Arguments

Following is how I wrote my get_args function. First, I declare all my arguments with clap:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("tailr")
        .version("0.1.0")
        .author("Ken Youens-Clark <kyclark@gmail.com>")
        .about("Rust tail")
        .arg(
            Arg::with_name("files") 1
                .value_name("FILE")
                .help("Input file(s)")
                .required(true)
                .multiple(true),
        )
        .arg(
            Arg::with_name("lines") 2
                .short("n")
                .long("lines")
                .value_name("LINES")
                .help("Number of lines")
                .default_value("10"),
        )
        .arg(
            Arg::with_name("bytes") 3
                .short("c")
                .long("bytes")
                .value_name("BYTES")
                .conflicts_with("lines")
                .help("Number of bytes"),
        )
        .arg(
            Arg::with_name("quiet") 4
                .short("q")
                .long("quiet")
                .help("Suppress headers"),
        )
        .get_matches();
1

The files argument is positional and requires at least one value.

2

The lines argument has a default value of 10.

3

The bytes argument is optional and conflicts with lines.

4

The quiet flag is optional.

Then I validate lines and bytes and return the Config:

    let lines = matches
        .value_of("lines")
        .map(parse_num)
        .transpose()
        .map_err(|e| format!("illegal line count -- {}", e))?; 1

    let bytes = matches
        .value_of("bytes")
        .map(parse_num)
        .transpose()
        .map_err(|e| format!("illegal byte count -- {}", e))?; 2

    Ok(Config { 3
        files: matches.values_of_lossy("files").unwrap(),
        lines: lines.unwrap(),
        bytes,
        quiet: matches.is_present("quiet"),
    })
}
1

Attempt to parse lines as an integer, creating a useful error message when invalid.

2

Do the same for bytes.

3

Return the Config.

At this point, the program passes all the tests included with cargo test dies:

running 4 tests
test dies_no_args ... ok
test dies_bytes_and_lines ... ok
test dies_bad_lines ... ok
test dies_bad_bytes ... ok

Counting the Total Lines and Bytes in a File

Next, it’s time to figure out how to read a file from a given byte or line location. For instance, the default case is to print the last 10 lines of a file, so I need to know how many lines are in the file to figure out which is the tenth from the end. The same is true for bytes. I also need to determine if the user has requested more lines or bytes than the file contains. When this value is negative—meaning the user wants to start beyond the beginning of the file—the program should print the entire file. When this value is positive—meaning the user wants to start beyond the end of the file—the program should print nothing.

I decided to create a function called count_lines_bytes that takes a filename and returns a tuple containing the total number of lines and bytes in the file. Here is the function’s signature:

fn count_lines_bytes(filename: &str) -> MyResult<(i64, i64)> {
    unimplemented!()
}

If you want to create this function, modify your tests module to add the following unit test:

#[cfg(test)]
mod tests {
    use super::{count_lines_bytes, parse_num, TakeValue::*};

    #[test]
    fn test_parse_num() {} // Same as before

    #[test]
    fn test_count_lines_bytes() {
        let res = count_lines_bytes("tests/inputs/one.txt");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), (1, 24));

        let res = count_lines_bytes("tests/inputs/ten.txt");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), (10, 49));
    }

You can expand your run to temporarily print this information:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files {
        match File::open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(_) => {
                let (total_lines, total_bytes) =
                    count_lines_bytes(&filename)?;
                println!(
                    "{} has {} lines and {} bytes",
                    filename, total_lines, total_bytes
                );
            }
        }
    }
    Ok(())
}
Note

I decided to pass the filename to the count_lines_bytes function instead of the filehandle that is returned by File::open because the filehandle will be consumed by the function, making it unavailable for use in selecting the bytes or lines.

Verify that it looks OK:

$ cargo run tests/inputs/one.txt tests/inputs/ten.txt
tests/inputs/one.txt has 1 lines and 24 bytes
tests/inputs/ten.txt has 10 lines and 49 bytes

Finding the Starting Line to Print

My next step was to write a function to print the lines of a given file. Following is the signature of my print_lines function. Be sure to add use std::io::BufRead for this:

fn print_lines(
    mut file: impl BufRead, 1
    num_lines: &TakeValue, 2
    total_lines: i64, 3
) -> MyResult<()> {
    unimplemented!();
}
1

The file argument should implement the BufRead trait.

2

The num_lines argument is a TakeValue describing the number of lines to print.

3

The total_lines argument is the total number of lines in this file.

I can find the starting line’s index using the number of lines the user wants to print and the total number of lines in the file. Since I will also need this logic to find the starting byte position, I decided to write a function called get_start_index that will return Some<u64> when there is a valid starting position or None when there is not. A valid starting position must be a positive number, so I decided to return a u64. Additionally, the functions where I will use the returned index also require this type:

fn get_start_index(take_val: &TakeValue, total: i64) -> Option<u64> {
    unimplemented!();
}

Following is a unit test you can add to the tests module that might help you see all the possibilities more clearly. For instance, the function returns None when the given file is empty or when trying to read from a line beyond the end of the file. Be sure to add get_start_index to the list of super imports:

#[test]
fn test_get_start_index() {
    // +0 from an empty file (0 lines/bytes) returns None
    assert_eq!(get_start_index(&PlusZero, 0), None);

    // +0 from a nonempty file returns an index that
    // is one less than the number of lines/bytes
    assert_eq!(get_start_index(&PlusZero, 1), Some(0));

    // Taking 0 lines/bytes returns None
    assert_eq!(get_start_index(&TakeNum(0), 1), None);

    // Taking any lines/bytes from an empty file returns None
    assert_eq!(get_start_index(&TakeNum(1), 0), None);

    // Taking more lines/bytes than is available returns None
    assert_eq!(get_start_index(&TakeNum(2), 1), None);

    // When starting line/byte is less than total lines/bytes,
    // return one less than starting number
    assert_eq!(get_start_index(&TakeNum(1), 10), Some(0));
    assert_eq!(get_start_index(&TakeNum(2), 10), Some(1));
    assert_eq!(get_start_index(&TakeNum(3), 10), Some(2));

    // When starting line/byte is negative and less than total,
    // return total - start
    assert_eq!(get_start_index(&TakeNum(-1), 10), Some(9));
    assert_eq!(get_start_index(&TakeNum(-2), 10), Some(8));
    assert_eq!(get_start_index(&TakeNum(-3), 10), Some(7));

    // When starting line/byte is negative and more than total,
    // return 0 to print the whole file
    assert_eq!(get_start_index(&TakeNum(-20), 10), Some(0));
}

Once you figure out the line index to start printing, use this information in the print_lines function to iterate the lines of the input file and print all the lines after the starting index, if there is one.

Solution

I’ll walk you through how I arrived at a solution. My solution will incorporate several dependencies, so this is how my src/lib.rs starts:

use crate::TakeValue::*;
use clap::{App, Arg};
use once_cell::sync::OnceCell;
use regex::Regex;
use std::{
    error::Error,
    fs::File,
    io::{BufRead, BufReader, Read, Seek, SeekFrom},
};

type MyResult<T> = Result<T, Box<dyn Error>>;

static NUM_RE: OnceCell<Regex> = OnceCell::new();

I suggested writing several intermediate functions in the first part of this chapter, so next I’ll show my versions that pass the unit tests I provided.

Counting All the Lines and Bytes in a File

I will start by showing my count_lines_bytes function to count all the lines and bytes in a file. In previous programs, I have used BufRead::read_line, which writes into a String. In the following function, I use BufRead::read_until to read raw bytes to avoid the cost of creating strings, which I don’t need:

fn count_lines_bytes(filename: &str) -> MyResult<(i64, i64)> {
    let mut file = BufReader::new(File::open(filename)?); 1
    let mut num_lines = 0; 2
    let mut num_bytes = 0;
    let mut buf = Vec::new();
    loop {
        let bytes_read = file.read_until(b'\n', &mut buf)?; 3
        if bytes_read == 0 { 4
            break;
        }
        num_lines += 1; 5
        num_bytes += bytes_read as i64; 6
        buf.clear(); 7
    }
    Ok((num_lines, num_bytes)) 8
}
1

Create a mutable filehandle to read the given filename.

2

Initialize counters for the number of lines and bytes as well as a buffer for reading the lines.

3

Use BufRead::read_until to read bytes until a newline byte. This function returns the number of bytes that were read from the filehandle.

4

When no bytes were read, exit the loop.

5

Increment the line count.

6

Increment the byte count. Note that BufRead::read_until returns a usize that must be cast to i64 to add the value to the num_bytes tally.

7

Clear the buffer before reading the next line.

8

Return a tuple containing the number of lines and bytes in the file.

Finding the Start Index

To find the starting line or byte position, my program relies on a get_start_index function that uses the desired location and the total number of lines and bytes in the file:

fn get_start_index(take_val: &TakeValue, total: i64) -> Option<u64> {
    match take_val {
        PlusZero => {
            if total > 0 { 1
                Some(0)
            } else {
                None
            }
        }
        TakeNum(num) => {
            if num == &0 || total == 0 || num > &total { 2
                None
            } else {
                let start = if num < &0 { total + num } else { num - 1 }; 3
                Some(if start < 0 { 0 } else { start as u64 }) 4
            }
        }
    }
}
1

When the user wants to start at index 0, return 0 if the file is not empty; otherwise, return None.

2

Return None if the user wants to select nothing, the file is empty, or the user wants to select more data than is available in the file.

3

If the desired number of lines or bytes is negative, add it to the total; otherwise, subtract one from the desired number to get the zero-based offset.

4

If the starting index is less than 0, return 0; otherwise, return the starting index as a u64.

Printing the Lines

The following is my print_lines function, much of which is similar to count_lines_bytes:

fn print_lines(
    mut file: impl BufRead,
    num_lines: &TakeValue,
    total_lines: i64,
) -> MyResult<()> {
    if let Some(start) = get_start_index(num_lines, total_lines) { 1
        let mut line_num = 0; 2
        let mut buf = Vec::new();
        loop {
            let bytes_read = file.read_until(b'\n', &mut buf)?;
            if bytes_read == 0 {
                break;
            }
            if line_num >= start { 3
                print!("{}", String::from_utf8_lossy(&buf)); 4
            }
            line_num += 1;
            buf.clear();
        }
    }

    Ok(())
}
1

Check if there is a valid starting position when trying to read the given number of lines from the total number of lines available.

2

Initialize variables for counting and reading lines from the file.

3

Check if the given line is at or beyond the starting point.

4

If so, convert the bytes to a string and print.

Here is how I can integrate this into my run function:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files {
        match File::open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(file) => {
                let (total_lines, _total_bytes) =
                    count_lines_bytes(filename)?; 1
                let file = BufReader::new(file); 2
                print_lines(file, &config.lines, total_lines)?; 3
            }
        }
    }
    Ok(())
}
1

Count the total lines and bytes in the current file.

2

Create a BufReader with the opened filehandle.

3

Print the requested number of lines.

A quick check shows this will select, for instance, the last three lines:

$ cargo run -- -n 3 tests/inputs/ten.txt
eight
nine
ten

I can get the same output by staring on the eighth line:

$ cargo run -- -n +8 tests/inputs/ten.txt
eight
nine
ten

If I run cargo test at this point, I pass more than two-thirds of the tests.

Printing the Bytes

Next, I will show my print_bytes function:

fn print_bytes<T: Read + Seek>(
    mut file: T,
    num_bytes: &TakeValue,
    total_bytes: i64,
) -> MyResult<()> {
    if let Some(start) = get_start_index(num_bytes, total_bytes) { 1
        file.seek(SeekFrom::Start(start))?; 2
        let mut buffer = Vec::new(); 3
        file.read_to_end(&mut buffer)?; 4
        if !buffer.is_empty() {
            print!("{}", String::from_utf8_lossy(&buffer)); 5
        }
    }

    Ok(())
}
1

See if there is a valid starting byte position.

2

Use Seek::seek to move to the desired byte position as defined by SeekFrom::Start.

3

Create a mutable buffer for reading the bytes.

4

Read from the byte position to the end of the file and place the results into the buffer.

5

If the buffer is not empty, convert the selected bytes to a String and print.

Here’s how I integrated this into the run function:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files {
        match File::open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(file) => {
                let (total_lines, total_bytes) =
                    count_lines_bytes(filename)?;
                let file = BufReader::new(file);
                if let Some(num_bytes) = &config.bytes { 1
                    print_bytes(file, num_bytes, total_bytes)?; 2
                } else {
                    print_lines(file, &config.lines, total_lines)?; 3
                }
            }
        }
    }
    Ok(())
}
1

Check if the user has requested a byte selection.

2

If so, print the selected bytes.

3

Otherwise, print the line selection.

A quick check with cargo test shows I’m inching ever closer to passing all my tests. All the failing tests start with multiple, and they are failing because my program is not printing the headers separating the output from each file. I’ll modify the code from Chapter 4 for this. Here is my final run function that will pass all the tests:

pub fn run(config: Config) -> MyResult<()> {
    let num_files = config.files.len(); 1
    for (file_num, filename) in config.files.iter().enumerate() { 2
        match File::open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(file) => {
                if !config.quiet && num_files > 1 { 3
                    println!(
                        "{}==> {} <==",
                        if file_num > 0 { "\n" } else { "" },
                        filename
                    );
                }

                let (total_lines, total_bytes) =
                    count_lines_bytes(filename)?;
                let file = BufReader::new(file);
                if let Some(num_bytes) = &config.bytes {
                    print_bytes(file, num_bytes, total_bytes)?;
                } else {
                    print_lines(file, &config.lines, total_lines)?;
                }
            }
        }
    }
    Ok(())
}
1

Find the number of files.

2

Use Iterator::enumerate to iterate through the index positions and filenames.

3

If the config quiet option is false and there are multiple files, print the header.

Benchmarking the Solution

How does the tailr program compare to tail for the subset of features it shares? I suggested earlier that you could use the biggie program to create large input files to test your program. I created a file called 1M.txt that has one million lines of randomly generated text to use in testing my program. I can use the time command to see how long it takes for tail to find the last 10 lines of the 1M.txt file:2

$ time tail 1M.txt > /dev/null 1

real    0m0.022s 2
user    0m0.006s 3
sys     0m0.015s 4
1

I don’t want to see the output from the command, so I redirect it to /dev/null, a special system device that ignores its input.

2

The real time is wall clock time, measuring how long the process took from start to finish.

3

The user time is how long the CPU spent in user mode outside the kernel.

4

The sys time is how long the CPU spent working inside the kernel.

I want to build the fastest version of tailr possible to compare to tail, so I’ll use cargo build --release to create a release build. The binary will be created at target​/⁠release/tailr. This build of the program appears to be much slower than tail:

$ time target/release/tailr 1M.txt > /dev/null

real    0m0.564s
user    0m0.071s
sys     0m0.030s

This is the start of a process called benchmarking, where I try to compare how well different programs work. Running one iteration and eyeballing the output is not very scientific or effective. Luckily, there is a Rust crate called hyperfine that can do this much better. After installing it with cargo install hyperfine, I can run benchmarks and find that my Rust program is about 10 times slower than the system tail when printing the last 10 lines from the 1M.txt file:

$ hyperfine -i -L prg tail,target/release/tailr '{prg} 1M.txt > /dev/null'
Benchmark #1: tail 1M.txt > /dev/null
  Time (mean ± σ):       9.0 ms ±   0.7 ms    [User: 3.7 ms, System: 4.1 ms]
  Range (min … max):     7.6 ms …  12.6 ms    146 runs

Benchmark #2: target/release/tailr 1M.txt > /dev/null
  Time (mean ± σ):      85.3 ms ±   0.8 ms    [User: 68.1 ms, System: 14.2 ms]
  Range (min … max):    83.4 ms …  86.6 ms    32 runs

Summary
  'tail 1M.txt > /dev/null' ran
    9.46 ± 0.79 times faster than 'target/release/tailr 1M.txt > /dev/null'

If I request the last 100K lines, however, the Rust version is about 80 times faster than tail:

$ hyperfine -i -L prg tail,target/release/tailr '{prg} -n 100000 1M.txt
  > /dev/null'
Benchmark #1: tail -n 100000 1M.txt > /dev/null
  Time (mean ± σ):     10.338 s ±  0.052 s    [User: 5.643 s, System: 4.685 s]
  Range (min … max):   10.245 s … 10.424 s    10 runs

Benchmark #2: target/release/tailr -n 100000 1M.txt > /dev/null
  Time (mean ± σ):     129.1 ms ±   3.8 ms    [User: 98.8 ms, System: 26.6 ms]
  Range (min … max):   127.0 ms … 144.2 ms    19 runs

Summary
  'target/release/tailr -n 100000 1M.txt > /dev/null' ran
   80.07 ± 2.37 times faster than 'tail -n 100000 1M.txt > /dev/null'

If I change the command to {prg} -c 100 1M.txt to print the last 100 bytes, the Rust version is around 9 times slower:

Summary
  'tail -c 100 1M.txt > /dev/null' ran
    8.73 ± 2.49 times faster than 'target/release/tailr -c 100 1M.txt
    > /dev/null'

If I request the last million bytes, however, the Rust version is a little faster:

Summary
  'target/release/tailr -c 1000000 1M.txt > /dev/null' ran
    1.12 ± 0.05 times faster than 'tail -c 1000000 1M.txt > /dev/null'

To improve the performance, the next step would probably be profiling the code to find where Rust is using most of the time and memory. Program optimization is a fascinating and deep topic well beyond the scope of this book.

Summary

Reflect upon your progress in this chapter:

  • You learned how to create a regular expression as a static, global variable using the once_cell crate.

  • You learned how to seek a line or byte position in a filehandle.

  • You saw how to indicate multiple trait bounds like <T: Read + Seek> and also how to write this using where.

  • You learned how to make Cargo build a release binary.

  • You used hyperfine to benchmark programs.

In the next chapter, you will learn how to use and control pseudorandom number generators to make random selections.

1 The range here means all the characters between those two code points. Refer to the output from the ascii program in Chapter 10 to see that the contiguous values from 0 to 9 are all numbers. Contrast this with the values from A to z where various punctuation characters fall in the middle, which is why you will often see the range [A-Za-z] to select ASCII alphabet characters.

2 I used macOS 11.6 running on a MacBook Pro M1 with 8 cores and 8 GB RAM for all the benchmarking tests.