Chapter 17. Strings and Text

The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information.

Alan Perlis, epigram #34

We’ve been using Rust’s main textual types, String, str, and char, throughout the book. In “String Types”, we described the syntax for character and string literals, and showed how strings are represented in memory. In this chapter, we cover text handling in more detail.

In this chapter:

Some Unicode Background

This book is about Rust, not Unicode, which has entire books devoted to it already. But Rust’s character and string types are designed around Unicode. Here are a few bits of Unicode that help explain Rust.

UTF-8

The Rust String and str types represent text using the UTF-8 encoding form. UTF-8 encodes a character as a sequence of one to four bytes (Figure 17-1).

(UTF-8 encodes Unicode code points using one through four bytes.)
Figure 17-1. The UTF-8 encoding

There are two restrictions on well-formed UTF-8 sequences. First, only the shortest encoding for any given code point is considered well-formed; you can’t spend four bytes encoding a code point that would fit in three. This rule ensures that there is exactly one UTF-8 encoding for a given code point. Second, well-formed UTF-8 must not encode numbers from 0xd800 through 0xdfff or beyond 0x10ffff: those are either reserved for noncharacter purposes, or outside Unicode’s range entirely.

Figure 17-2 shows some examples.

(examples of encoding the characters '*', Greek mu, Japanese 'sabi', and the crab emoji in UTF-8)
Figure 17-2. UTF-8 examples

Note that, even though the crab emoji has an encoding whose leading byte contributes only zeros to the code point, it still needs a four-byte encoding: three-byte UTF-8 encodings can only convey 16-bit code points, and 0x1f980 is 17 bits long.

Here’s a quick example of a string containing characters with encodings of varying lengths:

assert_eq!("うどん: udon".as_bytes(),
           &[0xe3, 0x81, 0x86, // う
             0xe3, 0x81, 0xa9, // ど
             0xe3, 0x82, 0x93, // ん
             0x3a, 0x20, 0x75, 0x64, 0x6f, 0x6e // : udon
           ]);

The diagram shows some very helpful properties of UTF-8:

  • Since UTF-8 encodes code points 0 through 0x7f as nothing more than the bytes 0 through 0x7f, a range of bytes holding ASCII text is valid UTF-8. And if a string of UTF-8 includes only characters from ASCII, the reverse is also true: the UTF-8 encoding is valid ASCII.

    The same is not true for Latin-1: for example, Latin-1 encodes 'é' as the byte 0xe9, which UTF-8 would interpret as the first byte of a three-byte encoding.

  • From looking at any byte’s upper bits, you can immediately tell whether it is the start of some character’s UTF-8 encoding, or a byte from the midst of one.

  • An encoding’s first byte alone tells you the encoding’s full length, via its leading bits.

  • Since no encoding is longer than four bytes, UTF-8 processing never requires unbounded loops, which is nice when working with untrusted data.

  • In well-formed UTF-8, you can always tell unambiguously where characters’ encodings begin and end, even if you start from a random point in the midst of the bytes. UTF-8 first bytes and following bytes are always distinct, so one encoding cannot start in the midst of another. The first byte determines the encoding’s total length, so no encoding can be a prefix of another. This has a lot of nice consequences. For example, searching a UTF-8 string for an ASCII delimiter character requires only a simple scan for the delimiter’s byte. It can never appear as any part of a multibyte encoding, so there’s no need to keep track of the UTF-8 structure at all. Similarly, algorithms that search for one byte string in another will work without modification on UTF-8 strings, even though some don’t even examine every byte of the text being searched.

Although variable-width encodings are more complicated than fixed-width encodings, these characteristics make UTF-8 more comfortable to work with than you might expect. The standard library handles most aspects for you.

Characters (char)

A Rust char is a 32-bit value holding a Unicode code point. A char is guaranteed to fall in the range from 0 to 0xd7ff, or in the range 0xe000 to 0x10ffff; all the methods for creating and manipulating char values ensure that this is true. The char type implements Copy and Clone, along with all the usual traits for comparison, hashing, and formatting.

In the descriptions that follow, the variable ch is always of type char.

Classifying Characters

The char type has methods for classifying characters into a few common categories. These all draw their definitions from Unicode, as shown in Table 17-0.

Method Description Examples
ch.is_numeric() A numeric character. This includes the Unicode general categories “Number; digit” and “Number; letter”, but not “Number; other”. '4'.is_numeric()
'ᛮ'.is_numeric()
!'⑧'.is_numeric()
ch.is_alphabetic() An alphabetic character: Unicode’s “Alphabetic” derived property. 'q'.is_alphabetic()
'七'.is_alphabetic()
ch.is_alphanumeric() Either numeric or alphabetic, as defined above. '9'.is_alphanumeric()
'饂'.is_alphanumeric()
!'*'.is_alphanumeric()
ch.is_whitespace() A whitespace character: Unicode character property “WSpace=Y”. ' '.is_whitespace()
'\n'.is_whitespace()
'\u{A0}'.is_whitespace()
ch.is_control() A control character: Unicode’s “Other, control” general category. '\n'.is_control()
'\u{85}'.is_control()

String and str

Rust’s String and str types are guaranteed to hold only well-formed UTF-8. The library ensures this by restricting the ways you can create String and str values and the operations you can perform on them, such that the values are well-formed when introduced, and remain so as you work with them. All their methods protect this guarantee: no safe operation on them can introduce ill-formed UTF-8. This simplifies code that works with the text.

Rust places text-handling methods on either str and String depending on whether the method needs a resizable buffer, or is happy just using the text in place. Since String dereferences to &str, every method defined on str is directly available on String as well. This section presents methods from both types, grouped by rough function.

These methods index text by byte offsets, and measure its length in bytes, rather than characters. In practice, given the nature of Unicode, indexing by character is not as useful as it may seem, and byte offsets are faster and simpler. If you try to use a byte offset that lands in the midst of some character’s UTF-8 encoding, the method panics, so you can’t introduce ill-formed UTF-8 this way.

A String is implemented as a wrapper around a Vec<u8> that ensures the vector’s contents are always well-formed UTF-8. Rust will never change String to use a more complicated representation, so you can assume that String shares Vec’s performance characteristics.

In these explanations, the following variables have the given types:

Variable Presumed type
string String
slice &str or something that dereferences to one, like String or Rc<String>
ch char
n usize, a length
i, j usize, a byte offset
range A range of usize byte offsets, either fully bounded like i..j, or partly bounded like i.., ..j, or ..
pattern Any pattern type: char, String, &str, &[char], or FnMut(char) -> bool

We describe pattern types in “Patterns for Searching Text”.

Creating String Values

There are a few common ways to create String values:

  • String::new() returns a fresh, empty string. This has no heap-allocated buffer, but will allocate one as needed.

  • String::with_capacity(n) returns a fresh, empty string with a buffer pre-allocated to hold at least n bytes. If you know the length of the string you’re building in advance, this constructor lets you get the buffer sized correctly from the start, instead of resizing the buffer as you build the string. The string will still grow its buffer as needed if its length exceeds n bytes. Like vectors, strings have capacity, reserve, and shrink_to_fit methods, but usually the default allocation logic is fine.

  • slice.to_string() allocates a fresh String whose contents are a copy of slice. We’ve been using expressions like "literal text".to_string() throughout the book to make Strings from string literals.

  • iter.collect() constructs a string by concatenating an iterator’s items, which can be char, &str, or String values. For example, to remove all spaces from a string, you can write:

    let spacey = "man hat tan";
    let spaceless: String =
        spacey.chars().filter(|c| !c.is_whitespace()).collect();
    assert_eq!(spaceless, "manhattan");

    Using collect this way takes advantage of String’s implementation of the std::iter::FromIterator trait.

  • The &str type cannot implement Clone: the trait requires clone on a &T to return a T value, but str is unsized. However, &str does implement ToOwned, which lets the implementer specify its owned equivalent, so slice.to_owned() returns a copy of slice as a freshly allocated String.

Appending and Inserting Text

The following methods add text to a String:

  • string.push(ch) appends the character ch to the end string.

  • string.push_str(slice) appends the full contents of slice.

  • string.extend(iter) appends the items produced by the iterator iter to the string. The iterator can produce char, str, or String values. These are String’s implementations of std::iter::Extend:

    let mut also_spaceless = "con".to_string();
    also_spaceless.extend("tri but ion".split_whitespace());
    assert_eq!(also_spaceless, "contribution");
  • string.insert(i, ch) inserts the single character ch at byte offset i in string. This entails shifting over any characters after i to make room for ch, so building up a string this way can require time quadratic in the length of the string.

  • string.insert_str(i, slice) does the same for slice, with the same performance caveat.

String implements std::fmt::Write, meaning that the write! and writeln! macros can append formatted text to Strings:

use std::fmt::Write;

let mut letter = String::new();
writeln!(letter, "Whose {} these are I think I know", "rutabagas")?;
writeln!(letter, "His house is in the village though;")?;
assert_eq!(letter, "Whose rutabagas these are I think I know\n\
                    His house is in the village though;\n");

Since write! and writeln! are designed for writing to output streams, they return a Result, which Rust complains if you ignore. This code uses the ? operator to handle it, but writing to a String is actually infallible, so in this case calling .unwrap() would be OK too.

Since String implements Add<&str> and AddAssign<&str>, you can write code like this:

let left = "partners".to_string();
let mut right = "crime".to_string();
assert_eq!(left + " in " + &right, "partners in crime");

right += " doesn't pay";
assert_eq!(right, "crime doesn't pay");

When applied to strings, the + operator takes its left operand by value, so it can actually reuse that String as the result of the addition. As a consequence, if the left operand’s buffer is large enough to hold the result, no allocation is needed.

In an unfortunate lack of symmetry, the left operand of + cannot be a &str, so you cannot write:

let parenthetical = "(" + string + ")";

You must instead write:

let parenthetical = "(".to_string() + string + ")";

However, this restriction does discourage building up strings from the end backward. This approach performs poorly because the text must be repeatedly shifted toward the end of the buffer.

Building strings from beginning to end by appending small pieces, however, is efficient. A String behaves the way a vector does, always at least doubling its buffer’s size when it needs more capacity. As explained in “Building Vectors Element by Element”, this keeps recopying overhead proportional to the final size. Even so, using String::with_capacity to create strings with the right buffer size to begin with avoids resizing at all, and can reduce the number of calls to the heap allocator.

Conventions for Searching and Iterating

Rust’s standard library functions for searching text and iterating over text follow some naming conventions to make them easier to remember:

The standard library doesn’t provide all combinations for every operation. For example, many operations don’t need an n variant, as it’s easy enough to simply end the iteration early.

Patterns for Searching Text

When a standard library function needs to search, match, split, or trim text, it accepts several different types to represent what to look for:

let haystack = "One fine day, in the middle of the night";

assert_eq!(haystack.find(','), Some(12));
assert_eq!(haystack.find("night"), Some(35));
assert_eq!(haystack.find(char::is_whitespace), Some(3));

These types are called patterns, and most operations support them:

assert_eq!("## Elephants"
           .trim_left_matches(|ch: char| ch == '#' || ch.is_whitespace()),
           "Elephants");

The standard library supports four main kinds of patterns:

  • A char as a pattern matches that character.

  • A String or &str or &&str as a pattern matches a substring equal to the pattern.

  • A FnMut(char) -> bool closure as a pattern matches a single character for which the closure returns true.

  • A &[char] as a pattern (not a &str, but a slice of char values) matches any single character that appears in the list. Note that if you write out the list as an array literal, you may need to use an as expression to get the type right:

    let code = "\t    function noodle() { ";
    assert_eq!(code.trim_left_matches(&[' ', '\t'] as &[char]),
               "function noodle() { ");
    // Shorter equivalent: &[' ', '\t'][..]

    Otherwise, Rust will be confused by the fixed-size array type &[char; 2], which is unfortunately not a pattern type.

In the library’s own code, a pattern is any type that implements the std::str::Pattern trait. The details of Pattern are not yet stable, so you can’t implement it for your own types in stable Rust, but the door is open to permit regular expressions and other sophisticated patterns in the future. Rust does guarantee that the pattern types supported now will continue to work in the future.

Iterating over Text

The standard library provides several ways to iterate over a slice’s text. Figure 17-3 shows examples of some.

You can think of the split and match families as being complements of each other: splits are the ranges between matches.

(A sample text, and the items various iterators would produce            when applied to it.)
Figure 17-3. Some ways to iterate over a slice

For some kinds of patterns, working from end to start can change the values produced; for an example, see the splits on the pattern "rr" in the figure. Patterns that always match a single character can’t behave this way. When an iterator would produce the same set of items in either direction (that is, when only the order is affected), the iterator is a DoubleEndedIterator, meaning that you can apply its rev method to iterate in the other order, and draw items from either end:

  • slice.chars() returns an iterator over slice’s characters.

  • slice.char_indices() returns an iterator over slice’s characters and their byte offsets:

    assert_eq!("élan".char_indices().collect::<Vec<_>>(),
               vec![(0, 'é'), // has a two-byte UTF-8 encoding
                    (2, 'l'),
                    (3, 'a'),
                    (4, 'n')]);

    Note that this is not equivalent to .chars().enumerate(), since it supplies each character’s byte offset within the slice, instead of just numbering the characters.

  • slice.bytes() returns an iterator over the individual bytes of slice, exposing the UTF-8 encoding:

    assert_eq!("élan".bytes().collect::<Vec<_>>(),
               vec![195, 169, b'l', b'a', b'n']);
  • slice.lines() returns an iterator over the lines of slice. Lines are terminated by "\n" or "\r\n". Each item produced is a &str borrowing from slice. The items do not include the lines’ terminating characters.

  • slice.split(pattern) returns an iterator over the portions of slice separated by matches of pattern. This produces empty strings between immediately adjacent matches, as well as for matches at the beginning and end of slice.

  • The slice.rsplit(pattern) method is the same, but scans slice from end to start, producing matches in that order.

  • slice.split_terminator(pattern) and slice.rsplit_terminator(pattern) are similar, except that the pattern is treated as a terminator, not a separator: if pattern matches at the right end of slice, the iterators do not produce an empty slice representing the empty string between that match and the end of the slice, as split and rsplit do. For example:

    // The ':' characters are separators here. Note the final "".
    assert_eq!("jimb:1000:Jim Blandy:".split(':').collect::<Vec<_>>(),
               vec!["jimb", "1000", "Jim Blandy", ""]);
    
    // The '\n' characters are terminators here.
    assert_eq!("127.0.0.1  localhost\n\
                127.0.0.1  www.reddit.com\n"
               .split_terminator('\n').collect::<Vec<_>>(),
               vec!["127.0.0.1  localhost",
                    "127.0.0.1  www.reddit.com"]);
                    // Note, no final ""!
  • The slice.splitn(n, pattern) and slice.rsplitn(n, pattern) are like split and rsplit, except that they split the string into at most n slices, at the first or last n-1 matches for pattern.

  • slice.split_whitespace() returns an iterator over the whitespace-separated portions of slice. A run of multiple whitespace characters is considered a single separator. Trailing whitespace is ignored. This uses the same definition of whitespace as char::is_whitespace:

    let poem = "This  is  just  to say\n\
                I have eaten\n\
                the plums\n\
                again\n";
    
    assert_eq!(poem.split_whitespace().collect::<Vec<_>>(),
               vec!["This", "is", "just", "to", "say",
                    "I", "have", "eaten", "the", "plums",
                    "again"]);
  • slice.matches(pattern) returns an iterator over the matches for pattern in slice. slice.rmatches(pattern) is the same, but iterates from end to start.

  • slice.match_indices(pattern) and slice.rmatch_indices(pattern) are similar, except that the items produced are (offset, match) pairs, where offset is the byte offset at which the match begins, and match is the matching slice.

Converting Other Types to Strings

There are three main ways to convert nontextual values to strings:

The Display and Debug formatting traits are just two among several that the format! macro and its relatives use to format values as text. We’ll cover the others, and explain how to implement them all, in “Formatting Values”.

Borrowing as Other Text-Like Types

You can borrow a slice’s contents in several different ways:

Producing Text from UTF-8 Data

If you have a block of bytes that you believe contains UTF-8 data, you have a few options for converting them into Strings or slices, depending on how you want to handle errors:

Putting Off Allocation

Suppose you want your program to greet the user. On Unix, you could write:

fn get_name() -> String {
    std::env::var("USER") // Windows uses "USERNAME"
        .unwrap_or("whoever you are".to_string())
}

println!("Greetings, {}!", get_name());

For Unix users, this greets them by username. For Windows users and the tragically unnamed, it provides alternative stock text.

The std::env::var function returns a String—and has good reasons to do so that we won’t go into here. But that means the alternative stock text must also be returned as a String. This is disappointing: when get_name returns a static string, no allocation should be necessary at all.

The nub of the problem is that sometimes the return value of name should be an owned String, sometimes it should be a &'static str, and we can’t know which one it will be until we run the program. This dynamic character is the hint to consider using std::borrow::Cow, the clone-on-write type that can hold either owned or borrowed data.

As explained in “Borrow and ToOwned at Work: The Humble Cow”, Cow<'a, T> is an enum with two variants: Owned and Borrowed. Borrowed holds a reference &'a T, and Owned holds the owning version of &T: String for &str, Vec<i32> for &[i32], and so on. Whether Owned or Borrowed, a Cow<'a, T> can always produce a &T for you to use. In fact, Cow<'a, T> dereferences to &T, behaving as a kind of smart pointer.

Changing get_name to return a Cow results in the following:

use std::borrow::Cow;

fn get_name() -> Cow<'static, str> {
    std::env::var("USER")
        .map(|v| Cow::Owned(v))
        .unwrap_or(Cow::Borrowed("whoever you are"))
}

If this succeeds in reading the "USER" environment variable, the map returns the resulting String as a Cow::Owned. If it fails, the unwrap_or returns its static &str as a Cow::Borrowed. The caller can remain unchanged:

println!("Greetings, {}!", get_name());

As long as T implements the std::fmt::Display trait, displaying a Cow<'a, T> produces the same results as displaying a T.

Cow is also useful when you may or may not need to modify some text you’ve borrowed. When no changes are necessary, you can continue to borrow it. But Cows namesake clone-on-write behavior can give you an owned, mutable copy of the value on demand. Cow’s to_mut method makes sure the Cow is Cow::Owned, applying the value’s ToOwned implementation if necessary, and then returns a mutable reference to the value.

So if you find that some of your users, but not all, have titles by which they would prefer to be addressed, you can say:

fn get_title() -> Option<&'static str> { ... }

let mut name = get_name();
if let Some(title) = get_title() {
    name.to_mut().push_str(", ");
    name.to_mut().push_str(title);
}

println!("Greetings, {}!", name);

This might produce output like the following:

$ cargo run
Greetings, jimb, Esq.!
$

What’s nice here is that, if get_name() returns a static string and get_title returns None, the Cow simply carries the static string all the way through to the println!. You’ve managed to put off allocation unless it’s really necessary, while still writing straightforward code.

Since Cow is frequently used for strings, the standard library has some special support for Cow<'a, str>. It provides From and Into conversions from both String and &str, so you can write get_name more tersely:

fn get_name() -> Cow<'static, str> {
    std::env::var("USER")
        .map(|v| v.into())
        .unwrap_or("whoever you are".into())
}

Cow<'a, str> also implements std::ops::Add and std::ops::AddAssign, so to add the title to the name, you could write:

if let Some(title) = get_title() {
    name += ", ";
    name += title;
}

Or, since a String can be a write! macro’s destination:

use std::fmt::Write;

if let Some(title) = get_title() {
    write!(name.to_mut(), ", {}", title).unwrap();
}

As before, no allocation occurs until you try to modify the Cow.

Keep in mind that not every Cow<..., str> must be 'static: you can use Cow to borrow previously computed text until the moment a copy becomes necessary.

Formatting Values

Throughout the book, we’ve been using text formatting macros like println!:

println!("{:.3}µs: relocated {} at {:#x} to {:#x}, {} bytes",
         0.84391, "object",
         140737488346304_usize, 6299664_usize, 64);

That call produces the following output:

0.844µs: relocated object at 0x7fffffffdcc0 to 0x602010, 64 bytes

The string literal serves as a template for the output: each {...} in the template gets replaced by the formatted form of one of the following arguments. The template string must be a constant, so that Rust can check it against the types of the arguments at compile time. Each argument must be used; Rust reports a compile-time error otherwise.

Several standard library features share this little language for formatting strings:

  • The format! macro uses it to build Strings.

  • The println! and print! macros write formatted text to the standard output stream.

  • The writeln! and write! macros write it to a designated output stream.

  • The panic! macro uses it to build a (hopefully informative) expression of terminal dismay.

Rust’s formatting facilities are designed to be open-ended. You can extend these macros to support your own types by implementing the std::fmt module’s formatting traits. And you can use the format_args! macro and the std::fmt::Arguments type to make your own functions and macros support the formatting language.

Formatting macros always borrow shared references to their arguments; they never take ownership of them or mutate them.

The template’s {...} forms are called format parameters, and have the form {which:how}. Both parts are optional; {} is frequently used.

The which value selects which argument following the template should take the parameter’s place. You can select arguments by index or by name. Parameters with no which value are simply paired with arguments from left to right.

The how value says how the argument should be formatted: how much padding, to which precision, in which numeric radix, and so on. If how is present, the colon before it is required.

Here are some examples:

Template string Argument list Result
"number of {}: {}" "elephants", 19 "number of elephants: 19"
"from {1} to {0}" "the grave", "the cradle" "from the cradle to the grave"
"v = {:?}" vec![0,1,2,5,12,29] "v = [0, 1, 2, 5, 12, 29]"
"name = {:?}" "Nemo" "name = \"Nemo\""
"{:8.2} km/s" 11.186 " 11.19 km/s"
"{:20} {:02x} {:02x}" "adc #42", 105, 42 "adc #42 69 2a"
"{1:02x} {2:02x} {0}" "adc #42", 105, 42 "69 2a adc #42"
"{lsb:02x} {msb:02x} {insn}" insn="adc #42", lsb=105, msb=42 "69 2a adc #42"

If you want to include '{' or '}' characters in your output, double the characters in the template:

assert_eq!(format!("{{a, c}} ⊂ {{a, b, c}}"),
           "{a, c} ⊂ {a, b, c}");

Formatting Text Values

When formatting a textual type like &str or String (char is treated like a single-character string), the how value of a parameter has several parts, all optional.

  • A text length limit. Rust truncates your argument if it is longer than this. If you specify no limit, Rust uses the full text.

  • A minimum field width. After any truncation, if your argument is shorter than this, Rust pads it on the right (by default) with spaces (by default) to make a field of this width. If omitted, Rust doesn’t pad your argument.

  • An alignment. If your argument needs to be padded to meet the minimum field width, this says where your text should be placed within the field. <, ^, and > put your text at the start, middle, and end, respectively.

  • A padding character to use in this padding process. If omitted, Rust uses spaces. If you specify the padding character, you must also specify the alignment.

Here are some examples showing how to write things out, and their effects. All are using the same eight-character argument, "bookends":

Features in use Template string Result
Default "{}" "bookends"
Minimum field width "{:4}" "bookends"
"{:12}" "bookends "
Text length limit "{:.4}" "book"
"{:.12}" "bookends"
Field width, length limit "{:12.20}" "bookends "
"{:4.20}" "bookends"
"{:4.6}" "booken"
"{:6.4}" "book "
Aligned left, width "{:<12}" "bookends "
Centered, width "{:^12}" " bookends "
Aligned right, width "{:>12}" " bookends"
Pad with '=', centered, width "{:=^12}" "==bookends=="
Pad '*', aligned right, width, limit "{:*>12.4}" "********book"

Rust’s formatter has a naïve understanding of width: it assumes each character occupies one column, with no regard for combining characters, half-width katakana, zero-width spaces, or the other messy realities of Unicode. For example:

assert_eq!(format!("{:4}", "th\u{e9}"),   "th\u{e9} ");
assert_eq!(format!("{:4}", "the\u{301}"), "the\u{301}");

Although Unicode says these strings are both equivalent to "thé", Rust’s formatter doesn’t know that characters like '\u{301}', COMBINING ACUTE ACCENT, need special treatment. It pads the first string correctly, but assumes the second is four columns wide and adds no padding. Although it’s easy to see how Rust could improve in this specific case, true multilingual text formatting for all of Unicode’s scripts is a monumental task, best handled by relying on your platform’s user interface toolkits, or perhaps by generating HTML and CSS and making a web browser sort it all out.

Along with &str and String, you can also pass formatting macros smart pointer types with textual referents, like Rc<String> or Cow<'a, str>, without ceremony.

Since filename paths are not necessarily well-formed UTF-8, std::path::Path isn’t quite a textual type; you can’t pass a std::path::Path directly to a formatting macro. However, a Path’s display method returns a value you can format that sorts things out in a platform-appropriate way:

println!("processing file: {}", path.display());

Formatting Numbers

When the formatting argument has a numeric type like usize or f64, the parameter’s how value has the following parts, all optional:

  • A padding and alignment, which work as they do with textual types.

  • A + character, requesting that the number’s sign always be shown, even when the argument is positive.

  • A # character, requesting an explicit radix prefix like 0x or 0b. See the “notation” bullet point that concludes this list.

  • A 0 character, requesting that the minimum field width be satisfied by including leading zeros in the number, instead of the usual padding approach.

  • A minimum field width. If the formatted number is not at least this wide, Rust pads it on the left (by default) with spaces (by default) to make a field of the given width.

  • A precision for floating-point arguments, indicating how many digits Rust should include after the decimal point. Rust rounds or zero-extends as necessary to produce exactly this many fractional digits. If the precision is omitted, Rust tries to accurately represent the value using as few digits as possible. For arguments of integer type, the precision is ignored.

  • A notation. For integer types, this can be b for binary, o for octal, or x or X for hexadecimal with lower- or uppercase letters. If you included the # character, these include an explicit Rust-style radix prefix, 0b, 0o, 0x, or 0X. For floating-point types, a radix of e or E requests scientific notation, with a normalized coefficient, using e or E for the exponent. If you don’t specify any notation, Rust formats numbers in decimal.

Some examples of formatting the i32 value 1234:

Features in use Template string Result
Default "{}" "1234"
Forced sign "{:+}" "+1234"
Minimum field width "{:12}" " 1234"
"{:2}" "1234"
Sign, width "{:+12}" " +1234"
Leading zeros, width "{:012}" "000000001234"
Sign, zeros, width "{:+012}" "+00000001234"
Aligned left, width "{:<12}" "1234 "
Centered, width "{:^12}" " 1234 "
Aligned right, width "{:>12}" " 1234"
Aligned left, sign, width "{:<+12}" "+1234 "
Centered, sign, width "{:^+12}" " +1234 "
Aligned right, sign, width "{:>+12}" " +1234"
Padded with '=', centered, width "{:=^12}" "====1234===="
Binary notation "{:b}" "10011010010"
Width, octal notation "{:12o}" " 2322"
Sign, width, hexadecimal notation "{:+12x}" " +4d2"
Sign, width, hex with capital digits "{:+12X}" " +4D2"
Sign, explicit radix prefix, width, hex "{:+#12x}" " +0x4d2"
Sign, radix, zeros, width, hex "{:+#012x}" "+0x0000004d2"
"{:+#06x}" "+0x4d2"

As the last two examples show, the minimum field width applies to the entire number, sign, radix prefix, and all.

Negative numbers always include their sign. The results are like those shown in the “forced sign” examples.

When you request leading zeros, alignment and padding characters are simply ignored, since the zeros expand the number to fill the entire field.

Using the argument 1234.5678, we can show effects specific to floating-point types:

Features in use Template string Result
Default "{}" "1234.5678"
Precision "{:.2}" "1234.57"
"{:.6}" "1234.567800"
Minimum field width "{:12}" " 1234.5678"
Minimum, precision "{:12.2}" " 1234.57"
"{:12.6}" " 1234.567800"
Leading zeros, minimum, precision "{:012.6}" "01234.567800"
Scientific "{:e}" "1.2345678e3"
Scientific, precision "{:.3e}" "1.235e3"
Scientific, minimum, precision "{:12.3e}" " 1.235e3"
"{:12.3E}" " 1.235E3"

Formatting Values for Debugging

To help with debugging and logging, the {:?} parameter formats any public type in the Rust standard library in a way meant to be helpful to programmers. You can use this to inspect vectors, slices, tuples, hash tables, threads, and hundreds of other types.

For example, you can write the following:

use std::collections::HashMap;
let mut map = HashMap::new();
map.insert("Portland", (45.5237606,-122.6819273));
map.insert("Taipei",   (25.0375167, 121.5637));
println!("{:?}", map);

This prints:

{"Taipei": (25.0375167, 121.5637), "Portland": (45.5237606, -122.6819273)}

The HashMap and (f64, f64) types already know how to format themselves, with no effort required on your part.

If you include the # character in the format parameter, Rust will pretty-print the value. Changing this code to say println!("{:#?}", map) leads to this output:

{
    "Taipei": (
        25.0375167,
        121.5637
    ),
    "Portland": (
        45.5237606,
        -122.6819273
    )
}

These exact forms aren’t guaranteed, and do sometimes change from one Rust release to the next.

As we’ve mentioned, you can use the #[derive(Debug)] syntax to make your own types work with {:?}:

#[derive(Copy, Clone, Debug)]
struct Complex { r: f64, i: f64 }

With this definition in place, we can use a {:?} format to print Complex values:

let third = Complex { r: -0.5, i: f64::sqrt(0.75) };
println!("{:?}", third);

This prints:

Complex { r: -0.5, i: 0.8660254037844386 }

This is fine for debugging, but it might be nice if {} could print them in a more traditional form, like -0.5 + 0.8660254037844386i. In “Formatting Your Own Types”, we’ll show how to do exactly that.

Formatting Your Own Types

The formatting macros use a set of traits defined in the std::fmt module to convert values to text. You can make Rust’s formatting macros format your own types by implementing one or more of these traits yourself.

The notation of a format parameter indicates which trait its argument’s type must implement:

Notation Example Trait purpose
none {} std::fmt::Display Text, numbers, errors: the catch-all trait
b {bits:#b} std::fmt::Binary Numbers in binary
o {:#5o} std::fmt::Octal Numbers in octal
x {:4x} std::fmt::LowerHex Numbers in hexadecimal, lower-case digits
X {:016X} std::fmt::UpperHex Numbers in hexadecimal, upper-case digits
e {:.3e} std::fmt::LowerExp Floating-point numbers in scientific notation
E {:.3E} std::fmt::UpperExp Same, upper-case E
? {:#?} std::fmt::Debug Debugging view, for developers
p {:p} std::fmt::Pointer Pointer as address, for developers

When you put the #[derive(Debug)] attribute on a type definition so that you can use the {:?} format parameter, you are simply asking Rust to implement the std::fmt::Debug trait for you.

The formatting traits all have the same structure, differing only in their names. We’ll use std::fmt::Display as a representative:

trait Display {
    fn fmt(&self, dest: &mut std::fmt::Formatter)
        -> std::fmt::Result;
}

The fmt method’s job is to produce a properly formatted representation of self and write its characters to dest. In addition to serving as an output stream, the dest argument also carries details parsed from the format parameter, like the alignment and minimum field width.

For example, earlier in this chapter we suggested that it would be nice if Complex values printed themselves in the usual a + bi form. Here’s a Display implementation that does that:

use std::fmt;

impl fmt::Display for Complex {
    fn fmt(&self, dest: &mut fmt::Formatter) -> fmt::Result {
        let i_sign = if self.i < 0.0 { '-' } else { '+' };
        write!(dest, "{} {} {}i", self.r, i_sign, f64::abs(self.i))
    }
}

This takes advantage of the fact that Formatter is itself an output stream, so the write! macro can do most of the work for us. With this implementation in place, we can write the following:

let one_twenty = Complex { r: -0.5, i: 0.866 };
assert_eq!(format!("{}", one_twenty),
           "-0.5 + 0.866i");

let two_forty = Complex { r: -0.5, i: -0.866 };
assert_eq!(format!("{}", two_forty),
           "-0.5 - 0.866i");

It’s sometimes helpful to display complex numbers in polar form: if you imagine a line drawn on the complex plane from the origin to the number, the polar form gives the line’s length, and its clockwise angle to the positive x-axis. The # character in a format parameter typically selects some alternate display form; the Display implementation could treat it as a request to use polar form:

impl fmt::Display for Complex {
    fn fmt(&self, dest: &mut fmt::Formatter) -> fmt::Result {
        let (r, i) = (self.r, self.i);
        if dest.alternate() {
            let abs = f64::sqrt(r * r + i * i);
            let angle = f64::atan2(i, r) / std::f64::consts::PI * 180.0;
            write!(dest, "{} ∠ {}°", abs, angle)
        } else {
            let i_sign = if i < 0.0 { '-' } else { '+' };
            write!(dest, "{} {} {}i", r, i_sign, f64::abs(i))
        }
    }
}

Using this implementation:

let ninety = Complex { r: 0.0, i: 2.0 };
assert_eq!(format!("{}", ninety),
           "0 + 2i");
assert_eq!(format!("{:#}", ninety),
           "2 ∠ 90°");

Although the formatting traits’ fmt methods return a fmt::Result value (a typical module-specific Result type), you should propagate failures only from operations on the Formatter, as the fmt::Display implementation does with its calls to write!; your formatting functions must never originate errors themselves. This allows macros like format! to simply return a String instead of a Result<String, ...>, since appending the formatted text to a String never fails. It also ensures that any errors you do get from write! or writeln! reflect real problems from the underlying I/O stream, not formatting issues.

Formatter has plenty of other helpful methods, including some for handling structured data like maps, lists, and so on, which we won’t cover here; consult the online documentation for the full details.

Using the Formatting Language in Your Own Code

You can write your own functions and macros that accept format templates and arguments by using Rust’s format_args! macro and the std::fmt::Arguments type. For example, suppose your program needs to log status messages as it runs, and you’d like to use Rust’s text formatting language to produce them. The following would be a start:

fn logging_enabled() -> bool {
    ...
}

use std::fs::OpenOptions;
use std::io::Write;

fn write_log_entry(entry: std::fmt::Arguments) {
    if logging_enabled() {
        // Keep things simple for now, and just
        // open the file every time.
        let mut log_file = OpenOptions::new()
            .append(true)
            .create(true)
            .open("log-file-name")
            .expect("failed to open log file");

        log_file.write_fmt(entry)
            .expect("failed to write to log");
    }
}

You can call write_log_entry like so:

write_log_entry(format_args!("Hark! {:?}\n", mysterious_value));

At compile time, the format_args! macro parses the template string and checks it against the arguments’ types, reporting an error if there are any problems. At runtime, it evaluates the arguments and builds an Arguments value carrying all the information necessary to format the text: a pre-parsed form of the template, along with shared references to the argument values.

Constructing an Arguments value is cheap: it’s just gathering up some pointers. No formatting work takes place yet, only the collection of the information needed to do so later. This can be important: if logging is not enabled, any time spent converting numbers to decimal, padding values, and so on would be wasted.

The File type implements the std::io::Write trait, whose write_fmt method takes an Argument and does the formatting. It writes the results to the underlying stream.

That call to write_log_entry isn’t pretty. This is where a macro can help:

macro_rules! log { // no ! needed after name in macro definitions
    ($format:tt, $($arg:expr),*) => (
        write_log_entry(format_args!($format, $($arg),*))
    )
}

We cover macros in detail in Chapter 20. For now, take it on faith that this defines a new log! macro that passes its arguments along to format_args!, and then calls your write_log_entry function on the resulting Arguments value. The formatting macros like println!, writeln!, and format! are all roughly the same idea.

You can use log! like so:

log!("O day and night, but this is wondrous strange! {:?}\n",
     mysterious_value);

Hopefully, this looks a little better.

Regular Expressions

The external regex crate is Rust’s official regular expression library. It provides the usual searching and matching functions. It has good support for Unicode, but it can search byte strings as well. Although it doesn’t support some features you’ll often find in other regular expression packages, like backreferences and look-around patterns, those simplifications allow regex to ensure that searches take time linear in the size of the expression and in the length of the text being searched. These guarantees, among others, make regex safe to use even with untrusted expressions searching untrusted text.

In this book, we’ll provide only an overview of regex; you should consult its online documentation for details.

Although the regex crate is not in std, it is maintained by the Rust library team, the same group responsible for std. To use regex, put the following line in the [dependencies] section of your crate’s Cargo.toml file:

regex = "0.2.2"

Then place an extern crate item in your crate’s root:

extern crate regex;

In the following sections, we’ll assume that you have these changes in place.

Basic Regex Use

A Regex value represents a parsed regular expression, ready to use. The Regex::new constructor tries to parse a &str as a regular expression, and returns a Result:

use regex::Regex;

// A semver version number, like 0.2.1.
// May contain a pre-release version suffix, like 0.2.1-alpha.
// (No build metadata suffix, for brevity.)
//
// Note use of r"..." raw string syntax, to avoid backslash blizzard.
let semver = Regex::new(r"(\d+)\.(\d+)\.(\d+)(-[-.[:alnum:]]*)?")?;

// Simple search, with a Boolean result.
let haystack = r#"regex = "0.2.5""#;
assert!(semver.is_match(haystack));

The Regex::captures method searches a string for the first match, and returns a regex::Captures value holding match information for each group in the expression:

// You can retrieve capture groups:
let captures = semver.captures(haystack)
    .ok_or("semver regex should have matched")?;
assert_eq!(&captures[0], "0.2.5");
assert_eq!(&captures[1], "0");
assert_eq!(&captures[2], "2");
assert_eq!(&captures[3], "5");

Indexing a Captures value panics if the requested group didn’t match. To test whether a particular group matched, you can call Captures::get, which returns an Option<regex::Match>. A Match value records a single group’s match:

assert_eq!(captures.get(4), None);
assert_eq!(captures.get(3).unwrap().start(), 13);
assert_eq!(captures.get(3).unwrap().end(), 14);
assert_eq!(captures.get(3).unwrap().as_str(), "5");

You can iterate over all the matches in a string:

let haystack = "In the beginning, there was 1.0.0. \
                For a while, we used 1.0.1-beta, \
                but in the end, we settled on 1.2.4.";

let matches: Vec<&str> = semver.find_iter(haystack)
    .map(|match_| match_.as_str())
    .collect();
assert_eq!(matches, vec!["1.0.0", "1.0.1-beta", "1.2.4"]);

The find_iter iterator produces a Match value for each nonoverlapping match of the expression, working from the start of the string to the end. The captures_iter method is similar, but produces Captures values recording all capture groups. Searching is slower when capture groups must be reported, so if you don’t need them, it’s best to use one of the methods that doesn’t return them.

Building Regex Values Lazily

The Regex::new constructor can be expensive: constructing a Regex for a 1200-character regular expression can take almost a millisecond on a fast developer machine, and even a trivial expression takes microseconds. It’s best to keep Regex construction out of heavy computational loops; instead, you should construct your Regex once, and then reuse the same one.

The lazy_static crate provides a nice way to construct static values lazily the first time they are used. To start with, note the dependency in your Cargo.toml file:

[dependencies]
lazy_static = "0.2.8"

This crate provides a macro to declare such variables:

#[macro_use]
extern crate lazy_static;

lazy_static! {
    static ref SEMVER: Regex
        = Regex::new(r"(\d+)\.(\d+)\.(\d+)(-[-.[:alnum:]]*)?")
              .expect("error parsing regex");
}

The macro expands to a declaration of a static variable named SEMVER, but its type is not exactly Regex. Instead, it’s a macro-generated type that implements Deref<Target=Regex> and therefore exposes all the same methods as a Regex. The first time SEMVER is dereferenced, the initializer is evaluated, and the value saved for later use. Since SEMVER is a static variable, not just a local variable, the initializer runs at most once per program execution.

With this declaration in place, using SEMVER is straightforward:

use std::io::BufRead;

let stdin = std::io::stdin();
for line in stdin.lock().lines() {
    let line = line?;
    if let Some(match_) = SEMVER.find(&line) {
        println!("{}", match_.as_str());
    }
}

You can put the lazy_static! declaration in a module, or even inside the function that uses the Regex, if that’s the most appropriate scope. The regular expression is still always compiled only once per program execution.

Normalization

Most users would consider the French word for tea, thé, to be three characters long. However, Unicode actually has two ways to represent this text:

  • In the composed form, thé comprises the three characters 't', 'h', and 'é', where 'é' is a single Unicode character with code point 0xe9.

  • In the decomposed form, thé comprises the four characters 't', 'h', 'e', and '\u{301}', where the 'e' is the plain ASCII character, without an accent, and code point 0x301 is the “COMBINING ACUTE ACCENT” character, which adds an acute accent to whatever character it follows.

Unicode does not consider either the composed or the decomposed form of é to be the “correct” one; rather, it considers them both equivalent representations of the same abstract character. Unicode says both forms should be displayed in the same way, and text input methods are permitted to produce either, so users will generally not know which form they are viewing or typing. (Rust lets you use Unicode characters directly in string literals, so you can simply write "thé" if you don’t care which encoding you get. Here we’ll use the \u escapes for clarity.)

However, considered as Rust &str or String values, "th\u{e9}" and "the\u{301}" are completely distinct. They have different lengths, compare as unequal, have different hash values, and order themselves differently with respect to other strings:

assert!("th\u{e9}" != "the\u{301}");
assert!("th\u{e9}" >  "the\u{301}");

// A Hasher is designed to accumulate the hash of a series of values,
// so hashing just one is a bit clunky.
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;
fn hash<T: ?Sized + Hash>(t: &T) -> u64 {
    let mut s = DefaultHasher::new();
    t.hash(&mut s);
    s.finish()
}

// These values may change in future Rust releases.
assert_eq!(hash("th\u{e9}"),   0x53e2d0734eb1dff3);
assert_eq!(hash("the\u{301}"), 0x90d837f0a0928144);

Clearly, if you intend to compare user-supplied text, or use it as a key in a hash table or B-tree, you will need to put each string in some canonical form first.

Fortunately, Unicode specifies normalized forms for strings. Whenever two strings should be treated as equivalent according to Unicode’s rules, their normalized forms are character-for-character identical. When encoded with UTF-8, they are byte-for-byte identical. This means you can compare normalized strings with ==, use them as keys in a HashMap or HashSet, and so on, and you’ll get Unicode’s notion of equality.

Failure to normalize can even have security consequences. For example, if your website normalizes usernames in some cases but not others, you could end up with two distinct users named bananasflambé, which some parts of your code treat as the same user, but others distinguish, resulting in one’s privileges being extended incorrectly to the other. Of course, there are many ways to avoid this sort of problem, but history shows there are also many ways not to.

Normalization Forms

Unicode defines four normalized forms, each of which is appropriate for different uses. There are two questions to answer:

  • First, do you prefer characters to be as composed as possible or as decomposed as possible?

    For example, the most composed representation of the Vietnamese word Phở is the three-character string "Ph\u{1edf}", where both the tonal mark ̉ and the vowel mark ̛ are applied to the base character “o” in a single Unicode character, '\u{1edf}', which Unicode dutifully names LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE.

    The most decomposed representation splits out the base letter and its two marks into three separate Unicode characters: 'o', '\u{31b}' (COMBINING HORN), and '\u{309}' (COMBINING HOOK ABOVE), resulting in "Pho\u{31b}\u{309}". (Whenever combining marks appear as separate characters, rather than as part of a composed character, all normalized forms specify a fixed order in which they must appear, so normalization is well specified even when characters have multiple accents.)

    The composed form generally has fewer compatibility problems, since it more closely matches the representations most languages used for their text before Unicode became established. It may also work better with naïve string formatting features like Rust’s format! macro. The decomposed form, on the other hand, may be better for displaying text or searching, since it makes the detailed structure of the text more explicit.

  • The second question is: if two character sequences represent the same fundamental text, but differ in the way that text should be formatted, do you want to treat them as equivalent, or keep them distinct?

    Unicode has separate characters for the ordinary digit '5', the superscript digit '⁵' (or '\u{2075}'), and the circled digit '⑤' (or '\u{2464}'), but declares all three to be compatibility equivalent. Similarly, Unicode has a single character for the ligature ffi ('\u{fb03}'), but declares this to be compatibility equivalent to the three-character sequence "ffi".

    Compatibility equivalence makes sense for searches: a search for "difficult", using only ASCII characters, ought to match the string "di\u{fb03}cult", which uses the ffi ligature. Applying compatibility decomposition to the latter string would replace the ligature with the three plain letters "ffi", making the search easier. But normalizing text to a compatibility equivalent form can lose essential information, so it should not be applied carelessly. For example, it would be incorrect in most contexts to store "2⁵" as "25".

The Unicode Normalization Form C and Normalization Form D (NFC and NFD) use the maximally composed and maximally decomposed forms of each character, but do not try to unify compatibility equivalent sequences. The NFKC and NFKD normalization forms are like NFC and NFD, but normalize all compatibility equivalent sequences to some simple representative of their class.

The World Wide Web Consortium’s “Character Model For the World Wide Web” recommends using NFC for all content. The Unicode Identifier and Pattern Syntax annex suggests using NFKC for identifiers in programming languages, and offers principles for adapting the form when necessary.

The unicode-normalization Crate

Rust’s unicode-normalization crate provides a trait that adds methods to &str to put the text in any of the four normalized forms. To use it, add the following line to the [dependencies] section of your Cargo.toml file:

unicode-normalization = "0.1.5"

The top file of your crate needs an extern crate declaration:

extern crate unicode_normalization;

With these declarations in place, a &str has four new methods that return iterators over a particular normalized form of the string:

use unicode_normalization::UnicodeNormalization;

// No matter what representation the lefthand string uses
// (you shouldn't be able to tell just by looking),
// these assertions will hold.
assert_eq!("Phở".nfd().collect::<String>(), "Pho\u{31b}\u{309}");
assert_eq!("Phở".nfc().collect::<String>(), "Ph\u{1edf}");

// The lefthand side here uses the "ffi" ligature character.
assert_eq!("① Di\u{fb03}culty".nfkc().collect::<String>(), "1 Difficulty");

Taking a normalized string and normalizing it again in the same form is guaranteed to return identical text.

Although any substring of a normalized string is itself normalized, the concatenation of two normalized strings is not necessarily normalized: for example, the second string might start with combining characters that should be placed before combining characters at the end of the first string.

As long as a text uses no unassigned code points when it is normalized, Unicode promises that its normalized form will not change in future versions of the standard. This means that normalized forms are generally safe to use in persistent storage, even as the Unicode standard evolves.