Chapter 23. Foreign Functions

Cyberspace. Unthinkable complexity. Lines of light ranged in the non-space of the mind, clusters and constellations of data. Like city lights, receding . . .”

William Gibson, Neuromancer

Tragically, not every program in the world is written in Rust. There are many critical libraries and interfaces implemented in other languages that we would like to be able to use in our Rust programs. Rust’s foreign function interface (FFI) lets Rust code call functions written in C, and in some cases C++. Since most operating systems offer C interfaces, Rust’s foreign function interface allows immediate access to all sorts of low-level facilities.

In this chapter, we’ll write a program that links with libgit2, a C library for working with the Git version control system. First, we’ll show what it’s like to use C functions directly from Rust, using the unsafe features demonstrated in the previous chapter. Then, we’ll show how to construct a safe interface to libgit2, taking inspiration from the open source git2-rs crate, which does exactly that.

We’ll assume that you’re familiar with C and the mechanics of compiling and linking C programs. Working with C++ is similar. We’ll also assume that you’re somewhat familiar with the Git version control system.

There do exist Rust crates for communicating with many other languages, including Python, JavaScript, Lua, and Java. We don’t have room to cover them here, but ultimately, all these interfaces are built using the C foreign function interface, so this chapter should give you a head start no matter which language you need to work with.

Finding Common Data Representations

The common denominator of Rust and C is machine language, so in order to anticipate what Rust values look like to C code, or vice versa, you need to consider their machine-level representations. Throughout the book, we’ve made a point of showing how values are actually represented in memory, so you’ve probably noticed that the data worlds of C and Rust have a lot in common: a Rust usize and a C size_t are identical, for example, and structs are fundamentally the same idea in both languages. To establish a correspondence between Rust and C types, we’ll start with primitives and then work our way up to more complicated types.

Given its primary use as a systems programming language, C has always been surprisingly loose about its types’ representations: an int is typically 32 bits long, but could be longer, or as short as 16 bits; a C char may be signed or unsigned; and so on. To cope with this variability, Rust’s std::os::raw module defines a set of Rust types that are guaranteed to have the same representation as certain C types (Table 23-1). These cover the primitive integer and character types.

Table 23-1. std::os::raw types in Rust
C type Corresponding std::os::raw type
short c_short
int c_int
long c_long
long long c_longlong
unsigned short c_ushort
unsigned, unsigned int c_uint
unsigned long c_ulong
unsigned long long c_ulonglong
char c_char
signed char c_schar
unsigned char c_uchar
float c_float
double c_double
void *, const void * *mut c_void, *const c_void

Some notes about Table 23-1:

For defining Rust struct types compatible with C structs, you can use the #[repr(C)] attribute. Placing #[repr(C)] above a struct definition asks Rust to lay out the struct’s fields in memory the same way a C compiler would lay out the analogous C struct type. For example, libgit2’s git2/errors.h header file defines the following C struct to provide details about a previously reported error:

typedef struct {
    char *message;
    int klass;
} git_error;

You can define a Rust type with an identical representation as follows:

use std::os::raw::{c_char, c_int};

#[repr(C)]
pub struct git_error {
    pub message: *const c_char,
    pub klass: c_int
}

The #[repr(C)] attribute affects only the layout of the struct itself, not the representations of its individual fields, so to match the C struct, each field must use the C-like type as well: *const c_char for char *, c_int for int, and so on.

In this particular case, the #[repr(C)] attribute probably doesn’t change the layout of git_error. There really aren’t too many interesting ways to lay out a pointer and an integer. But whereas C and C++ guarantee that a structure’s members appear in memory in the order they’re declared, each at a distinct address, Rust reorders fields to minimize the overall size of the struct, and zero-sized types take up no space. The #[repr(C)] attribute tells Rust to follow C’s rules for the given type.

You can also use #[repr(C)] to control the representation of C-style enums:

#[repr(C)]
#[allow(non_camel_case_types)]
enum git_error_code {
    GIT_OK         =  0,
    GIT_ERROR      = -1,
    GIT_ENOTFOUND  = -3,
    GIT_EEXISTS    = -4,
    ...
}

Normally, Rust plays all sorts of games when choosing how to represent enums. For example, we mentioned the trick Rust uses to store Option<&T> in a single word (if T is sized). Without #[repr(C)], Rust would use a single byte to represent the git_error_code enum; with #[repr(C)], Rust uses a value the size of a C int, just as C would.

You can also ask Rust to give an enum the same representation as some integer type. Starting the preceding definition with #[repr(i16)] would give you a 16-bit type with the same representation as the following C++ enum:

#include <stdint.h>

enum git_error_code: int16_t {
    GIT_OK         =  0,
    GIT_ERROR      = -1,
    GIT_ENOTFOUND  = -3,
    GIT_EEXISTS    = -4,
    ...
};

As mentioned earlier, #[repr(C)] applies to unions as well. Fields of #[repr(C)] unions always start at the first bit of the union’s memory—index 0.

Suppose you have a C struct that uses a union to hold some data and a tag value to indicate which field of the union should be used, similar to a Rust enum.

enum tag {
    FLOAT = 0,
    INT   = 1,
};

union number {
    float f;
    short i;
};

struct tagged_number {
    tag t;
    number n;
};

Rust code can interoperate with this structure by applying #[repr(C)] to the enum, structure, and union types, and using a match statement that selects a union field within a larger struct based on the tag:

#[repr(C)]
enum Tag {
    Float = 0,
    Int = 1
}

#[repr(C)]
union FloatOrInt {
    f: f32,
    i: i32,
}

#[repr(C)]
struct Value {
    tag: Tag,
    union: FloatOrInt
}

fn is_zero(v: Value) -> bool {
    use self::Tag::*;
    unsafe {
        match v {
            Value { tag: Int, union: FloatOrInt { i: 0 } } => true,
            Value { tag: Float, union: FloatOrInt { f: num } } => (num == 0.0),
            _ => false
        }
    }
}

Even complex structures can be easily used across the FFI boundary using this kind of technique.

Passing strings between Rust and C is a little harder. C represents a string as a pointer to an array of characters, terminated by a null character. Rust, on the other hand, stores the length of a string explicitly, either as a field of a String or as the second word of a fat reference &str. Rust strings are not null-terminated; in fact, they may include null characters in their contents, like any other character.

This means that you can’t borrow a Rust string as a C string: if you pass C code a pointer into a Rust string, it could mistake an embedded null character for the end of the string or run off the end looking for a terminating null that isn’t there. Going the other direction, you may be able to borrow a C string as a Rust &str, as long as its contents are well-formed UTF-8.

This situation effectively forces Rust to treat C strings as types entirely distinct from String and &str. In the std::ffi module, the CString and CStr types represent owned and borrowed null-terminated arrays of bytes. Compared to String and str, the methods on CString and CStr are quite limited, restricted to construction and conversion to other types. We’ll show these types in action in the next section.

Declaring Foreign Functions and Variables

An extern block declares functions or variables defined in some other library that the final Rust executable will be linked with. For example, on most platforms, every Rust program is linked against the standard C library, so we can tell Rust about the C library’s strlen function like this:

use std::os::raw::c_char;

extern {
    fn strlen(s: *const c_char) -> usize;
}

This gives Rust the function’s name and type, while leaving the definition to be linked in later.

Rust assumes that functions declared inside extern blocks use C conventions for passing arguments and accepting return values. They are defined as unsafe functions. These are the right choices for strlen: it is indeed a C function, and its specification in C requires that you pass it a valid pointer to a properly terminated string, which is a contract that Rust cannot enforce. (Almost any function that takes a raw pointer must be unsafe: safe Rust can construct raw pointers from arbitrary integers, and dereferencing such a pointer would be undefined behavior.)

With this extern block, we can call strlen like any other Rust function, although its type gives it away as a tourist:

use std::ffi::CString;

let rust_str = "I'll be back";
let null_terminated = CString::new(rust_str).unwrap();
unsafe {
    assert_eq!(strlen(null_terminated.as_ptr()), 12);
}

The CString::new function builds a null-terminated C string. It first checks its argument for embedded null characters, since those cannot be represented in a C string, and returns an error if it finds any (hence the need to unwrap the result). Otherwise, it adds a null byte to the end and returns a CString owning the resulting characters.

The cost of CString::new depends on what type you pass it. It accepts anything that implements Into<Vec<u8>>. Passing a &str entails an allocation and a copy, as the conversion to Vec<u8> builds a heap-allocated copy of the string for the vector to own. But passing a String by value simply consumes the string and takes over its buffer, so unless appending the null character forces the buffer to be resized, the conversion requires no copying of text or allocation at all.

CString dereferences to CStr, whose as_ptr method returns a *const c_char pointing at the start of the string. This is the type that strlen expects. In the example, strlen runs down the string, finds the null character that CString::new placed there, and returns the length, as a byte count.

You can also declare global variables in extern blocks. POSIX systems have a global variable named environ that holds the values of the process’s environment variables. In C, it’s declared:

extern char **environ;

In Rust, you would say:

use std::ffi::CStr;
use std::os::raw::c_char;

extern {
    static environ: *mut *mut c_char;
}

To print the environment’s first element, you could write:

unsafe {
    if !environ.is_null() && !(*environ).is_null() {
        let var = CStr::from_ptr(*environ);
        println!("first environment variable: {}",
                 var.to_string_lossy())
    }
}

After making sure environ has a first element, the code calls CStr::from_ptr to build a CStr that borrows it. The to_string_lossy method returns a Cow<str>: if the C string contains well-formed UTF-8, the Cow borrows its content as a &str, not including the terminating null byte. Otherwise, to_string_lossy makes a copy of the text in the heap, replaces the ill-formed UTF-8 sequences with the official Unicode replacement character, , and builds an owning Cow from that. Either way, the result implements Display, so you can print it with the {} format parameter.

Using Functions from Libraries

To use functions provided by a particular library, you can place a #[link] attribute atop the extern block that names the library Rust should link the executable with. For example, here’s a program that calls libgit2’s initialization and shutdown methods, but does nothing else:

use std::os::raw::c_int;

#[link(name = "git2")]
extern {
    pub fn git_libgit2_init() -> c_int;
    pub fn git_libgit2_shutdown() -> c_int;
}

fn main() {
    unsafe {
        git_libgit2_init();
        git_libgit2_shutdown();
    }
}

The extern block declares the extern functions as before. The #[link(name = "git2")] attribute leaves a note in the crate to the effect that, when Rust creates the final executable or shared library, it should link against the git2 library. Rust uses the system linker to build executables; on Unix, this passes the argument -lgit2 on the linker command line; on Windows, it passes git2.LIB.

#[link] attributes work in library crates, too. When you build a program that depends on other crates, Cargo gathers together the link notes from the entire dependency graph and includes them all in the final link.

In this example, if you would like to follow along on your own machine, you’ll need to build libgit2 for yourself. We used libgit2 version 0.25.1. To compile libgit2, you will need to install the CMake build tool and the Python language; we used CMake version 3.8.0 and Python version 2.7.13.

The full instructions for building libgit2 are available on its website, but they’re simple enough that we’ll show the essentials here. On Linux, assume you’ve already unzipped the library’s source into the directory /home/jimb/libgit2-0.25.1:

$ cd /home/jimb/libgit2-0.25.1
$ mkdir build
$ cd build
$ cmake ..
$ cmake --build .

On Linux, this produces a shared library /home/jimb/libgit2-0.25.1/build/libgit2.so​.⁠0.25.1 with the usual nest of symlinks pointing to it, including one named libgit2.so. On macOS, the results are similar, but the library is named libgit2.dylib.

On Windows, things are also straightforward. Assume you’ve unzipped the source into the directory C:\Users\JimB\libgit2-0.25.1. In a Visual Studio command prompt:

> cd C:\Users\JimB\libgit2-0.25.1
> mkdir build
> cd build
> cmake -A x64 ..
> cmake --build .

These are the same commands as used on Linux, except that you must request a 64-bit build when you run CMake the first time to match your Rust compiler. (If you have installed the 32-bit Rust toolchain, then you should omit the -A x64 flag to the first cmake command.) This produces an import library git2.LIB and a dynamic-link library git2.DLL, both in the directory C:\Users\JimB\libgit2-0.25.1\build\Debug. (The remaining instructions are shown for Unix, except where Windows is substantially different.)

Create the Rust program in a separate directory:

$ cd /home/jimb
$ cargo new --bin git-toy
     Created binary (application) `git-toy` package

Take the code shown earlier and put it in src/main.rs. Naturally, if you try to build this, Rust has no idea where to find the libgit2 you built:

$ cd git-toy
$ cargo run
   Compiling git-toy v0.1.0 (/home/jimb/git-toy)
error: linking with `cc` failed: exit code: 1
  |
  = note: /usr/bin/ld: error: cannot find -lgit2
          src/main.rs:11: error: undefined reference to 'git_libgit2_init'
          src/main.rs:12: error: undefined reference to 'git_libgit2_shutdown'
          collect2: error: ld returned 1 exit status


error: aborting due to previous error

error: could not compile `git-toy`.

To learn more, run the command again with --verbose.

You can tell Rust where to search for libraries by writing a build script, Rust code that Cargo compiles and runs at build time. Build scripts can do all sorts of things: generate code dynamically, compile C code to be included in the crate, and so on. In this case, all you need is to add a library search path to the executable’s link command. When Cargo runs the build script, it parses the build script’s output for information of this sort, so the build script simply needs to print the right magic to its standard output.

To create your build script, add a file named build.rs in the same directory as the Cargo.toml file, with the following contents:

fn main() {
    println!(r"cargo:rustc-link-search=native=/home/jimb/libgit2-0.25.1/build");
}

This is the right path for Linux; on Windows, you would change the path following the text native= to C:\Users\JimB\libgit2-0.25.1\build\Debug. (We’re cutting some corners to keep this example simple; in a real application, you should avoid using absolute paths in your build script. We cite documentation that shows how to do it right at the end of this section.)

Now you can almost run the program. On macOS it may work immediately; on a Linux system you will probably see something like the following:

$ cargo run
   Compiling git-toy v0.1.0 (/tmp/rustbook-transcript-tests/git-toy)
    Finished dev [unoptimized + debuginfo] target(s)
     Running `target/debug/git-toy`
target/debug/git-toy: error while loading shared libraries:
libgit2.so.25: cannot open shared object file: No such file or directory

This means that, although Cargo succeeded in linking the executable against the library, it doesn’t know where to find the shared library at run time. Windows reports this failure by popping up a dialog box. On Linux, you must set the LD_LIBRARY_PATH environment variable:

$ export LD_LIBRARY_PATH=/home/jimb/libgit2-0.25.1/build:$LD_LIBRARY_PATH
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/git-toy`

On macOS, you may need to set DYLD_LIBRARY_PATH instead.

On Windows, you must set the PATH environment variable:

> set PATH=C:\Users\JimB\libgit2-0.25.1\build\Debug;%PATH%
> cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/git-toy`
>

Naturally, in a deployed application you’d want to avoid having to set environment variables just to find your library’s code. One alternative is to statically link the C library into your crate. This copies the library’s object files into the crate’s .rlib file, alongside the object files and metadata for the crate’s Rust code. The entire collection then participates in the final link.

It is a Cargo convention that a crate that provides access to a C library should be named LIB-sys, where LIB is the name of the C library. A -sys crate should contain nothing but the statically linked library and Rust modules containing extern blocks and type definitions. Higher-level interfaces then belong in crates that depend on the -sys crate. This allows multiple upstream crates to depend on the same -sys crate, assuming there is a single version of the -sys crate that meets everyone’s needs.

For the full details on Cargo’s support for build scripts and linking with system libraries, see the online Cargo documentation. It shows how to avoid absolute paths in build scripts, control compilation flags, use tools like pkg-config, and so on. The git2-rs crate also provides good examples to emulate; its build script handles some complex situations.

A Raw Interface to libgit2

Figuring out how to use libgit2 properly breaks down into two questions:

  • What does it take to use libgit2 functions in Rust?

  • How can we build a safe Rust interface around them?

We’ll take these questions one at a time. In this section, we’ll write a program that’s essentially a single giant unsafe block filled with nonidiomatic Rust code, reflecting the clash of type systems and conventions that is inherent in mixing languages. We’ll call this the raw interface. The code will be messy, but it will make plain all the steps that must occur for Rust code to use libgit2.

Then, in the next section, we’ll build a safe interface to libgit2 that puts Rust’s types to use enforcing the rules libgit2 imposes on its users. Fortunately, libgit2 is an exceptionally well-designed C library, so the questions that Rust’s safety requirements force us to ask all have pretty good answers, and we can construct an idiomatic Rust interface with no unsafe functions.

The program we’ll write is very simple: it takes a path as a command-line argument, opens the Git repository there, and prints out the head commit. But this is enough to illustrate the key strategies for building safe and idiomatic Rust interfaces.

For the raw interface, the program will end up needing a somewhat larger collection of functions and types from libgit2 than we used before, so it makes sense to move the extern block into its own module. We’ll create a file named raw.rs in git-toy/src whose contents are as follows:

#![allow(non_camel_case_types)]

use std::os::raw::{c_int, c_char, c_uchar};

#[link(name = "git2")]
extern {
    pub fn git_libgit2_init() -> c_int;
    pub fn git_libgit2_shutdown() -> c_int;
    pub fn giterr_last() -> *const git_error;

    pub fn git_repository_open(out: *mut *mut git_repository,
                               path: *const c_char) -> c_int;
    pub fn git_repository_free(repo: *mut git_repository);

    pub fn git_reference_name_to_id(out: *mut git_oid,
                                    repo: *mut git_repository,
                                    reference: *const c_char) -> c_int;

    pub fn git_commit_lookup(out: *mut *mut git_commit,
                             repo: *mut git_repository,
                             id: *const git_oid) -> c_int;

    pub fn git_commit_author(commit: *const git_commit) -> *const git_signature;
    pub fn git_commit_message(commit: *const git_commit) -> *const c_char;
    pub fn git_commit_free(commit: *mut git_commit);
}

#[repr(C)] pub struct git_repository { _private: [u8; 0] }
#[repr(C)] pub struct git_commit { _private: [u8; 0] }

#[repr(C)]
pub struct git_error {
    pub message: *const c_char,
    pub klass: c_int
}

pub const GIT_OID_RAWSZ: usize = 20;

#[repr(C)]
pub struct git_oid {
    pub id: [c_uchar; GIT_OID_RAWSZ]
}

pub type git_time_t = i64;

#[repr(C)]
pub struct git_time {
    pub time: git_time_t,
    pub offset: c_int
}

#[repr(C)]
pub struct git_signature {
    pub name: *const c_char,
    pub email: *const c_char,
    pub when: git_time
}

Each item here is modeled on a declaration from libgit2’s own header files. For example, libgit2-0.25.1/include/git2/repository.h includes this declaration:

extern int git_repository_open(git_repository **out, const char *path);

This function tries to open the Git repository at path. If all goes well, it creates a git_repository object and stores a pointer to it in the location pointed to by out. The equivalent Rust declaration is the following:

pub fn git_repository_open(out: *mut *mut git_repository,
                           path: *const c_char) -> c_int;

The libgit2 public header files define the git_repository type as a typedef for an incomplete struct type:

typedef struct git_repository git_repository;

Since the details of this type are private to the library, the public headers never define struct git_repository, ensuring that the library’s users can never build an instance of this type themselves. One possible analogue to an incomplete struct type in Rust is this:

#[repr(C)] pub struct git_repository { _private: [u8; 0] }

This is a struct type containing an array with no elements. Since the _private field isn’t pub, values of this type cannot be constructed outside this module, which is perfect as the reflection of a C type that only libgit2 should ever construct, and which is manipulated solely through raw pointers.

Writing large extern blocks by hand can be a chore. If you are creating a Rust interface to a complex C library, you may want to try using the bindgen crate, which has functions you can use from your build script to parse C header files and generate the corresponding Rust declarations automatically. We don’t have space to show bindgen in action here, but bindgen’s page on crates.io includes links to its documentation.

Next we’ll rewrite main.rs completely. First, we need to declare the raw module:

mod raw;

According to libgit2’s conventions, fallible functions return an integer code that is positive or zero on success, and negative on failure. If an error occurs, the giterr_last function will return a pointer to a git_error structure providing more details about what went wrong. libgit2 owns this structure, so we don’t need to free it ourselves, but it could be overwritten by the next library call we make. A proper Rust interface would use Result, but in the raw version, we want to use the libgit2 functions just as they are, so we’ll have to roll our own function for handling errors:

use std::ffi::CStr;
use std::os::raw::c_int;

fn check(activity: &'static str, status: c_int) -> c_int {
    if status < 0 {
        unsafe {
            let error = &*raw::giterr_last();
            println!("error while {}: {} ({})",
                     activity,
                     CStr::from_ptr(error.message).to_string_lossy(),
                     error.klass);
            std::process::exit(1);
        }
    }

    status
}

We’ll use this function to check the results of libgit2 calls like this:

check("initializing library", raw::git_libgit2_init());

This uses the same CStr methods used earlier: from_ptr to construct the CStr from a C string and to_string_lossy to turn that into something Rust can print.

Next, we need a function to print out a commit:

unsafe fn show_commit(commit: *const raw::git_commit) {
    let author = raw::git_commit_author(commit);

    let name = CStr::from_ptr((*author).name).to_string_lossy();
    let email = CStr::from_ptr((*author).email).to_string_lossy();
    println!("{} <{}>\n", name, email);

    let message = raw::git_commit_message(commit);
    println!("{}", CStr::from_ptr(message).to_string_lossy());
}

Given a pointer to a git_commit, show_commit calls git_commit_author and git_commit_message to retrieve the information it needs. These two functions follow a convention that the libgit2 documentation explains as follows:

If a function returns an object as a return value, that function is a getter and the object’s lifetime is tied to the parent object.

In Rust terms, author and message are borrowed from commit: show_commit doesn’t need to free them itself, but it must not hold on to them after commit is freed. Since this API uses raw pointers, Rust won’t check their lifetimes for us: if we do accidentally create dangling pointers, we probably won’t find out about it until the program crashes.

The preceding code assumes these fields hold UTF-8 text, which is not always correct. Git permits other encodings as well. Interpreting these strings properly would probably entail using the encoding crate. For brevity’s sake, we’ll gloss over those issues here.

Our program’s main function reads as follows:

use std::ffi::CString;
use std::mem;
use std::ptr;
use std::os::raw::c_char;

fn main() {
    let path = std::env::args().skip(1).next()
        .expect("usage: git-toy PATH");
    let path = CString::new(path)
        .expect("path contains null characters");

    unsafe {
        check("initializing library", raw::git_libgit2_init());

        let mut repo = ptr::null_mut();
        check("opening repository",
              raw::git_repository_open(&mut repo, path.as_ptr()));

        let c_name = b"HEAD\0".as_ptr() as *const c_char;
        let oid = {
            let mut oid = mem::MaybeUninit::uninit();
            check("looking up HEAD",
                  raw::git_reference_name_to_id(oid.as_mut_ptr(), repo, c_name));
            oid.assume_init()
        };

        let mut commit = ptr::null_mut();
        check("looking up commit",
              raw::git_commit_lookup(&mut commit, repo, &oid));

        show_commit(commit);

        raw::git_commit_free(commit);

        raw::git_repository_free(repo);

        check("shutting down library", raw::git_libgit2_shutdown());
    }
}

This starts with code to handle the path argument and initialize the library, all of which we’ve seen before. The first novel code is this:

let mut repo = ptr::null_mut();
check("opening repository",
      raw::git_repository_open(&mut repo, path.as_ptr()));

The call to git_repository_open tries to open the Git repository at the given path. If it succeeds, it allocates a new git_repository object for it and sets repo to point to that. Rust implicitly coerces references into raw pointers, so passing &mut repo here provides the *mut *mut git_repository the call expects.

This shows another libgit2 convention in use (from the libgit2 documentation):

Objects which are returned via the first argument as a pointer-to-pointer are owned by the caller and it is responsible for freeing them.

In Rust terms, functions like git_repository_open pass ownership of the new value to the caller.

Next, consider the code that looks up the object hash of the repository’s current head commit:

let oid = {
    let mut oid = mem::MaybeUninit::uninit();
    check("looking up HEAD",
          raw::git_reference_name_to_id(oid.as_mut_ptr(), repo, c_name));
    oid.assume_init()
};

The git_oid type stores an object identifier—a 160-bit hash code that Git uses internally (and throughout its delightful user interface) to identify commits, individual versions of files, and so on. This call to git_reference_name_to_id looks up the object identifier of the current "HEAD" commit.

In C it’s perfectly normal to initialize a variable by passing a pointer to it to some function that fills in its value; this is how git_reference_name_to_id expects to treat its first argument. But Rust won’t let us borrow a reference to an uninitialized variable. We could initialize oid with zeros, but this is a waste: any value stored there will simply be overwritten.

It is possible to ask Rust to give us uninitialized memory, but because reading uninitialized memory at any time is instant undefined behavior, Rust provides an abstraction, MaybeUninit, to ease its use. MaybeUninit<T> tells the compiler to set aside enough memory for your type T, but not to touch it until you say that it’s safe to do so. While this memory is owned by the MaybeUninit, the compiler will also avoid certain optimizations that could otherwise cause undefined behavior even without any explicit access to the uninitialized memory in your code.

MaybeUninit provides a method, as_mut_ptr(), that produces a *mut T pointing to the potentially uninitialized memory it wraps. By passing that pointer to a foreign function that initializes the memory and then calling the unsafe method assume_init on the MaybeUninit to produce a fully initialized T, you can avoid undefined behavior without the additional overhead that comes from initializing and immediately throwing away a value. assume_init is unsafe because calling it on a MaybeUninit without being certain that the memory is actually initialized will immediately cause undefined behavior.

In this case, it is safe because git_reference_name_to_id initializes the memory owned by the MaybeUninit. We could use MaybeUninit for the repo and commit variables as well, but since these are just single words, we just go ahead and initialize them to null:

let mut commit = ptr::null_mut();
check("looking up commit",
      raw::git_commit_lookup(&mut commit, repo, &oid));

This takes the commit’s object identifier and looks up the actual commit, storing a git_commit pointer in commit on success.

The remainder of the main function should be self-explanatory. It calls the show_commit function defined earlier, frees the commit and repository objects, and shuts down the library.

Now we can try out the program on any Git repository ready at hand:

$ cargo run /home/jimb/rbattle
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/git-toy /home/jimb/rbattle`
Jim Blandy <jimb@red-bean.com>

Animate goop a bit.

A Safe Interface to libgit2

The raw interface to libgit2 is a perfect example of an unsafe feature: it certainly can be used correctly (as we do here, so far as we know), but Rust can’t enforce the rules you must follow. Designing a safe API for a library like this is a matter of identifying all these rules and then finding ways to turn any violation of them into a type or borrow-checking error.

Here, then, are libgit2’s rules for the features the program uses:

  • You must call git_libgit2_init before using any other library function. You must not use any library function after calling git_libgit2_shutdown.

  • All values passed to libgit2 functions must be fully initialized, except for output parameters.

  • When a call fails, output parameters passed to hold the results of the call are left uninitialized, and you must not use their values.

  • A git_commit object refers to the git_repository object it is derived from, so the former must not outlive the latter. (This isn’t spelled out in the libgit2 documentation; we inferred it from the presence of certain functions in the interface and then verified it by reading the source code.)

  • Similarly, a git_signature is always borrowed from a given git_commit, and the former must not outlive the latter. (The documentation does cover this case.)

  • The message associated with a commit and the name and email address of the author are all borrowed from the commit and must not be used after the commit is freed.

  • Once a libgit2 object has been freed, it must never be used again.

As it turns out, you can build a Rust interface to libgit2 that enforces all of these rules, either through Rust’s type system or by managing details internally.

Before we get started, let’s restructure the project a little bit. We’d like to have a git module that exports the safe interface, of which the raw interface from the previous program is a private submodule.

The whole source tree will look like this:

git-toy/
├── Cargo.toml
├── build.rs
└── src/
    ├── main.rs
    └── git/
        ├── mod.rs
        └── raw.rs

Following the rules we explained in “Modules in Separate Files”, the source for the git module appears in git/mod.rs, and the source for its git::raw submodule goes in git/raw.rs.

Once again, we’re going to rewrite main.rs completely. It should start with a declaration of the git module:

mod git;

Then, we’ll need to create the git subdirectory and move raw.rs into it:

$ cd /home/jimb/git-toy
$ mkdir src/git
$ mv src/raw.rs src/git/raw.rs

The git module needs to declare its raw submodule. The file src/git/mod.rs must say:

mod raw;

Since it’s not pub, this submodule is not visible to the main program.

In a bit we’ll need to use some functions from the libc crate, so we must add a dependency in Cargo.toml. The full file now reads:

[package]
name = "git-toy"
version = "0.1.0"
authors = ["You <you@example.com>"]
edition = "2018"

[dependencies]
libc = "0.2"

Now that we’ve restructured our modules, let’s consider error handling. Even libgit2’s initialization function can return an error code, so we’ll need to have this sorted out before we can get started. An idiomatic Rust interface needs its own Error type that captures the libgit2 failure code as well as the error message and class from giterr_last. A proper error type must implement the usual Error, Debug, and Display traits. Then, it needs its own Result type that uses this Error type. Here are the necessary definitions in src/git/mod.rs:

use std::error;
use std::fmt;
use std::result;

#[derive(Debug)]
pub struct Error {
    code: i32,
    message: String,
    class: i32
}

impl fmt::Display for Error {
    fn fmt(&self, f: &mut fmt::Formatter) -> result::Result<(), fmt::Error> {
        // Displaying an `Error` simply displays the message from libgit2.
        self.message.fmt(f)
    }
}

impl error::Error for Error { }

pub type Result<T> = result::Result<T, Error>;

To check the result from raw library calls, the module needs a function that turns a libgit2 return code into a Result:

use std::os::raw::c_int;
use std::ffi::CStr;

fn check(code: c_int) -> Result<c_int> {
    if code >= 0 {
        return Ok(code);
    }

    unsafe {
        let error = raw::giterr_last();

        // libgit2 ensures that (*error).message is always non-null and null
        // terminated, so this call is safe.
        let message = CStr::from_ptr((*error).message)
            .to_string_lossy()
            .into_owned();

        Err(Error {
            code: code as i32,
            message,
            class: (*error).klass as i32
        })
    }
}

The main difference between this and the check function from the raw version is that this constructs an Error value instead of printing an error message and exiting immediately.

Now we’re ready to tackle libgit2 initialization. The safe interface will provide a Repository type that represents an open Git repository, with methods for resolving references, looking up commits, and so on. Continuing in git/mod.rs, here’s the definition of Repository:

/// A Git repository.
pub struct Repository {
    // This must always be a pointer to a live `git_repository` structure.
    // No other `Repository` may point to it.
    raw: *mut raw::git_repository
}

A Repository’s raw field is not public. Since only code in this module can access the raw::git_repository pointer, getting this module right should ensure the pointer is always used correctly.

If the only way to create a Repository is to successfully open a fresh Git repository, that will ensure that each Repository points to a distinct git_repository object:

use std::path::Path;
use std::ptr;

impl Repository {
    pub fn open<P: AsRef<Path>>(path: P) -> Result<Repository> {
        ensure_initialized();

        let path = path_to_cstring(path.as_ref())?;
        let mut repo = ptr::null_mut();
        unsafe {
            check(raw::git_repository_open(&mut repo, path.as_ptr()))?;
        }
        Ok(Repository { raw: repo })
    }
}

Since the only way to do anything with the safe interface is to start with a Repository value, and Repository::open starts with a call to ensure_initialized, we can be confident that ensure_initialized will be called before any libgit2 functions. Its definition is as follows:

fn ensure_initialized() {
    static ONCE: std::sync::Once = std::sync::Once::new();
    ONCE.call_once(|| {
        unsafe {
            check(raw::git_libgit2_init())
                .expect("initializing libgit2 failed");
            assert_eq!(libc::atexit(shutdown), 0);
        }
    });
}

extern fn shutdown() {
    unsafe {
        if let Err(e) = check(raw::git_libgit2_shutdown()) {
            eprintln!("shutting down libgit2 failed: {}", e);
            std::process::abort();
        }
    }
}

The std::sync::Once type helps run initialization code in a thread-safe way. Only the first thread to call ONCE.call_once runs the given closure. Any subsequent calls, by this thread or any other, block until the first has completed and then return immediately, without running the closure again. Once the closure has finished, calling ONCE.call_once is cheap, requiring nothing more than an atomic load of a flag stored in ONCE.

In the preceding code, the initialization closure calls git_libgit2_init and checks the result. It punts a bit and just uses expect to make sure initialization succeeded, instead of trying to propagate errors back to the caller.

To make sure the program calls git_libgit2_shutdown, the initialization closure uses the C library’s atexit function, which takes a pointer to a function to invoke before the process exits. Rust closures cannot serve as C function pointers: a closure is a value of some anonymous type carrying the values of whatever variables it captures or references to them; a C function pointer is just a pointer. However, Rust fn types work fine, as long as you declare them extern so that Rust knows to use the C calling conventions. The local function shutdown fits the bill and ensures libgit2 gets shut down properly.

In “Unwinding”, we mentioned that it is undefined behavior for a panic to cross language boundaries. The call from atexit to shutdown is such a boundary, so it is essential that shutdown not panic. This is why shutdown can’t simply use .expect to handle errors reported from raw::git_libgit2_shutdown. Instead, it must report the error and terminate the process itself. POSIX forbids calling exit within an atexit handler, so shutdown calls std::process::abort to terminate the program abruptly.

It might be possible to arrange to call git_libgit2_shutdown sooner—say, when the last Repository value is dropped. But no matter how we arrange things, calling git_libgit2_shutdown must be the safe API’s responsibility. The moment it is called, any extant libgit2 objects become unsafe to use, so a safe API must not expose this function directly.

A Repository’s raw pointer must always point to a live git_repository object. This implies that the only way to close a repository is to drop the Repository value that owns it:

impl Drop for Repository {
    fn drop(&mut self) {
        unsafe {
            raw::git_repository_free(self.raw);
        }
    }
}

By calling git_repository_free only when the sole pointer to the raw::git_repository is about to go away, the Repository type also ensures the pointer will never be used after it’s freed.

The Repository::open method uses a private function called path_to_cstring, which has two definitions—one for Unix-like systems and one for Windows:

use std::ffi::CString;

#[cfg(unix)]
fn path_to_cstring(path: &Path) -> Result<CString> {
    // The `as_bytes` method exists only on Unix-like systems.
    use std::os::unix::ffi::OsStrExt;

    Ok(CString::new(path.as_os_str().as_bytes())?)
}

#[cfg(windows)]
fn path_to_cstring(path: &Path) -> Result<CString> {
    // Try to convert to UTF-8. If this fails, libgit2 can't handle the path
    // anyway.
    match path.to_str() {
        Some(s) => Ok(CString::new(s)?),
        None => {
            let message = format!("Couldn't convert path '{}' to UTF-8",
                                  path.display());
            Err(message.into())
        }
    }
}

The libgit2 interface makes this code a little tricky. On all platforms, libgit2 accepts paths as null-terminated C strings. On Windows, libgit2 assumes these C strings hold well-formed UTF-8 and converts them internally to the 16-bit paths Windows actually requires. This usually works, but it’s not ideal. Windows permits filenames that are not well-formed Unicode and thus cannot be represented in UTF-8. If you have such a file, it’s impossible to pass its name to libgit2.

In Rust, the proper representation of a filesystem path is a std::path::Path, carefully designed to handle any path that can appear on Windows or POSIX. This means that there are Path values on Windows that one cannot pass to libgit2, because they are not well-formed UTF-8. So although path_to_cstring’s behavior is less than ideal, it’s actually the best we can do given libgit2’s interface.

The two path_to_cstring definitions just shown rely on conversions to our Error type: the ? operator attempts such conversions, and the Windows version explicitly calls .into(). These conversions are unremarkable:

impl From<String> for Error {
    fn from(message: String) -> Error {
        Error { code: -1, message, class: 0 }
    }
}

// NulError is what `CString::new` returns if a string
// has embedded zero bytes.
impl From<std::ffi::NulError> for Error {
    fn from(e: std::ffi::NulError) -> Error {
        Error { code: -1, message: e.to_string(), class: 0 }
    }
}

Next, let’s figure out how to resolve a Git reference to an object identifier. Since an object identifier is just a 20-byte hash value, it’s perfectly fine to expose it in the safe API:

/// The identifier of some sort of object stored in the Git object
/// database: a commit, tree, blob, tag, etc. This is a wide hash of the
/// object's contents.
pub struct Oid {
    pub raw: raw::git_oid
}

We’ll add a method to Repository to perform the lookup:

use std::mem;
use std::os::raw::c_char;

impl Repository {
    pub fn reference_name_to_id(&self, name: &str) -> Result<Oid> {
        let name = CString::new(name)?;
        unsafe {
            let oid = {
                let mut oid = mem::MaybeUninit::uninit();
                check(raw::git_reference_name_to_id(
                        oid.as_mut_ptr(), self.raw,
                        name.as_ptr() as *const c_char))?;
                oid.assume_init()
            };
            Ok(Oid { raw: oid })
        }
    }
}

Although oid is left uninitialized when the lookup fails, this function guarantees that its caller can never see the uninitialized value simply by following Rust’s Result idiom: either the caller gets an Ok carrying a properly initialized Oid value, or it gets an Err.

Next, the module needs a way to retrieve commits from the repository. We’ll define a Commit type as follows:

use std::marker::PhantomData;

pub struct Commit<'repo> {
    // This must always be a pointer to a usable `git_commit` structure.
    raw: *mut raw::git_commit,
    _marker: PhantomData<&'repo Repository>
}

As we mentioned earlier, a git_commit object must never outlive the git_repository object it was retrieved from. Rust’s lifetimes let the code capture this rule precisely.

The RefWithFlag example earlier in this chapter used a PhantomData field to tell Rust to treat a type as if it contained a reference with a given lifetime, even though the type apparently contained no such reference. The Commit type needs to do something similar. In this case, the _marker field’s type is PhantomData<&'repo Repository>, indicating that Rust should treat Commit<'repo> as if it held a reference with lifetime 'repo to some Repository.

The method for looking up a commit is as follows:

impl Repository {
    pub fn find_commit(&self, oid: &Oid) -> Result<Commit> {
        let mut commit = ptr::null_mut();
        unsafe {
            check(raw::git_commit_lookup(&mut commit, self.raw, &oid.raw))?;
        }
        Ok(Commit { raw: commit, _marker: PhantomData })
    }
}

How does this relate the Commit’s lifetime to the Repository’s? The signature of find_commit omits the lifetimes of the references involved according to the rules outlined in “Omitting Lifetime Parameters”. If we were to write the lifetimes out, the full signature would read:

fn find_commit<'repo, 'id>(&'repo self, oid: &'id Oid)
    -> Result<Commit<'repo>>

This is exactly what we want: Rust treats the returned Commit as if it borrows something from self, which is the Repository.

When a Commit is dropped, it must free its raw::git_commit:

impl<'repo> Drop for Commit<'repo> {
    fn drop(&mut self) {
        unsafe {
            raw::git_commit_free(self.raw);
        }
    }
}

From a Commit, you can borrow a Signature (a name and email address) and the text of the commit message:

impl<'repo> Commit<'repo> {
    pub fn author(&self) -> Signature {
        unsafe {
            Signature {
                raw: raw::git_commit_author(self.raw),
                _marker: PhantomData
            }
        }
    }

    pub fn message(&self) -> Option<&str> {
        unsafe {
            let message = raw::git_commit_message(self.raw);
            char_ptr_to_str(self, message)
        }
    }
}

Here’s the Signature type:

pub struct Signature<'text> {
    raw: *const raw::git_signature,
    _marker: PhantomData<&'text str>
}

A git_signature object always borrows its text from elsewhere; in particular, signatures returned by git_commit_author borrow their text from the git_commit. So our safe Signature type includes a PhantomData<&'text str> to tell Rust to behave as if it contained a &str with a lifetime of 'text. Just as before, Commit::author properly connects this 'text lifetime of the Signature it returns to that of the Commit without us needing to write a thing. The Commit::message method does the same with the Option<&str> holding the commit message.

A Signature includes methods for retrieving the author’s name and email address:

impl<'text> Signature<'text> {
    /// Return the author's name as a `&str`,
    /// or `None` if it is not well-formed UTF-8.
    pub fn name(&self) -> Option<&str> {
        unsafe {
            char_ptr_to_str(self, (*self.raw).name)
        }
    }

    /// Return the author's email as a `&str`,
    /// or `None` if it is not well-formed UTF-8.
    pub fn email(&self) -> Option<&str> {
        unsafe {
            char_ptr_to_str(self, (*self.raw).email)
        }
    }
}

The preceding methods depend on a private utility function char_ptr_to_str:

/// Try to borrow a `&str` from `ptr`, given that `ptr` may be null or
/// refer to ill-formed UTF-8. Give the result a lifetime as if it were
/// borrowed from `_owner`.
///
/// Safety: if `ptr` is non-null, it must point to a null-terminated C
/// string that is safe to access for at least as long as the lifetime of
/// `_owner`.
unsafe fn char_ptr_to_str<T>(_owner: &T, ptr: *const c_char) -> Option<&str> {
    if ptr.is_null() {
        return None;
    } else {
        CStr::from_ptr(ptr).to_str().ok()
    }
}

The _owner parameter’s value is never used, but its lifetime is. Making the lifetimes in this function’s signature explicit gives us:

fn char_ptr_to_str<'o, T: 'o>(_owner: &'o T, ptr: *const c_char)
    -> Option<&'o str>

The CStr::from_ptr function returns a &CStr whose lifetime is completely unbounded, since it was borrowed from a dereferenced raw pointer. Unbounded lifetimes are almost always inaccurate, so it’s good to constrain them as soon as possible. Including the _owner parameter causes Rust to attribute its lifetime to the return value’s type, so callers can receive a more accurately bounded reference.

It is not clear from the libgit2 documentation whether a git_signature’s email and author pointers can be null, despite the documentation for libgit2 being quite good. Your authors dug around in the source code for some time without being able to persuade themselves one way or the other and finally decided that char_ptr_to_str had better be prepared for null pointers just in case. In Rust, this sort of question is answered immediately by the type: if it’s &str, you can count on the string to be there; if it’s Option<&str>, it’s optional.

Finally, we’ve provided safe interfaces for all the functionality we need. The new main function in src/main.rs is slimmed down quite a bit and looks like real Rust code:

fn main() {
    let path = std::env::args_os().skip(1).next()
        .expect("usage: git-toy PATH");

    let repo = git::Repository::open(&path)
        .expect("opening repository");

    let commit_oid = repo.reference_name_to_id("HEAD")
        .expect("looking up 'HEAD' reference");

    let commit = repo.find_commit(&commit_oid)
        .expect("looking up commit");

    let author = commit.author();
    println!("{} <{}>\n",
             author.name().unwrap_or("(none)"),
             author.email().unwrap_or("none"));

    println!("{}", commit.message().unwrap_or("(none)"));
}

In this chapter, we’ve gone from simplistic interfaces that don’t provide many safety guarantees to a safe API wrapping an inherently unsafe API by arranging for any violation of the latter’s contract to be a Rust type error. The result is an interface that Rust can ensure you use correctly. For the most part, the rules we’ve made Rust enforce are the sorts of rules that C and C++ programmers end up imposing on themselves anyway. What makes Rust feel so much stricter than C and C++ is not that the rules are so foreign, but that this enforcement is mechanical and comprehensive.

Conclusion

Rust is not a simple language. Its goal is to span two very different worlds. It’s a modern programming language, safe by design, with conveniences like closures and iterators, yet it aims to put you in control of the raw capabilities of the machine it runs on, with minimal run-time overhead.

The contours of the language are determined by these goals. Rust manages to bridge most of the gap with safe code. Its borrow checker and zero-cost abstractions put you as close to the bare metal as possible without risking undefined behavior. When that’s not enough or when you want to leverage existing C code, unsafe code and the foreign function interface stand ready. But again, the language doesn’t just offer you these unsafe features and wish you luck. The goal is always to use unsafe features to build safe APIs. That’s what we did with libgit2. It’s also what the Rust team has done with Box, Vec, the other collections, channels, and more: the standard library is full of safe abstractions, implemented with some unsafe code behind the scenes.

A language with Rust’s ambitions was, perhaps, not destined to be the simplest of tools. But Rust is safe, fast, concurrent—and effective. Use it to build large, fast, secure, robust systems that take advantage of the full power of the hardware they run on. Use it to make software better.