A Fast LinkedList in Rust: Part 1 - Design

Posted by Rahul Singh on July 21st, 2022 Tagged Rust, LinkedList

LinkedLists in general, have one big advantage - when you add a node to the list you can hold a long lived "pointer"^[3] to that node that doesn’t get invalidated even when the list is modified. This "pointer" can then be used to remove that element from the list in constant time, regardless of the position of that element in the list and without having to move every element. That is the whole point of using a LinkedList^[1]

We take this behavior for granted in languages such as Go and Java because the garbage collector makes sure not to reclaim memory held by the node when we release the "pointer". In C & C++, it's the responsibility of the developer to ensure that "pointer" is not de-referenced once the node is removed, but the language itself doesn't prevent anyone from holding a pointer to things that don't exist any more.

With Rust (safe rust to be specific) things are a bit different. You can't hold a long lived "pointer" to a node in a linked list because of the ownership and borrowing rules. In fact it's hard to even build the list because you can't have two nodes point to each other (in safe rust) because it's not clear who "owns" the memory for any given node. There is a good book that goes through this in great detail, along with several ways to get around this but all of them come up short. The std::collections::LinkedList in the Rust standard library also comes up short.

This really defeats the purpose of using a linkedlist in the first place.

It seems like no one has really solved the problem of building a useful linkedlist in rust. Furthermore, the generally accepted consensus seems to be that LinkedLists are terrible data structures and should be avoided at all costs. ^[2]

For certain problems, LinkedLists are perfect (LRU Cache is one such example) and rather than add to the debate, I’ll define what useful means and then walk through an implementation that works.

Two Important Properties of a Useful Linked List

For a LinkedList to be truly useful, it must support two operations:

The push operation must return some sort of “pointer” to the node inserted into the list and that “pointer” must remain valid between mutations of the list.
The list must support inserts and removes from the middle of the list in constant time, via this pointer.

These two properties are useful because we can hold onto the "pointer" returned when we add a node to the list for as long as we like and the use that pointer to refer to that node regardless of where that node is in the list. This "pointer" then gives us constant time access to that particular node and consequently constant time removes and deletes of either that node or the next / prev nodes even if they are in the middle of the list.

Lets walk through implementing a linked list with the two properties above (specifically a doubly linked list) with the stipulation that we'll need unsafe rust for this task.

Structure of the LinkedList

Nodes for our linked list will be allocated on the heap and we need a way to represent links between them. It is important that the links be nullable to indicate that the last node points to a non-existent node". In other words we need a "pointer" (a logical reference) to heap allocated memory or "null".^[4]

We have several different options. We can use Box to refer to heap allocated memory and wrap that in an Option so a pointer to non-existent next node can be represented by None. This is what the std::collections::LinkedList implementation does except that it uses NonNull rather than Box.

Here we will use raw pointers (and unsafe rust) and manage memory with a free list ourselves using core::alloc & core::dealloc. Here is what a node in the LinkedList looks like and list struct itself:

pub struct InternalNode {
    val: T,
    prev: *mut InternalNode,
    next: *mut InternalNode,
}

pub struct LinkedList {
    head: *mut InternalNode,
    tail: *mut InternalNode,
}

We can use core::ptr::null_mut() to get a null pointer and is_null() to check if its null. This does mean that we have to handle allocation and deallocation as well as reading and writing to memory ourselves.

Managing memory with a FreeList

One issue with LinkedLists (and any linked data structure in general) is that every time an element is added or removed from the list, memory is allocated and de-allocated. Allocating & de-allocating memory is expensive and one way to implement a fast linked list is to use a free list.

The free list is just another linked list of "free" nodes on the heap that can be reused when a new element is pushed onto the list. When an element is popped off the list the node is returned to the freelist (rather than de-allocated) so it can be used again later. When a new element is inserted into the list a node is removed from the free list and moved to the main list (and no memory allocation is necessary). Since the freelist is a linkedlist itself, it simply reuses the next and prev pointers of InternalNode:

pub struct FreeList {
    capacity: usize,
    len: usize,
    head: *mut InternalNode,
    tail: *mut InternalNode,
}

When we create the LinkedList we initialize the freelist with an initial capacity. When an element is pushed on to the list, we acquire a node from the free list and write to that memory and connect the pointers. When an element is popped off the list we simply return that node to the freelist rather than de-allocating it.

This gives us another rather nice property. A Node is always valid memory - the only difference is whether its a logically valid element on the main list or logically free because its on the free list.

We have one more detail to make our linked list useful (as well as fast) - constant time mutations in the middle of the list. To achieve this we need a way to maintain long lived references to the node in the list so we can use these to mutate the list at that node.

Handles as long lived "Pointers"

A handle is a logical reference to a node in the linked list. Its a wrapper around a raw pointer with the constraint that clients of the list cannot de-reference the raw pointer. Secondly it cannot be used independently. It can only be passed to the list.

To implement the handle we introduce to additional concepts: ContainerId (cid) and NodeId (nid). The ContainerId is a unique integer that identifies an instance of a container (a linked list) so handles cannot be used across lists. This ties the handle to the list it was obtained from.

The nodeId is another integer that refers to a unique node in the list that that handle points to. The nodeId is incremented every time a node is moved from the free list to the main list. This ensures that when a node is reused from the freelist the old handle becomes invalid because the nodeId does't match anymore. Since the raw pointer within the handle will always be valid, we cannot use it to determine is a node is logically a new one.

Handles don't hold rust references to the linked list itself so that they can be long lived. Handles are simply wrappers around a raw pointer. This allows us to push an element to the list and obtain a handle to the node just pushed. We can then use this handle at any time to refer to that node.

With these two concepts (raw pointers to memory in a freelist and handles) our linked list now looks like this:

pub(super) struct InternalNode {
    pub(super) val: T,
    pub(super) fl_node: bool,
    pub(super) nid: usize,
    pub(super) prev: *mut InternalNode,
    pub(super) next: *mut InternalNode,
}

pub struct FastLinkedList {
    cid: usize,
    nid: usize,
    pub(super) head: *mut InternalNode,
    pub(super) tail: *mut InternalNode,
    len: usize,
    fl: fl::FreeList,
}

The struct that implements the handle is:

pub struct Node {
    pub(super) cid: usize,
    pub(super) nid: usize,
    pub(super) ptr: *mut InternalNode,
}

A Working Example

Lets see a real example where we create a new instance of the list and the add a few elements to it. We'll hold onto the handles of the every element pushed onto the list and then add an element to the middle of the list at any position.

use deepmesa::collections::LinkedList;

let mut list = LinkedList::::with_capacity(10);
let handle_1 = list.push_front(1);
let handle_2 = list.push_front(2);
let handle_3 = list.push_front(3);

// The list is now 3 <-> 2 <-> 1 
// We can insert an element between 2 and 3 like so

handle_4 = list.push_next(&handle_2, 4)
// The list is now 3 <-> 2 <-> 4 <-> 1 
// We can use any of the handles to push an element

handle_5 = list.push_next(&handle_3, 5);
// The list is now 3 <-> 5 <-> 2 <-> 4 <-> 1 
// Removing an element is easy as well

val = list.pop_prev(&handle_4);
assert_eq!(val, Some(2));

The complete implementation is here and the Github Repository is here.

In Part 2 - Performance we look at performance data for the deepmesa LinkedList.

[1] This concept can also be extended to all linked data structures such as trees or graphs

[2] I’m not sure if that's purely because of the cache locality problem endemic to linked lists or somehow related to the difficulty of building a linkedlist in Rust but either way I’m not convinced that a general disparagement of the Linked List is warranted.

[3] I'm using quotes around words such as "null" and "pointer" to emphasize that these are concepts rather than actual implementation because rust doesn't have the concept of null like other languages do. Pointers in rust come with specific constraints whereas "pointer" in this case means "to refer to something" rather than a specific implementation.

arrsingh.com

Distributed Systems & Artificial Intelligence