Vectors (Dynamic Arrays)

This page covers the design of the PVec type provided by magpie.

From the outside, a PVec<T> should be thought of as identical in functionality to a Vec<T>. The difference from the user's perspective is that they explicitly need to supply a BumpAllocator instance so that the vector implementation can request file-backed memory within the same object.

In the rest of this page, the term "persistable vector" will be used to refer to a PVec<T>.

Design

The approximate definition of a persistable vector is found in the following code listing.

#![allow(unused)]
fn main() {
pub struct PVec<T> {
    len: usize,
    capacity: usize,
    buf: usize,

    allocator: Option<BumpAllocator>,
}
}

The "header" of a vector contains information about its allocated capacity and its current length, along with a buf entry that is the offset, in bytes, to the first allocated element of the vector. In this way, we do not need to store pointer information as part of our "serialized" vector representation in the underlying file. Any instance of Magpie that loads the underlying file (and, of course, casts it to the correct datatype) will be able to load and safely access the same vector in memory, as all references are encoded as relative offsets within the file.

A persistable vector does not model its underlying buffer as a contiguous region of memory. This is a conscious decision, in an effort to avoid internal fragmentation of the backing storage whenever the vector needs to be resized. Instead, a persistable vector is composed of one or more sections. A section is created as a the result of a call to the bump allocator's allocate() method, whenever space needs dictate the need for the underlying buffer to be resized. The newly allocated region of memory is, itself, contiguous within the process' virtual address space.

Entries within the persistable vector are of the following type:

#![allow(unused)]
fn main() {
enum Entry<T> {
    SectionBegin {
        section_len: usize,
    },
    SectionEnd {
        next_section_offset: usize,
    },
    Elem(T),
}
}

Each section contains two "bookends": the initial element, SectionBegin describes the section immediately following it (namely, how many elements it can hold), and the last element, SectionEnd, contains the offset to the next allocated section (if any). The memory in between those elements is the memory that user data occupies, wrapped within an Entry::Elem enum.

Memory Layout

The vector uses a bump allocator to request contiguous regions of memory on resize events. The vector grows geometrically, doubling its capacity on each resize request. The image below illustrates the sample layout in memory (and consequently, within a backing file) of a vector with a single section (in green, delimited by SectionBegin and SectionEnd entries), which has been allocated after some other data stored by the object.

Basic Layout

The below image shows the effect of calling push() on the above vector. Because the allocated buffer was at capacity, a new allocation was requested. The new allocation occured immediately after the previous buffer, since no other allocation had intervened. The old SectionEnd pointer was updated with the new section's offset, and the original element was inserted into the newly allocated section.

Post-extend

To index into the section, the implementation reads the offset contained within buf and constructs a pointer into the loaded file by calculating, in essence, &buf + *buf.

An advantage of this layout is that we do not need to pay the overhead of data copy on resize to maintain our invariant of the underlying buffer residing within a contiguous region of memory.

This layout, however, has implications on access times. Most importantly, accesses to the vector's elements through an index (and, consequently, insertions and removals) are no longer truly constant time O(1), but they still complete within a bounded number of steps.

Indexing proceeds by accessing each segment sequentially, and checking if the requested index falls within it. If so, the implementation fetches the entry from the contiguous block of the section's memory. If not, the next section's address is computed by reading the offset information encoded in SectionEnd, and the above check is repeated. This means that access times are O(log(n / C)), where C is the vector's initial capacity, and n is the number of actual elements. This means that the initial capacity users choose for the vector matters. A value that is too far off the number of elements the vector ends up holding will lead to a larger number of relatively small sections being allocated, and therefore more steps before the appropriate section is identified. For this reason, the initial capacity should be (roughly) in the same order of magnitude as the anticipated number of elements in the vector.

API

The full list of supported operations can be found in the corresponding module-level docs. For the prototype version, the supported operations are the fundamental ones, namely:

Inserting elements -- fn push(&mut self: PVec, elem: T) -> ()
Removing elements -- fn pop(&mut self: PVec) -> Option<T>
Querying the vector's length -- fn len(&self) -> usize
Indexing into the vector through the [] operator.

Keyboard shortcuts

The Sibley Guide to magpie

Vectors (Dynamic Arrays)

Design

Memory Layout

API