Vectors (Dynamic Arrays)
This page covers the design of the
PVec type
provided by magpie.
From the outside, a PVec<T> should be thought of as identical in functionality to a Vec<T>. The
difference from the user's perspective is that they explicitly need to supply a BumpAllocator
instance so that the vector implementation can request file-backed memory within the same object.
In the rest of this page, the term "persistable vector" will be used to refer to a PVec<T>.
Design
The approximate definition of a persistable vector is found in the following code listing.
#![allow(unused)] fn main() { pub struct PVec<T> { len: usize, capacity: usize, buf: usize, allocator: Option<BumpAllocator>, } }
The "header" of a vector contains information about its allocated capacity and its current length,
along with a buf entry that is the offset, in bytes, to the first allocated element of the vector.
In this way, we do not need to store pointer information as part of our "serialized" vector
representation in the underlying file. Any instance of Magpie that loads the underlying file (and,
of course, casts it to the correct datatype) will be able to load and safely access the same vector
in memory, as all references are encoded as relative offsets within the file.
A persistable vector does not model its underlying buffer as a contiguous region of memory. This is
a conscious decision, in an effort to avoid internal fragmentation of the backing storage whenever
the vector needs to be resized. Instead, a persistable vector is composed of one or more sections. A
section is created as a the result of a call to the bump allocator's allocate() method, whenever
space needs dictate the need for the underlying buffer to be resized. The newly allocated region of
memory is, itself, contiguous within the process' virtual address space.
Entries within the persistable vector are of the following type:
#![allow(unused)] fn main() { enum Entry<T> { SectionBegin { section_len: usize, }, SectionEnd { next_section_offset: usize, }, Elem(T), } }
Each section contains two "bookends": the initial element, SectionBegin describes the section
immediately following it (namely, how many elements it can hold), and the last element,
SectionEnd, contains the offset to the next allocated section (if any). The memory in between
those elements is the memory that user data occupies, wrapped within an Entry::Elem enum.
Memory Layout
The vector uses a bump allocator to request contiguous regions of memory on resize events. The
vector grows geometrically, doubling its capacity on each resize request. The image below
illustrates the sample layout in memory (and consequently, within a backing file) of a vector with a
single section (in green, delimited by SectionBegin and SectionEnd entries), which has been
allocated after some other data stored by the object.
The below image shows the effect of calling push() on the above vector. Because the allocated
buffer was at capacity, a new allocation was requested. The new allocation occured immediately after
the previous buffer, since no other allocation had intervened. The old SectionEnd pointer was
updated with the new section's offset, and the original element was inserted into the newly
allocated section.
To index into the section, the implementation reads the offset contained within buf and constructs
a pointer into the loaded file by calculating, in essence, &buf + *buf.
An advantage of this layout is that we do not need to pay the overhead of data copy on resize to maintain our invariant of the underlying buffer residing within a contiguous region of memory.
This layout, however, has implications on access times. Most importantly, accesses to the vector's
elements through an index (and, consequently, insertions and removals) are no longer truly constant
time O(1), but they still complete within a bounded number of steps.
Indexing proceeds by accessing each segment sequentially, and checking if the requested index falls
within it. If so, the implementation fetches the entry from the contiguous block of the section's
memory. If not, the next section's address is computed by reading the offset information encoded in
SectionEnd, and the above check is repeated. This means that access times are O(log(n / C)),
where C is the vector's initial capacity, and n is the number of actual elements. This means
that the initial capacity users choose for the vector matters. A value that is too far off the
number of elements the vector ends up holding will lead to a larger number of relatively small
sections being allocated, and therefore more steps before the appropriate section is identified. For
this reason, the initial capacity should be (roughly) in the same order of magnitude as the
anticipated number of elements in the vector.
API
The full list of supported operations can be found in the corresponding module-level docs. For the prototype version, the supported operations are the fundamental ones, namely:
- Inserting elements --
fn push(&mut self: PVec, elem: T) -> () - Removing elements --
fn pop(&mut self: PVec) -> Option<T> - Querying the vector's length --
fn len(&self) -> usize - Indexing into the vector through the
[]operator.