Multiversioning

Summary

This page covers the design of the multiversioning subsystem of magpie, as well as how it integrates with nanotransaction execution. Details of how epics make use of versioning can be found in the epic design doc.

Introduction
Version Materialization and Storage
- Storage
Nanotransaction Scheduling
- Nanotransaction Classification at scheduling time
Versioned Object Resolution
Epics and Multiversioning

Introduction

Currently, worker-resident objects are maintained in a per-worker object tracker. The execution subsystem uses the object tracker to resolve IPtr instances to concrete, in-memory object references.

The object tracker only maintains "current" object versions, and because of our execution subsystem's structure it means that even read-only transactions end up being serialized behind transactions that have been scheduled before them that modify any subset of the read-only transaction's dependencies. Additionally, even though we maintain version history for each object (in the form of pre- and post-images in transaction log entries), we currently have no way of materializing past versions, which means that we cannot support any stronger isolation modes than "read committed" (note that in this context, "read committed" means that the effects of nanotransactions that are executed as part of an epic are immediately externalized to the system (as opposed to making them visible once the epic successfully commits).

The goal of the present page is to formalize the design of the versioning subsystem, concretely describe how it interacts with the nanotransaction execution subsystem to provide higher concurrency for read-mostly workloads, and sketch how epics will make use of this subsystem.

Version Materialization and Storage

A new per-worker component, the version manager, is introduced to supplement the functionality of the object tracker. The version manager maintains a mapping from object IDs to a list of committed versions of the given object in a lock-free hashmap variant, each entry storing a circular array of pointers to heap-allocated materialized versions of the target object.

Version Manager

On object initialization, we store a copy of the object in the table. Any time the object is modified as part of a nanotransaction, we materialize a new read-only version of the object by applying the set of relevant post-images to the latest in-memory copy of the object, and writing the new version pointer to the tail of the array.

This process can either happen downstream from the log manager by a "versioning thread", or it can be done by the worker thread that made the changes. I am unsure which is going to perform better, so I want to experiment with both.

Storage

Our overall approach is close to the delta storage approach of existing MVCC databases; we maintain the most recent version of all objects on disk and in-memory, and we maintain deltas generated from effectful nanotransactions in the form of post-images in associated transaction log entries. If we need to materialize a version that is not already cached, we need to "walk" the chain of deltas, starting from the oldest version available to us through the version manager and repeatedly applying the equivalent of logical undo() operations until we arrive at the desired version.

The reason why our approach is "close" to delta storage is that we unfortunately have to store as many copies of the entire materialized object in memory as there are slots in each version cache. Ideally, we would be maintaining a page cache and doing direct I/O from disk; given this infrastructure, we could avoid the above storage amplification by enforcing page-level copy-on-write. With that said, switching over to direct I/O and implementing a performant cache seems like more trouble than it's worth at the present moment, so for now we will just accept the fact that we incur high storage amplification and rely on Rust's clone-on-write smart pointer types.

Nanotransaction Scheduling

In the presence of object versioning, the worker-local scheduler will have to be modified to accomodate the following scenarios:

Read-only nanotransactions: This class of nanotransactions is unconstrained in terms of which executor thread they can run on, so the scheduler can apply any constant-time heuristic (e.g. least-in-flight) to schedule.
Read-write nanotransactions: These are nanotransactions that have both mutable and non-mutable object parameters, or nanotransactions that allocate an object and initialize it with data extracted from a set of input, read-only objects. For this class of nanotransactions, the scheduler will apply its existing scheduling logic, i.e. they will be executed on a core that locally owns all the objects that are passed in as arguments, regardless of whether the nanotransaction mutates the target object or not. This is to maintain the serializability of individual nanotransaction histories.

Nanotransactions that just issue blind writes (i.e. write-only nanotransactions) to a set of input objects are treated the same as read-write nanotransactions.

As an example, assume objects 123 and 234, and three nanotransaction definitions, ro_nando(&obj) (read-only nanotransaction), rw_nando(&obj_1, &mut obj_2) (read-write), and wo_nando(&mut obj) (write-only). The below figure shows a snapshot of the job queues of a worker with two executor threads, after a mix of the above nanotransactions have been scheduled.

Local Schedule Snapshot

All nanotransaction activations that intend to modify at least one of the two objects would get scheduled on whatever executor thread owns the current object locally, while the read-only nanotransactions on the same objects are free to be scheduled on any thread according to capacity. Until the instances of ro_nando() start running in thread #2, we do not know which specific version of the target objects they will read -- all the system guarantees is that they will read the most recently committed version of the target object (see the object resolution section for more). The two executor threads will proceed independently with the execution of their respective queues without needing to synchronize at any point.

Nanotransaction Classification at scheduling time

Since the only information about the nanotransaction activation available to the nanotransaction scheduler is the target nanotransaction's name and its arguments (in the form of IPtr instances in the case of object parameters), we have no way of determining the class a particular nanotransaction belongs to. To address this, the nandoize macro will have to be modified to generate that metadata as part of nandoization.

This metadata will be made available to the scheduler either programmatically (by generating something like the resolve_function() resolver per-processed library, e.g. get_operation_class(name: str)-> NandoClassEnum) or by encoding that metadata in an auxiliary, well-known, per-library binary file that is loaded by the scheduler during library loading.

Versioned Object Resolution

The nandoize macro will have to be further modified to generate the correct resolution code for versioned objects.

Currently our procedural macro emits calls to the resolve_object!() macro for all of an activation's object arguments. The resolver code is then executed by the executor thread that the activation is scheduled on as part of the nanotransaction's "preamble".

A new macro, resolve_versioned_object!(), will resolve the target object from the worker's version manager -- if a version is not specified explicitly, the result will be the most recently committed version of the object; if a version is specified, the version manager will return a cached version (if one still exists) or attempt to re-materialize the specified object version. Explicit versioning will probably only be used in epics for now, so we can take this materialization outside of the executor threads' hot path.

Epics and Multiversioning

This section provides a sketch for how epics will interact with the versioning subsystem; a detailed integration design is beyond the scope of this page.

Epics will run under at least two different isolation modes -- a weak one that immediately externalizes the effects of the individual nanotransactions that are running as part of the epic, and a stronger, SI-like mode that establishes a snapshot at the start of the epic entry point and externalizes the epic's effects at the end of the epic. As mentioned in the introduction, the first mode is relatively straightforward, so we will focus on snapshot-based epics.

There are two issues that need to be addressed:

Maintaining per-object version constraints, so that as the epic unfolds and new objects are added to the epic's read set we refine a constraint set over the object versions that the epic may access, based on the object versions it has already accessed.
Merging the effects of an epic (buffered while the epic is in progress in e.g. shadow objects) into the primary object copy (mmap-backed version that independent nanotransactions access).

The solution to the first issue will probably be independent of the per-instance version manager. We want versioning constraints to be persisted and shareable, so ideally we will store constraints on a per-object basis, in the object's metadata. These constraints will be modified any time a transaction executes over multiple objects, as part of the nanotransaction itself. The execution engine will then be responsible for reading the version constraints in the metadata of the target objects and supply versions that satisfy the established constraints when requesting an object version from the version manager.

To merge the effects of an epic at the time when the epic "commits", we will create a new system nanotransaction, apply(). This transaction will be passed the epic's complete write set, together with a set of transaction log entries (generated by the nanotransactions in the epic and buffered by the system) that contain the epic's writes to the objects in the write set. apply() needs to perform a version check over all the objects in the write set -- if the versions of the current in-memory objects is different to the versions of the objects as included in the epic's read set, then apply() exits and the epic is aborted/rolled back. If on the other hand all version checks succeed, the epic will commit, and so apply() is free to fast-forward the target objects by effectively running the redo logic from recovery, and then to pass the set of transaction log entries to the worker's log manager to append to the log;

Keyboard shortcuts

The Sibley Guide to magpie