Multiversioning
Summary
This page covers the design of the multiversioning subsystem of magpie, as well as how it integrates with nanotransaction execution. Details of how epics make use of versioning can be found in the epic design doc.
- Introduction
- Version Materialization and Storage
- Nanotransaction Scheduling
- Versioned Object Resolution
- Epics and Multiversioning
Introduction
Currently, worker-resident objects are maintained in a per-worker object
tracker. The
execution subsystem uses the object tracker to resolve IPtr instances to concrete, in-memory
object references.
The object tracker only maintains "current" object versions, and because of our execution subsystem's structure it means that even read-only transactions end up being serialized behind transactions that have been scheduled before them that modify any subset of the read-only transaction's dependencies. Additionally, even though we maintain version history for each object (in the form of pre- and post-images in transaction log entries), we currently have no way of materializing past versions, which means that we cannot support any stronger isolation modes than "read committed" (note that in this context, "read committed" means that the effects of nanotransactions that are executed as part of an epic are immediately externalized to the system (as opposed to making them visible once the epic successfully commits).
The goal of the present page is to formalize the design of the versioning subsystem, concretely describe how it interacts with the nanotransaction execution subsystem to provide higher concurrency for read-mostly workloads, and sketch how epics will make use of this subsystem.
Version Materialization and Storage
A new per-worker component, the version manager, is introduced to supplement the functionality of the object tracker. The version manager maintains a mapping from object IDs to a list of committed versions of the given object in a lock-free hashmap variant, each entry storing a circular array of pointers to heap-allocated materialized versions of the target object.
On object initialization, we store a copy of the object in the table. Any time the object is modified as part of a nanotransaction, we materialize a new read-only version of the object by applying the set of relevant post-images to the latest in-memory copy of the object, and writing the new version pointer to the tail of the array.
This process can either happen downstream from the log manager by a "versioning thread", or it can be done by the worker thread that made the changes. I am unsure which is going to perform better, so I want to experiment with both.
Storage
Our overall approach is close to the delta storage approach of existing MVCC databases; we maintain
the most recent version of all objects on disk and in-memory, and we maintain deltas generated from
effectful nanotransactions in the form of post-images in associated transaction log entries. If we
need to materialize a version that is not already cached, we need to "walk" the chain of deltas,
starting from the oldest version available to us through the version manager and repeatedly applying
the equivalent of logical undo() operations until we arrive at the desired version.
The reason why our approach is "close" to delta storage is that we unfortunately have to store as many copies of the entire materialized object in memory as there are slots in each version cache. Ideally, we would be maintaining a page cache and doing direct I/O from disk; given this infrastructure, we could avoid the above storage amplification by enforcing page-level copy-on-write. With that said, switching over to direct I/O and implementing a performant cache seems like more trouble than it's worth at the present moment, so for now we will just accept the fact that we incur high storage amplification and rely on Rust's clone-on-write smart pointer types.
Nanotransaction Scheduling
In the presence of object versioning, the worker-local scheduler will have to be modified to accomodate the following scenarios:
- Read-only nanotransactions: This class of nanotransactions is unconstrained in terms of which executor thread they can run on, so the scheduler can apply any constant-time heuristic (e.g. least-in-flight) to schedule.
- Read-write nanotransactions: These are nanotransactions that have both mutable and non-mutable object parameters, or nanotransactions that allocate an object and initialize it with data extracted from a set of input, read-only objects. For this class of nanotransactions, the scheduler will apply its existing scheduling logic, i.e. they will be executed on a core that locally owns all the objects that are passed in as arguments, regardless of whether the nanotransaction mutates the target object or not. This is to maintain the serializability of individual nanotransaction histories.
Nanotransactions that just issue blind writes (i.e. write-only nanotransactions) to a set of input objects are treated the same as read-write nanotransactions.
As an example, assume objects 123 and 234, and three nanotransaction definitions,
ro_nando(&obj) (read-only nanotransaction), rw_nando(&obj_1, &mut obj_2) (read-write), and
wo_nando(&mut obj) (write-only). The below figure shows a snapshot of the job queues of a worker
with two executor threads, after a mix of the above nanotransactions have been scheduled.
All nanotransaction activations that intend to modify at least one of the two objects would get
scheduled on whatever executor thread owns the current object locally, while the read-only
nanotransactions on the same objects are free to be scheduled on any thread according to capacity.
Until the instances of ro_nando() start running in thread #2, we do not know which specific
version of the target objects they will read -- all the system guarantees is that they will read the
most recently committed version of the target object (see the object
resolution section for more). The two executor threads will proceed
independently with the execution of their respective queues without needing to synchronize at any point.
Nanotransaction Classification at scheduling time
Since the only information about the nanotransaction activation available to the nanotransaction
scheduler is the target nanotransaction's name and its arguments (in the form of IPtr instances in
the case of object parameters), we have no way of determining the class a particular nanotransaction
belongs to. To address this, the nandoize macro will have to be modified to generate that metadata
as part of nandoization.
This metadata will be made available to the scheduler either programmatically (by generating
something like the resolve_function() resolver per-processed library, e.g. get_operation_class(name: str)-> NandoClassEnum) or by encoding that metadata in an auxiliary, well-known, per-library binary file
that is loaded by the scheduler during library loading.
Versioned Object Resolution
The nandoize macro will have to be further modified to generate the correct resolution code for
versioned objects.
Currently our procedural macro emits
calls
to the resolve_object!()
macro
for all of an activation's object arguments. The resolver code is then executed by the executor
thread that the activation is scheduled on as part of the nanotransaction's "preamble".
A new macro, resolve_versioned_object!(), will resolve the target object from the worker's version
manager -- if a version is not specified explicitly, the result will be the most recently committed
version of the object; if a version is specified, the version manager will return a cached version
(if one still exists) or attempt to re-materialize the specified object version. Explicit versioning
will probably only be used in epics for now, so we can take this materialization outside of the
executor threads' hot path.
Epics and Multiversioning
This section provides a sketch for how epics will interact with the versioning subsystem; a detailed integration design is beyond the scope of this page.
Epics will run under at least two different isolation modes -- a weak one that immediately externalizes the effects of the individual nanotransactions that are running as part of the epic, and a stronger, SI-like mode that establishes a snapshot at the start of the epic entry point and externalizes the epic's effects at the end of the epic. As mentioned in the introduction, the first mode is relatively straightforward, so we will focus on snapshot-based epics.
There are two issues that need to be addressed:
- Maintaining per-object version constraints, so that as the epic unfolds and new objects are added to the epic's read set we refine a constraint set over the object versions that the epic may access, based on the object versions it has already accessed.
- Merging the effects of an epic (buffered while the epic is in progress in e.g. shadow objects) into the primary object copy (mmap-backed version that independent nanotransactions access).
The solution to the first issue will probably be independent of the per-instance version manager. We want versioning constraints to be persisted and shareable, so ideally we will store constraints on a per-object basis, in the object's metadata. These constraints will be modified any time a transaction executes over multiple objects, as part of the nanotransaction itself. The execution engine will then be responsible for reading the version constraints in the metadata of the target objects and supply versions that satisfy the established constraints when requesting an object version from the version manager.
To merge the effects of an epic at the time when the epic "commits", we will create a new system
nanotransaction, apply(). This transaction will be passed the epic's complete write set, together
with a set of transaction log entries (generated by the nanotransactions in the epic and buffered by
the system) that contain the epic's writes to the objects in the write set. apply() needs to
perform a version check over all the objects in the write set -- if the versions of the current
in-memory objects is different to the versions of the objects as included in the epic's read set,
then apply() exits and the epic is aborted/rolled back. If on the other hand all version checks
succeed, the epic will commit, and so apply() is free to fast-forward the target objects by
effectively running the redo logic from recovery, and then to pass the set of transaction log
entries to the worker's log manager to append to the log;