Nanotransactions - v1

Execution Engine Overview
Component Breakdown
Design Discussion
Technical Reference
- Data Structures
  - Intent Log Entry
  - Transaction Log Entry
- Interfaces
  - 1. GC <-> TM
  - 2. TM <-> Nando Scheduler

Execution Engine Overview

Basic Architecture Diagram

Requests for nanotransaction activations arrive at the local instance of the Activation Router component -- for now, think of this as the global component of a two-tier scheduler. If the activation router determines that the given nanotransaction can execute locally (potentially having to wait for incoming data dependencies to arrive to the present host), it forwards the nanotransaction request to the Transaction Manager (1). The transaction manager logs the intent to run the given nanotransaction, and then forwards the request to the local Nando Scheduler (2), receiving back a future that it can await for the nanotransaction's termination. The rest of the nanotransaction's lifecycle is synchronous. The local scheduler is responsible for constructing a nanotransaction activation that can be enqueued for execution to the Nando Executor (3), which outputs the nanotransaction execution status along with any new log entries that need to be appended to the local log to the Log Manager (4), which is also responsible for informing the local scheduler of the outcome of the nanotransaction (5). At this point, the future returned initially over (2) is completed.

Component Breakdown

Activation Router

For incoming nanotransaction activation requests (say, from a local magpie REPL or from another magpie instance on the same cluster), it is responsible for ensuring that the nanotransaction can be scheduled locally (either by awaiting in-progress data dependency movements or initiating them itself), and either forwarding the request to a different host (eventually) or submitting it for local execution across interface (1) to the local Transaction Manager.

It is also possible for the Transaction Manager to notify the activation router of the need to schedule a nanotransaction on a remote host (although this will probably not need to happen until we want to support epics).

Finally, this component is responsible for notifying the request's source for its eventual status (success / failure / fwd / etc.).

Transaction Manager

Interface between the application execution environment and the nanotransaction execution infrastructure. Responsible for recording events to the local intent log (not pictured) on all incoming event edges, for transaction recovery and lineage tracking, as well as forwarding local execution requests to the local scheduler, and forwarding requests for remote transactions to the Activation Router.

Nando Scheduler

Local scheduler responsible for scheduling activations that are ready to be executed (in terms of dependencies etc.) to the appropriate task queue feeding into the local executor. The scheduler is somewhat critical to the correct execution of nanotransactions, as it also enforces mutual exclusion (by ensuring that any "active" object is exclusively owned by exactly one thread in the local executor at any given time). Some more detail around this is provided in the "Local Scheduling" section.

Nando Executor

A pool of N threads (configurable), each of which is pinned to a hardware core, consuming nanotransaction activations from its incoming queue (produced by the local scheduler), and outputting log records (and in the case of epics, continuations) to the local log manager. Inspired by various thread-per-core models, the goal here is to minimize latency on the nanotransaction execution path by executing many small units of work (the nanotransaction) in a strategy that maximizes locality and minimizes context switching and the need for synchronization between execution threads.

Log Manager / Commit Engine

A component dedicated to flushing log entries to stable storage. This component also encapsulates the commit/abort logic of the transaction execution system. In the initial design, all nanotransactions should commit barring exceptional circumstances, so the commit logic is essentially a no-op (or rather a list of assertions to make sure we didn't mess anything up in the scheduling and execution layers) -- however, it becomes important for supporting epics, and hence fleshing it out is reserved for that design.

Design Discussion

Motivating this design

The above kind of comes out of frustration on my part around the lack of control in pure async runtimes like tokio. Scheduling is opaque, metrics are confusing, and at the end of the day it is a lot of technical baggage for a relatively narrow purpose (we are not trying to power the same kinds of use cases as things like tokio, so we can lose some of the generality).

The TPC-style setup of the execution engine was influenced by the design of "modern", low latency systems (e.g. LMAX, Tigerbeetle, ScyllaDB), and the rationale behind them, namely -- you will go fast if you avoid cross-thread synchronization as much as possible (hence the ownership of each active object by at most one thread), try to maximize the mileage you get out of your cache (with the thread-per-object model and the scheduling policy working together to leverage both spatial and temporal locality), and minimize context switching (in which we are aided by the constraints we place on individual nandos and what they can actually do at runtime; separating the logging functionality to another thread/component is also in service of this target).

Local Scheduling

The job of the local scheduler is to enforce single-threaded accesses to objects being operated on by nanotransactions at any given time. Its scheduling policy should strive to minimize (or eliminate) thrashing of local objects between cores, leveraging locality in a working set.

The scheduler can receive a request for execution from two different components: the local Transaction Manager, because of an "external" request for a nanotransaction activation on the given machine, or (in the case of epics) from the local "Commit Engine".

The commit engine is responsible for inspecting the output of "yielding" executions that have requested spawns of downstream subcomputations (in the form of nando activations), and for forwarding them to the local scheduler.

Object Versioning

Each object is associated with a version (starting at 0), which increases in strict monotonic order as updates are applied to it. Versions reside in an object's header, and are incremented by the owning executor thread after any nanotransaction invocation that modifies the object's contents. A transaction's log entry includes version information for both its read set and write set (which might include one or more output objects that this transaction created, again at version 0).

Versions are available to all components of the system -- they can be specified explicitly in a nanotransaction's activation request as part of the nanotransaction's input, but only for read dependencies. These versions can be materialized if they are not already present locally (because of an object move, garbage collection, etc.).

Object versions are not user visible or accessible, so the only way for the system to request an object at a specific version is for read-only transactions that do not need access to the latest version of an object. At this point, a recent version is selected by the runtime during the transaction submission, and the nanotransaction is permitted to proceed (potentially on a different executor thread than the one that owns the target object presently).

Absent a version specification, the runtime will use the latest version of an object. Depending on the access patterns for a given object (and the associated complexity and benefits), it might be appropriate to attempt to keep two live versions of an object around, a "back buffer" for writes and a "front buffer" that is one version behind the "head" of an object and is used to service reads (in order to avoid blocking write nanotransactions).

Technical Reference

Data Structures

Intent Log Entry

#![allow(unused)]
fn main() {
enum IntentKind {
    OWNERSHIP_CHANGE,
    USER_TRANSACTION_EXECUTION,
}

enum IntentLogEntryStatus {
    START,    // set before sending top-level transaction to nando scheduler
    SUCCESS,  // successful commit response from scheduler
    ABORTED(String),  // abort notification from scheduler, with reason for abort
}

struct IntentLogEntry {
    kind: IntentKind,
    txn_id: TxnId,
    status: IntentLogEntryStatus,
    timestamp: PersistableTimestamp,
}
}

All intended actions on objects are first entered in the owning instance's local intent log. This includes both application transaction execution, as well as internal operations like ownership changes.

For each submitted transaction, there will always be an entry with status START. For each completed transaction, there will also be an entry at a strictly greater timestamp with status either SUCCESS or ABORTED, and a reason for the abort if any (which might be application-specific error codes or system abort codes).

Integrating "destructive" system operations like ownership changes into the transaction path seems like a good design choice, as it restricts the number of concurrent (i.e. unsafe) interactions we need to consider when building the system. Under this model, an ownership change is a "linear" process from the instance's activation router receiving the ownership change request, through the transaction system (which places a whomstone for the moved data in its local log for any lagging requests targeted at the object under ownership transfer), to the component that is responsible for the shuffling of data between hosts.

Transaction Log Entry

#![allow(unused)]
fn main() {
struct Image<T> {
    field: IPtr,  // IPtr is a triple of (object_id, offset, size)
    pre_value: T,
    post_value: Option<T>,
}

struct Version {
    object_id: ObjectId,
    version: u64,
}

struct TransactionLogEntry {
    txn_id: TxnId,
    images: Vec<Image>,
    read_set: Vec<Version>,
    // Object IDs in write_set that also appear in read_set
    // will be accompanied by a version that
    // is the old version incremented by one
    write_set: Vec<Version>,
    timestamp: PersistableTimestamp,
}
}

Instances of TransactionLogEntry are constructed by worker threads in Nando Executor during nanotransaction execution, and are then passed off to the Log Manager over (4) (most likely over a queue) for persistence.

In case of an application-level abort() in the middle of executing a nanotransaction, it is the executor thread's responsibility to re-apply all values from images to recover the state of the target objects. This could be more elegant through shadow paging, but I am not sure what the relative overheads are and how the two approaches compare to each other.

An open question with regard to both kinds of log entries is whether or not we need a concept like an LSN, both for easier (semantic) association of entries (which could also be accomplished through the included txn_id) as well as managing the externalization of effects.

Interfaces

1. GC <-> TM

TODO

2. TM <-> Nando Scheduler

#![allow(unused)]
fn main() {
struct NandoHandle<T> {}
impl<T> Future for NandoHandle<T> {
    type Output = Result<T, TxnExecutionError>;
    /* ... */
}

struct ActivationIntent {
    txn_id: TxnId,
    /* ... */
}

struct Activation {
    txn_id: TxnId,
    /* ... */
}

fn schedule_activation<T>(activation_intent: ActivationIntent) -> NandoHandle<T> {
    /* ... */
}
}

ActivationIntent should contain the name of the nanotransaction we want to schedule, and a list of its arguments, either as IPtr entries (for objects to be resolved), or values (for any pass-by-value arguments). This should be translated by the Nando Scheduler into a concrete Activation that is then executable by Nando Executor threads.

Although ActivationIntent and Activation might appear redundant at this point, their distinction becomes clearer in the epics document. For now it should suffice to say that an ActivationIntent can lead to the spawning of multiple connected Activations, but from the caller's perspective (i.e. Transaction Manager), that is opaque.

Keyboard shortcuts

The Sibley Guide to magpie