Architecture

Edit this page on GitHub

SirixDB is a temporal, append-only database that never overwrites data. Every transaction commit creates an immutable snapshot. It uses copy-on-write with path copying to share unchanged data between revisions, keeping storage minimal. The storage engine is log-structured and optimized for flash drives — sequential writes only, no WAL, no compaction.

Node-Based Document Model

Unlike document databases that store JSON as opaque blobs, SirixDB decomposes each document into a tree of fine-grained nodes. Each node has a stable 64-bit nodeKey that never changes across revisions. Nodes are linked via parent, child, and sibling pointers, enabling O(1) navigation in any direction.

Field names are stored once in an in-memory dictionary and referenced by 32-bit keys, saving space when the same field appears thousands of times.

JSON Tree Encoding Input JSON {"name":"Alice","scores":[95,87]} DOC key=0 OBJECT key=1 OBJ_KEY "name" key=2 STR "Alice" key=3 OBJ_KEY "scores" key=4 ARRAY key=5 NUM 95 key=6 NUM 87 key=7 Structure node Object key Value node

Each node type maps directly to a JSON construct: OBJECT, ARRAY, OBJECT_KEY, STRING_VALUE, NUMBER_VALUE, BOOLEAN_VALUE, and NULL_VALUE. Navigation between nodes is O(1) via stored pointers — no scanning required.

Copy-on-Write with Path Copying

When a transaction modifies data, SirixDB doesn’t rewrite existing pages. Instead, it copies only the modified page and its ancestor path to the root. All unchanged pages are shared between the old and new revision via pointers. This is the same principle used in persistent data structures and ZFS.

New / copied page Modified page Shared pointer (unchanged) Uber RevRoot Indirect Indirect Page A Page B Page C Page D Uber RevRoot Indirect' Page A' Uber RevRoot Indirect' Page D' only changed path copied On-Disk Layout (append-only) revisions Rev 1 Rev 2 Rev 3 data Rev 1 pages Δ Rev 2 Δ Rev 3 Time Rev 1 Rev 2 Rev 3

This means a revision that modifies a single record only writes the modified page plus its ancestor path — typically 3-4 pages. A 10 GB database with 1,000 revisions and 0.1% change each requires roughly 20 GB total, not 10 TB.

Physically, each resource is stored in two append-only files: a data file for page content (IndirectPages, RecordPages) and a revisions file for revision metadata (RevisionRootPages, UberPage references). The UberPage is conceptually at the top — it references the latest RevisionRootPage — and is always written last as an atomic operation. Even if a crash occurs mid-commit, the previous valid state is preserved.

Page Structure

Each resource is organized as a trie of pages. The RevisionRootPage is the entry point for a single revision, branching into subtrees for data, indexes, and metadata.

Page Hierarchy (single revision) UberPage IndirectPages RevisionRootPage author, timestamp, commit message Data IndirectPages RecordPage RecordPage JSON/XML nodes PathSummary unique path trie NamePage field name dictionary PathPage / CASPage Index Tree 1 Index Tree 2 user-defined indexes

UberPage — The root entry point. Written last during a commit as an atomic operation. Contains a reference to the IndirectPage tree that addresses all revisions.

IndirectPages — Fan-out nodes (512 references each) that form the trie structure. Borrowed from ZFS, they store checksums in parent references for data integrity. Unused slots are tracked with a bitset for compact storage.

RevisionRootPage — Entry point for a single revision. Stores the author, timestamp, and optional commit message. Branches to the data trie, path summary, name dictionary, and index pages.

RecordPages — Leaf pages storing up to 1024 nodes each. These are the pages that get versioned by the sliding snapshot algorithm.

Sliding Snapshot Versioning

SirixDB doesn’t just copy entire pages on every change. It versions RecordPages at a sub-page level, storing only changed records. The sliding snapshot algorithm, developed by Marc Kramis, avoids the trade-off between read performance and write amplification that plagues traditional approaches.

Page Versioning Strategies Full Copy Incremental Sliding Snapshot Rev 1 Rev 2 Rev 3 Rev 4 Rev 5 A B C D A B' C D A B' C' D A' B' C' D A' B' C' D' Fast reads, wasteful writes A B C D B' C' A' B' C' D write spike! D' Compact, but periodic write spikes A B C D B' C' + A,D A' + B',C' D' + A' Bounded reads, no spikes
Strategy Reads to reconstruct Write cost per revision Write spikes?
Full 1 page Entire page (all records) No
Incremental Up to all revisions Only changed records Yes (periodic full dump)
Differential 2 pages All changes since last full dump Yes (growing deltas)
Sliding Snapshot At most N fragments Changed + expired records No

The sliding snapshot uses a window of size N (typically 3-5). Changed records are always written. Records older than N revisions that haven’t been written are carried forward. This guarantees that at most N page fragments need to be read to reconstruct any page — regardless of total revision count.

For details, see Marc Kramis’s thesis: Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications.

Secondary Indexes

SirixDB supports three types of user-defined secondary indexes, all stored in the same versioned trie structure as the data. Indexes are part of the transaction and version with the data — the index at revision 42 always matches the data at revision 42.

Path Summary

Every resource maintains a compact path summary — a trie of all unique paths in the document. Each unique path gets a path class reference (PCR), a stable integer ID. Nodes in the main data tree reference their PCR, enabling efficient path-based lookups.

Path Summary and Index Architecture Data Tree OBJECT KEY "users" KEY "config" ARRAY OBJ OBJ "name" "Alice" "name" "Bob" Path Summary / PCR=0 users PCR=1 config PCR=5 [] PCR=2 name PCR=3 age PCR=4 Index Types Name Index hash("name") → {key5,key12} Path Index PCR=3 → {key5, key12} CAS Index (PCR=4, 30) → {key7}

Index Types

Index Key Use case
Name Field name hash → node keys Find all nodes named "email" regardless of path
Path PCR → node keys Find all nodes at path /users/[]/name
CAS (PCR + typed value) → node keys Find all users where age > 30 on path /users/[]/age

CAS (content-and-structure) indexes are the most selective — they index both the path and the typed value, enabling efficient range queries. All indexes are stored in balanced binary search trees (Red-Black trees) within the same versioned page structure.

For the JSONiq API to create and query indexes, see the Function Reference.

Further Reading