Architecture

Edit this page on GitHub

What if your database never forgot anything, and you could query any point in its history as fast as querying the present — without storage exploding?

SirixDB is a bitemporal node store that makes version control a first-class citizen of the storage engine. Every commit creates an immutable snapshot. Every revision is queryable. And unlike naive approaches that copy everything (git-style) or maintain expensive logs (event sourcing), SirixDB achieves this through structural sharing and a novel sliding snapshot versioning algorithm.

Feature What it means Why it matters
Bitemporal Every revision preserved Git-like history for your data
Append-Only No in-place updates No WAL needed, crash-safe by design
Copy-on-Write Modified pages copied, unchanged shared O(Δ) storage per revision
Structural Sharing Unchanged subtrees reference existing pages Billion-node docs with small revisions
Log-Structured Sequential writes only SSD-friendly, no random write I/O

Node-Based Document Model

Unlike document databases that store JSON as opaque blobs, SirixDB decomposes each document into a tree of fine-grained nodes. Each node has a stable 64-bit nodeKey that never changes across revisions. Nodes are linked via parent, child, and sibling pointers, enabling O(1) navigation in any direction.

Field names are stored once in an in-memory dictionary and referenced by 32-bit keys, saving space when the same field appears thousands of times.

Why Node Storage Matters

Most document databases (MongoDB, CouchDB) treat a document as an opaque blob: store it, retrieve it, replace it. SirixDB understands the structure of your data — and that changes everything.

Aspect Document Store SirixDB Node Store
Size limits 16 MB (MongoDB), 1 MB (DynamoDB) Unlimited — nodes stored independently
Update granularity Replace entire document Write only changed nodes
Query efficiency Load doc, filter in app Navigate directly to target nodes
Memory footprint Entire doc in memory Stream nodes, never load full tree
Versioning granularity “Document changed” “These specific nodes changed”
Diff precision “Something’s different” Exact path to every modified node

This means: a 100 GB JSON dataset? Store it as one logical resource — no sharding. Change one field in a deeply nested object? SirixDB writes just that node plus the modified path. Need the history of a specific field across 1,000 commits? That’s a fast index lookup, not a full-document diff.

JSON Tree Encoding Input JSON {"name":"Alice","scores":[95,87]} DOC key=0 OBJECT key=1 OBJ_KEY "name" key=2 STR "Alice" key=3 OBJ_KEY "scores" key=4 ARRAY key=5 NUM 95 key=6 NUM 87 key=7 Structure node Object key Value node

Each node type maps directly to a JSON construct: OBJECT, ARRAY, OBJECT_KEY, STRING_VALUE, NUMBER_VALUE, BOOLEAN_VALUE, and NULL_VALUE. Navigation between nodes is O(1) via stored pointers — no scanning required.

Why Bitemporal Storage?

Traditional databases force painful workarounds for temporal queries — audit tables, change data capture pipelines, backup diffs. SirixDB eliminates all of that.

Query any point in history

A customer claims they were charged the wrong price. Your current database shows today’s price. What was the price at the moment of their order?

let $catalog := jn:open('shop', 'products', xs:dateTime('2024-01-15T15:23:47Z'))
return $catalog.products[?$$.sku eq "SKU-12345"].price

One query. Exact answer. No audit infrastructure required — the database remembers everything.

Structured diffs between any two points in time

A production outage started at 2:00 AM. What configuration changes were made since midnight?

let $midnight := jn:open('configs', 'production', xs:dateTime('2024-01-15T00:00:00Z'))
let $incident := jn:open('configs', 'production', xs:dateTime('2024-01-15T02:00:00Z'))
return jn:diff('configs', 'production', sdb:revision($midnight), sdb:revision($incident))

Returns a structured JSON diff showing exactly what was inserted, deleted, updated, and moved — with node-level precision.

No History Table Performance Tax

The “standard” approach to temporal data — history tables with valid_from and valid_to timestamps — comes with hidden costs: every index grows 3x (includes timestamps), every query needs timestamp range filters, updates require two writes (insert new + update old), and joining two temporal tables means intersecting validity intervals.

SirixDB sidesteps all of this. Indexes are scoped to a revision — query revision 42 and you get revision 42’s index, no filtering. Finding the right revision for a timestamp is a single O(log R) binary search, amortized across an entire session.

Copy-on-Write with Path Copying

When a transaction modifies data, SirixDB doesn’t rewrite existing pages. Instead, it copies only the modified page and its ancestor path to the root. All unchanged pages are shared between the old and new revision via pointers. This is the same principle used in persistent data structures and ZFS.

New page Modified page Shared pointer (unchanged) Uber RevRoot Indirect Indirect Page A Page B Page C Page D Uber RevRoot Indirect' Page A' Uber RevRoot Indirect' Page D' only changed path copied Time Rev 1 Rev 2 Rev 3

This means a revision that modifies a single record only writes the modified page plus its ancestor path — typically 3-4 pages. A 10 GB database with 1,000 revisions and 0.1% change each requires roughly 20 GB total, not 10 TB.

Physically, each resource is stored in two append-only logical devices (files). LD₁ stores page content (IndirectPages, RecordPages, NamePages) followed by a RevisionRootPage at the end of each revision’s data. LD₂ stores the UberPage — a sequence of timestamp + offset pairs, one per revision, pointing to the corresponding RRP in LD₁.

Logical Device Layout: LD₂ stores UberPage with timestamp and offset pairs pointing to each revision's RevisionRootPage in LD₁. Each revision appends only modified page fragments (copy-on-write).

The UberPage is always written last as an atomic operation. Even if a crash occurs mid-commit, the previous valid state is preserved. Checksums are stored in parent page references (borrowed from ZFS), so corruption is detected immediately on read. Because the store is append-only and the UberPage commit is atomic, no write-ahead log (WAL) is needed — the on-disk state is always consistent.

Page Structure

Each resource is organized as a trie of pages. The RevisionRootPage is the entry point for a single revision, branching into subtrees for data, indexes, and metadata.

Page Hierarchy (single revision) UberPage RevisionRootPage author, timestamp, commit message Data IndirectPages RecordPage RecordPage JSON/XML nodes PathSummary unique path tree NamePage Index Tree 1 Index Tree n name indexes PathPage Index Tree 1 Index Tree n path indexes CASPage Index Tree 1 Index Tree n CAS indexes

UberPage — The root entry point. Written last during a commit as an atomic operation. Contains a reference to the IndirectPage tree that addresses all revisions.

IndirectPages — Fan-out nodes (512 references each) that form the trie structure. Borrowed from ZFS, they store checksums in parent references for data integrity. Unused slots are tracked with a bitset for compact storage.

RevisionRootPage — Entry point for a single revision. Stores the author, timestamp, and optional commit message. Branches to the data trie, path summary, name dictionary, and index pages.

RecordPages — Leaf pages storing up to 1024 nodes each. These are the pages that get versioned by the sliding snapshot algorithm.

Sliding Snapshot Versioning

SirixDB doesn’t just copy entire pages on every change. It versions RecordPages at a sub-page level, storing only changed records. The sliding snapshot algorithm, developed by Marc Kramis, avoids the trade-off between read performance and write amplification that plagues traditional approaches.

Page Versioning Strategies Full Copy Incremental Differential Sliding Snapshot Rev 1 Rev 2 Rev 3 Rev 4 Rev 5 A B C D A B' C D A B' C' D A' B' C' D A' B' C' D' Fast reads, wasteful writes A B C D B' C' A' B' C' D write spike! D' Compact, but periodic write spikes A B C D B' B' C' growing! A' B' C' D D' 2 reads, growing deltas + spikes A B C D B' C' A' + D D' + B' Bounded reads, no spikes
Strategy Reads to reconstruct Write cost per revision Write spikes?
Full 1 page Entire page (all records) No
Incremental All since last full dump Only changed records Yes (periodic full dump)
Differential 2 pages (full + diff) All changes since last full dump Yes (growing deltas + full dump)
Sliding Snapshot At most N fragments Changed + expired records No

The sliding snapshot uses a window of size N (typically 3-5). Changed records are always written. Records older than N revisions that haven’t been written are carried forward. This guarantees that at most N page fragments need to be read to reconstruct any page — regardless of total revision count.

For details, see Marc Kramis’s thesis: Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications.

Transaction Model

SirixDB uses a single-writer, multiple-reader (SWMR) concurrency model per resource. At most one write transaction can be active on a resource at any time, while an unlimited number of concurrent read transactions can proceed without locks.

Read transactions are bound to a specific revision and see an immutable snapshot — they never observe uncommitted changes and never block writers. This is true snapshot isolation without the overhead of MVCC bookkeeping at query time.

Uncommitted changes are held in an in-memory transaction intent log (TIL). On commit, changes are written sequentially to the append-only log and the new UberPage is flushed atomically. On abort, the in-memory state is simply discarded — nothing was written to disk.

Because each resource is an independent log-structured store, write transactions on different resources proceed in parallel with no coordination.

Secondary Indexes

SirixDB supports three types of user-defined secondary indexes, all stored in the same versioned trie structure as the data. Indexes are part of the transaction and version with the data — the index at revision 42 always matches the data at revision 42.

Path Summary

Every resource maintains a compact path summary — a tree of all unique paths in the document. Each unique path gets a path class reference (PCR), a stable integer ID. Nodes in the main data tree reference their PCR, enabling efficient path-based lookups.

Path Summary and Index Architecture Data Tree DocumentRootNode OBJECT KEY "users" KEY "config" ARRAY OBJ OBJ "name" "Alice" "age" 28 "name" "Bob" "age" 35 k0 k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k14 Path Summary DocumentRootNode PCR=0 users PCR=1 config PCR=5 [] PCR=2 name PCR=3 age PCR=4 Index Types Name Index hash("name") → {key5,key10} Path Index PCR=3 → {key5, key10} CAS Index (PCR=4, 28) → {key8}

Index Types

Index Key Use case
Name Field name hash → node keys Find all nodes named "email" regardless of path
Path PCR → node keys Find all nodes at path /users/[]/name
CAS (PCR + typed value) → node keys Find all users where age > 30 on path /users/[]/age

CAS (content-and-structure) indexes are the most selective — they index both the path and the typed value, enabling efficient range queries. All indexes are stored in Height Optimized Tries (HOT) within the same versioned page structure.

For the JSONiq API to create and query indexes, see the Function Reference.

Further Reading