Architecture

Edit this page on GitHub

What if your database never forgot anything, and you could query any point in its history as fast as querying the present — without storage exploding?

SirixDB is a bitemporal node store that makes version control a first-class citizen of the storage engine. Every commit creates an immutable snapshot. Every revision is queryable. And unlike naive approaches that copy everything (git-style) or maintain expensive logs (event sourcing), SirixDB achieves this through structural sharing and a novel sliding snapshot versioning algorithm.

Feature	What it means	Why it matters
Bitemporal	Every revision preserved	Git-like history for your data
Append-Only	No in-place updates	No WAL needed, crash-safe by design
Copy-on-Write	Modified pages copied, unchanged shared	O(Δ) storage per revision
Structural Sharing	Unchanged subtrees reference existing pages	Billion-node docs with small revisions
Log-Structured	Sequential writes only	SSD-friendly, no random write I/O

Node-Based Document Model

Unlike document databases that store JSON as opaque blobs, SirixDB decomposes each document into a tree of fine-grained nodes. Each node has a stable 64-bit nodeKey that never changes across revisions. Nodes are linked via parent, child, and sibling pointers, enabling O(1) navigation in any direction.

Field names are stored once in an in-memory dictionary and referenced by 32-bit keys, saving space when the same field appears thousands of times.

Why Node Storage Matters

Most document databases (MongoDB, CouchDB) treat a document as an opaque blob: store it, retrieve it, replace it. SirixDB understands the structure of your data — and that changes everything.

Aspect	Document Store	SirixDB Node Store
Size limits	16 MB (MongoDB), 1 MB (DynamoDB)	Unlimited — nodes stored independently
Update granularity	Replace entire document	Write only changed nodes
Query efficiency	Load doc, filter in app	Navigate directly to target nodes
Memory footprint	Entire doc in memory	Stream nodes, never load full tree
Versioning granularity	“Document changed”	“These specific nodes changed”
Diff precision	“Something’s different”	Exact path to every modified node

This means: a 100 GB JSON dataset? Store it as one logical resource — no sharding. Change one field in a deeply nested object? SirixDB writes just that node plus the modified path. Need the history of a specific field across 1,000 commits? That’s a fast index lookup, not a full-document diff.

Each node type maps directly to a JSON construct: OBJECT, ARRAY, OBJECT_KEY, STRING_VALUE, NUMBER_VALUE, BOOLEAN_VALUE, and NULL_VALUE. Navigation between nodes is O(1) via stored pointers — no scanning required.

Why Bitemporal Storage?

Traditional databases force painful workarounds for temporal queries — audit tables, change data capture pipelines, backup diffs. SirixDB eliminates all of that.

Query any point in history

A customer claims they were charged the wrong price. Your current database shows today’s price. What was the price at the moment of their order?

let $catalog := jn:open('shop', 'products', xs:dateTime('2024-01-15T15:23:47Z'))
return $catalog.products[?$$.sku eq "SKU-12345"].price

One query. Exact answer. No audit infrastructure required — the database remembers everything.

Structured diffs between any two points in time

A production outage started at 2:00 AM. What configuration changes were made since midnight?

let $midnight := jn:open('configs', 'production', xs:dateTime('2024-01-15T00:00:00Z'))
let $incident := jn:open('configs', 'production', xs:dateTime('2024-01-15T02:00:00Z'))
return jn:diff('configs', 'production', sdb:revision($midnight), sdb:revision($incident))

Returns a structured JSON diff showing exactly what was inserted, deleted, updated, and moved — with node-level precision.

No History Table Performance Tax

The “standard” approach to temporal data — history tables with valid_from and valid_to timestamps — comes with hidden costs: every index grows 3x (includes timestamps), every query needs timestamp range filters, updates require two writes (insert new + update old), and joining two temporal tables means intersecting validity intervals.

SirixDB sidesteps all of this. Indexes are scoped to a revision — query revision 42 and you get revision 42’s index, no filtering. Finding the right revision for a timestamp is a single O(log R) binary search, amortized across an entire session.

Copy-on-Write with Path Copying

When a transaction modifies data, SirixDB doesn’t rewrite existing pages. Instead, it copies only the modified page and its ancestor path to the root. All unchanged pages are shared between the old and new revision via pointers. This is the same principle used in persistent data structures and ZFS.

This means a revision that modifies a single record only writes the modified page plus its ancestor path — typically 3-4 pages. A 10 GB database with 1,000 revisions and 0.1% change each requires roughly 20 GB total, not 10 TB.

Physically, each resource is stored in two append-only logical devices (files). LD₁ stores page content (IndirectPages, RecordPages, NamePages) followed by a RevisionRootPage at the end of each revision’s data. LD₂ stores the UberPage — a sequence of timestamp + offset pairs, one per revision, pointing to the corresponding RRP in LD₁.

Logical Device Layout: LD₂ stores UberPage with timestamp and offset pairs pointing to each revision's RevisionRootPage in LD₁. Each revision appends only modified page fragments (copy-on-write).

The UberPage is always written last as an atomic operation. Even if a crash occurs mid-commit, the previous valid state is preserved. Checksums are stored in parent page references (borrowed from ZFS), so corruption is detected immediately on read. Because the store is append-only and the UberPage commit is atomic, no write-ahead log (WAL) is needed — the on-disk state is always consistent.

Page Structure

Each resource is organized as a trie of pages. The RevisionRootPage is the entry point for a single revision, branching into subtrees for data, indexes, and metadata.

UberPage — The root entry point. Written last during a commit as an atomic operation. Contains a reference to the IndirectPage tree that addresses all revisions.

IndirectPages — Fan-out nodes (512 references each) that form the trie structure. Borrowed from ZFS, they store checksums in parent references for data integrity. Unused slots are tracked with a bitset for compact storage.

RevisionRootPage — Entry point for a single revision. Stores the author, timestamp, and optional commit message. Branches to the data trie, path summary, name dictionary, and index pages.

RecordPages — Leaf pages storing up to 1024 nodes each. These are the pages that get versioned by the sliding snapshot algorithm.

Sliding Snapshot Versioning

SirixDB doesn’t just copy entire pages on every change. It versions RecordPages at a sub-page level, storing only changed records. The sliding snapshot algorithm, developed by Marc Kramis, avoids the trade-off between read performance and write amplification that plagues traditional approaches.

Strategy	Reads to reconstruct	Write cost per revision	Write spikes?
Full	1 page	Entire page (all records)	No
Incremental	All since last full dump	Only changed records	Yes (periodic full dump)
Differential	2 pages (full + diff)	All changes since last full dump	Yes (growing deltas + full dump)
Sliding Snapshot	At most N fragments	Changed + expired records	No

The sliding snapshot uses a window of size N (typically 3-5). Changed records are always written. Records older than N revisions that haven’t been written are carried forward. This guarantees that at most N page fragments need to be read to reconstruct any page — regardless of total revision count.

For details, see Marc Kramis’s thesis: Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications.

Transaction Model

SirixDB uses a single-writer, multiple-reader (SWMR) concurrency model per resource. At most one write transaction can be active on a resource at any time, while an unlimited number of concurrent read transactions can proceed without locks.

Read transactions are bound to a specific revision and see an immutable snapshot — they never observe uncommitted changes and never block writers. This is true snapshot isolation without the overhead of MVCC bookkeeping at query time.

Uncommitted changes are held in an in-memory transaction intent log (TIL). On commit, changes are written sequentially to the append-only log and the new UberPage is flushed atomically. On abort, the in-memory state is simply discarded — nothing was written to disk.

Because each resource is an independent log-structured store, write transactions on different resources proceed in parallel with no coordination.

Secondary Indexes

SirixDB supports three types of user-defined secondary indexes, all stored in the same versioned trie structure as the data. Indexes are part of the transaction and version with the data — the index at revision 42 always matches the data at revision 42.

Path Summary

Every resource maintains a compact path summary — a tree of all unique paths in the document. Each unique path gets a path class reference (PCR), a stable integer ID. Nodes in the main data tree reference their PCR, enabling efficient path-based lookups.

Index Types

Index	Key	Use case
Name	Field name hash → node keys	Find all nodes named `"email"` regardless of path
Path	PCR → node keys	Find all nodes at path `/users/[]/name`
CAS	(PCR + typed value) → node keys	Find all users where `age > 30` on path `/users/[]/age`

CAS (content-and-structure) indexes are the most selective — they index both the path and the typed value, enabling efficient range queries. All indexes are stored in Height Optimized Tries (HOT) within the same versioned page structure.

For the JSONiq API to create and query indexes, see the Function Reference.