Architecture
What if your database never forgot anything, and you could query any point in its history as fast as querying the present — without storage exploding?
SirixDB is a bitemporal node store that makes version control a first-class citizen of the storage engine. Every commit creates an immutable snapshot. Every revision is queryable. And unlike naive approaches that copy everything (git-style) or maintain expensive logs (event sourcing), SirixDB achieves this through structural sharing and a novel sliding snapshot versioning algorithm.
| Feature | What it means | Why it matters |
|---|---|---|
| Bitemporal | Every revision preserved | Git-like history for your data |
| Append-Only | No in-place updates | No WAL needed, crash-safe by design |
| Copy-on-Write | Modified pages copied, unchanged shared | O(Δ) storage per revision |
| Structural Sharing | Unchanged subtrees reference existing pages | Billion-node docs with small revisions |
| Log-Structured | Sequential writes only | SSD-friendly, no random write I/O |
Node-Based Document Model
Unlike document databases that store JSON as opaque blobs, SirixDB decomposes each document into a tree of fine-grained nodes. Each node has a stable 64-bit nodeKey that never changes across revisions. Nodes are linked via parent, child, and sibling pointers, enabling O(1) navigation in any direction.
Field names are stored once in an in-memory dictionary and referenced by 32-bit keys, saving space when the same field appears thousands of times.
Why Node Storage Matters
Most document databases (MongoDB, CouchDB) treat a document as an opaque blob: store it, retrieve it, replace it. SirixDB understands the structure of your data — and that changes everything.
| Aspect | Document Store | SirixDB Node Store |
|---|---|---|
| Size limits | 16 MB (MongoDB), 1 MB (DynamoDB) | Unlimited — nodes stored independently |
| Update granularity | Replace entire document | Write only changed nodes |
| Query efficiency | Load doc, filter in app | Navigate directly to target nodes |
| Memory footprint | Entire doc in memory | Stream nodes, never load full tree |
| Versioning granularity | “Document changed” | “These specific nodes changed” |
| Diff precision | “Something’s different” | Exact path to every modified node |
This means: a 100 GB JSON dataset? Store it as one logical resource — no sharding. Change one field in a deeply nested object? SirixDB writes just that node plus the modified path. Need the history of a specific field across 1,000 commits? That’s a fast index lookup, not a full-document diff.
Each node type maps directly to a JSON construct: OBJECT, ARRAY, OBJECT_KEY, STRING_VALUE, NUMBER_VALUE, BOOLEAN_VALUE, and NULL_VALUE. Navigation between nodes is O(1) via stored pointers — no scanning required.
Why Bitemporal Storage?
Traditional databases force painful workarounds for temporal queries — audit tables, change data capture pipelines, backup diffs. SirixDB eliminates all of that.
Query any point in history
A customer claims they were charged the wrong price. Your current database shows today’s price. What was the price at the moment of their order?
let $catalog := jn:open('shop', 'products', xs:dateTime('2024-01-15T15:23:47Z'))
return $catalog.products[?$$.sku eq "SKU-12345"].price
One query. Exact answer. No audit infrastructure required — the database remembers everything.
Structured diffs between any two points in time
A production outage started at 2:00 AM. What configuration changes were made since midnight?
let $midnight := jn:open('configs', 'production', xs:dateTime('2024-01-15T00:00:00Z'))
let $incident := jn:open('configs', 'production', xs:dateTime('2024-01-15T02:00:00Z'))
return jn:diff('configs', 'production', sdb:revision($midnight), sdb:revision($incident))
Returns a structured JSON diff showing exactly what was inserted, deleted, updated, and moved — with node-level precision.
No History Table Performance Tax
The “standard” approach to temporal data — history tables with valid_from and valid_to timestamps — comes with hidden costs: every index grows 3x (includes timestamps), every query needs timestamp range filters, updates require two writes (insert new + update old), and joining two temporal tables means intersecting validity intervals.
SirixDB sidesteps all of this. Indexes are scoped to a revision — query revision 42 and you get revision 42’s index, no filtering. Finding the right revision for a timestamp is a single O(log R) binary search, amortized across an entire session.
Copy-on-Write with Path Copying
When a transaction modifies data, SirixDB doesn’t rewrite existing pages. Instead, it copies only the modified page and its ancestor path to the root. All unchanged pages are shared between the old and new revision via pointers. This is the same principle used in persistent data structures and ZFS.
This means a revision that modifies a single record only writes the modified page plus its ancestor path — typically 3-4 pages. A 10 GB database with 1,000 revisions and 0.1% change each requires roughly 20 GB total, not 10 TB.
Physically, each resource is stored in two append-only logical devices (files). LD₁ stores page content (IndirectPages, RecordPages, NamePages) followed by a RevisionRootPage at the end of each revision’s data. LD₂ stores the UberPage — a sequence of timestamp + offset pairs, one per revision, pointing to the corresponding RRP in LD₁.
The UberPage is always written last as an atomic operation. Even if a crash occurs mid-commit, the previous valid state is preserved. Checksums are stored in parent page references (borrowed from ZFS), so corruption is detected immediately on read. Because the store is append-only and the UberPage commit is atomic, no write-ahead log (WAL) is needed — the on-disk state is always consistent.
Page Structure
Each resource is organized as a trie of pages. The RevisionRootPage is the entry point for a single revision, branching into subtrees for data, indexes, and metadata.
UberPage — The root entry point. Written last during a commit as an atomic operation. Contains a reference to the IndirectPage tree that addresses all revisions.
IndirectPages — Fan-out nodes (512 references each) that form the trie structure. Borrowed from ZFS, they store checksums in parent references for data integrity. Unused slots are tracked with a bitset for compact storage.
RevisionRootPage — Entry point for a single revision. Stores the author, timestamp, and optional commit message. Branches to the data trie, path summary, name dictionary, and index pages.
RecordPages — Leaf pages storing up to 1024 nodes each. These are the pages that get versioned by the sliding snapshot algorithm.
Sliding Snapshot Versioning
SirixDB doesn’t just copy entire pages on every change. It versions RecordPages at a sub-page level, storing only changed records. The sliding snapshot algorithm, developed by Marc Kramis, avoids the trade-off between read performance and write amplification that plagues traditional approaches.
| Strategy | Reads to reconstruct | Write cost per revision | Write spikes? |
|---|---|---|---|
| Full | 1 page | Entire page (all records) | No |
| Incremental | All since last full dump | Only changed records | Yes (periodic full dump) |
| Differential | 2 pages (full + diff) | All changes since last full dump | Yes (growing deltas + full dump) |
| Sliding Snapshot | At most N fragments | Changed + expired records | No |
The sliding snapshot uses a window of size N (typically 3-5). Changed records are always written. Records older than N revisions that haven’t been written are carried forward. This guarantees that at most N page fragments need to be read to reconstruct any page — regardless of total revision count.
For details, see Marc Kramis’s thesis: Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications.
Transaction Model
SirixDB uses a single-writer, multiple-reader (SWMR) concurrency model per resource. At most one write transaction can be active on a resource at any time, while an unlimited number of concurrent read transactions can proceed without locks.
Read transactions are bound to a specific revision and see an immutable snapshot — they never observe uncommitted changes and never block writers. This is true snapshot isolation without the overhead of MVCC bookkeeping at query time.
Uncommitted changes are held in an in-memory transaction intent log (TIL). On commit, changes are written sequentially to the append-only log and the new UberPage is flushed atomically. On abort, the in-memory state is simply discarded — nothing was written to disk.
Because each resource is an independent log-structured store, write transactions on different resources proceed in parallel with no coordination.
Secondary Indexes
SirixDB supports three types of user-defined secondary indexes, all stored in the same versioned trie structure as the data. Indexes are part of the transaction and version with the data — the index at revision 42 always matches the data at revision 42.
Path Summary
Every resource maintains a compact path summary — a tree of all unique paths in the document. Each unique path gets a path class reference (PCR), a stable integer ID. Nodes in the main data tree reference their PCR, enabling efficient path-based lookups.
Index Types
| Index | Key | Use case |
|---|---|---|
| Name | Field name hash → node keys | Find all nodes named "email" regardless of path |
| Path | PCR → node keys | Find all nodes at path /users/[]/name |
| CAS | (PCR + typed value) → node keys | Find all users where age > 30 on path /users/[]/age |
CAS (content-and-structure) indexes are the most selective — they index both the path and the typed value, enabling efficient range queries. All indexes are stored in Height Optimized Tries (HOT) within the same versioned page structure.
For the JSONiq API to create and query indexes, see the Function Reference.
Further Reading
- Evolutionary Tree-Structured Storage — Marc Kramis’s thesis describing the sliding snapshot algorithm
- Flexible Secure Cloud Storage — Sebastian Graf’s PhD dissertation on the trie-based page structure and formal verification of the sliding snapshot algorithm
- HOT: A Height Optimized Trie Index — the trie structure used for SirixDB’s secondary indexes
- SirixDB on GitHub — source code and detailed
docs/ARCHITECTURE.md - REST API documentation — HTTP interface for SirixDB
- JSONiq API — query language guide