Usually database systems simply either overwrite data in-place or do a copy-on-write operation followed by a removal of the outdated data (the latter maybe some time later from a background process). Data however naturally evolves over time. It is often times of great value to keep the history. We for instance might record the payroll of an employee at the first of march in 2019. Let’s say it’s 5000€ / month. Then as of 15th april we notice, that the recorded payroll was wrong and correct it to 5300€. Now, what’s the answer to what the payroll was on march, first in 2019? Database Systems, which only preserve the most recent version don’t even know that the payroll wasn’t right. Our answer to this question depends on what source we consider most authoritative: the record or reality? The fact that they disagree effectively splits this payroll event into two tracks of time, rendering neither source entirely accurate. Questions such as these might be easily answered by temporal database systems such as Sirix. We provide at all times the system / transaction time, which is set, once a transaction commits (when is a fact valid in the database / the record). Application or valid time has to be set by the application itself (when is a fact valid in the real world/reality?).
Thus, one usage scenario for Sirix is data auditing. We keep track of past revisions in our specialized index structure for (bi)temporal data and our novel sliding snapshot versioning algorithm, which balances read- and write-performance while avoiding any peaks (our system is very space efficient and depending on the versioning algorithm we only copy changed records plus maybe a few more during writes). Thus we for instance usually do not simply copy a whole database page if only a single record has changed; we also do not have to cluster data during every commit as in B+-trees and LSM-Trees), Furthermore, we can simply keep track of changes through hooks and our user defined secondary index structures to avoid potentially expensive diffing of whole resources (even though we can utilize hashes, which can optionally be built during update operations for integrity checks, and thus skip the comparison of whole subtree if the hashes and stable, unique node identifiers match). As already mention we can use an ID-based diff algorithm if we do not want to keep track of what changed in an additional index. We never allow to override or delete old revisions (that said we might look into how to prune old revisions if that is ever needed). A single read/write transaction appends data at all times. For sure you can revert to a specific revision and commit the new revision, but all revisions in-between will be accessible for data audits. Thus, we are able to help to answer who changed what and when.
Time Travel queries
Data audits are about how specific records/nodes have changed. Time Travel queries can answer questions as these, but also allows to reconstruct a whole subtree as it looked at a specific time or during a specific time span, or how the whole document/resource changed over time. You might want to analyze the past to predict the future. Through additional temporal XPath axis and XQuery functions we encourage you to look into how your data has evolved.
Fixing application or human errors / simple undo/redo operations / rollbacks / reproduce experiments
For all of the above nentioned use cases: You can simple revert to a specific revision/point in time where everything was in a known good state and commit the revision again, or you might simply select a specific record/node and correct the error and commit the new revision.
Sirix is a storage system, which brings versioning to a sub-file granular level while taking full advantage of flash based drives as for instance SSDs. As such per revision as well as per page deltas are stored. Time-complexity for retrieval of records/nodes and the storage are logarithmic (
O(log n)). Space complexity is linear (
O(n)). Currently, we provide several APIs which are layered. A very low level page-API, which handles the storage and retrieval of records on a per page-fragment level (whereas a buffer manager handles the caching of pages in-memory and the versioning takes place even on a lower layer for storing and reconstructing the page-fragments in CPU-friendly algorithms), a cursor based API to store and navigate through records (currently XML/XDM-nodes as well as JSON-nodes) on top, a DOM-alike node layer for simple in-memory processing of these nodes, which is used by Brackit, a sophisticated XQuery processor. And last but not least a RESTful asynchronous HTTP-API. Our goal is to provide a seamless integration of a native JSON layer besides the XML node layer, that is extending the XQuery Data Model (XDM) with other node types (support for JSONiq through the XQuery processor Brackit). In general, however we could store every kind of data. We provide
- The current revision of the resource or any subset thereof;
- The full revision history of the resource or any subset thereof;
- The full modification history of the resource or any subset thereof.
We not only support all XPath axis (as well as a few more like as for instance a PostOrderAxis) to query a resource in one revision but also novel temporal axis which facilitate navigation in time. A transaction (cursor) on a resource can be started either by specifying a specific revision number (to open a revision/version/snapshot of a resource) or by a given point in time. The latter starts a transaction on the revision number which was committed closest to the given timestamp.
You may find a quick overview about the main features useful.
We provide several APIs to interact with Sirix.
- The transactional cursor API is a powerful low-level API.
- On top of this API we built a Brackit.org binding to provide the ability to use Sirix with a more DOM-alike API with in-memory nodes and an XQuery API.
- We provide a powerful, asynchronous, non-blocking RESTful-API to interact with a Sirix HTTP-server. Authorization is done via Keycloak.
Articles published on Medium:
- Asynchronous, Temporal REST With Vert.x, Keycloak and Kotlin Coroutines
- Pushing Database Versioning to Its Limits by Means of a Novel Sliding Snapshot Algorithm and Efficient Time Travel Queries
- How we built an asynchronous, temporal RESTful API based on Vert.x, Keycloak and Kotlin/Coroutines for Sirix.io (Open Source)
- Why Copy-on-Write Semantics and Node-Level-Versioning are Key to Efficient Snapshots
Sirix was forked from Treetank (which is not maintained anymore), but as a university project it was subject to some publications.
A lot of the ideas still are based on the Ph.D. thesis of Marc Kramis: Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications
As well as from Sebastian Graft’s work and thesis: Flexible Secure Cloud Storage
Other publications include:
- Versatile Key Management for Secure Cloud Storage (DISCCO12)
- A legal and technical perspective on secure cloud Storage (DFN Forum12)
- A Secure Cloud Gateway based upon XML and Web Services (ECOWS11, PhD Symposium)
- Treetank, Designing a Versioned XML Storage (XMLPrague11)
- Hecate, Managing Authorization with RESTful XML (WS-REST11)
- Rolling Boles, Optimal XML Structure Integrity for Updating Operations (WWW11, Poster)
- JAX-RX - Unified REST Access to XML Resources (TechReport10)
- Integrity Assurance for RESTful XML (WISM100)
- Temporal REST, How to really exploit XML (IADIS WWW/Internet08)
- Distributing XML with focus on parallel evaluation (DBISP2P08)