ActorDB is very cool, but from what I gather it's — at least nominally — designe...

derefr · on Feb 27, 2016

To me, that just sounds like it forces you to do the Right Thing from the beginning. Eventually, you will need to shard your data in some way†.

If you're already locked into a "global queries against tables containing Everyone's Stuff" data model, you end up doing this through Stupid Database Tricks like "see-other" redirection or silent persistent-hash load-balancing (with constant ops-heavy rebalancing as you grow.)

If you think about sharding from the beginning, though, you end up just dividing your data into little naturally-atomic "worlds." Like sharding email [headers, not bodies] by user, or sharding StackOverflow posts by community.

If you don't think you'll ever scale to the point where you'll be forced to make those calls, just pick a different database. (But then, if you don't think you'll ever reach that scale, then literally any database will do.)

---

† Unless you're EVE Online. They could probably write a really good whitepaper about how they've scaled their single-node MSSQL database so far. I presume they're mostly following the same patterns you would when running e.g. Oracle on big iron; but—according to most sources—they're just using some heavily-loaded commodity hardware with no fancy IO offloading et al. No idea how they do it.

calpaterson · on Feb 27, 2016

No, most businesses could run their datasets off normal databases. This becomes more true every year as stock hardware gets bigger ram and disks.

There are plenty of businesses running of single-node SQL databases. Not Google's scale, fine, but the same scale as the bottom 99% of businesses.

For the record, StackOverflow (your example) is run off a single MS-SQL cluster. It's not even that heavily loaded apparently:

http://highscalability.com/blog/2014/7/21/stackoverflow-upda...

derefr · on Feb 28, 2016

Yes, that's what I said: for most businesses (that aren't trying to scale to infinity the way that VC startups talk about), literally any database will do. Even single-node SQLite will do, because there won't be enough contention for its locking to ever matter.

atombender · on Feb 27, 2016

Absolutely. Keep in mind that ActorDB actually requires that you supply the sharding key yourself whenever you do a query. It's not like, say, Elasticsearch, which shards both reads and writes transparently.

Many apps do require large queries that span many "partitions": For example, listing all of a StackOverflow member's contributions across all communities, sorted by time. With ActorDB, you would have to plan for this by denormalizing a bit (biokoda might correct me here): Each community would be an actor with a table of QA items, and each user would also have an actor containing a table listing their own QA items for all communities. Since ActorDB apparently has transactions, you can maintain this duplicate data atomically, though you can't maintain foreign-key constraints across actors.

derefr · on Feb 27, 2016

I work with CouchDB, and it sounds similar: many queries end up being map-reduce functions run across many CouchDB "databases" (partitions) on the same node/cluster.

One of the nice things about this approach, from my CouchDB experience, is that each DB/partition has its own permissions (fully extensible through design-documents shoved into that DB), so instead of needing to carefully write your business layer to ensure that users can only ever query their own data out of a table, you just do all your work for a particular user in a table that only contains stuff that user has a right to access to begin with.

It's a lot like working with S3 buckets, now that I think about it. Buckets containing tables, rather than buckets containing objects.

catnaroek · on Feb 28, 2016

Please excuse my ignorance - how do I enforce nontrivial system-wide invariants with lots of little databases rather than a single consolidated one?

The invariants I care about are:

(0) Referential integrity. If the table `Foo` has a `FOREIGN KEY (BarID) REFERENCES Bar (BarID)`, then no row in `Foo` must be seen as having a `BarID` whose value can't be found in the table `Bar`.

(1) Linearizability. There must exist a total order on the entire transaction history of a database, such that, starting from the empty database, and executing the transactions nonconcurrently in the given total order, the result is the current state of the database. (NOTE: The transaction history need not be physically stored anywhere. So this invariant can't be “tested” - it has to be proven to hold.)

These guarantees are so basic, so fundamental in my everyday use of RDBMSs, that I need to be convinced that they hold.

biokoda · on Feb 28, 2016

By finding the natural sharding factor. If you have one, which very often you do. Three completely different examples:

- Are you running a mail server? Every user is an actor (i.e. an sqlite instance).

- Are you running a backend for a dropbox/evernote/messaging type app? Every user is an actor.

- Do you have a distributed filesystem? Use the KV mode and shard on file hash.

- Do you need scalable counters? Use KV mode again, split every counter into 10 (or 100) and increment/decrement on those.

catnaroek · on Feb 28, 2016

Again, please excuse my ignorance - what is the definition of “natural sharding factor”, and how would I compute it for, say, an ERP system?

biokoda · on Feb 28, 2016

I have no experience in developing in ERP systems. I do have experience on being at the end of a very poorly working one.

These systems seem (from my outside view) to have a tendency to become giant monoliths. So when developing you must fight against increasing monolithic complexity. Using something like ActorDB can be somewhat of a beneficial constraint. It forces you to maintain a clean design.

I would force sales, marketing, shipping, product planning (taking from wikipedia here..), to be their own separate actors with their own schemas. Then if possible shard within those types. So if something is customer service, have an actor per customer and have all his data there. If you're developing multiple products, every product has an actor.

catnaroek · on Feb 28, 2016

> These systems seem (from my outside view) to have a tendency to become giant monoliths. So when developing you must fight against increasing monolithic complexity.

You aren't wrong, that's my experience as well. It's just as annoying for programmers (or, at least, for me) as it is for users. The following question has popped out countless times in my head: “Why do I have to rely on an implicit convention that this application module never touches this database table?” There was never a good answer.

The only reason why I put up with such things is that I have no idea how to prevent more modular designs from turning into a data integrity nightmare. (I'll freely admit my lack of education is to blame here.) For instance, let's say we have three modules: inventory, sales and shipping. Furthermore, let's assume each module is its own actor and uses its own backend database. We must implement the use case “enter a sale in the system”:

(0) The sales module queries the inventory module whether there is enough of a product in stock to satisfy a customer order. The expected sequence of actions is:

(1) The inventory module “locks” the requested quantity/amount of the product [so that it can't be used, say, for another sale], and gives the sales module a “token” that can be used to confirm or cancel the withdrawal.

(2) The sales module queries the shipping module if there are enough available trucks/ships/whatever to ship the product to the customer's location by a given delivery date.

(3) The shipping module “locks” however many trucks/ships/whatever it deems necessary to ship the product, and gives the sales modules a “token” that can be used to confirm or cancel the shipping.

(4) The sales module queries the user for the customer's credit card number and verification code, interfaces with the bank's system, blablabla...

(5) The sales module confirms to the inventory and shipping modules [in this specific order] that the product will be withdrawn and shipped.

Now some exception handling:

(6) If step 3 fails, the sales module cancels the product withdrawal.

(7) If step 4 fails, the sales module cancels the product withdrawal and shipping.

(8) If any system [inventory, sales, shipping] goes down, neither the product nor the trucks/ships/whatever can be kept locked forever. So each lock must have a timeout, and, if it's neither explicitly confirmed nor explicitly cancelled by the sales module, it will be implicitly cancelled by the inventory and/or shipping module when the timeout elapses.

(9) It may happen [unlikely, but not impossible], that the inventory and shipping module's clocks get unsynchronized in such a way that, when the product withdrawal has been been confirmed, the shipping lock has already elapsed. Oh, the nightmare.

Implementing all of this correctly in all cases is actually tricky! And if anything is implemented slightly wrong, the whole system goes kaboom! With a monolithic database, there is no need to “lock” any resources, nor issue “confirmation tokens” - just use the DBMS's built-in transaction system!

biokoda · on Feb 28, 2016

> Furthermore, let's assume each module is its own actor and uses its own backend database.

I would have every module an actor type. There can be multiple types each type has its own schema. Within an actor type many actors. An actor for every product for instance. So all X widgets are in one actor.

> With a monolithic database, there is no need to “lock” any resources, nor issue “confirmation tokens” - just use the DBMS's built-in transaction system!

ActorDB has distributed ACID transactions so I would use that. You can create a transaction over multiple actors. The reason I would split it into many actors is that you're always locking small parts of the system for the duration of the transaction not the entire DB.

catnaroek · on Feb 28, 2016

> I would have every module an actor type. There can be multiple types each type has its own schema. Within an actor type many actors.

Noted.

> ActorDB has distributed ACID transactions so I would use that. You can create a transaction over multiple actors.

Sweet! In particular, the second sentence is exactly what I wanted to hear.

I will give ActorDB a try.

diroussel · on Feb 28, 2016

You won't find many RDBMS that can guarantee Linearizability.

See: https://martin.kleppmann.com/2015/09/26/transactions-at-stra...

catnaroek · on Feb 29, 2016

The ones that I use at home (PostgreSQL) and at work (SQL Server) allow serializable transactions. Of course, in many cases it's overkill, but it's good to know that it's there when needed.

biokoda · on Feb 27, 2016

You can actually get very far by not sharding. But it forces you to eventually throw out everything that makes relational databases awesome and turn it into a glorified KV store.

rodgerd · on Feb 27, 2016

> they're just using some heavily-loaded commodity hardware with no fancy IO offloading et al.

Aren't they backed by RAMSANs?

biokoda · on Feb 27, 2016

Well put.

biokoda · on Feb 27, 2016

It would perform well with one actor when compared to a single SQLite instance. Not compared to PostgreSQL/MySQL due to the concurrency model being optimized for many concurrent actors, not concurrent access to a single actor. A single actor can still execute thousands of queries per second however.

Compared to a single SQLite instance ActorDB has two major advantages. No write multiplication due to using LMDB and compression which means reads/writes are significantly smaller. SQLite will always write everything twice. First to WAL, then to the SQLite file and it has no compression capabilities. It should completely trounce rqlite in the performance department.

atombender · on Feb 27, 2016

Thanks, that's useful. I know nothing about SQLite's concurrency model, but it's not surprising that it can't compete with Postgres.

Does this mean that ActorDB relies on LMDB to replace the WAL? I don't know enough about LMDB either, but I presume it has some kind of redo log.

biokoda · on Feb 27, 2016

Yes ActorDB takes sqlite3.c, takes out the wal.c code and replaces it with calls to LMDB. There is no redo log, LMDB is designed in such a way that it does not require it. It is actually verified as the safest storage engine design out there.

thesz · on Feb 27, 2016

Do you have a link on the LMDB verification?

What I know is that SQLite IS safe to use even on unsafe file systems. There was a discussion here few months ago (link was from danloo.com).

biokoda · on Feb 27, 2016

https://www.usenix.org/conference/osdi14/technical-sessions/...