All Your Base 2012Nov 2012 16 minute read
The first ever All Your Base conference took place in Oxford last Friday - casting aside the familiar format of focusing on a single aspect of database technology such as NoSQL or even a single product (as interesting as both those things often are) and instead tackling a much trickier brief: presenting the bigger picture - a cross-section of all the different database types currently available and showcasing their relative merits. If White October succeed in their ambition to host another one next year, you should go. Yes, you - although this was pitched at web developers, if you have any interest in storing data, I guarantee it’ll be worth your while.
You’ve built your application, so what next?
First up was Alvin Richards from MongoDB creators, 10gen to talk about scaling your application, introducing themes such as scaling and trade-offs that would be returned to again and again throughout the day.
The talk was suprisingly wide ranging, covering general areas rather than just being focussed on MongoDB. Topics included eventual consistency, memory ratios (if you hit 1:1, welcome to persistent memory stores) and how to improve them by sharding across multiple servers (including some finer detail about sharding strategies and Mongo’s autobalancing feature), and a reminder not to assume that using a scaleable datastore automatically means you’re building a scaleable application.
Another important concept was the idea that some data should be treated more equally than others (“do you love all your data equally?”) when it comes down to assurance and latency. Some data is more important - such as credit card transactions - requiring a greater focus on assurance than others - like tracking data - where the greater importance is not slowing the application down while waiting to check that the data has been stored without error. The old fire and forget style of database write that put a lot of developers off in the past is no longer the only way - with the newer drivers, MongoDB now lets you tune the level of data durability on a per query basis using the “w” option to set the level of concern about the data being sent for storage.
Switching from the Relational to the Graph model
Again rather than sticking to OrientDB specifics, Luca started by talking about graph databases in general - what they could do and why it might be worth the extra effort to understand such a different data storage paradigm.
In addition to some basic graph theory, this also included a handy relational database 101 explaining how indexes and foreign keys really work and why they can cause a significant performance hit. Where graph dbs score is by storing the relationships between objects rather than calculating them each time the query is run, doing away with indexes and foreign key relationships altogether, making scaling up the amount of stored data much more viable. As Claudio Martella noted last year, it’s not so much NoSQL as NoSPARQL.
PouchDB is the database that speaks sync - why does this matter?
They key take away here was that “offline is a fact” and “sync is hard”, along with the reminder that Mac-based task manager Things needed a 2 year run-up to get sync ready for public consumption.
Other similar projects include TouchDB which is an ObjectiveC port aimed at touchscreen devices(both iOS and Android versions are available). To me the most impressive feature of these CouchDB clones is that they’ve been carefully implemented so that replication works seamlessly between all of them so a CouchDB database can be pushed around to PouchDB or TouchDB (and back again) as required.
Eventual Consistency in the real world
Matt kicked things off with a nice bit of NoSQL history, urging us to track down a copy of Werner Vogels’ 2007 Amazon’s Dynamo Paper - Dynamo the shopping cart tech rather than the database, that came later - as essential background reading to the concept of “eventual consistency”. (If this is your thing, I’d throw in Werner’s 2012 blog post on Dynamo as well as it sets out the whole backstory.)
Life as with engineering (and CAP theorem) is all about trade-offs; in the CAP theorem triangle, Riak chooses Availability over Consistency. As Matt ably illustrated with some demo slides, the alternative to “Eventual Consistency” is “Eventual Availability” which really sucks in terms of the user experience. I’ll gloss over Matt’s description of how Eventual Consistency is implemented in Riak - I’ll only get it wrong - except to say that it has built-in mechanisms for replicating stored objects to multiple nodes automatically and, in the event of an unresolvable conflict (“divergent object versions” if my memory serves), it stores both copies so that the application can choose which one it wants.
On a personal note, I’m slightly disappointed that I’m no nearer to discovering the “correct” pronunciation of Riak as Matt used at least 3 different ones in the opening 5 minutes. Not sure whether I’m overthinking it or if Matt was messing with us.
MySQL and MariaDB story
Monty brought us the story of how MySQL was developed, why it was dual licensed (only the second product ever to do so - after Ghostscript) and how he turned down the initial offer of $50million to sell the company after 3 years of development. (He didn’t consider the product was “good enough” at that point and feared an outright sale would allow the product to move from an open to a closed source model, killing further development and possibly the product as well.)
Following the deal with Sun in 2008, many of the original core developers left the project about a year later and joined Monty in working on MariaDB - a fork of MySQL. A drop-in replacement for MySQL, MariaDB initially kept in sync with point version releases of MySQL with both products cherry picking the best innovations of the other along the way. However post MySQL 5.5, the products have started to diverge, as reflected by the fact that MariaDB’s release naming convention has jumped from 5.x to 10.0. Some parts of the latest round of code changes from Sun have been rejected by the MariaDB team and I can see the 2 products drifting further apart over the next few years.
I still haven’t got around to checking this out for myself - although apparently, provided that my typing speed’s up to the task, I can make the switch in around 15 seconds, MariaDB is faster, more stable, sporting helpful extra features for developers - such as progress reports during long operations - and providing better results for some queries. As I’ve been having trouble with a misfiring MySQL query lately, the comparison could be interesting (at the time I concluded that I was doing something wrong and routed around the problem, now I’m not so sure). Monty, unsurprisingly, backs MariaDB to be better “in every single case” and is making no secret of wanting to make MySQL obsolete.
Git: The NoSQL Database
Billed by our host as “a crazy tech talk”, we were invited to think beyond the obvious with this exploration of whether Git can be repurposed from a code repository to a database of sorts.
Yes is the simple answer - Brandon drilled down into Git’s storage model, how it handles conflicts and versioning and how it deals with transactions. He then went on to talk about programatic access to repositories with grit and libgit2 (expected to be grit’s replacement at Github) and even ORM wrappers such as GitModel and Toystore.
However on closer inspection… “all the features that make a great DB - yeah, git doesn’t have those”. There’s no native support for concurrency, merge conflicts are messy and at web scale issues surrounding paths and write speed limits make it hugely impractical. However for simpler applications, it’s an interesting possibility.
The overall message - “abuse your tools and imagine how to make them better”.
Redis, Steady, Go!
Next up was Peter Cooper for an @antirez-approved whistle-stop tour of Redis, the Remote dictionary structure.
An ultra-fast key/value store, Redis is “hard to categorize”. The Redis Manifesto positions it as a “DSL for abstract data types” while it is often marketed as a more functional (and persistent) alternative to memcache. In use, as demonstrated here, it behaves more like assembly language with syntax such as:
SET A "hello world" GET A SET counter 1 INCR counter
Single threaded in general (except where background-run processes are concerned) and all operations should be considered atomic. Highly configurable the recommendation from both Peter and creator Salvatore/@antirez is to run your own server, however a number of specialists hosts offering Redis-as-a-service are on hand if this doesn’t sound like a good starting point.
MySQL @ Twitter
Speaker: Lisa Phillips, Senior MySQL DBA @ Twitter
Another switch of pace and focus for this talk from Lisa Phillips, Senior DBA (within a newly expanded team of 8) at Twitter; this was the story of how a particular database technology is used in the real world - in this case the high volume world of a busy social networking site.
How busy? Twitter hosts an average of 4,629 Tweets per second. Popular events such as international football matches (“people all tweeting ‘GOAL!!’ in the same second”) push the network radically beyond that average figure although the record traffic peak was a mass Tweet orchestrated by a Japanese TV show which produced 25,088 posted messages in a second. (Of course, extra capacity and support can be planned for major sporting events - problems are more likely to come from unexpected news items such as the death of Michael Jackson.)
Surprisingly - and in an interesting reflection of Google’s early years - Twitter’s “thousands of database servers” don’t run on anything particular special, just the same mix of Dell and HP boxes as you’d find any everyone else’s datacentres.
Twitter regard themselves as highly pragmatic when it comes to technology choices, preferring the philosophy of “the right tool for the job”. MySQL was the obvious choice for the platform’s primary database as it offered a serious, stable (some of the servers have an uptime of 2 years+ and could easily have been around longer had it not been for an enforced physical move) solution for no money. Another plus point for choosing to use a recognised technology like MySQL is that it makes easier to bring developers onboard as well as hiring DBAs.
Facebook is also using MySQL, presumably for similar reasons.
Running at this sort of scale, they’ve found some interesting edge cases that few others are likely to have encountered - amusingly in this context “small datasets” are considered to be anything less than 1.5TB. For example it turns out that there’s a limit on the number of unique IDs a MySQL table is prepared to generate before it falls over (hence the whole Snowflake thing); unhelpfully once this limit is hit, it fails silently. In the light of these kinds of issue, they now take a particular interest in “people who’ve seen things break on a large scale” when hiring.
Mobile Web Persistence Strategies for Offline Clients
Set against a background of frustration with the latest cross-browser glitches and incompatibilities heralded in by the new era of mobile browsers and “packaged apps” - locally installed web code that looks and feels like a native app and needs to keep working when the mobile network slows to a crawl or disappears altogether - this was the story of why Brian created Lawnchair along with some stuff about future trends and illustrations of the kind of stuff that can go wrong.
In an attempt to bring a little order to the chaos, WebSimpleDB API was invented to set out a standard for data storage and retrieval which in turn has morphed into IndexedDB which still doesn’t enjoy full cross-platform support so now there are polyfill style hacks for that as well.
Back to more traditional databases with an eye-opener (or refresher if you’re already familiar with Postgres) from Craig. Dubbed “the emacs of databases”, Postgres - or PostgreSQL if you prefer - has a lot going for it. Using the psql command line tool gives you access not only to the data itself (welcome back SQL) but to a range of commands such as \d for describe \e and to open the last query in your default text editor, very reminiscent of the emacs or vim environments.
Postgres natively supports a wide range of datatypes including geospatial and network-related types as well as all the usual suspects like varchar and float. It also supports “extensions” that bring even more data types (and functions) into play. These include:
- hstore - a searchable key/value (hash?) solution that’s stored inside its own column
- citext - case insensitive text
- PLV8 - an implementation of the V8 engine inside Postgres
Other features include Range Types (new for 9.2) and built in full text search.
Apache Cassandra, and why BASE is great for real-time analytics
Cassandra originated at Facebook where elements of Google’s Bigtable were mixed together with Amazon’s Dynamo shopping cart engine to create an inbox search tool - a cross between a row-based system and a key/value store. It first went open source in 2008 and became an Apache “top level project” 2 years later.
Designed to run at scale, Cassandra is multi-datacentre aware, sporting multi-master architecture to realise it’s central philosophy of “no single point of failure” (SPOF). Keys (which can consist of single or multiple columns) provide access to individual rows, giving a richer data model than traditional key/value stores (but not as rich as MongoDB). Reading back a single (unlimited width) row is fast but reading across multiple rows is slow - depending on how the information is partitioned, it’s highly likely that these rows are located in different datacentres.
Two common uses of Cassandra are to provide a session store (read dominated, low latency, easy to update existing items with atomic counters) and real time analytics (write dominated, rarely updated, distributed). Twitter’s promoted tweets dashboard is a good example of this latter usage being run over massive quantities of data. Careful planning of keys and values prevents each query having to sift through 4.2TB of data. This sort of querying is where Acunu Analytics comes in - it sits over the top of Cassandra to help with the stuff that OOTB Cassandra is not that good at such as grouping query and optimizing query results for reporting purposes.